Recently I was looking for a Python function that could parse HTML and remove the <script>
tags. For example, you may have the following:
0001 html = "Hello world! <script>alert('Hi!');</script> Test."
We want to strip between the tags <script>
and <script>
. I wrote a function called strip_between()
that does this:
0002 # strip_between() 0003 # 0004 # Strips out a string inclusively between two strings until no matches are 0005 # found. 0006 # 0007 # NOTE: There is an assumption made that the strings are found and only found 0008 # in matching pairs. If b appears before a, it doesn't make sense to remove 0009 # between b and a, as order matters. 0010 # 0011 # @param s The string to be searched. 0012 # @param a The first string to search for. 0013 # @param b The second string to search for. 0014 # @return The removal of the strings. 0015 def strip_between(s, a, b) : 0016 z = 0 0017 l = len(s) + 1 0018 while l > len(s) : 0019 l = len(s) 0020 i = s.find(a, z) 0021 j = s.find(b, z) 0022 if i < 0 or j < 0 or i >= j : 0023 continue 0024 s = s[:i] + s[(j + len(b)):] 0025 z = i 0026 return s
It takes the input string s
, the first string to check for a
and the second string to check for b
. It keeps checking until all matches have been found.
We can now run this on our test input 1:
0027 result = strip_between(html, "<script", "</script>") 0028 print(result)
Hello world! Test.
Hopefully you find this useful!
Note that the >
has been dropped from the <script>
tag - this was intentional. HTML tags can take parameters (as is XML in general) and we can often see the language also being specified.↩︎