Coffee Space – Coffee Space

Strip Between

Recently I was looking for a Python function that could parse HTML and remove the <script> tags. For example, you may have the following:

0001 html = "Hello world! <script>alert('Hi!');</script> Test."

We want to strip between the tags <script> and <script>. I wrote a function called strip_between() that does this:

0002 # strip_between()
0003 #
0004 # Strips out a string inclusively between two strings until no matches are
0005 # found.
0006 #
0007 # NOTE: There is an assumption made that the strings are found and only found
0008 # in matching pairs. If b appears before a, it doesn't make sense to remove
0009 # between b and a, as order matters.
0010 #
0011 # @param s The string to be searched.
0012 # @param a The first string to search for.
0013 # @param b The second string to search for.
0014 # @return The removal of the strings.
0015 def strip_between(s, a, b) :
0016   z = 0
0017   l = len(s) + 1
0018   while l > len(s) :
0019     l = len(s)
0020     i = s.find(a, z)
0021     j = s.find(b, z)
0022     if i < 0 or j < 0 or i >= j :
0023       continue
0024     s = s[:i] + s[(j + len(b)):]
0025     z = i
0026   return s

It takes the input string s, the first string to check for a and the second string to check for b. It keeps checking until all matches have been found.

We can now run this on our test input ¹:

0027 result = strip_between(html, "<script", "</script>")
0028 print(result)

Hello world!  Test.

Hopefully you find this useful!

Note that the > has been dropped from the <script> tag - this was intentional. HTML tags can take parameters (as is XML in general) and we can often see the language also being specified.↩︎