For recent task I had a big string and I wanted to find 'bad' substrings. I knew what would be a good substring, I could write regular expressions for them. For example if I had the string 'Hello there', he said, 'what do you need?'...
and I want to find all the non-quoted text (i.e. everything that's not surrounded by '
.
It's relatively easy to make a regex that extracts all the quoted text: r"'.*?'"
(Note: this is python re syntax).
Using re.split
You can then use python's re.split(...)
function to split the string based on this regular expression. The matched text is not returned, only the rest of the text. The text returned is the substrings that don't match your regular expression!
You would call it like this:
import re
doc = "'Hello there', he said, 'what do you need?'..."
for match in re.split(r"'.*?'", doc):
# match now holds some text that doesn't match the regex
print match
The returned list will look like this: ['', ', he said, ', '...']
. There might be empty strings (''
), you probably should ignore them.
If you have several 'known good' patterns, and you want to find all the substrings that don't match A, or B, or C, ..., then you can combine them together with |
's, i.e. re.split(r"'.*?'|{.*?}", doc)
There are numerous other solutions to this problem using negative look ahead assertions [example, example]
Position of the match
One problem with this approach is that it doesn't tell you the position in the original string where you substrings occur. You may or may not need this. For my task I needed to know where the 'bad' substring was. In normal regular expression matches you can use the start, end, or span methods to find where the match is. re.split just returns raw strings, not match objects, so this is unavailable.
If you surround your regular expression with capturing brackets, i.e. r"('.*?')"
, then re.split()
will return all the matched (aswell as unmatched) substring.
e.g. this call: re.split("('.*?')", doc)
will produce this list: ['', "'Hello there'", ', he said, ', "'what do you need?'", '...']
. It includes the matched and unmatched substrings. re.split includes empty strings, so that every even numbered element is a string that doesn't match, and every odd numbered element is a string that does match.
Since every character in the original string is returned to you, you can count them and find the position of the bad substrings.
import re
doc = "'Hello there', he said, 'what do you need?'..."
offset = 0
for index, match in enumerate(re.split("('.*?')", doc)):
if index%2 == 0:
print "Bad string %r starts at position %d" % (match, offset)
offset += len(match)
Which gives the following output
Bad string '' starts at position 0
Bad string ', he said, ' starts at position 13
Bad string '...' starts at position 43