A regular expression (regex or RE) describes a set of strings. Used in the right way REs become very powerful for text mining and manipulation.
REs can be concatenated. If A and B are both RE, then AB is also RE. If a string p matches A and another string q matches B, the string pq will match AB.
Regex allow the use of escaped letters and special symbols to match a wide range of strings according to certain rules.
import re
r = re.search('Bal{2,5}','34eBallllll342')
if r: print "Positive"
else: print "Negative"
RE, seq = '(?<=abc)def', 'abcdef'
RE, seq = '(?<=-)\w+', 'spam-egg' # Looks for a word following a hyphen
m = re.search(RE, seq)
print m.group(0)
print '''\n#Matching vs. Searching'''
print re.match("c", "abcdef")
print re.search("c", "abcdef")
print '''\n#Module'''
pattern = 'ABC'
string = 'ABCD'
prog = re.compile(pattern)
result = prog.match(string)
result = re.match(pattern, string) #re.compile is more efficient when the expression will used several times
#findall(string[,pos[,endpos]]) #findall by an positional limit the search regions
print '''\n#group([group1, ...])'''
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
print m.group(0)
print m.group(1)
print m.group(2)
print m.group(1,2)
m = re.match(r"(?P\w+) (?P\w+)" , "Malcolm Reynolds")
print m.group('first_name')
print m.group('last_name')
print '''\n#Names groups can also be refered to by their index:'''
print m.group(1)
print m.group(2)
print '''\n#If a group matches multiple times, only the last match is accessible:'''
m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
print m.group(1) # Matches only the last match.
print '''\n#grouodict([default])'''
m = re.match(r"(?P\w+) (?P\w+)" , "Macolm Reynolds")
print m.groupdict()
print '''\n#Making a Phonebook'''
input = """Ross McFluff: 834.345.1254 155 Elm Street
Ronald Heathmore: 892.345.3428 436 Finley Avenue
Frank Burger: 925.541.7625 662 South Dogwood Way
Heather Albrecht: 548.326.4584 919 Park Place"""
entries = re.split("\n+", input)
print entries
print
print [re.split(":? ", entry, 3) for entry in entries]
print
print [re.split(":? ", entry, 4) for entry in entries]
print '''\n#Text Munging'''
import random
def repl(m):
inner_word = list(m.group(2))
random.shuffle(inner_word)
return m.group(1) + "".join(inner_word) + m.group(3)
text = '''Some long text here.'''
##Hikowa, please report your absences promptly!"
##print re.sub(r"(\w)(\w+)(\w)", repl, text)
print re.sub(r"(\w)(\w+)(\w)", repl, text)
print '''\n#Finding all Adverbs'''
text = "He was carefully disguised bt captured quickly by police."
print re.findall(r"\w+ly", text)
print '''\n#Finding all Adverbs and tehir Positions'''
text = "He was carefully disguised bt captured quickly by police."
for m in re.finditer(r"\w+ly",text):
print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
#234567891123456789212345678931234567894123456789512345678961234567897123456789
import re # http://stackoverflow.com/questions/250271/python-regex-use-how-to-get-positions-of-matches
p = re.compile("[a-z]")
for m in p.finditer('a1b2c3d4'):
print m.start(), m.group()
^ in the beginning of square brackets negates all characters included. For instance, to match every non-alphanumeric character except some defined ones, one could formulate a regex like this:
[^a-zA-Z\d\s:]
Add a ? to a quantifier to make it ungreedy.
Note: Ranges are inclusive.
Using the UNICODE regex flag it is possible do something like ur'?u^[^Wd_]+$', which will match any string consisting solely of alphabetic unicode characters.
There is an excellent book on regex [1] and the official python HowTo guide [2] and re module [3] as well as a nice regex primer [4] and the cheat-sheet [5]. There is also a informational website dedicated to regular expressions [6].
[1] | Mastering regular Expression by Jeffrey Friedly, published by O'Reilly, 1.Editions |
[2] | Regular Expression HOWTO [http://docs.python.org/howto/regex.html#regex-howto] |
[3] | Module re [http://docs.python.org/2.7/library/re.html#module-re] |
[4] | Regex Primer [http://python.about.com/od/regularexpressions/a/regexprimer.htm] |
[5] | Cheat-Sheet [http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/] |
[6] | Char classes [http://www.regular-expressions.info/charclass.html] |
Comment on This Data Unit