正则表达式 Regular Expression
- 引入 re 模块
- 规则定义 patternName = r"abc..."
1. 概念
- 正则表达式(RE)是一种小型的 高度专业化的语言
- 它内嵌在python中, 通过re模块实现
2. 作用
处理字符串.
- 匹配
- 替换
- 分隔
3. 字符匹配
- 普通字符
- 元字符
. ^ $ * + ? {} [] | ()
# 匹配普通字符
>>> import re
>>> pattern = r"ab"
>>> re.findall(pattern, "123abc")
['ab']
4. 元字符
4.0 .
- 任意字符
4.1 []
- 在字符序列中选择一个
- 常用来指定一个字符集: [abc], [0-9], [a-zA-Z]
- 元字符在字符集中当做普通字符处理: [abc$]
- 补集 : [^a-z]
>>> import re
# 字符集
>>> pattern = "[a-z]"
>>> re.findall(pattern, "abc")
['a', 'b', 'c']
# 补集
>>> pattern = "[^a-z]"
>>> re.findall(pattern, "abc")
[]
# 特殊字符
>>> pattern = "[a^$]"
>>> re.findall(pattern, "abc^$")
['a', '^', '$']
4.2 ^
4.3 $
- 匹配行尾
>>> pattern = "a$"
>>> re.findall(pattern, "aaab")
[]
>>> re.findall(pattern, "bbba")
['a']
4.4 - 转义字符
4.5 重复
4.5.1 *
- 重复次数: [0, +无穷)
>>> pattern = r"ab*"
>>> re.findall(pattern, "a")
['a']
>>> re.findall(pattern, "ab")
['ab']
>>> re.findall(pattern, "abb")
['abb']
>>> re.findall(pattern, "abbbbbbbbbb")
['abbbbbbbbbb']
4.5.2 +
- 重复次数: [1, +无穷)
>>> pattern = r"ab+"
>>> re.findall(pattern, "a")
[]
>>> re.findall(pattern, "ab")
['ab']
>>> re.findall(pattern, "abbbbbb")
['abbbbbb']
4.5.3 ?
- 重复次数: [0, 1] , 即 有 或 没有
>>> pattern = r"ab?"
>>> re.findall(pattern, "a")
['a']
>>> re.findall(pattern, "ab")
['ab']
>>> re.findall(pattern, "abbbbb")
['ab']
4.5.4 {m,n}
- {m,n} 重复次数: [m, n]
- {m} 重复次数: m
- {m,} 重复次数: [m, +无穷)
- m缺省值为0
>>> pattern = r"d{1,3}"
>>> re.findall(pattern, "1234")
['123', '4']
>>> pattern = r"d{1,}"
>>> re.findall(pattern, "1234")
['1234']
>>> pattern = r"d{1}"
>>> re.findall(pattern, "1234")
['1', '2', '3', '4']
5. 编译正则表达式
5.1 编译
- re模块 提供了 一个正则表达式引擎接口,
可以将 REstring 编译成对象
>>> import re
>>> telPatternString = r"d{3}"
>>> telPattern = re.compile(telPatternString)
>>> telPattern
<_sre.SRE_Pattern object at 0x01806170>
>>> telPattern.findall("1")
[]
>>> telPattern.findall("123")
['123']
>>> telPattern.findall("1234")
['123']
5.2 编译时 使用参数
- 忽略大小写
>>> import re
>>> namePatternString = r"[a-z]{3}"
>>> namePattern = re.compile( namePatternString, re.IGNORECASE )
>>> namePattern.findall("abc")
['abc']
>>> namePattern.findall("abC")
['abC']
5.3 反斜杠的麻烦
- 字符串前加"r", 反斜杠就不会被任何特殊方式处理
>>> pattern = r""
>>> re.findall(pattern, "c:dirA")
['']
>>> pattern = ""
>>> re.findall(pattern, "c:dirA")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:Pythoninstall-2.7libre.py", line 177, in findall
return _compile(pattern, flags).findall(string)
File "D:Pythoninstall-2.7libre.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)
6. Regex object 的一些方法
6.1 match()
6.2 search()
6.3 findall()
6.4 finditer
6.5 sub() subn()
- subn(pattern, repl, string, count=0, flags=0)
>>> re.sub(r"a", "x", "abca")
'xbcx'
>>> re.subn(r"a", "x", "abca")
('xbcx', 2)
6.6 split()
- split(pattern, string, maxsplit=0, flags=0)
>>> re.split("[^d]", "1999-09/19 23:34:59")
['1999', '09', '19', '23', '34', '59']
>>> re.split("[^d ]", "1 + 2 + 3 - 4 * 5")
['1 ', ' 2 ', ' 3 ', ' 4 ', ' 5']
7. Match object 的一些函数
- group() 返回被正则匹配的字符串 obj.group()
- start() 匹配字符串的起始位置
- end() 匹配字符串的末尾位置
- span() (起始位置, 末尾位置)
- 检查 Match object 是否为 None, 判断是否 匹配成功.
8. re属性
- 编译标识 flags
- DOTALL/S 使匹配包括换行在内的所有字符
- IGNORECASE/I 忽略大小写
- LOCALE/L 本地化匹配
- MULTILINE/M 多行匹配, 影响 ^$
- VERBOSE/X 去除"""编写正则时的换行符
# re.S
>>> re.findall(r"a.b", "anb")
[]
>>> re.findall(r"a.b", "anb", re.S)
['anb']
# re.M
>>> s = """
... line1: a1
... line2: a2
... line3: a3
... """
>>> s
'nline1: a1nline2: a2nline3: a3n'
>>> re.findall(r"^line[0-9]", s)
[]
>>> re.findall(r"^line[0-9]", s, re.M)
['line1', 'line2', 'line3']
# re.X
>>> telPatternStr = r"""
... d{3,4}
... -?
... d{7}
... """
>>> telPatternStr
'nd{3,4}n-?nd{7}n'
>>> re.findall(telPatternStr, "011-1234567")
[]
>>> re.findall(telPatternStr, "011-1234567", re.X)
['011-1234567']
9. 正则 分组 - ()
- ( pattern1 | pattern2 ) 二选一
- 分组优先被返回
# 爬网址
>>> s = """
... <a href="www.baidu.com">baidu</a>
... <a href="www.sina.com.cn">sina</a>
... """
>>> print s
<a href="www.baidu.com">baidu</a>
<a href="www.sina.com.cn">sina</a>
>>> re.findall( r"<a href=".+">.+</a>", s )
['<a href="www.baidu.com">baidu</a>', '<a href="www.sina.com.cn">sina</a>']
>>> re.findall( r"<a href="(.+)">.+</a>", s )
['www.baidu.com', 'www.sina.com.cn']
>>>
10. 小爬虫