第8章 字符串处理 学习目标 完成本章学习后,读者将能够:
理解Python字符串的Unicode本质与不可变特性 熟练运用字符串方法进行查找、替换、分割与格式化 掌握正则表达式的语法与Python re模块的使用 理解字符编码体系(ASCII、UTF-8、GBK)与编解码操作 运用字符串处理技术解决实际文本处理问题 8.1 字符串基础 8.1.1 创建字符串 1 2 3 4 5 6 7 single = 'Hello' double = "World" multi = """这是 多行 字符串""" raw = r"C:\Users\name\file.txt" unicode_str = "你好,世界!🌍"
学术注记 :Python 3的str类型是Unicode字符串 ,每个字符是一个Unicode码点。底层实现使用灵活的内部编码:纯ASCII字符串每字符1字节,纯Latin-1每字符1字节,纯BMP每字符2字节,含补充字符每字符4字节。这一优化称为PEP 393——Flexible String Representation。
8.1.2 索引与切片 1 2 3 4 5 6 7 text = "Python" print (text[0 ]) print (text[-1 ]) print (text[1 :4 ]) print (text[::-1 ]) print (text[::2 ])
8.1.3 字符串不可变性 1 2 3 4 5 6 text = "Python" new_text = "J" + text[1 :] new_text = text.replace("P" , "J" )
工程实践 :大量字符串拼接应使用"".join(parts)而非+=。+=每次创建新字符串对象,时间复杂度为O(n²);join()一次性分配内存,时间复杂度为O(n)。
1 2 3 4 5 6 7 result = "" for word in words: result += word result = "" .join(words)
8.2 字符串方法 8.2.1 大小写与空白处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 text = "Hello World" print (text.lower()) print (text.upper()) print (text.title()) print (text.capitalize()) print (text.swapcase()) text = " hello " print (text.strip()) print (text.lstrip()) print (text.rstrip()) print ("xxhelloxx" .strip("x" ))
8.2.2 查找与替换 1 2 3 4 5 6 7 8 9 10 11 12 text = "Hello World World" print (text.find("World" )) print (text.rfind("World" )) print (text.find("Python" )) print (text.index("World" )) print (text.count("World" )) print (text.replace("World" , "Python" )) print (text.replace("World" , "Python" , 1 ))
8.2.3 分割与连接 1 2 3 4 5 6 7 8 9 10 11 12 "apple,banana,cherry" .split("," ) "one two three" .split() "line1\nline2\nline3" .splitlines() " " .join(["Hello" , "World" ]) "-" .join(["2024" , "01" , "15" ]) "Hello World Python" .partition(" " ) "Hello World Python" .rpartition(" " )
8.2.4 判断方法 1 2 3 4 5 6 7 8 9 10 11 "123" .isdigit() "abc" .isalpha() "abc123" .isalnum() " " .isspace() "hello" .islower() "HELLO" .isupper() "Hello World" .istitle() "Hello" .startswith("He" ) "Hello" .endswith("lo" )
8.2.5 对齐与填充 1 2 3 4 5 6 text = "Python" print (text.center(20 , "-" )) print (text.ljust(20 , "*" )) print (text.rjust(20 , "*" )) print ("42" .zfill(8 ))
8.3 格式化字符串 8.3.1 f-string(推荐) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 name = "Alice" age = 25 pi = 3.14159 print (f"Name: {name} , Age: {age} " )print (f"Pi: {pi:.2 f} " ) print (f"Number: {42 :08b} " ) print (f"Hex: {255 :x} " ) print (f"Thousands: {1234567 :,} " ) print (f"Percent: {0.856 :.1 %} " ) print (f"|{name:>10 } |" ) print (f"|{name:<10 } |" ) print (f"|{name:^10 } |" ) print (f"Result: {2 ** 10 } " ) print (f"Upper: {name.upper()} " ) x = 42 print (f"{x = } " ) print (f"{x = :08b} " ) from datetime import datetimenow = datetime.now() print (f"{now:%Y-%m-%d %H:%M:%S} " )
1 2 3 4 5 6 7 print ("Name: {}, Age: {}" .format ("Alice" , 25 ))print ("Name: {0}, Age: {1}, Again: {0}" .format ("Alice" , 25 ))print ("Name: {name}, Age: {age}" .format (name="Alice" , age=25 ))print ("Name: %s, Age: %d" % ("Alice" , 25 ))
工程实践 :优先使用f-string,它最简洁、可读性最强,且性能最优。format()方法在需要动态模板时使用。%格式化仅用于维护旧代码。
8.4 正则表达式 8.4.1 基本匹配 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import retext = "Hello World 123" match = re.search(r"\d+" , text)if match : print (match .group()) print (match .span()) print (re.findall(r"\d+" , "abc 123 def 456" )) print (re.match (r"Hello" , text)) print (re.match (r"World" , text)) print (re.sub(r"\d+" , "[NUM]" , "abc 123 def 456" )) print (re.split(r"\s+" , "Hello World Python" ))
8.4.2 元字符与字符类 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import retext = "Hello World 123 !@#" print (re.findall(r"[aeiou]" , text)) print (re.findall(r"[A-Z]" , text)) print (re.findall(r"[0-9]" , text)) print (re.findall(r"[^aeiou]" , text))
8.4.3 量词 1 2 3 4 5 6 7 8 9 10 11 12 import retext = "a aa aaa aaaa" print (re.findall(r"a+" , text)) print (re.findall(r"a+?" , text))
8.4.4 分组与捕获 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import retext = "John: 25, Alice: 30, Bob: 35" pattern = r"(\w+): (\d+)" for match in re.finditer(pattern, text): print (f"Name: {match .group(1 )} , Age: {match .group(2 )} " ) text = "name@example.com" pattern = r"(?P<username>\w+)@(?P<domain>\w+\.\w+)" match = re.search(pattern, text)print (match .group("username" )) print (matchgroup("domain" )) pattern = r"(?:\d{3})-(\d{4})" match = re.search(pattern, "Tel: 010-1234" )print (match .group(1 )) print (re.findall(r"(\w+) \1" , "hello hello world world" ))
8.4.5 断言 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import retext = "abc123def456" print (re.findall(r"\d+(?=def)" , text)) print (re.findall(r"\d+(?!def)" , text)) print (re.findall(r"(?<=abc)\d+" , text)) print (re.findall(r"\bword\b" , "a word wordword" ))
8.4.6 编译与标志 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import repattern = re.compile (r"\d+" ) print (pattern.findall("abc 123 def 456" ))pattern = re.compile (r""" \d{3} # 区号 - # 分隔符 \d{4} # 号码 """ , re.VERBOSE)
8.5 字符编码 8.5.1 编码与解码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 text = "你好,世界!" encoded = text.encode("utf-8" ) print (type (encoded)) decoded = encoded.decode("utf-8" ) text = "你好" print (text.encode("ascii" , errors="ignore" )) print (text.encode("ascii" , errors="replace" )) print (text.encode("ascii" , errors="xmlcharrefreplace" ))
8.5.2 Unicode规范化 1 2 3 4 5 6 7 8 9 10 11 import unicodedatas1 = "caf\u00e9" s2 = "cafe\u0301" print (s1 == s2) print (len (s1), len (s2)) normalized = unicodedata.normalize("NFC" , s2) print (s1 == normalized)
学术注记 :Unicode同一字符可有多种编码表示。NFC(规范化形式C)优先使用预组合字符,NFD优先使用分解形式。文本比较前应先规范化,否则可能导致字符串不相等。
8.5.3 编码体系详解 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def demonstrate_encoding (): """演示不同编码体系""" text = "Hello 你好" print ("UTF-8编码(变长,1-4字节):" ) utf8_bytes = text.encode("utf-8" ) print (f" 字节数: {len (utf8_bytes)} " ) print (f" 十六进制: {utf8_bytes.hex ()} " ) print ("\nUTF-16编码(固定2或4字节):" ) utf16_bytes = text.encode("utf-16" ) print (f" 字节数: {len (utf16_bytes)} " ) print (f" 十六进制: {utf16_bytes.hex ()} " ) print ("\nGBK编码(中文编码):" ) gbk_bytes = text.encode("gbk" ) print (f" 字节数: {len (gbk_bytes)} " ) print (f" 十六进制: {gbk_bytes.hex ()} " ) demonstrate_encoding()
编码 特点 适用场景 UTF-8 变长(1-4字节),兼容ASCII Web、跨平台、推荐默认 UTF-16 固定2或4字节,BOM标记 Windows内部、Java GBK 中文编码,2字节中文 中文Windows遗留系统 ASCII 单字节,仅128字符 纯英文环境 Latin-1 单字节,256字符 西欧语言
8.5.4 BOM与字节序 1 2 3 4 5 6 7 8 9 10 11 12 13 14 def demonstrate_bom (): """演示字节顺序标记(BOM)""" text = "Hello" utf8_bom = text.encode("utf-8-sig" ) utf16_le = text.encode("utf-16-le" ) utf16_be = text.encode("utf-16-be" ) print (f"UTF-8 with BOM: {utf8_bom[:3 ].hex ()} (BOM: efbbbf)" ) print (f"UTF-16 LE: {utf16_le[:2 ].hex ()} (小端)" ) print (f"UTF-16 BE: {utf16_be[:2 ].hex ()} (大端)" ) demonstrate_bom()
学术注记 :BOM(Byte Order Mark)是Unicode编码开头的特殊标记,用于指示字节序。UTF-8不需要BOM,但Windows程序常添加。UTF-16/32必须处理字节序问题。
8.6 正则表达式进阶 8.6.1 常用正则模式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import reclass RegexPatterns : """常用正则表达式模式集合""" EMAIL = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' PHONE_CN = r'^1[3-9]\d{9}$' IP_ADDRESS = r'^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$' URL = r'^https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b[-a-zA-Z0-9()@:%_\+.~#?&/=]*$' DATE_ISO = r'^\d{4}-\d{2}-\d{2}$' TIME_24H = r'^([01]\d|2[0-3]):[0-5]\d(:[0-5]\d)?$' HEX_COLOR = r'^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$' @staticmethod def validate (pattern: str , text: str ) -> bool : return bool (re.match (pattern, text)) print (RegexPatterns.validate(RegexPatterns.EMAIL, "user@example.com" ))print (RegexPatterns.validate(RegexPatterns.IP_ADDRESS, "192.168.1.1" ))
8.6.2 高级匹配技术 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import redef advanced_regex_techniques (): """高级正则表达式技术""" text = "color colour" print (re.findall(r"colou?r" , text)) text = "a" * 100 + "!" pattern = r"(?>a+)+!" try : re.match (pattern, text) except : print ("原子组防止灾难性回溯" ) text = "((a+b)*c)" depth = 0 for char in text: if char == '(' : depth += 1 elif char == ')' : depth -= 1 print (f"括号深度: {depth} " ) advanced_regex_techniques()
8.6.3 性能优化 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import reimport timeitdef regex_performance (): """正则表达式性能优化""" text = "The quick brown fox jumps over the lazy dog " * 1000 pattern = re.compile (r"\b\w{4}\b" ) def with_compile (): return pattern.findall(text) def without_compile (): return re.findall(r"\b\w{4}\b" , text) print (f"预编译: {timeit.timeit(with_compile, number=100 ):.4 f} 秒" ) print (f"即时编译: {timeit.timeit(without_compile, number=100 ):.4 f} 秒" ) bad_pattern = r"(a+)+b" good_pattern = r"a+b" text = "a" * 30 regex_performance()
8.6.4 实用正则工具函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import refrom typing import Iteratordef extract_urls (text: str ) -> list [str ]: """提取文本中的URL""" pattern = r'https?://[^\s<>"{}|\\^`\[\]]+' return re.findall(pattern, text) def extract_emails (text: str ) -> list [str ]: """提取文本中的邮箱地址""" pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b' return re.findall(pattern, text) def extract_hashtags (text: str ) -> list [str ]: """提取社交媒体标签""" pattern = r'#\w+' return re.findall(pattern, text) def extract_mentions (text: str ) -> list [str ]: """提取@提及""" pattern = r'@\w+' return re.findall(pattern, text) def clean_text (text: str ) -> str : """清理文本:移除多余空白和特殊字符""" text = re.sub(r'\s+' , ' ' , text) text = re.sub(r'[^\w\s\u4e00-\u9fff.,!?;:\-()]' , '' , text) return text.strip() def tokenize (text: str ) -> list [str ]: """简单分词""" return re.findall(r'\b\w+\b' , text.lower()) sample = """ Check out https://example.com and email me at user@example.com #Python #Regex @developer """ print (extract_urls(sample))print (extract_emails(sample))print (extract_hashtags(sample))print (extract_mentions(sample))
8.7 前沿技术动态 8.7.1 f-string增强(PEP 701) Python 3.12解除了f-string的诸多限制:
1 2 3 4 5 6 7 8 9 10 11 12 13 f"{func(arg='value' )} " f"""Result: { calculate( x=1 , y=2 ) } """f"{f'{x=} ' } "
8.7.2 字符串方法增强 1 2 3 4 5 6 7 "example.txt" .removeprefix("example." )"example.txt" .removesuffix(".txt" )import syssys.set_int_max_str_digits(4300 )
8.7.3 文本处理库 1 2 3 4 5 6 7 import regexpattern = regex.compile (r"\p{Script=Han}+" ) matches = pattern.findall("中文English混合" ) import unicodedatanormalized = unicodedata.normalize("NFC" , "café" )
8.8 本章小结 本章系统介绍了Python字符串处理的完整体系:
字符串基础 :Unicode本质、不可变性、索引切片字符串方法 :查找、替换、分割、连接、判断、对齐格式化 :f-string(推荐)、format()、%格式化正则表达式 :元字符、量词、分组、断言、编译标志字符编码 :UTF-8编解码、Unicode规范化、BOM处理正则进阶 :常用模式、性能优化、工具函数8.8.1 字符串处理最佳实践 场景 推荐方法 原因 简单替换 str.replace() 简洁高效 模式匹配 re模块 灵活强大 格式化输出 f-string 可读性好、性能优 大量拼接 “”.join() O(n)复杂度 文本比较 unicodedata.normalize() 处理Unicode等价
8.8.2 常见陷阱与规避 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 result = "" for i in range (10000 ): result += str (i) result = "" .join(str (i) for i in range (10000 )) with open ("file.txt" , "r" ) as f: content = f.read() with open ("file.txt" , "r" , encoding="utf-8" ) as f: content = f.read() text = "<div>content</div><div>more</div>" re.findall(r"<div>.*</div>" , text) re.findall(r"<div>.*?</div>" , text)
8.9 练习题 基础题 编写程序,统计字符串中每个字符的出现次数。
使用正则表达式验证邮箱地址格式。
将字符串中的所有单词首字母大写(不使用title())。
进阶题 实现简单模板引擎,支持{{ variable }}变量替换。
使用正则表达式提取HTML标签中的属性和内容。
编写函数,将驼峰命名转换为下划线命名(如userName → user_name)。
项目实践 日志解析器 :编写一个程序,要求:使用正则表达式解析Apache/Nginx访问日志 提取IP、时间、请求方法、URL、状态码、响应大小 统计各状态码出现次数 找出访问量最大的前10个IP 支持按时间范围过滤 输出结构化报告 思考题 为什么Python字符串不可变?这一设计决策带来了哪些好处?
f-string中嵌套的引号如何处理?为什么f-string比%格式化和format()更快?
Unicode的NFC和NFD规范化有什么区别?在什么场景下需要特别注意?
8.10 延伸阅读 8.10.1 Unicode与编码 8.10.2 正则表达式 8.10.3 文本处理 8.10.4 国际化与本地化 下一章:第9章 文件操作