第8章字符串处理

学习目标

完成本章学习后，读者将能够：

理解Python字符串的Unicode本质与不可变特性
熟练运用字符串方法进行查找、替换、分割与格式化
掌握正则表达式的语法与Python re模块的使用
理解字符编码体系（ASCII、UTF-8、GBK）与编解码操作
运用字符串处理技术解决实际文本处理问题

8.1 字符串基础

8.1.1 创建字符串

single = 'Hello'
double = "World"
multi = """这是
多行
字符串"""
raw = r"C:\Users\name\file.txt"      # 原始字符串，反斜杠不转义
unicode_str = "你好，世界！🌍"          # Python 3字符串原生支持Unicode

学术注记：Python 3的str类型是Unicode字符串，每个字符是一个Unicode码点。底层实现使用灵活的内部编码：纯ASCII字符串每字符1字节，纯Latin-1每字符1字节，纯BMP每字符2字节，含补充字符每字符4字节。这一优化称为PEP 393——Flexible String Representation。

8.1.2 索引与切片

text = "Python"

print(text[0])        # "P"
print(text[-1])       # "n"
print(text[1:4])      # "yth"
print(text[::-1])     # "nohtyP"
print(text[::2])      # "Pto"

8.1.3 字符串不可变性

text = "Python"
# text[0] = "J"       # TypeError！字符串不可修改

# 修改需创建新字符串
new_text = "J" + text[1:]       # "Jython"
new_text = text.replace("P", "J")  # "Jython"

工程实践：大量字符串拼接应使用"".join(parts)而非+=。+=每次创建新字符串对象，时间复杂度为O(n²)；join()一次性分配内存，时间复杂度为O(n)。

# 低效：O(n²)
result = ""
for word in words:
    result += word

# 高效：O(n)
result = "".join(words)

8.2 字符串方法

8.2.1 大小写与空白处理

text = "Hello World"

# 大小写转换
print(text.lower())          # "hello world"
print(text.upper())          # "HELLO WORLD"
print(text.title())          # "Hello World"
print(text.capitalize())     # "Hello world"
print(text.swapcase())       # "hELLO wORLD"

# 去除空白
text = "  hello  "
print(text.strip())          # "hello"
print(text.lstrip())         # "hello  "
print(text.rstrip())         # "  hello"

# 去除指定字符
print("xxhelloxx".strip("x"))  # "hello"

8.2.2 查找与替换

text = "Hello World World"

# 查找
print(text.find("World"))       # 6 - 第一次出现的索引
print(text.rfind("World"))      # 12 - 最后一次出现的索引
print(text.find("Python"))      # -1 - 未找到
print(text.index("World"))      # 6 - 同find，但未找到抛ValueError
print(text.count("World"))      # 2

# 替换
print(text.replace("World", "Python"))         # 全部替换
print(text.replace("World", "Python", 1))      # 只替换第一个

8.2.3 分割与连接

# 分割
"apple,banana,cherry".split(",")        # ["apple", "banana", "cherry"]
"one  two   three".split()              # ["one", "two", "three"] - 自动处理多空格
"line1\nline2\nline3".splitlines()      # ["line1", "line2", "line3"]

# 连接
" ".join(["Hello", "World"])            # "Hello World"
"-".join(["2024", "01", "15"])          # "2024-01-15"

# 分区
"Hello World Python".partition(" ")     # ("Hello", " ", "World Python")
"Hello World Python".rpartition(" ")    # ("Hello World", " ", "Python")

8.2.4 判断方法

"123".isdigit()          # True - 纯数字
"abc".isalpha()          # True - 纯字母
"abc123".isalnum()       # True - 字母或数字
"   ".isspace()          # True - 纯空白
"hello".islower()        # True
"HELLO".isupper()        # True
"Hello World".istitle()  # True

# 前缀后缀检测
"Hello".startswith("He")  # True
"Hello".endswith("lo")    # True

8.2.5 对齐与填充

text = "Python"

print(text.center(20, "-"))   # "-------Python-------"
print(text.ljust(20, "*"))    # "Python**************"
print(text.rjust(20, "*"))    # "**************Python"
print("42".zfill(8))          # "00000042"

8.3 格式化字符串

8.3.1 f-string（推荐）

name = "Alice"
age = 25
pi = 3.14159

# 基本用法
print(f"Name: {name}, Age: {age}")

# 格式控制
print(f"Pi: {pi:.2f}")              # "Pi: 3.14"
print(f"Number: {42:08b}")          # "Number: 00101010" - 二进制
print(f"Hex: {255:x}")              # "Hex: ff"
print(f"Thousands: {1234567:,}")    # "Thousands: 1,234,567"
print(f"Percent: {0.856:.1%}")      # "Percent: 85.6%"

# 对齐
print(f"|{name:>10}|")              # "|     Alice|"
print(f"|{name:<10}|")              # "|Alice     |"
print(f"|{name:^10}|")              # "|  Alice   |"

# 表达式
print(f"Result: {2 ** 10}")         # "Result: 1024"
print(f"Upper: {name.upper()}")     # "Upper: ALICE"

# 调试输出（Python 3.8+）
x = 42
print(f"{x = }")                    # "x = 42"
print(f"{x = :08b}")                # "x = 00101010"

# 日期格式化
from datetime import datetime
now = datetime.now()
print(f"{now:%Y-%m-%d %H:%M:%S}")   # "2026-04-18 10:30:00"

8.3.2 format方法与%格式化

# str.format()
print("Name: {}, Age: {}".format("Alice", 25))
print("Name: {0}, Age: {1}, Again: {0}".format("Alice", 25))
print("Name: {name}, Age: {age}".format(name="Alice", age=25))

# % 格式化（旧式，不推荐新代码使用）
print("Name: %s, Age: %d" % ("Alice", 25))

工程实践：优先使用f-string，它最简洁、可读性最强，且性能最优。format()方法在需要动态模板时使用。%格式化仅用于维护旧代码。

8.4 正则表达式

8.4.1 基本匹配

import re

text = "Hello World 123"

# 搜索
match = re.search(r"\d+", text)
if match:
    print(match.group())     # "123"
    print(match.span())      # (12, 15)

# 全部匹配
print(re.findall(r"\d+", "abc 123 def 456"))  # ["123", "456"]

# 匹配开头
print(re.match(r"Hello", text))    # 匹配成功
print(re.match(r"World", text))    # None - match只匹配开头

# 替换
print(re.sub(r"\d+", "[NUM]", "abc 123 def 456"))  # "abc [NUM] def [NUM]"

# 分割
print(re.split(r"\s+", "Hello   World   Python"))  # ["Hello", "World", "Python"]

8.4.2 元字符与字符类

import re

# 常用元字符
# .  任意字符（除换行）
# ^  字符串开头
# $  字符串结尾
# \d 数字  \D 非数字
# \w 单词字符  \W 非单词字符
# \s 空白  \S 非空白
# \b 单词边界  \B 非单词边界

text = "Hello World 123 !@#"

print(re.findall(r"[aeiou]", text))      # 元音字母
print(re.findall(r"[A-Z]", text))        # 大写字母
print(re.findall(r"[0-9]", text))        # 数字
print(re.findall(r"[^aeiou]", text))     # 非元音

8.4.3 量词

import re

# *  0次或多次    +  1次或多次    ?  0次或1次
# {n}  恰好n次   {n,}  至少n次   {n,m}  n到m次

text = "a aa aaa aaaa"

# 贪婪匹配（默认）
print(re.findall(r"a+", text))     # ["a", "aa", "aaa", "aaaa"]

# 非贪婪匹配（加?）
print(re.findall(r"a+?", text))    # ["a", "a", "a", "a", "a", "a", "a", "a", "a", "a"]

8.4.4 分组与捕获

import re

# 捕获组
text = "John: 25, Alice: 30, Bob: 35"
pattern = r"(\w+): (\d+)"
for match in re.finditer(pattern, text):
    print(f"Name: {match.group(1)}, Age: {match.group(2)}")

# 命名组
text = "name@example.com"
pattern = r"(?P<username>\w+)@(?P<domain>\w+\.\w+)"
match = re.search(pattern, text)
print(match.group("username"))    # "name"
print(matchgroup("domain"))       # "example.com"

# 非捕获组
pattern = r"(?:\d{3})-(\d{4})"
match = re.search(pattern, "Tel: 010-1234")
print(match.group(1))             # "1234" - 只捕获后四位

# 反向引用
print(re.findall(r"(\w+) \1", "hello hello world world"))  # ["hello", "world"]

8.4.5 断言

import re

text = "abc123def456"

# 正向先行断言：匹配后面跟着def的数字
print(re.findall(r"\d+(?=def)", text))     # ["123"]

# 负向先行断言：匹配后面不跟着def的数字
print(re.findall(r"\d+(?!def)", text))     # ["456"]

# 正向后行断言：匹配前面是abc的数字
print(re.findall(r"(?<=abc)\d+", text))    # ["123"]

# 单词边界
print(re.findall(r"\bword\b", "a word wordword"))  # ["word"]

8.4.6 编译与标志

import re

# 预编译（提升重复使用性能）
pattern = re.compile(r"\d+")
print(pattern.findall("abc 123 def 456"))

# 常用标志
# re.IGNORECASE  忽略大小写
# re.MULTILINE   ^和$匹配每行
# re.DOTALL      .匹配换行符
# re.VERBOSE     允许注释和空白

pattern = re.compile(r"""
    \d{3}      # 区号
    -          # 分隔符
    \d{4}      # 号码
""", re.VERBOSE)

8.5 字符编码

8.5.1 编码与解码

text = "你好，世界！"

# 编码：str → bytes
encoded = text.encode("utf-8")       # b'\xe4\xbd\xa0\xe5\xa5\xbd...'
print(type(encoded))                  # <class 'bytes'>

# 解码：bytes → str
decoded = encoded.decode("utf-8")    # "你好，世界！"

# 错误处理
text = "你好"
print(text.encode("ascii", errors="ignore"))      # b'' - 忽略
print(text.encode("ascii", errors="replace"))      # b'??' - 替换
print(text.encode("ascii", errors="xmlcharrefreplace"))  # b'&#20320;&#22909;'

8.5.2 Unicode规范化

import unicodedata

# é的两种表示：单一码点 vs 组合字符
s1 = "caf\u00e9"       # 单一码点
s2 = "cafe\u0301"      # e + 组合重音
print(s1 == s2)         # False！视觉相同但编码不同
print(len(s1), len(s2)) # 4 5

# NFC规范化（推荐）
normalized = unicodedata.normalize("NFC", s2)
print(s1 == normalized)  # True

学术注记：Unicode同一字符可有多种编码表示。NFC（规范化形式C）优先使用预组合字符，NFD优先使用分解形式。文本比较前应先规范化，否则可能导致字符串不相等。

8.5.3 编码体系详解

def demonstrate_encoding():
    """演示不同编码体系"""
    
    text = "Hello 你好"
    
    print("UTF-8编码（变长，1-4字节）:")
    utf8_bytes = text.encode("utf-8")
    print(f"  字节数: {len(utf8_bytes)}")
    print(f"  十六进制: {utf8_bytes.hex()}")
    
    print("\nUTF-16编码（固定2或4字节）:")
    utf16_bytes = text.encode("utf-16")
    print(f"  字节数: {len(utf16_bytes)}")
    print(f"  十六进制: {utf16_bytes.hex()}")
    
    print("\nGBK编码（中文编码）:")
    gbk_bytes = text.encode("gbk")
    print(f"  字节数: {len(gbk_bytes)}")
    print(f"  十六进制: {gbk_bytes.hex()}")

demonstrate_encoding()

编码	特点	适用场景
UTF-8	变长（1-4字节），兼容ASCII	Web、跨平台、推荐默认
UTF-16	固定2或4字节，BOM标记	Windows内部、Java
GBK	中文编码，2字节中文	中文Windows遗留系统
ASCII	单字节，仅128字符	纯英文环境
Latin-1	单字节，256字符	西欧语言

8.5.4 BOM与字节序

def demonstrate_bom():
    """演示字节顺序标记（BOM）"""
    
    text = "Hello"
    
    utf8_bom = text.encode("utf-8-sig")
    utf16_le = text.encode("utf-16-le")
    utf16_be = text.encode("utf-16-be")
    
    print(f"UTF-8 with BOM: {utf8_bom[:3].hex()} (BOM: efbbbf)")
    print(f"UTF-16 LE: {utf16_le[:2].hex()} (小端)")
    print(f"UTF-16 BE: {utf16_be[:2].hex()} (大端)")

demonstrate_bom()

学术注记：BOM（Byte Order Mark）是Unicode编码开头的特殊标记，用于指示字节序。UTF-8不需要BOM，但Windows程序常添加。UTF-16/32必须处理字节序问题。

8.6 正则表达式进阶

8.6.1 常用正则模式

import re

class RegexPatterns:
    """常用正则表达式模式集合"""
    
    EMAIL = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    PHONE_CN = r'^1[3-9]\d{9}$'
    IP_ADDRESS = r'^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$'
    URL = r'^https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b[-a-zA-Z0-9()@:%_\+.~#?&/=]*$'
    DATE_ISO = r'^\d{4}-\d{2}-\d{2}$'
    TIME_24H = r'^([01]\d|2[0-3]):[0-5]\d(:[0-5]\d)?$'
    HEX_COLOR = r'^#?([a-fA-F0-9]{6}|[a-fA-F0-9]{3})$'
    
    @staticmethod
    def validate(pattern: str, text: str) -> bool:
        return bool(re.match(pattern, text))

print(RegexPatterns.validate(RegexPatterns.EMAIL, "user@example.com"))
print(RegexPatterns.validate(RegexPatterns.IP_ADDRESS, "192.168.1.1"))

8.6.2 高级匹配技术

import re

def advanced_regex_techniques():
    """高级正则表达式技术"""
    
    # 1. 条件匹配
    text = "color colour"
    print(re.findall(r"colou?r", text))
    
    # 2. 原子组（防止回溯）
    text = "a" * 100 + "!"
    pattern = r"(?>a+)+!"
    try:
        re.match(pattern, text)
    except:
        print("原子组防止灾难性回溯")
    
    # 3. 嵌套结构匹配（使用regex库）
    # 标准re不支持递归，需使用regex第三方库
    
    # 4. 平衡组（匹配嵌套括号）
    text = "((a+b)*c)"
    depth = 0
    for char in text:
        if char == '(':
            depth += 1
        elif char == ')':
            depth -= 1
    print(f"括号深度: {depth}")

advanced_regex_techniques()

8.6.3 性能优化

import re
import timeit

def regex_performance():
    """正则表达式性能优化"""
    
    text = "The quick brown fox jumps over the lazy dog " * 1000
    
    # 预编译 vs 即时编译
    pattern = re.compile(r"\b\w{4}\b")
    
    def with_compile():
        return pattern.findall(text)
    
    def without_compile():
        return re.findall(r"\b\w{4}\b", text)
    
    print(f"预编译: {timeit.timeit(with_compile, number=100):.4f}秒")
    print(f"即时编译: {timeit.timeit(without_compile, number=100):.4f}秒")
    
    # 避免灾难性回溯
    bad_pattern = r"(a+)+b"
    good_pattern = r"a+b"
    
    text = "a" * 30

regex_performance()

8.6.4 实用正则工具函数

import re
from typing import Iterator

def extract_urls(text: str) -> list[str]:
    """提取文本中的URL"""
    pattern = r'https?://[^\s<>"{}|\\^`\[\]]+'
    return re.findall(pattern, text)

def extract_emails(text: str) -> list[str]:
    """提取文本中的邮箱地址"""
    pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'
    return re.findall(pattern, text)

def extract_hashtags(text: str) -> list[str]:
    """提取社交媒体标签"""
    pattern = r'#\w+'
    return re.findall(pattern, text)

def extract_mentions(text: str) -> list[str]:
    """提取@提及"""
    pattern = r'@\w+'
    return re.findall(pattern, text)

def clean_text(text: str) -> str:
    """清理文本：移除多余空白和特殊字符"""
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s\u4e00-\u9fff.,!?;:\-()]', '', text)
    return text.strip()

def tokenize(text: str) -> list[str]:
    """简单分词"""
    return re.findall(r'\b\w+\b', text.lower())

sample = """
Check out https://example.com and email me at user@example.com
#Python #Regex @developer
"""
print(extract_urls(sample))
print(extract_emails(sample))
print(extract_hashtags(sample))
print(extract_mentions(sample))

8.7 前沿技术动态

8.7.1 f-string增强（PEP 701）

Python 3.12解除了f-string的诸多限制：

# 嵌套引号
f"{func(arg='value')}"

# 多行表达式
f"""Result: {
    calculate(
        x=1,
        y=2
    )
}"""

# 嵌套f-string
f"{f'{x=}'}"

8.7.2 字符串方法增强

# Python 3.9+ 移除前后缀
"example.txt".removeprefix("example.")
"example.txt".removesuffix(".txt")

# Python 3.11+ 更快的字符串操作
import sys
sys.set_int_max_str_digits(4300)

8.7.3 文本处理库

import regex

pattern = regex.compile(r"\p{Script=Han}+")
matches = pattern.findall("中文English混合")

import unicodedata
normalized = unicodedata.normalize("NFC", "café")

8.8 本章小结

本章系统介绍了Python字符串处理的完整体系：

字符串基础：Unicode本质、不可变性、索引切片
字符串方法：查找、替换、分割、连接、判断、对齐
格式化：f-string（推荐）、format()、%格式化
正则表达式：元字符、量词、分组、断言、编译标志
字符编码：UTF-8编解码、Unicode规范化、BOM处理
正则进阶：常用模式、性能优化、工具函数

8.8.1 字符串处理最佳实践

场景	推荐方法	原因
简单替换	str.replace()	简洁高效
模式匹配	re模块	灵活强大
格式化输出	f-string	可读性好、性能优
大量拼接	“”.join()	O(n)复杂度
文本比较	unicodedata.normalize()	处理Unicode等价

8.8.2 常见陷阱与规避

# 陷阱1：字符串拼接性能
result = ""
for i in range(10000):
    result += str(i)  # O(n²)！

# 正确做法
result = "".join(str(i) for i in range(10000))

# 陷阱2：编码错误
with open("file.txt", "r") as f:  # 可能UnicodeDecodeError
    content = f.read()

# 正确做法
with open("file.txt", "r", encoding="utf-8") as f:
    content = f.read()

# 陷阱3：正则贪婪匹配
text = "<div>content</div><div>more</div>"
re.findall(r"<div>.*</div>", text)  # 匹配整个字符串！

# 正确做法：非贪婪
re.findall(r"<div>.*?</div>", text)

8.9 练习题

基础题

编写程序，统计字符串中每个字符的出现次数。
使用正则表达式验证邮箱地址格式。
将字符串中的所有单词首字母大写（不使用title()）。

进阶题

实现简单模板引擎，支持{{ variable }}变量替换。
使用正则表达式提取HTML标签中的属性和内容。
编写函数，将驼峰命名转换为下划线命名（如userName → user_name）。

项目实践

日志解析器：编写一个程序，要求：
- 使用正则表达式解析Apache/Nginx访问日志
- 提取IP、时间、请求方法、URL、状态码、响应大小
- 统计各状态码出现次数
- 找出访问量最大的前10个IP
- 支持按时间范围过滤
- 输出结构化报告

思考题

为什么Python字符串不可变？这一设计决策带来了哪些好处？
f-string中嵌套的引号如何处理？为什么f-string比%格式化和format()更快？
Unicode的NFC和NFD规范化有什么区别？在什么场景下需要特别注意？

8.10 延伸阅读

8.10.1 Unicode与编码

Unicode标准 (https://unicode.org/) — Unicode官方规范
UTF-8 Everywhere (https://utf8everywhere.org/) — UTF-8编码最佳实践
PEP 393 — Flexible String Representation (https://peps.python.org/pep-0393/) — Python 3.3字符串内部优化
《Unicode Demystified》 (Richard Gillam) — Unicode深入指南

8.10.2 正则表达式

《精通正则表达式》 (Jeffrey Friedl) — 正则表达式圣经
Regular-Expressions.info (https://www.regular-expressions.info/) — 正则表达式教程
regex101 (https://regex101.com/) — 在线正则表达式测试工具
re模块文档 (https://docs.python.org/3/library/re.html) — Python正则表达式官方文档

8.10.3 文本处理

《Natural Language Processing with Python》 — NLTK教程
《Text Processing in Python》 (David Mertz) — 文本处理技术
string模块文档 (https://docs.python.org/3/library/string.html) — 字符串常量和模板

8.10.4 国际化与本地化

gettext模块 (https://docs.python.org/3/library/gettext.html) — 国际化支持
locale模块 (https://docs.python.org/3/library/locale.html) — 本地化设置
ICU库 (https://pypi.org/project/PyICU/) — 高级Unicode处理

下一章：第9章文件操作

第8章 字符串处理

学习目标

8.1 字符串基础

8.1.1 创建字符串

8.1.2 索引与切片

8.1.3 字符串不可变性

8.2 字符串方法

8.2.1 大小写与空白处理

8.2.2 查找与替换

8.2.3 分割与连接

8.2.4 判断方法

8.2.5 对齐与填充

8.3 格式化字符串

8.3.1 f-string（推荐）

8.3.2 format方法与%格式化

8.4 正则表达式

8.4.1 基本匹配

8.4.2 元字符与字符类

8.4.3 量词

8.4.4 分组与捕获

8.4.5 断言

8.4.6 编译与标志

8.5 字符编码

8.5.1 编码与解码

8.5.2 Unicode规范化

8.5.3 编码体系详解

8.5.4 BOM与字节序

8.6 正则表达式进阶

8.6.1 常用正则模式

8.6.2 高级匹配技术

8.6.3 性能优化

8.6.4 实用正则工具函数

8.7 前沿技术动态

8.7.1 f-string增强（PEP 701）

8.7.2 字符串方法增强

8.7.3 文本处理库

8.8 本章小结

8.8.1 字符串处理最佳实践

8.8.2 常见陷阱与规避

8.9 练习题

基础题

进阶题

项目实践

思考题

8.10 延伸阅读

8.10.1 Unicode与编码

8.10.2 正则表达式

8.10.3 文本处理

8.10.4 国际化与本地化

第8章字符串处理