第7章字典与集合

学习目标

完成本章学习后，读者将能够：

理解字典的哈希表实现原理及其对性能的影响
熟练运用字典的各种方法与特殊字典类型（defaultdict、Counter）
掌握集合运算及其在数据去重、成员检测中的应用
理解可哈希性（Hashable）的概念及其对字典键和集合元素的限制
运用字典与集合解决实际数据处理问题

7.1 字典基础

7.1.1 创建字典

# 字面量创建
person = {"name": "Alice", "age": 25, "city": "北京"}

# 构造函数
person = dict(name="Alice", age=25, city="北京")

# 从键值对列表
person = dict([("name", "Alice"), ("age", 25)])

# 从两个列表
keys = ["name", "age", "city"]
values = ["Alice", 25, "北京"]
person = dict(zip(keys, values))

# fromkeys：统一默认值
defaults = dict.fromkeys(["host", "port", "db"], "unknown")

# 字典推导式
squares = {x: x ** 2 for x in range(5)}  # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

学术注记：Python 3.7+字典保证插入顺序（CPython 3.6已实现，3.7正式纳入语言规范）。底层使用紧凑数组+哈希索引的混合结构，相比旧实现节省约20-25%内存。字典的查找、插入、删除均摊时间复杂度为O(1)。

7.1.2 访问元素

person = {"name": "Alice", "age": 25, "city": "北京"}

# 直接索引（键不存在抛出KeyError）
print(person["name"])          # "Alice"

# get方法（键不存在返回默认值）
print(person.get("email"))            # None
print(person.get("email", "未设置"))    # "未设置"

# 遍历
for key in person:                       # 遍历键
    print(key)
for key, value in person.items():        # 遍历键值对
    print(f"{key}: {value}")
for value in person.values():            # 遍历值
    print(value)

# 成员检测（检测键，O(1)）
print("name" in person)        # True
print("Alice" in person)       # False - 检测的是键，不是值

7.1.3 修改字典

person = {"name": "Alice", "age": 25}

# 添加/修改
person["email"] = "alice@example.com"    # 添加
person["age"] = 26                       # 修改

# 批量更新
person.update({"phone": "123456", "city": "上海"})
person.update(age=27)

# setdefault：键不存在时设置默认值
person.setdefault("country", "中国")     # 添加
person.setdefault("name", "Bob")         # 已存在，不修改

# 合并（Python 3.9+）
dict1 = {"a": 1, "b": 2}
dict2 = {"b": 3, "c": 4}
merged = dict1 | dict2          # {"a": 1, "b": 3, "c": 4}
dict1 |= dict2                  # 原地合并

7.1.4 删除元素

person = {"name": "Alice", "age": 25, "city": "北京", "email": "a@b.com"}

del person["email"]                     # 删除指定键
age = person.pop("age")                 # 删除并返回值
city = person.pop("city", "未知")        # 带默认值
key, value = person.popitem()           # 删除并返回最后插入的键值对
person.clear()                          # 清空

7.2 特殊字典

7.2.1 defaultdict

from collections import defaultdict

# 自动初始化缺失键
word_counts = defaultdict(int)
for word in "the quick brown fox jumps over the lazy dog".split():
    word_counts[word] += 1

# 分组
groups = defaultdict(list)
for name, dept in [("Alice", "Eng"), ("Bob", "Sales"), ("Charlie", "Eng")]:
    groups[dept].append(name)
print(groups)  # {'Eng': ['Alice', 'Charlie'], 'Sales': ['Bob']}

# 嵌套defaultdict
tree = lambda: defaultdict(tree)
config = tree()
config["database"]["host"] = "localhost"
config["database"]["port"] = 5432

7.2.2 Counter

from collections import Counter

# 创建
text = "hello world"
counter = Counter(text)
print(counter)               # Counter({'l': 3, 'o': 2, 'h': 1, 'e': 1, ...})

# 最常见元素
print(counter.most_common(2))  # [('l', 3), ('o', 2)]

# 算术运算
c1 = Counter("aabbcc")
c2 = Counter("bbccdd")
print(c1 + c2)    # 加法：Counter({'b': 4, 'c': 4, 'a': 2, 'd': 2})
print(c1 - c2)    # 减法（负值被丢弃）
print(c1 & c2)    # 交集：取最小值
print(c1 | c2)    # 并集：取最大值

7.2.3 ChainMap

from collections import ChainMap

# 多字典视图（不复制数据）
defaults = {"theme": "dark", "language": "en", "font": "Arial"}
user_config = {"theme": "light"}
system_config = {"language": "zh"}

config = ChainMap(user_config, system_config, defaults)
print(config["theme"])     # "light" - 从user_config找到
print(config["language"])  # "zh" - 从system_config找到
print(config["font"])      # "Arial" - 从defaults找到

7.3 集合

7.3.1 创建与基本操作

# 创建
empty = set()                          # 注意：{} 是空字典，不是空集合
fruits = {"apple", "banana", "cherry"}
numbers = set([1, 2, 3, 2, 1])        # {1, 2, 3}
chars = set("hello")                   # {'h', 'e', 'l', 'o'}

# 添加与删除
fruits.add("date")                     # 添加
fruits.discard("banana")               # 安全删除（不存在不报错）
fruits.remove("cherry")                # 删除（不存在抛出KeyError）
item = fruits.pop()                    # 弹出任意元素
fruits.clear()                         # 清空

# 成员检测（O(1)，比列表O(n)快得多）
print("apple" in fruits)               # True

7.3.2 集合运算

a = {1, 2, 3, 4, 5}
b = {4, 5, 6, 7, 8}

# 并集
print(a | b)                  # {1, 2, 3, 4, 5, 6, 7, 8}
print(a.union(b))             # 等价

# 交集
print(a & b)                  # {4, 5}
print(a.intersection(b))      # 等价

# 差集
print(a - b)                  # {1, 2, 3}
print(a.difference(b))        # 等价

# 对称差集（并集减交集）
print(a ^ b)                  # {1, 2, 3, 6, 7, 8}
print(a.symmetric_difference(b))

# 子集与超集
print({1, 2} <= {1, 2, 3})    # True - 子集
print({1, 2} < {1, 2, 3})     # True - 真子集
print({1, 2, 3} >= {1, 2})    # True - 超集

# 不相交
print({1, 2}.isdisjoint({3, 4}))  # True

7.3.3 frozenset——不可变集合

frozen = frozenset([1, 2, 3])

# frozenset可作为字典键或集合元素
d = {frozenset([1, 2]): "value"}
s = {frozenset([1, 2]), frozenset([3, 4])}

学术注记：集合和字典的键都要求元素是可哈希的（Hashable）。可哈希意味着对象的生命周期内哈希值不变，且可与其他对象比较相等。不可变类型（int、str、tuple、frozenset）是可哈希的；可变类型（list、dict、set）不可哈希。

7.4 性能分析

7.4.1 时间复杂度

操作	列表	字典/集合	说明
`x in container`	O(n)	O(1)	成员检测
`container[key]`	O(1)	O(1)	索引/键访问
`container.append/add`	O(1)*	O(1)	追加/添加
删除指定元素	O(n)	O(1)	按值/键删除

# 实际性能对比
import timeit

data = list(range(100000))
data_set = set(range(100000))

# 列表查找：O(n)
print(timeit.timeit(lambda: 99999 in data, number=1000))    # 较慢

# 集合查找：O(1)
print(timeit.timeit(lambda: 99999 in data_set, number=1000))  # 极快

7.4.2 内存开销

import sys

lst = list(range(1000))
d = {i: i for i in range(1000)}
s = set(range(1000))

print(sys.getsizeof(lst))    # 约8KB
print(sys.getsizeof(d))      # 约36KB - 哈希表开销
print(sys.getsizeof(s))      # 约32KB - 哈希表开销

工程实践：集合和字典以空间换时间。当需要频繁的成员检测时，将列表转为集合可获得数量级的性能提升。但集合/字典的内存开销约为列表的4倍。

7.5 实用技巧

7.5.1 去重保序

numbers = [1, 2, 2, 3, 3, 3, 4]

# 方法1：dict保序（Python 3.7+）
unique = list(dict.fromkeys(numbers))

# 方法2：集合辅助
seen = set()
unique = [x for x in numbers if not (x in seen or seen.add(x))]

7.5.2 字典排序

scores = {"Alice": 85, "Bob": 92, "Charlie": 78}

# 按键排序
sorted_by_key = dict(sorted(scores.items()))

# 按值排序
sorted_by_value = dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))

7.5.3 安全访问嵌套字典

def get_nested(d: dict, *keys, default=None):
    """安全访问嵌套字典"""
    for key in keys:
        if isinstance(d, dict):
            d = d.get(key, default)
        else:
            return default
    return d

data = {"user": {"profile": {"name": "Alice"}}}
print(get_nested(data, "user", "profile", "name"))     # "Alice"
print(get_nested(data, "user", "email", "address"))     # None

7.6 哈希表原理深入

7.6.1 哈希函数与冲突处理

def demonstrate_hash():
    """演示Python哈希函数"""
    
    print("整数哈希:")
    for i in range(5):
        print(f"  hash({i}) = {hash(i)}")
    
    print("\n字符串哈希:")
    for s in ["hello", "world", "hello"]:
        print(f"  hash('{s}') = {hash(s)}")
    
    print("\n元组哈希（元素必须可哈希）:")
    for t in [(1, 2), (1, 2), (3, 4)]:
        print(f"  hash({t}) = {hash(t)}")
    
    print("\n列表不可哈希:")
    try:
        hash([1, 2, 3])
    except TypeError as e:
        print(f"  错误: {e}")

demonstrate_hash()

学术注记：Python使用开放寻址法处理哈希冲突。当两个键哈希到同一位置时，按探测序列寻找下一个空位。探测公式为perturb = perturb >> 5，形成伪随机序列以减少聚集。删除元素时使用”墓碑标记”而非真正删除，以保持探测链完整。

7.6.2 字典内部结构

import sys

def analyze_dict_structure():
    """分析字典内部结构"""
    
    d = {i: i ** 2 for i in range(10)}
    
    print(f"字典大小: {sys.getsizeof(d)} 字节")
    print(f"键数量: {len(d)}")
    print(f"哈希表利用率: {len(d) / (sys.getsizeof(d) // 8):.2%}")
    
    print("\n字典内部属性:")
    print(f"  __len__: {d.__len__()}")
    print(f"  __hash__: {d.__hash__}")
    
    small = {}
    print(f"\n空字典大小: {sys.getsizeof(small)} 字节")

analyze_dict_structure()

7.6.3 自定义可哈希对象

from dataclasses import dataclass

@dataclass(frozen=True)
class Point:
    """可哈希的点类"""
    x: int
    y: int
    
    def __hash__(self) -> int:
        return hash((self.x, self.y))
    
    def __eq__(self, other: object) -> bool:
        if not isinstance(other, Point):
            return NotImplemented
        return self.x == other.x and self.y == other.y

p1 = Point(3, 4)
p2 = Point(3, 4)
p3 = Point(5, 6)

print(f"hash(p1) = {hash(p1)}")
print(f"p1 == p2: {p1 == p2}")
print(f"hash(p1) == hash(p2): {hash(p1) == hash(p2)}")

points = {p1: "A", p3: "B"}
print(points[Point(3, 4)])

7.6.4 哈希冲突攻击防护

import sys
import os

def demonstrate_hash_randomization():
    """演示哈希随机化（安全特性）"""
    
    print("Python 3.3+ 默认启用哈希随机化")
    print(f"PYTHONHASHSEED环境变量: {os.environ.get('PYTHONHASHSEED', 'random')}")
    
    print("\n不同运行中字符串哈希值会变化:")
    s = "hello"
    print(f"  hash('{s}') = {hash(s)}")
    print("  （每次运行值不同，防止哈希冲突攻击）")

demonstrate_hash_randomization()

学术注记：Python 3.3+默认启用哈希随机化，每次解释器启动时使用随机种子初始化哈希函数。这防止了恶意构造大量哈希冲突数据导致O(n²)性能退化的DoS攻击。可通过PYTHONHASHSEED=0禁用（仅用于调试）。

7.7 高级应用

7.7.1 字典视图对象

d = {"a": 1, "b": 2, "c": 3}

keys_view = d.keys()
values_view = d.values()
items_view = d.items()

print(f"keys视图类型: {type(keys_view)}")
print(f"keys视图: {list(keys_view)}")

d["d"] = 4
print(f"修改后keys视图: {list(keys_view)}")

print(f"\nkeys视图支持集合运算:")
print(f"  d.keys() & {'b', 'c', 'e'}: {d.keys() & {'b', 'c', 'e'}}")
print(f"  d.keys() | {'e', 'f'}: {d.keys() | {'e', 'f'}}")

7.7.2 OrderedDict与LRU缓存

from collections import OrderedDict

class LRUCache:
    """LRU缓存实现"""
    
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = OrderedDict()
    
    def get(self, key):
        if key not in self.cache:
            return None
        self.cache.move_to_end(key)
        return self.cache[key]
    
    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)
    
    def __repr__(self):
        return f"LRUCache({list(self.cache.items())})"

cache = LRUCache(3)
cache.put("a", 1)
cache.put("b", 2)
cache.put("c", 3)
print(cache)
cache.get("a")
cache.put("d", 4)
print(cache)

7.7.3 WeakValueDictionary

import weakref

class Cache:
    """使用弱引用的缓存"""
    
    def __init__(self):
        self._cache = weakref.WeakValueDictionary()
    
    def get(self, key, factory):
        if key not in self._cache:
            self._cache[key] = factory()
        return self._cache[key]

class ExpensiveObject:
    def __init__(self, value):
        self.value = value
        print(f"创建对象: {value}")

cache = Cache()
obj = cache.get("key1", lambda: ExpensiveObject(1))
obj = cache.get("key1", lambda: ExpensiveObject(1))

7.8 前沿技术动态

7.8.1 字典合并运算符（PEP 584）

Python 3.9引入字典合并运算符：

dict1 = {"a": 1, "b": 2}
dict2 = {"c": 3, "d": 4}

merged = dict1 | dict2
dict1 |= dict2

7.8.2 类型化字典

from typing import TypedDict

class UserDict(TypedDict):
    name: str
    age: int
    email: str

user: UserDict = {"name": "Alice", "age": 30, "email": "alice@example.com"}

7.8.3 高性能替代方案

from immutables import Map

immutable_map = Map({"a": 1, "b": 2})
new_map = immutable_map.set("c", 3)

7.9 本章小结

本章系统介绍了Python字典与集合的完整体系：

字典：哈希表实现，O(1)查找，Python 3.7+保证插入顺序
特殊字典：defaultdict自动初始化、Counter计数、ChainMap多字典视图
集合：无序唯一元素集，支持数学集合运算，O(1)成员检测
可哈希性：字典键和集合元素必须是可哈希的不可变类型
哈希原理：开放寻址法处理冲突，哈希随机化防止攻击
性能权衡：字典/集合以空间换时间，适合频繁查找场景

7.9.1 数据结构选择指南

场景	推荐结构	原因
键值映射	dict	O(1)查找，语义清晰
成员检测	set	O(1)查找，内存高效
计数统计	Counter	内置统计方法
分组聚合	defaultdict(list)	自动初始化列表
缓存实现	OrderedDict	支持顺序操作
配置继承	ChainMap	不复制数据

7.9.2 常见陷阱与规避

# 陷阱1：迭代中修改字典
d = {"a": 1, "b": 2, "c": 3}
for key in d:
    if key == "b":
        del d[key]  # RuntimeError!

# 正确做法：遍历副本
for key in list(d.keys()):
    if key == "b":
        del d[key]

# 陷阱2：可变对象作为键
d = {}
lst = [1, 2]
# d[lst] = "value"  # TypeError!

# 正确做法：使用元组
d[tuple(lst)] = "value"

# 陷阱3：集合元素去重丢失信息
data = [{"id": 1}, {"id": 1}]
# set(data)  # TypeError!

# 正确做法：使用元组或自定义哈希
unique = {tuple(sorted(d.items())) for d in data}

7.10 练习题

基础题

创建字典存储学生信息，实现添加、删除、查询操作。
使用Counter统计一段文本中每个单词的出现频率，输出前5个高频词。
给定两个列表，使用集合运算找出它们的共同元素和独有元素。

进阶题

实现函数merge_dicts(*dicts)，合并多个字典，相同键的值相加。
使用defaultdict实现图的邻接表表示，并实现BFS遍历。
实现LRU缓存，使用OrderedDict管理访问顺序。

项目实践

文本词频分析器：编写一个程序，要求：
- 读取文本文件，统计词频
- 支持停用词过滤
- 使用Counter进行统计
- 按频率排序输出
- 支持大小写敏感/不敏感模式
- 输出统计摘要（总词数、唯一词数、Top N）

思考题

Python字典的哈希表是如何处理冲突的？为什么字典的查找是O(1)？
为什么列表不能作为字典的键？如果需要用列表作为键，应该如何解决？
defaultdict与普通字典的setdefault()方法相比，各有什么优缺点？

7.11 延伸阅读

7.11.1 哈希表理论

《算法导论》第11章 — 哈希表的数学原理与冲突解决策略
《数据结构与算法分析》 (Mark Allen Weiss) — 哈希表实现细节
PEP 412 — Key-Sharing Dictionary (https://peps.python.org/pep-0412/) — Python字典内存优化

7.11.2 Python内部实现

CPython源码 (https://github.com/python/cpython) — Objects/dictobject.c
《Python源码剖析》 (陈儒) — 字典实现的深度解析
Raymond Hettinger: Modern Python Dictionaries (PyCon 2017) — 字典演进历史

7.11.3 高级数据结构

collections模块文档 (https://docs.python.org/3/library/collections.html) — 特殊容器类型
weakref模块 (https://docs.python.org/3/library/weakref.html) — 弱引用与缓存
functools.lru_cache (https://docs.python.org/3/library/functools.html) — 内置LRU缓存装饰器

7.11.4 安全与性能

PEP 456 — Secure and Interchangeable Hash Algorithm (https://peps.python.org/pep-0456/) — 哈希安全改进
Python Hash Collision Attack (https://www.python.org/dev/peps/pep-0412/) — 哈希冲突攻击与防护
High Performance Python (Micha Gorelick) — 字典性能优化技巧

下一章：第8章字符串处理

第7章 字典与集合

学习目标

7.1 字典基础

7.1.1 创建字典

7.1.2 访问元素

7.1.3 修改字典

7.1.4 删除元素

7.2 特殊字典

7.2.1 defaultdict

7.2.2 Counter

7.2.3 ChainMap

7.3 集合

7.3.1 创建与基本操作

7.3.2 集合运算

7.3.3 frozenset——不可变集合

7.4 性能分析

7.4.1 时间复杂度

7.4.2 内存开销

7.5 实用技巧

7.5.1 去重保序

7.5.2 字典排序

7.5.3 安全访问嵌套字典

7.6 哈希表原理深入

7.6.1 哈希函数与冲突处理

7.6.2 字典内部结构

7.6.3 自定义可哈希对象

7.6.4 哈希冲突攻击防护

7.7 高级应用

7.7.1 字典视图对象

7.7.2 OrderedDict与LRU缓存

7.7.3 WeakValueDictionary

7.8 前沿技术动态

7.8.1 字典合并运算符（PEP 584）

7.8.2 类型化字典

7.8.3 高性能替代方案

7.9 本章小结

7.9.1 数据结构选择指南

7.9.2 常见陷阱与规避

7.10 练习题

基础题

进阶题

项目实践

思考题

7.11 延伸阅读

7.11.1 哈希表理论

7.11.2 Python内部实现

7.11.3 高级数据结构

7.11.4 安全与性能

第7章字典与集合