第7章 字典与集合 学习目标 完成本章学习后,读者将能够:
理解字典的哈希表实现原理及其对性能的影响 熟练运用字典的各种方法与特殊字典类型(defaultdict、Counter) 掌握集合运算及其在数据去重、成员检测中的应用 理解可哈希性(Hashable)的概念及其对字典键和集合元素的限制 运用字典与集合解决实际数据处理问题 7.1 字典基础 7.1.1 创建字典 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 person = {"name" : "Alice" , "age" : 25 , "city" : "北京" } person = dict (name="Alice" , age=25 , city="北京" ) person = dict ([("name" , "Alice" ), ("age" , 25 )]) keys = ["name" , "age" , "city" ] values = ["Alice" , 25 , "北京" ] person = dict (zip (keys, values)) defaults = dict .fromkeys(["host" , "port" , "db" ], "unknown" ) squares = {x: x ** 2 for x in range (5 )}
学术注记 :Python 3.7+字典保证插入顺序 (CPython 3.6已实现,3.7正式纳入语言规范)。底层使用紧凑数组+哈希索引的混合结构,相比旧实现节省约20-25%内存。字典的查找、插入、删除均摊时间复杂度为O(1)。
7.1.2 访问元素 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 person = {"name" : "Alice" , "age" : 25 , "city" : "北京" } print (person["name" ]) print (person.get("email" )) print (person.get("email" , "未设置" )) for key in person: print (key) for key, value in person.items(): print (f"{key} : {value} " ) for value in person.values(): print (value) print ("name" in person) print ("Alice" in person)
7.1.3 修改字典 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 person = {"name" : "Alice" , "age" : 25 } person["email" ] = "alice@example.com" person["age" ] = 26 person.update({"phone" : "123456" , "city" : "上海" }) person.update(age=27 ) person.setdefault("country" , "中国" ) person.setdefault("name" , "Bob" ) dict1 = {"a" : 1 , "b" : 2 } dict2 = {"b" : 3 , "c" : 4 } merged = dict1 | dict2 dict1 |= dict2
7.1.4 删除元素 1 2 3 4 5 6 7 person = {"name" : "Alice" , "age" : 25 , "city" : "北京" , "email" : "a@b.com" } del person["email" ] age = person.pop("age" ) city = person.pop("city" , "未知" ) key, value = person.popitem() person.clear()
7.2 特殊字典 7.2.1 defaultdict 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 from collections import defaultdictword_counts = defaultdict(int ) for word in "the quick brown fox jumps over the lazy dog" .split(): word_counts[word] += 1 groups = defaultdict(list ) for name, dept in [("Alice" , "Eng" ), ("Bob" , "Sales" ), ("Charlie" , "Eng" )]: groups[dept].append(name) print (groups) tree = lambda : defaultdict(tree) config = tree() config["database" ]["host" ] = "localhost" config["database" ]["port" ] = 5432
7.2.2 Counter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from collections import Countertext = "hello world" counter = Counter(text) print (counter) print (counter.most_common(2 )) c1 = Counter("aabbcc" ) c2 = Counter("bbccdd" ) print (c1 + c2) print (c1 - c2) print (c1 & c2) print (c1 | c2)
7.2.3 ChainMap 1 2 3 4 5 6 7 8 9 10 11 from collections import ChainMapdefaults = {"theme" : "dark" , "language" : "en" , "font" : "Arial" } user_config = {"theme" : "light" } system_config = {"language" : "zh" } config = ChainMap(user_config, system_config, defaults) print (config["theme" ]) print (config["language" ]) print (config["font" ])
7.3 集合 7.3.1 创建与基本操作 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 empty = set () fruits = {"apple" , "banana" , "cherry" } numbers = set ([1 , 2 , 3 , 2 , 1 ]) chars = set ("hello" ) fruits.add("date" ) fruits.discard("banana" ) fruits.remove("cherry" ) item = fruits.pop() fruits.clear() print ("apple" in fruits)
7.3.2 集合运算 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 a = {1 , 2 , 3 , 4 , 5 } b = {4 , 5 , 6 , 7 , 8 } print (a | b) print (a.union(b)) print (a & b) print (a.intersection(b)) print (a - b) print (a.difference(b)) print (a ^ b) print (a.symmetric_difference(b))print ({1 , 2 } <= {1 , 2 , 3 }) print ({1 , 2 } < {1 , 2 , 3 }) print ({1 , 2 , 3 } >= {1 , 2 }) print ({1 , 2 }.isdisjoint({3 , 4 }))
7.3.3 frozenset——不可变集合 1 2 3 4 5 frozen = frozenset ([1 , 2 , 3 ]) d = {frozenset ([1 , 2 ]): "value" } s = {frozenset ([1 , 2 ]), frozenset ([3 , 4 ])}
学术注记 :集合和字典的键都要求元素是可哈希的(Hashable) 。可哈希意味着对象的生命周期内哈希值不变,且可与其他对象比较相等。不可变类型(int、str、tuple、frozenset)是可哈希的;可变类型(list、dict、set)不可哈希。
7.4 性能分析 7.4.1 时间复杂度 操作 列表 字典/集合 说明 x in containerO(n) O(1) 成员检测 container[key]O(1) O(1) 索引/键访问 container.append/addO(1)* O(1) 追加/添加 删除指定元素 O(n) O(1) 按值/键删除
1 2 3 4 5 6 7 8 9 10 11 import timeitdata = list (range (100000 )) data_set = set (range (100000 )) print (timeit.timeit(lambda : 99999 in data, number=1000 )) print (timeit.timeit(lambda : 99999 in data_set, number=1000 ))
7.4.2 内存开销 1 2 3 4 5 6 7 8 9 import syslst = list (range (1000 )) d = {i: i for i in range (1000 )} s = set (range (1000 )) print (sys.getsizeof(lst)) print (sys.getsizeof(d)) print (sys.getsizeof(s))
工程实践 :集合和字典以空间换时间。当需要频繁的成员检测时,将列表转为集合可获得数量级的性能提升。但集合/字典的内存开销约为列表的4倍。
7.5 实用技巧 7.5.1 去重保序 1 2 3 4 5 6 7 8 numbers = [1 , 2 , 2 , 3 , 3 , 3 , 4 ] unique = list (dict .fromkeys(numbers)) seen = set () unique = [x for x in numbers if not (x in seen or seen.add(x))]
7.5.2 字典排序 1 2 3 4 5 6 7 scores = {"Alice" : 85 , "Bob" : 92 , "Charlie" : 78 } sorted_by_key = dict (sorted (scores.items())) sorted_by_value = dict (sorted (scores.items(), key=lambda x: x[1 ], reverse=True ))
7.5.3 安全访问嵌套字典 1 2 3 4 5 6 7 8 9 10 11 12 def get_nested (d: dict , *keys, default=None ): """安全访问嵌套字典""" for key in keys: if isinstance (d, dict ): d = d.get(key, default) else : return default return d data = {"user" : {"profile" : {"name" : "Alice" }}} print (get_nested(data, "user" , "profile" , "name" )) print (get_nested(data, "user" , "email" , "address" ))
7.6 哈希表原理深入 7.6.1 哈希函数与冲突处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def demonstrate_hash (): """演示Python哈希函数""" print ("整数哈希:" ) for i in range (5 ): print (f" hash({i} ) = {hash (i)} " ) print ("\n字符串哈希:" ) for s in ["hello" , "world" , "hello" ]: print (f" hash('{s} ') = {hash (s)} " ) print ("\n元组哈希(元素必须可哈希):" ) for t in [(1 , 2 ), (1 , 2 ), (3 , 4 )]: print (f" hash({t} ) = {hash (t)} " ) print ("\n列表不可哈希:" ) try : hash ([1 , 2 , 3 ]) except TypeError as e: print (f" 错误: {e} " ) demonstrate_hash()
学术注记 :Python使用开放寻址法 处理哈希冲突。当两个键哈希到同一位置时,按探测序列寻找下一个空位。探测公式为perturb = perturb >> 5,形成伪随机序列以减少聚集。删除元素时使用”墓碑标记”而非真正删除,以保持探测链完整。
7.6.2 字典内部结构 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import sysdef analyze_dict_structure (): """分析字典内部结构""" d = {i: i ** 2 for i in range (10 )} print (f"字典大小: {sys.getsizeof(d)} 字节" ) print (f"键数量: {len (d)} " ) print (f"哈希表利用率: {len (d) / (sys.getsizeof(d) // 8 ):.2 %} " ) print ("\n字典内部属性:" ) print (f" __len__: {d.__len__()} " ) print (f" __hash__: {d.__hash__} " ) small = {} print (f"\n空字典大小: {sys.getsizeof(small)} 字节" ) analyze_dict_structure()
7.6.3 自定义可哈希对象 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 from dataclasses import dataclass@dataclass(frozen=True ) class Point : """可哈希的点类""" x: int y: int def __hash__ (self ) -> int : return hash ((self .x, self .y)) def __eq__ (self, other: object ) -> bool : if not isinstance (other, Point): return NotImplemented return self .x == other.x and self .y == other.y p1 = Point(3 , 4 ) p2 = Point(3 , 4 ) p3 = Point(5 , 6 ) print (f"hash(p1) = {hash (p1)} " )print (f"p1 == p2: {p1 == p2} " )print (f"hash(p1) == hash(p2): {hash (p1) == hash (p2)} " )points = {p1: "A" , p3: "B" } print (points[Point(3 , 4 )])
7.6.4 哈希冲突攻击防护 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import sysimport osdef demonstrate_hash_randomization (): """演示哈希随机化(安全特性)""" print ("Python 3.3+ 默认启用哈希随机化" ) print (f"PYTHONHASHSEED环境变量: {os.environ.get('PYTHONHASHSEED' , 'random' )} " ) print ("\n不同运行中字符串哈希值会变化:" ) s = "hello" print (f" hash('{s} ') = {hash (s)} " ) print (" (每次运行值不同,防止哈希冲突攻击)" ) demonstrate_hash_randomization()
学术注记 :Python 3.3+默认启用哈希随机化 ,每次解释器启动时使用随机种子初始化哈希函数。这防止了恶意构造大量哈希冲突数据导致O(n²)性能退化的DoS攻击。可通过PYTHONHASHSEED=0禁用(仅用于调试)。
7.7 高级应用 7.7.1 字典视图对象 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 d = {"a" : 1 , "b" : 2 , "c" : 3 } keys_view = d.keys() values_view = d.values() items_view = d.items() print (f"keys视图类型: {type (keys_view)} " )print (f"keys视图: {list (keys_view)} " )d["d" ] = 4 print (f"修改后keys视图: {list (keys_view)} " )print (f"\nkeys视图支持集合运算:" )print (f" d.keys() & {'b' , 'c' , 'e' } : {d.keys() & {'b' , 'c' , 'e' } }" )print (f" d.keys() | {'e' , 'f' } : {d.keys() | {'e' , 'f' } }" )
7.7.2 OrderedDict与LRU缓存 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 from collections import OrderedDictclass LRUCache : """LRU缓存实现""" def __init__ (self, capacity: int ): self .capacity = capacity self .cache = OrderedDict() def get (self, key ): if key not in self .cache: return None self .cache.move_to_end(key) return self .cache[key] def put (self, key, value ): if key in self .cache: self .cache.move_to_end(key) self .cache[key] = value if len (self .cache) > self .capacity: self .cache.popitem(last=False ) def __repr__ (self ): return f"LRUCache({list (self.cache.items())} )" cache = LRUCache(3 ) cache.put("a" , 1 ) cache.put("b" , 2 ) cache.put("c" , 3 ) print (cache)cache.get("a" ) cache.put("d" , 4 ) print (cache)
7.7.3 WeakValueDictionary 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import weakrefclass Cache : """使用弱引用的缓存""" def __init__ (self ): self ._cache = weakref.WeakValueDictionary() def get (self, key, factory ): if key not in self ._cache: self ._cache[key] = factory() return self ._cache[key] class ExpensiveObject : def __init__ (self, value ): self .value = value print (f"创建对象: {value} " ) cache = Cache() obj = cache.get("key1" , lambda : ExpensiveObject(1 )) obj = cache.get("key1" , lambda : ExpensiveObject(1 ))
7.8 前沿技术动态 7.8.1 字典合并运算符(PEP 584) Python 3.9引入字典合并运算符:
1 2 3 4 5 dict1 = {"a" : 1 , "b" : 2 } dict2 = {"c" : 3 , "d" : 4 } merged = dict1 | dict2 dict1 |= dict2
7.8.2 类型化字典 1 2 3 4 5 6 7 8 from typing import TypedDictclass UserDict (TypedDict ): name: str age: int email: str user: UserDict = {"name" : "Alice" , "age" : 30 , "email" : "alice@example.com" }
7.8.3 高性能替代方案 1 2 3 4 from immutables import Mapimmutable_map = Map({"a" : 1 , "b" : 2 }) new_map = immutable_map.set ("c" , 3 )
7.9 本章小结 本章系统介绍了Python字典与集合的完整体系:
字典 :哈希表实现,O(1)查找,Python 3.7+保证插入顺序特殊字典 :defaultdict自动初始化、Counter计数、ChainMap多字典视图集合 :无序唯一元素集,支持数学集合运算,O(1)成员检测可哈希性 :字典键和集合元素必须是可哈希的不可变类型哈希原理 :开放寻址法处理冲突,哈希随机化防止攻击性能权衡 :字典/集合以空间换时间,适合频繁查找场景7.9.1 数据结构选择指南 场景 推荐结构 原因 键值映射 dict O(1)查找,语义清晰 成员检测 set O(1)查找,内存高效 计数统计 Counter 内置统计方法 分组聚合 defaultdict(list) 自动初始化列表 缓存实现 OrderedDict 支持顺序操作 配置继承 ChainMap 不复制数据
7.9.2 常见陷阱与规避 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 d = {"a" : 1 , "b" : 2 , "c" : 3 } for key in d: if key == "b" : del d[key] for key in list (d.keys()): if key == "b" : del d[key] d = {} lst = [1 , 2 ] d[tuple (lst)] = "value" data = [{"id" : 1 }, {"id" : 1 }] unique = {tuple (sorted (d.items())) for d in data}
7.10 练习题 基础题 创建字典存储学生信息,实现添加、删除、查询操作。
使用Counter统计一段文本中每个单词的出现频率,输出前5个高频词。
给定两个列表,使用集合运算找出它们的共同元素和独有元素。
进阶题 实现函数merge_dicts(*dicts),合并多个字典,相同键的值相加。
使用defaultdict实现图的邻接表表示,并实现BFS遍历。
实现LRU缓存,使用OrderedDict管理访问顺序。
项目实践 文本词频分析器 :编写一个程序,要求:读取文本文件,统计词频 支持停用词过滤 使用Counter进行统计 按频率排序输出 支持大小写敏感/不敏感模式 输出统计摘要(总词数、唯一词数、Top N) 思考题 Python字典的哈希表是如何处理冲突的?为什么字典的查找是O(1)?
为什么列表不能作为字典的键?如果需要用列表作为键,应该如何解决?
defaultdict与普通字典的setdefault()方法相比,各有什么优缺点?
7.11 延伸阅读 7.11.1 哈希表理论 7.11.2 Python内部实现 7.11.3 高级数据结构 7.11.4 安全与性能 下一章:第8章 字符串处理