第9章 文件操作 学习目标 完成本章学习后,读者将能够:
掌握文件读写的基本操作与上下文管理器模式 熟练使用pathlib进行跨平台路径操作 理解文本文件与二进制文件的区别及编码处理 运用JSON、CSV、pickle进行数据序列化与反序列化 自定义上下文管理器管理资源生命周期 9.1 文件基础 9.1.1 打开与关闭文件 1 2 3 4 5 6 7 8 file = open ("example.txt" , "r" , encoding="utf-8" ) content = file.read() file.close() with open ("example.txt" , "r" , encoding="utf-8" ) as file: content = file.read()
工程实践 :始终使用with语句操作文件。它保证即使发生异常也能正确关闭文件,避免资源泄漏。手动调用close()在异常发生时可能不会执行。
9.1.2 文件模式 模式 描述 文件存在时 文件不存在时 r只读 读取 FileNotFoundError w只写 清空 创建 x独占写 FileExistsError 创建 a追加 追加到末尾 创建 r+读写 读取 FileNotFoundError b二进制模式 - - t文本模式(默认) - -
1 2 3 4 5 6 7 8 9 10 11 12 13 14 with open ("output.txt" , "w" , encoding="utf-8" ) as f: f.write("第一行\n" ) f.write("第二行\n" ) with open ("output.txt" , "a" , encoding="utf-8" ) as f: f.write("追加内容\n" ) with open ("data.bin" , "wb" ) as f: f.write(b"\x00\x01\x02\x03" ) with open ("data.bin" , "rb" ) as f: data = f.read()
9.1.3 文件指针 1 2 3 4 5 6 7 8 with open ("example.txt" , "r" , encoding="utf-8" ) as f: print (f.tell()) content = f.read(10 ) print (f.tell()) f.seek(0 ) f.seek(5 )
9.2 读写文件 9.2.1 读取文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 with open ("example.txt" , "r" , encoding="utf-8" ) as f: content = f.read() with open ("example.txt" , "r" , encoding="utf-8" ) as f: for line in f: print (line.strip()) with open ("example.txt" , "r" , encoding="utf-8" ) as f: lines = f.readlines() with open ("example.txt" , "r" , encoding="utf-8" ) as f: chunk = f.read(1024 )
工程实践 :处理大文件时,始终使用逐行迭代for line in f而非f.read()或f.readlines()。后者将整个文件加载到内存,可能导致内存溢出。
9.2.2 写入文件 1 2 3 4 5 6 7 8 9 10 11 12 13 with open ("output.txt" , "w" , encoding="utf-8" ) as f: f.write("Hello, World!\n" ) with open ("output.txt" , "w" , encoding="utf-8" ) as f: lines = ["第一行\n" , "第二行\n" , "第三行\n" ] f.writelines(lines) with open ("output.txt" , "w" , encoding="utf-8" ) as f: print ("Hello, World!" , file=f) print ("Python文件操作" , file=f)
9.2.3 大文件复制 1 2 3 4 5 6 7 8 9 def copy_file (src: str , dst: str , chunk_size: int = 8192 ) -> None : with open (src, "rb" ) as src_file, open (dst, "wb" ) as dst_file: while chunk := src_file.read(chunk_size): dst_file.write(chunk) import shutilshutil.copy2("source.bin" , "destination.bin" )
9.3 路径操作 9.3.1 pathlib(推荐) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 from pathlib import Pathpath = Path("home" ) / "user" / "documents" / "file.txt" print (path.name) print (path.stem) print (path.suffix) print (path.parent) print (Path("file.txt" ).resolve())print (path.exists())print (path.is_file())print (path.is_dir())print (path.stat().st_size) Path("new_dir" ).mkdir(parents=True , exist_ok=True ) path.unlink() path.rename("new_name.txt" ) content = Path("file.txt" ).read_text(encoding="utf-8" ) Path("output.txt" ).write_text("Hello!\n" , encoding="utf-8" ) for py_file in Path("." ).glob("*.py" ): print (py_file) for py_file in Path("." ).rglob("*.py" ): print (py_file)
工程实践 :优先使用pathlib而非os.path。pathlib面向对象、支持/运算符拼接路径、API更一致,是Python 3.4+的标准做法。
9.3.2 os.path(旧式) 1 2 3 4 5 6 7 8 import ospath = "/home/user/documents/file.txt" print (os.path.basename(path)) print (os.path.dirname(path)) print (os.path.join("a" , "b" , "c" )) print (os.path.splitext(path)) print (os.path.exists(path))
9.4 目录操作 9.4.1 遍历目录 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from pathlib import Pathfor item in Path("." ).iterdir(): print (f"{'[D]' if item.is_dir() else '[F]' } {item.name} " ) for root, dirs, files in os.walk("." ): for name in files: print (Path(root) / name) for py_file in Path("src" ).rglob("*.py" ): print (py_file)
9.4.2 创建与删除 1 2 3 4 5 6 7 8 9 10 11 from pathlib import Pathimport shutilPath("new_dir" ).mkdir() Path("a/b/c" ).mkdir(parents=True , exist_ok=True ) Path("file.txt" ).unlink() Path("empty_dir" ).rmdir() shutil.rmtree("non_empty_dir" )
9.5 序列化 9.5.1 JSON 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import jsonfrom dataclasses import dataclass, asdictfrom datetime import datetimedata = {"name" : "Alice" , "age" : 25 , "skills" : ["Python" , "SQL" ]} json_str = json.dumps(data, ensure_ascii=False , indent=2 ) with open ("data.json" , "w" , encoding="utf-8" ) as f: json.dump(data, f, ensure_ascii=False , indent=2 ) parsed = json.loads(json_str) with open ("data.json" , "r" , encoding="utf-8" ) as f: data = json.load(f) def json_serializer (obj ): if isinstance (obj, datetime): return obj.isoformat() raise TypeError(f"Type {type (obj)} not serializable" ) json_str = json.dumps({"time" : datetime.now()}, default=json_serializer) @dataclass class Person : name: str age: int person = Person("Alice" , 25 ) json_str = json.dumps(asdict(person))
9.5.2 CSV 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import csvdata = [ {"name" : "Alice" , "age" : 25 , "city" : "北京" }, {"name" : "Bob" , "age" : 30 , "city" : "上海" }, ] with open ("data.csv" , "w" , newline="" , encoding="utf-8" ) as f: writer = csv.DictWriter(f, fieldnames=["name" , "age" , "city" ]) writer.writeheader() writer.writerows(data) with open ("data.csv" , "r" , encoding="utf-8" ) as f: reader = csv.DictReader(f) for row in reader: print (row["name" ], row["age" ])
9.5.3 pickle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import pickledata = {"name" : "Alice" , "time" : datetime.now()} with open ("data.pkl" , "wb" ) as f: pickle.dump(data, f) with open ("data.pkl" , "rb" ) as f: loaded = pickle.load(f) pickled = pickle.dumps(data) loaded = pickle.loads(pickled)
安全警告 :pickle可以执行任意代码,绝不要加载不受信任来源的pickle数据 。在Web应用中应始终使用JSON而非pickle。
9.5.4 格式对比 特性 JSON CSV pickle 数据类型 基本类型 纯文本表格 任意Python对象 可读性 高 高 不可读 安全性 安全 安全 不安全 跨语言 是 是 否 适用场景 API、配置 表格数据 Python内部缓存
9.6 上下文管理器 9.6.1 自定义上下文管理器 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 class Timer : def __init__ (self, name: str = "" ): self .name = name def __enter__ (self ): import time self .start = time.perf_counter() return self def __exit__ (self, exc_type, exc_val, exc_tb ): import time elapsed = time.perf_counter() - self .start print (f"{self.name} 执行时间: {elapsed:.4 f} 秒" ) return False with Timer("计算" ): sum (range (1000000 )) from contextlib import contextmanager@contextmanager def timer (name: str = "" ): import time start = time.perf_counter() try : yield finally : elapsed = time.perf_counter() - start print (f"{name} 执行时间: {elapsed:.4 f} 秒" ) with timer("计算" ): sum (range (1000000 ))
9.6.2 实用上下文管理器 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 from contextlib import contextmanagerimport os@contextmanager def change_dir (path: str ): old_dir = os.getcwd() os.chdir(path) try : yield finally : os.chdir(old_dir) @contextmanager def temp_file (content: str , suffix: str = ".txt" ): import tempfile with tempfile.NamedTemporaryFile(mode="w" , suffix=suffix, delete=False , encoding="utf-8" ) as f: f.write(content) name = f.name try : yield name finally : os.unlink(name)
9.7 IO模型与性能 9.7.1 同步与异步IO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import timeimport asyncioimport aiofilesdef sync_io_example (): """同步IO示例""" start = time.time() with open ("file1.txt" , "w" ) as f: f.write("content1" ) with open ("file2.txt" , "w" ) as f: f.write("content2" ) with open ("file3.txt" , "w" ) as f: f.write("content3" ) print (f"同步IO耗时: {time.time() - start:.4 f} 秒" ) async def async_io_example (): """异步IO示例(需要aiofiles库)""" start = time.time() async with aiofiles.open ("file1.txt" , "w" ) as f: await f.write("content1" ) async with aiofiles.open ("file2.txt" , "w" ) as f: await f.write("content2" ) async with aiofiles.open ("file3.txt" , "w" ) as f: await f.write("content3" ) print (f"异步IO耗时: {time.time() - start:.4 f} 秒" )
学术注记 :Python的文件IO默认是阻塞式同步IO 。对于高并发场景(如Web服务器处理大量文件请求),应使用异步IO(asyncio + aiofiles)或线程池。异步IO利用操作系统提供的非阻塞接口,单线程即可处理多个IO操作。
9.7.2 缓冲与性能 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import timedef benchmark_buffer (): """缓冲区大小对性能的影响""" data = b"x" * 10_000_000 with open ("test.bin" , "wb" , buffering=0 ) as f: start = time.time() f.write(data) print (f"无缓冲: {time.time() - start:.4 f} 秒" ) with open ("test.bin" , "wb" , buffering=8192 ) as f: start = time.time() f.write(data) print (f"8KB缓冲: {time.time() - start:.4 f} 秒" ) with open ("test.bin" , "wb" , buffering=65536 ) as f: start = time.time() f.write(data) print (f"64KB缓冲: {time.time() - start:.4 f} 秒" ) benchmark_buffer()
9.7.3 内存映射文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 import mmapdef mmap_example (): """内存映射文件:将文件映射到内存""" with open ("large_file.bin" , "wb" ) as f: f.write(b"\x00" * 10_000_000 ) with open ("large_file.bin" , "r+b" ) as f: mm = mmap.mmap(f.fileno(), 0 ) print (f"文件大小: {len (mm)} 字节" ) mm[0 :4 ] = b"TEST" mm.seek(100 ) mm.write(b"Hello" ) mm.close() def mmap_search (): """在大型文件中搜索""" with open ("large_file.txt" , "r" , encoding="utf-8" ) as f: with mmap.mmap(f.fileno(), 0 , access=mmap.ACCESS_READ) as mm: index = mm.find(b"search_term" ) if index != -1 : print (f"找到于位置: {index} " )
学术注记 :内存映射文件(mmap)将文件直接映射到进程的虚拟内存空间,操作系统负责在内存和磁盘间交换数据。适用于:1)处理超大文件(超过内存容量);2)随机访问文件内容;3)多进程共享内存通信。
9.7.4 文件系统监控 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import timefrom watchdog.observers import Observerfrom watchdog.events import FileSystemEventHandlerclass FileChangeHandler (FileSystemEventHandler ): def on_created (self, event ): print (f"文件创建: {event.src_path} " ) def on_modified (self, event ): print (f"文件修改: {event.src_path} " ) def on_deleted (self, event ): print (f"文件删除: {event.src_path} " ) def watch_directory (path: str ): """监控目录变化""" observer = Observer() observer.schedule(FileChangeHandler(), path, recursive=True ) observer.start() try : while True : time.sleep(1 ) except KeyboardInterrupt: observer.stop() observer.join()
9.8 前沿技术动态 9.8.1 异步文件IO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import asyncioimport aiofilesasync def read_file_async (path: str ) -> str : async with aiofiles.open (path, mode='r' , encoding='utf-8' ) as f: return await f.read() async def write_file_async (path: str , content: str ) -> None : async with aiofiles.open (path, mode='w' , encoding='utf-8' ) as f: await f.write(content) async def main (): content = await read_file_async("data.txt" ) await write_file_async("output.txt" , content.upper()) asyncio.run(main())
9.8.2 现代路径操作 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from pathlib import Pathp = Path("data/output.txt" ) p.parent.mkdir(parents=True , exist_ok=True ) base = Path("/home/user/project" ) target = Path("/home/user/data/file.txt" ) relative = target.relative_to(base) for py_file in Path("." ).glob("**/*.py" ): print (py_file)
9.8.3 高性能序列化 1 2 3 4 5 6 7 8 9 10 11 12 import orjsonimport msgspecdata = {"name" : "Alice" , "age" : 30 , "items" : [1 , 2 , 3 ]} json_bytes = orjson.dumps(data) loaded = orjson.loads(json_bytes) encoder = msgspec.json.Encoder() decoder = msgspec.json.Decoder() encoded = encoder.encode(data) decoded = decoder.decode(encoded)
9.8.4 文件系统监控 1 2 3 4 5 6 7 8 9 10 11 12 13 from watchdog.observers import Observerfrom watchdog.events import FileSystemEventHandlerclass MyHandler (FileSystemEventHandler ): def on_modified (self, event ): print (f"Modified: {event.src_path} " ) def on_created (self, event ): print (f"Created: {event.src_path} " ) observer = Observer() observer.schedule(MyHandler(), path="." , recursive=True ) observer.start()
9.9 本章小结 本章系统介绍了Python文件操作的完整体系:
文件读写 :with语句、文本/二进制模式、逐行读取路径操作 :pathlib面向对象路径处理(推荐)目录操作 :遍历、创建、删除、复制序列化 :JSON(跨语言安全)、CSV(表格数据)、pickle(Python内部)上下文管理器 :资源管理的标准模式,支持类与生成器两种实现IO模型 :同步/异步IO、缓冲策略、内存映射文件9.9.1 文件操作最佳实践 场景 推荐方案 原因 普通文件读写 with + open() 自动资源管理 路径操作 pathlib.Path 面向对象、跨平台 配置文件 JSON/YAML 可读性好、跨语言 大文件处理 逐行迭代/mmap 内存友好 高并发IO asyncio + aiofiles 非阻塞、高效 临时文件 tempfile模块 安全、自动清理
9.9.2 常见陷阱与规避 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 f = open ("file.txt" , "r" ) content = f.read() with open ("file.txt" , "r" ) as f: content = f.read() with open ("file.txt" , "r" ) as f: content = f.read() with open ("file.txt" , "r" , encoding="utf-8" ) as f: content = f.read() content = open ("large.txt" ).read() with open ("large.txt" ) as f: for line in f: process(line) data = pickle.loads(untrusted_data) data = json.loads(untrusted_data)
9.10 练习题 基础题 编写程序,统计文本文件中的行数、单词数和字符数。
使用JSON存储和读取学生信息列表。
实现一个简单的日志记录器,支持按日期分割日志文件。
进阶题 实现文件搜索工具,在指定目录中递归搜索包含特定内容的文件。
编写程序,合并多个CSV文件并去重。
实现INI格式配置文件管理器,支持读写和类型转换。
项目实践 文件同步工具 :编写一个程序,要求:比较两个目录的文件差异 支持增量同步(仅复制修改过的文件) 使用文件哈希(MD5/SHA256)判断文件是否相同 支持排除规则(如忽略.git目录) 生成同步报告 使用pathlib处理路径 思考题 为什么with语句比手动close()更安全?__exit__方法的返回值有什么作用?
JSON和pickle的核心区别是什么?为什么pickle数据不安全?
pathlib相比os.path有哪些优势?在什么场景下仍需使用os.path?
9.11 延伸阅读 9.11.1 文件系统与IO 9.11.2 路径与文件操作 9.11.3 序列化与数据格式 9.11.4 异步IO 下一章:第10章 类与对象