第9章文件操作

学习目标

完成本章学习后，读者将能够：

掌握文件读写的基本操作与上下文管理器模式
熟练使用pathlib进行跨平台路径操作
理解文本文件与二进制文件的区别及编码处理
运用JSON、CSV、pickle进行数据序列化与反序列化
自定义上下文管理器管理资源生命周期

9.1 文件基础

9.1.1 打开与关闭文件

# 手动管理（不推荐）
file = open("example.txt", "r", encoding="utf-8")
content = file.read()
file.close()

# with语句（推荐）- 自动关闭文件，即使发生异常
with open("example.txt", "r", encoding="utf-8") as file:
    content = file.read()

工程实践：始终使用with语句操作文件。它保证即使发生异常也能正确关闭文件，避免资源泄漏。手动调用close()在异常发生时可能不会执行。

9.1.2 文件模式

模式	描述	文件存在时	文件不存在时
`r`	只读	读取	FileNotFoundError
`w`	只写	清空	创建
`x`	独占写	FileExistsError	创建
`a`	追加	追加到末尾	创建
`r+`	读写	读取	FileNotFoundError
`b`	二进制模式	-	-
`t`	文本模式（默认）	-	-

# 写入文件
with open("output.txt", "w", encoding="utf-8") as f:
    f.write("第一行\n")
    f.write("第二行\n")

# 追加内容
with open("output.txt", "a", encoding="utf-8") as f:
    f.write("追加内容\n")

# 二进制模式
with open("data.bin", "wb") as f:
    f.write(b"\x00\x01\x02\x03")
with open("data.bin", "rb") as f:
    data = f.read()

9.1.3 文件指针

with open("example.txt", "r", encoding="utf-8") as f:
    print(f.tell())          # 0 - 当前位置
    
    content = f.read(10)     # 读取10个字符
    print(f.tell())          # 10
    
    f.seek(0)                # 回到开头
    f.seek(5)                # 移到第5个字符

9.2 读写文件

9.2.1 读取文件

# 读取全部内容
with open("example.txt", "r", encoding="utf-8") as f:
    content = f.read()

# 逐行读取（推荐，内存友好）
with open("example.txt", "r", encoding="utf-8") as f:
    for line in f:
        print(line.strip())

# 读取所有行为列表
with open("example.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

# 读取指定字节数
with open("example.txt", "r", encoding="utf-8") as f:
    chunk = f.read(1024)     # 读取1KB

工程实践：处理大文件时，始终使用逐行迭代for line in f而非f.read()或f.readlines()。后者将整个文件加载到内存，可能导致内存溢出。

9.2.2 写入文件

# 写入字符串
with open("output.txt", "w", encoding="utf-8") as f:
    f.write("Hello, World!\n")

# 写入多行
with open("output.txt", "w", encoding="utf-8") as f:
    lines = ["第一行\n", "第二行\n", "第三行\n"]
    f.writelines(lines)

# 使用print写入
with open("output.txt", "w", encoding="utf-8") as f:
    print("Hello, World!", file=f)
    print("Python文件操作", file=f)

9.2.3 大文件复制

# 分块复制（内存友好）
def copy_file(src: str, dst: str, chunk_size: int = 8192) -> None:
    with open(src, "rb") as src_file, open(dst, "wb") as dst_file:
        while chunk := src_file.read(chunk_size):
            dst_file.write(chunk)

# 使用shutil（推荐）
import shutil
shutil.copy2("source.bin", "destination.bin")  # 保留元数据

9.3 路径操作

9.3.1 pathlib（推荐）

from pathlib import Path

# 创建路径
path = Path("home") / "user" / "documents" / "file.txt"

# 路径组件
print(path.name)       # "file.txt"
print(path.stem)       # "file"
print(path.suffix)     # ".txt"
print(path.parent)     # Path("home/user/documents")

# 绝对路径
print(Path("file.txt").resolve())

# 存在性检测
print(path.exists())
print(path.is_file())
print(path.is_dir())

# 文件信息
print(path.stat().st_size)    # 文件大小

# 创建与删除
Path("new_dir").mkdir(parents=True, exist_ok=True)
path.unlink()                  # 删除文件
path.rename("new_name.txt")    # 重命名

# 便捷读写
content = Path("file.txt").read_text(encoding="utf-8")
Path("output.txt").write_text("Hello!\n", encoding="utf-8")

# 文件搜索
for py_file in Path(".").glob("*.py"):
    print(py_file)
for py_file in Path(".").rglob("*.py"):    # 递归搜索
    print(py_file)

工程实践：优先使用pathlib而非os.path。pathlib面向对象、支持/运算符拼接路径、API更一致，是Python 3.4+的标准做法。

9.3.2 os.path（旧式）

import os

path = "/home/user/documents/file.txt"
print(os.path.basename(path))     # "file.txt"
print(os.path.dirname(path))      # "/home/user/documents"
print(os.path.join("a", "b", "c"))  # "a/b/c"（跨平台）
print(os.path.splitext(path))     # ("/home/user/documents/file", ".txt")
print(os.path.exists(path))

9.4 目录操作

9.4.1 遍历目录

from pathlib import Path

# 列出当前目录
for item in Path(".").iterdir():
    print(f"{'[D]' if item.is_dir() else '[F]'} {item.name}")

# 递归遍历
for root, dirs, files in os.walk("."):
    for name in files:
        print(Path(root) / name)

# 按模式搜索
for py_file in Path("src").rglob("*.py"):
    print(py_file)

9.4.2 创建与删除

from pathlib import Path
import shutil

# 创建目录
Path("new_dir").mkdir()
Path("a/b/c").mkdir(parents=True, exist_ok=True)  # 递归创建，已存在不报错

# 删除
Path("file.txt").unlink()        # 删除文件
Path("empty_dir").rmdir()        # 删除空目录
shutil.rmtree("non_empty_dir")   # 删除非空目录

9.5 序列化

9.5.1 JSON

import json
from dataclasses import dataclass, asdict
from datetime import datetime

data = {"name": "Alice", "age": 25, "skills": ["Python", "SQL"]}

# 序列化
json_str = json.dumps(data, ensure_ascii=False, indent=2)
with open("data.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

# 反序列化
parsed = json.loads(json_str)
with open("data.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# 自定义序列化
def json_serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")

json_str = json.dumps({"time": datetime.now()}, default=json_serializer)

# dataclass序列化
@dataclass
class Person:
    name: str
    age: int

person = Person("Alice", 25)
json_str = json.dumps(asdict(person))

9.5.2 CSV

import csv

# 写入CSV
data = [
    {"name": "Alice", "age": 25, "city": "北京"},
    {"name": "Bob", "age": 30, "city": "上海"},
]
with open("data.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "age", "city"])
    writer.writeheader()
    writer.writerows(data)

# 读取CSV
with open("data.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row["name"], row["age"])

9.5.3 pickle

import pickle

data = {"name": "Alice", "time": datetime.now()}

# 序列化到文件
with open("data.pkl", "wb") as f:
    pickle.dump(data, f)

# 从文件反序列化
with open("data.pkl", "rb") as f:
    loaded = pickle.load(f)

# 序列化到字节
pickled = pickle.dumps(data)
loaded = pickle.loads(pickled)

安全警告：pickle可以执行任意代码，绝不要加载不受信任来源的pickle数据。在Web应用中应始终使用JSON而非pickle。

9.5.4 格式对比

特性	JSON	CSV	pickle
数据类型	基本类型	纯文本表格	任意Python对象
可读性	高	高	不可读
安全性	安全	安全	不安全
跨语言	是	是	否
适用场景	API、配置	表格数据	Python内部缓存

9.6 上下文管理器

9.6.1 自定义上下文管理器

# 基于类的实现
class Timer:
    def __init__(self, name: str = ""):
        self.name = name
    
    def __enter__(self):
        import time
        self.start = time.perf_counter()
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        import time
        elapsed = time.perf_counter() - self.start
        print(f"{self.name} 执行时间: {elapsed:.4f}秒")
        return False

with Timer("计算"):
    sum(range(1000000))

# 基于生成器的实现
from contextlib import contextmanager

@contextmanager
def timer(name: str = ""):
    import time
    start = time.perf_counter()
    try:
        yield
    finally:
        elapsed = time.perf_counter() - start
        print(f"{name} 执行时间: {elapsed:.4f}秒")

with timer("计算"):
    sum(range(1000000))

9.6.2 实用上下文管理器

from contextlib import contextmanager
import os

@contextmanager
def change_dir(path: str):
    old_dir = os.getcwd()
    os.chdir(path)
    try:
        yield
    finally:
        os.chdir(old_dir)

@contextmanager
def temp_file(content: str, suffix: str = ".txt"):
    import tempfile
    with tempfile.NamedTemporaryFile(mode="w", suffix=suffix, delete=False, encoding="utf-8") as f:
        f.write(content)
        name = f.name
    try:
        yield name
    finally:
        os.unlink(name)

9.7 IO模型与性能

9.7.1 同步与异步IO

import time
import asyncio
import aiofiles

def sync_io_example():
    """同步IO示例"""
    start = time.time()
    
    with open("file1.txt", "w") as f:
        f.write("content1")
    with open("file2.txt", "w") as f:
        f.write("content2")
    with open("file3.txt", "w") as f:
        f.write("content3")
    
    print(f"同步IO耗时: {time.time() - start:.4f}秒")

async def async_io_example():
    """异步IO示例（需要aiofiles库）"""
    start = time.time()
    
    async with aiofiles.open("file1.txt", "w") as f:
        await f.write("content1")
    async with aiofiles.open("file2.txt", "w") as f:
        await f.write("content2")
    async with aiofiles.open("file3.txt", "w") as f:
        await f.write("content3")
    
    print(f"异步IO耗时: {time.time() - start:.4f}秒")

学术注记：Python的文件IO默认是阻塞式同步IO。对于高并发场景（如Web服务器处理大量文件请求），应使用异步IO（asyncio + aiofiles）或线程池。异步IO利用操作系统提供的非阻塞接口，单线程即可处理多个IO操作。

9.7.2 缓冲与性能

import time

def benchmark_buffer():
    """缓冲区大小对性能的影响"""
    data = b"x" * 10_000_000
    
    with open("test.bin", "wb", buffering=0) as f:
        start = time.time()
        f.write(data)
        print(f"无缓冲: {time.time() - start:.4f}秒")
    
    with open("test.bin", "wb", buffering=8192) as f:
        start = time.time()
        f.write(data)
        print(f"8KB缓冲: {time.time() - start:.4f}秒")
    
    with open("test.bin", "wb", buffering=65536) as f:
        start = time.time()
        f.write(data)
        print(f"64KB缓冲: {time.time() - start:.4f}秒")

benchmark_buffer()

9.7.3 内存映射文件

import mmap

def mmap_example():
    """内存映射文件：将文件映射到内存"""
    
    with open("large_file.bin", "wb") as f:
        f.write(b"\x00" * 10_000_000)
    
    with open("large_file.bin", "r+b") as f:
        mm = mmap.mmap(f.fileno(), 0)
        
        print(f"文件大小: {len(mm)} 字节")
        
        mm[0:4] = b"TEST"
        
        mm.seek(100)
        mm.write(b"Hello")
        
        mm.close()

def mmap_search():
    """在大型文件中搜索"""
    with open("large_file.txt", "r", encoding="utf-8") as f:
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            index = mm.find(b"search_term")
            if index != -1:
                print(f"找到于位置: {index}")

学术注记：内存映射文件（mmap）将文件直接映射到进程的虚拟内存空间，操作系统负责在内存和磁盘间交换数据。适用于：1）处理超大文件（超过内存容量）；2）随机访问文件内容；3）多进程共享内存通信。

9.7.4 文件系统监控

import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class FileChangeHandler(FileSystemEventHandler):
    def on_created(self, event):
        print(f"文件创建: {event.src_path}")
    
    def on_modified(self, event):
        print(f"文件修改: {event.src_path}")
    
    def on_deleted(self, event):
        print(f"文件删除: {event.src_path}")

def watch_directory(path: str):
    """监控目录变化"""
    observer = Observer()
    observer.schedule(FileChangeHandler(), path, recursive=True)
    observer.start()
    
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        observer.stop()
    observer.join()

9.8 前沿技术动态

9.8.1 异步文件IO

import asyncio
import aiofiles

async def read_file_async(path: str) -> str:
    async with aiofiles.open(path, mode='r', encoding='utf-8') as f:
        return await f.read()

async def write_file_async(path: str, content: str) -> None:
    async with aiofiles.open(path, mode='w', encoding='utf-8') as f:
        await f.write(content)

async def main():
    content = await read_file_async("data.txt")
    await write_file_async("output.txt", content.upper())

asyncio.run(main())

9.8.2 现代路径操作

from pathlib import Path

# Python 3.12+ 新增方法
p = Path("data/output.txt")
p.parent.mkdir(parents=True, exist_ok=True)

# 相对路径计算
base = Path("/home/user/project")
target = Path("/home/user/data/file.txt")
relative = target.relative_to(base)

# 路径匹配
for py_file in Path(".").glob("**/*.py"):
    print(py_file)

9.8.3 高性能序列化

import orjson
import msgspec

data = {"name": "Alice", "age": 30, "items": [1, 2, 3]}

json_bytes = orjson.dumps(data)
loaded = orjson.loads(json_bytes)

encoder = msgspec.json.Encoder()
decoder = msgspec.json.Decoder()
encoded = encoder.encode(data)
decoded = decoder.decode(encoded)

9.8.4 文件系统监控

from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class MyHandler(FileSystemEventHandler):
    def on_modified(self, event):
        print(f"Modified: {event.src_path}")
    
    def on_created(self, event):
        print(f"Created: {event.src_path}")

observer = Observer()
observer.schedule(MyHandler(), path=".", recursive=True)
observer.start()

9.9 本章小结

本章系统介绍了Python文件操作的完整体系：

文件读写：with语句、文本/二进制模式、逐行读取
路径操作：pathlib面向对象路径处理（推荐）
目录操作：遍历、创建、删除、复制
序列化：JSON（跨语言安全）、CSV（表格数据）、pickle（Python内部）
上下文管理器：资源管理的标准模式，支持类与生成器两种实现
IO模型：同步/异步IO、缓冲策略、内存映射文件

9.9.1 文件操作最佳实践

场景	推荐方案	原因
普通文件读写	with + open()	自动资源管理
路径操作	pathlib.Path	面向对象、跨平台
配置文件	JSON/YAML	可读性好、跨语言
大文件处理	逐行迭代/mmap	内存友好
高并发IO	asyncio + aiofiles	非阻塞、高效
临时文件	tempfile模块	安全、自动清理

9.9.2 常见陷阱与规避

# 陷阱1：忘记关闭文件
f = open("file.txt", "r")
content = f.read()
# 忘记f.close()！

# 正确做法
with open("file.txt", "r") as f:
    content = f.read()

# 陷阱2：编码错误
with open("file.txt", "r") as f:  # 可能UnicodeDecodeError
    content = f.read()

# 正确做法
with open("file.txt", "r", encoding="utf-8") as f:
    content = f.read()

# 陷阱3：大文件内存溢出
content = open("large.txt").read()  # 全部加载到内存！

# 正确做法
with open("large.txt") as f:
    for line in f:
        process(line)

# 陷阱4：pickle安全风险
data = pickle.loads(untrusted_data)  # 可能执行恶意代码！

# 正确做法：使用JSON
data = json.loads(untrusted_data)

9.10 练习题

基础题

编写程序，统计文本文件中的行数、单词数和字符数。
使用JSON存储和读取学生信息列表。
实现一个简单的日志记录器，支持按日期分割日志文件。

进阶题

实现文件搜索工具，在指定目录中递归搜索包含特定内容的文件。
编写程序，合并多个CSV文件并去重。
实现INI格式配置文件管理器，支持读写和类型转换。

项目实践

文件同步工具：编写一个程序，要求：
- 比较两个目录的文件差异
- 支持增量同步（仅复制修改过的文件）
- 使用文件哈希（MD5/SHA256）判断文件是否相同
- 支持排除规则（如忽略.git目录）
- 生成同步报告
- 使用pathlib处理路径

思考题

为什么with语句比手动close()更安全？__exit__方法的返回值有什么作用？
JSON和pickle的核心区别是什么？为什么pickle数据不安全？
pathlib相比os.path有哪些优势？在什么场景下仍需使用os.path？

9.11 延伸阅读

9.11.1 文件系统与IO

《操作系统概念》 (Silberschatz等) — 文件系统与IO原理
《UNIX环境高级编程》 (W. Richard Stevens) — 文件IO系统调用
Python IO文档 (https://docs.python.org/3/library/io.html) — Python IO层次结构

9.11.2 路径与文件操作

pathlib文档 (https://docs.python.org/3/library/pathlib.html) — 面向对象路径操作
PEP 428 — The pathlib module (https://peps.python.org/pep-0428/) — pathlib设计理念
shutil文档 (https://docs.python.org/3/library/shutil.html) — 高级文件操作

9.11.3 序列化与数据格式

JSON规范 (https://www.json.org/) — JSON数据格式标准
RFC 8259 (https://tools.ietf.org/html/rfc8259) — JSON官方规范
pickle协议 (https://docs.python.org/3/library/pickle.html) — Python序列化机制

9.11.4 异步IO

asyncio文档 (https://docs.python.org/3/library/asyncio.html) — Python异步IO框架
aiofiles (https://github.com/Tinche/aiofiles) — 异步文件操作库
《Python并发编程》 — 异步IO与并发模型

下一章：第10章类与对象

第9章 文件操作

学习目标

9.1 文件基础

9.1.1 打开与关闭文件

9.1.2 文件模式

9.1.3 文件指针

9.2 读写文件

9.2.1 读取文件

9.2.2 写入文件

9.2.3 大文件复制

9.3 路径操作

9.3.1 pathlib（推荐）

9.3.2 os.path（旧式）

9.4 目录操作

9.4.1 遍历目录

9.4.2 创建与删除

9.5 序列化

9.5.1 JSON

9.5.2 CSV

9.5.3 pickle

9.5.4 格式对比

9.6 上下文管理器

9.6.1 自定义上下文管理器

9.6.2 实用上下文管理器

9.7 IO模型与性能

9.7.1 同步与异步IO

9.7.2 缓冲与性能

9.7.3 内存映射文件

9.7.4 文件系统监控

9.8 前沿技术动态

9.8.1 异步文件IO

9.8.2 现代路径操作

9.8.3 高性能序列化

9.8.4 文件系统监控

9.9 本章小结

9.9.1 文件操作最佳实践

9.9.2 常见陷阱与规避

9.10 练习题

基础题

进阶题

项目实践

思考题

9.11 延伸阅读

9.11.1 文件系统与IO

9.11.2 路径与文件操作

9.11.3 序列化与数据格式

9.11.4 异步IO

第9章文件操作