第29章本地化与正则表达式

本地化概述

本地化（Localization）是指将软件适应特定地区或语言的过程，包括：

语言翻译：将用户界面和文本消息翻译为目标语言
区域设置：适应目标地区的日期、时间、数字、货币等格式
文化适应：考虑目标地区的文化差异和习惯

在C++中，本地化主要通过<locale>头文件中的功能实现。

字符编码

基本概念

字符编码是将字符映射到二进制数据的规则。常见的字符编码包括：

ASCII：美国信息交换标准代码，使用7位表示字符
ISO-8859-1：Latin-1编码，使用8位表示字符，兼容ASCII
UTF-8：Unicode的可变长度编码，使用1-4字节表示字符
UTF-16：Unicode的可变长度编码，使用2或4字节表示字符
UTF-32：Unicode的固定长度编码，使用4字节表示字符

C++中的字符类型

C++提供了多种字符类型来处理不同的字符编码：

char：通常为8位，用于ASCII或UTF-8编码
wchar_t：宽字符类型，通常为16或32位，用于UTF-16或UTF-32编码
char16_t：C++11引入，16位字符类型，用于UTF-16编码
char32_t：C++11引入，32位字符类型，用于UTF-32编码

字符串字面量

C++支持多种字符串字面量：

// 普通字符串字面量（char）
const char* s1 = "Hello";

// 宽字符串字面量（wchar_t）
const wchar_t* s2 = L"Hello";

// UTF-16字符串字面量（char16_t）
const char16_t* s3 = u"Hello";

// UTF-32字符串字面量（char32_t）
const char32_t* s4 = U"Hello";

// 原始字符串字面量（不处理转义序列）
const char* s5 = R"(Hello
World)";

区域设置（Locale）

基本概念

区域设置（Locale）是一组规则，定义了特定地区的语言和文化习惯，包括：

数字格式：小数点、千位分隔符等
日期和时间格式：年月日的顺序、分隔符等
货币格式：货币符号、符号位置等
字符分类：大写、小写、数字等的定义
字符串排序：字符的比较规则

C++中的区域设置

C++使用std::locale类来表示区域设置：

#include <locale>
#include <iostream>

int main() {
    // 使用默认区域设置
    std::locale defaultLoc;
    
    // 使用特定区域设置
    std::locale frLoc("fr_FR.UTF-8"); // 法语（法国）
    std::locale deLoc("de_DE.UTF-8"); // 德语（德国）
    std::locale zhLoc("zh_CN.UTF-8"); // 中文（中国）
    
    // 设置全局区域设置
    std::locale::global(std::locale("en_US.UTF-8"));
    
    return 0;
}

区域设置的组件

区域设置由多个 facet 组成，每个 facet 负责一个特定的功能：

std::num_put：数字输出格式化
std::num_get：数字输入解析
std::time_put：时间和日期输出格式化
std::time_get：时间和日期输入解析
std::money_put：货币输出格式化
std::money_get：货币输入解析
std::collate：字符串排序和比较
std::ctype：字符分类和转换

使用区域设置

#include <locale>
#include <iostream>
#include <string>

int main() {
    // 创建区域设置
    std::locale loc("de_DE.UTF-8");
    
    // 输出数字（使用德国格式）
    std::cout.imbue(loc);
    std::cout << "Number: " << 123456.78 << std::endl;
    
    // 字符串排序
    std::string s1 = "äpple";
    std::string s2 = "banana";
    
    std::collate<char> const& coll = std::use_facet<std::collate<char>>(loc);
    int result = coll.compare(s1.data(), s1.data() + s1.length(),
                             s2.data(), s2.data() + s2.length());
    
    if (result < 0) {
        std::cout << s1 << " comes before " << s2 << std::endl;
    } else if (result > 0) {
        std::cout << s1 << " comes after " << s2 << std::endl;
    } else {
        std::cout << s1 << " is equal to " << s2 << std::endl;
    }
    
    return 0;
}

日期和时间格式化

使用`std::put_time`

C++11引入了std::put_time函数，用于格式化日期和时间：

#include <iostream>
#include <iomanip>
#include <ctime>

int main() {
    // 获取当前时间
    std::time_t now = std::time(nullptr);
    std::tm* localTime = std::localtime(&now);
    
    // 格式化输出
    std::cout << "Current time: " << std::put_time(localTime, "%Y-%m-%d %H:%M:%S") << std::endl;
    std::cout << "Date: " << std::put_time(localTime, "%d/%m/%Y") << std::endl;
    std::cout << "Time: " << std::put_time(localTime, "%I:%M %p") << std::endl;
    
    return 0;
}

格式说明符

std::put_time使用的格式说明符：

说明符	含义	示例
`%Y`	四位年份	2023
`%y`	两位年份	23
`%m`	两位月份（01-12）	05
`%d`	两位日期（01-31）	21
`%H`	24小时制小时（00-23）	14
`%I`	12小时制小时（01-12）	02
`%M`	两位分钟（00-59）	30
`%S`	两位秒钟（00-59）	45
`%p`	AM/PM标记	PM
`%a`	缩写星期名	Mon
`%A`	完整星期名	Monday
`%b`	缩写月份名	Jan
`%B`	完整月份名	January

数字格式化

使用流操纵符

C++提供了多种流操纵符来格式化数字：

#include <iostream>
#include <iomanip>

int main() {
    double number = 123456.789;
    
    // 设置精度
    std::cout << "Precision: " << std::setprecision(5) << number << std::endl;
    
    // 固定小数点
    std::cout << "Fixed: " << std::fixed << std::setprecision(2) << number << std::endl;
    
    // 科学计数法
    std::cout << "Scientific: " << std::scientific << std::setprecision(2) << number << std::endl;
    
    // 十六进制
    std::cout << "Hex: " << std::hex << std::showbase << 255 << std::endl;
    
    // 八进制
    std::cout << "Octal: " << std::oct << std::showbase << 255 << std::endl;
    
    // 布尔值
    std::cout << "Bool: " << std::boolalpha << true << std::endl;
    
    return 0;
}

使用区域设置格式化数字

#include <iostream>
#include <locale>

int main() {
    // 使用德国区域设置
    std::locale deLoc("de_DE.UTF-8");
    std::cout.imbue(deLoc);
    
    double number = 123456.789;
    std::cout << "German format: " << number << std::endl;
    
    // 使用美国区域设置
    std::locale usLoc("en_US.UTF-8");
    std::cout.imbue(usLoc);
    std::cout << "US format: " << number << std::endl;
    
    return 0;
}

货币格式化

使用`std::money_put`和`std::money_get`

#include <iostream>
#include <locale>
#include <string>

int main() {
    // 使用美国区域设置
    std::locale usLoc("en_US.UTF-8");
    std::cout.imbue(usLoc);
    
    // 格式化货币
    long double amount = 1234.56;
    std::cout << "US currency: " << std::showbase << std::put_money(amount * 100) << std::endl;
    
    // 使用德国区域设置
    std::locale deLoc("de_DE.UTF-8");
    std::cout.imbue(deLoc);
    std::cout << "German currency: " << std::showbase << std::put_money(amount * 100) << std::endl;
    
    // 输入货币
    std::string input;
    std::cout << "Enter amount: ";
    std::cin.imbue(usLoc);
    std::cin >> std::get_money(amount);
    std::cout << "You entered: " << std::showbase << std::put_money(amount) << std::endl;
    
    return 0;
}

正则表达式概述

正则表达式（Regular Expression）是一种用于匹配字符串中字符组合的模式。在C++11中，标准库引入了<regex>头文件，提供了正则表达式的支持。

正则表达式的用途

字符串搜索：在文本中查找特定模式的字符串
字符串替换：替换文本中匹配特定模式的字符串
字符串验证：验证字符串是否符合特定的格式（如电子邮件、电话号码等）
字符串分割：根据特定模式分割字符串

C++中的正则表达式库

C++正则表达式库主要包含以下组件：

std::regex：表示正则表达式对象
std::smatch：表示正则表达式匹配的结果
std::regex_match：检查整个字符串是否匹配正则表达式
std::regex_search：在字符串中查找匹配正则表达式的子串
std::regex_replace：替换字符串中匹配正则表达式的子串

正则表达式语法

基本语法

语法	含义	示例
`^`	匹配字符串开头	`^Hello` 匹配以”Hello”开头的字符串
`$`	匹配字符串结尾	`World$` 匹配以”World”结尾的字符串
`.`	匹配任意单个字符（除换行符）	`H.llo` 匹配”Hello”、”Hallo”等
`*`	匹配前面的元素0次或多次	`ab*` 匹配”a”、”ab”、”abb”等
`+`	匹配前面的元素1次或多次	`ab+` 匹配”ab”、”abb”等
`?`	匹配前面的元素0次或1次	`ab?` 匹配”a”、”ab”
`{n}`	匹配前面的元素恰好n次	`a{3}` 匹配”aaa”
`{n,}`	匹配前面的元素至少n次	`a{2,}` 匹配”aa”、”aaa”等
`{n,m}`	匹配前面的元素至少n次，最多m次	`a{1,3}` 匹配”a”、”aa”、”aaa”
`	`	匹配两个模式之一
`(...)`	捕获组，匹配括号内的模式	`(ab)+` 匹配”ab”、”abab”等
`[abc]`	字符集，匹配括号内的任意字符	`[aeiou]` 匹配任意元音字母
`[^abc]`	否定字符集，匹配括号外的任意字符	`[^0-9]` 匹配任意非数字字符
`\d`	匹配数字，等价于`[0-9]`	`\d+` 匹配一个或多个数字
`\D`	匹配非数字，等价于`[^0-9]`	`\D+` 匹配一个或多个非数字字符
`\w`	匹配单词字符，等价于`[a-zA-Z0-9_]`	`\w+` 匹配一个或多个单词字符
`\W`	匹配非单词字符，等价于`[^a-zA-Z0-9_]`	`\W+` 匹配一个或多个非单词字符
`\s`	匹配空白字符（空格、制表符、换行符等）	`\s+` 匹配一个或多个空白字符
`\S`	匹配非空白字符	`\S+` 匹配一个或多个非空白字符
`\b`	单词边界	`\bword\b` 匹配完整的单词”word”
`\B`	非单词边界	`\Bword\B` 匹配单词内部的”word”

转义字符

在C++中，正则表达式中的反斜杠需要被转义，所以需要使用双反斜杠：

// 匹配数字的正则表达式
std::regex numRegex("\\d+");

// 匹配电子邮件的正则表达式
std::regex emailRegex("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}");

C++正则表达式库的使用

1. `std::regex_match`

std::regex_match检查整个字符串是否匹配正则表达式：

#include <regex>
#include <iostream>
#include <string>

int main() {
    std::string phone = "123-456-7890";
    std::regex phoneRegex("\\d{3}-\\d{3}-\\d{4}");
    
    if (std::regex_match(phone, phoneRegex)) {
        std::cout << "Valid phone number" << std::endl;
    } else {
        std::cout << "Invalid phone number" << std::endl;
    }
    
    return 0;
}

2. `std::regex_search`

std::regex_search在字符串中查找匹配正则表达式的子串：

#include <regex>
#include <iostream>
#include <string>

int main() {
    std::string text = "My phone number is 123-456-7890. Call me!";
    std::regex phoneRegex("\\d{3}-\\d{3}-\\d{4}");
    std::smatch match;
    
    if (std::regex_search(text, match, phoneRegex)) {
        std::cout << "Found phone number: " << match.str() << std::endl;
    } else {
        std::cout << "No phone number found" << std::endl;
    }
    
    return 0;
}

3. `std::regex_replace`

std::regex_replace替换字符串中匹配正则表达式的子串：

#include <regex>
#include <iostream>
#include <string>

int main() {
    std::string text = "My phone number is 123-456-7890. Call me!";
    std::regex phoneRegex("\\d{3}-\\d{3}-\\d{4}");
    
    // 替换匹配的子串
    std::string replaced = std::regex_replace(text, phoneRegex, "***-***-****");
    std::cout << "Original: " << text << std::endl;
    std::cout << "Replaced: " << replaced << std::endl;
    
    return 0;
}

4. 捕获组

捕获组允许提取匹配的子串：

#include <regex>
#include <iostream>
#include <string>

int main() {
    std::string date = "2023-05-21";
    std::regex dateRegex("(\\d{4})-(\\d{2})-(\\d{2})");
    std::smatch match;
    
    if (std::regex_match(date, match, dateRegex)) {
        std::cout << "Year: " << match[1] << std::endl;
        std::cout << "Month: " << match[2] << std::endl;
        std::cout << "Day: " << match[3] << std::endl;
    }
    
    return 0;
}

5. 迭代器

使用std::sregex_iterator遍历所有匹配：

#include <regex>
#include <iostream>
#include <string>

int main() {
    std::string text = "Phone numbers: 123-456-7890, 987-654-3210";
    std::regex phoneRegex("\\d{3}-\\d{3}-\\d{4}");
    
    // 迭代所有匹配
    std::sregex_iterator it(text.begin(), text.end(), phoneRegex);
    std::sregex_iterator end;
    
    while (it != end) {
        std::cout << "Found: " << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

正则表达式的高级应用

1. 命名捕获组

C++11支持命名捕获组，使用(?<name>pattern)语法：

#include <regex>
#include <iostream>
#include <string>

int main() {
    std::string date = "2023-05-21";
    std::regex dateRegex("(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})");
    std::smatch match;
    
    if (std::regex_match(date, match, dateRegex)) {
        std::cout << "Year: " << match["year"] << std::endl;
        std::cout << "Month: " << match["month"] << std::endl;
        std::cout << "Day: " << match["day"] << std::endl;
    }
    
    return 0;
}

2. 前向断言和后向断言

C++11支持前向断言和后向断言，用于匹配特定上下文的模式：

前向断言：(?=pattern) 匹配后面跟着pattern的位置
否定前向断言：(?!pattern) 匹配后面不跟着pattern的位置
后向断言：(?<=pattern) 匹配前面是pattern的位置
否定后向断言：(?<!pattern) 匹配前面不是pattern的位置

#include <regex>
#include <iostream>
#include <string>

int main() {
    std::string text = "apple banana app application";
    
    // 匹配后面跟着"le"的"app"
    std::regex regex1("app(?=le)");
    std::sregex_iterator it1(text.begin(), text.end(), regex1);
    while (it1 != std::sregex_iterator()) {
        std::cout << "Match 1: " << it1->str() << std::endl;
        ++it1;
    }
    
    // 匹配后面不跟着"le"的"app"
    std::regex regex2("app(?!le)");
    std::sregex_iterator it2(text.begin(), text.end(), regex2);
    while (it2 != std::sregex_iterator()) {
        std::cout << "Match 2: " << it2->str() << std::endl;
        ++it2;
    }
    
    return 0;
}

3. 正则表达式标志

正则表达式可以设置不同的标志，影响匹配行为：

标志	含义
`std::regex::icase`	忽略大小写
`std::regex::nosubs`	不存储子匹配
`std::regex::optimize`	优化正则表达式以提高匹配速度
`std::regex::collate`	考虑区域设置的字符比较规则
`std::regex::multiline`	多行模式，`^`和`$`匹配每行的开头和结尾
`std::regex::dotall`	点号匹配所有字符，包括换行符

#include <regex>
#include <iostream>
#include <string>

int main() {
    std::string text = "Apple banana APP";
    
    // 忽略大小写
    std::regex regex1("apple", std::regex::icase);
    if (std::regex_search(text, regex1)) {
        std::cout << "Match (case-insensitive): yes" << std::endl;
    }
    
    // 多行模式
    std::string text2 = "Line 1\nLine 2\nLine 3";
    std::regex regex2("^Line", std::regex::multiline);
    std::sregex_iterator it(text2.begin(), text2.end(), regex2);
    while (it != std::sregex_iterator()) {
        std::cout << "Match (multiline): " << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

本地化与正则表达式的结合

处理不同语言的文本

#include <regex>
#include <iostream>
#include <string>
#include <locale>

int main() {
    // 设置区域设置为德语
    std::locale::global(std::locale("de_DE.UTF-8"));
    
    // 匹配包含德语变音符号的单词
    std::string text = "Äpfel Banana Über Morgen"; 
    std::regex regex("[\u00C0-\u02AF]+\w*"); // 匹配包含重音字符的单词
    
    std::sregex_iterator it(text.begin(), text.end(), regex);
    while (it != std::sregex_iterator()) {
        std::cout << "Match: " << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

验证国际化输入

#include <regex>
#include <iostream>
#include <string>

int main() {
    // 匹配中文姓名（2-4个汉字）
    std::regex chineseNameRegex("[\\u4E00-\\u9FA5]{2,4}");
    
    // 匹配日文平假名
    std::regex hiraganaRegex("[\\u3040-\\u309F]+");
    
    // 测试
    std::string name1 = "张三";
    std::string name2 = "John";
    std::string japanese = "こんにちは";
    
    std::cout << "\"" << name1 << "\" is Chinese name: " 
              << std::regex_match(name1, chineseNameRegex) << std::endl;
    std::cout << "\"" << name2 << "\" is Chinese name: " 
              << std::regex_match(name2, chineseNameRegex) << std::endl;
    std::cout << "\"" << japanese << "\" has hiragana: " 
              << std::regex_search(japanese, hiraganaRegex) << std::endl;
    
    return 0;
}

最佳实践

1. 本地化最佳实践

使用std::locale：使用标准库的区域设置功能，而不是自己实现
考虑字符编码：使用UTF-8编码处理多语言文本
避免硬编码：将文本消息存储在配置文件或资源文件中，便于翻译
测试不同区域设置：在不同的区域设置下测试程序的行为
使用std::wstring：对于需要处理宽字符的场景，使用std::wstring

2. 正则表达式最佳实践

编译正则表达式：对于重复使用的正则表达式，应该预编译
避免复杂正则表达式：过于复杂的正则表达式会降低性能和可读性
使用原始字符串字面量：使用原始字符串字面量可以避免转义字符的麻烦
测试正则表达式：使用在线正则表达式测试工具测试正则表达式的正确性
考虑性能：对于大型文本，正则表达式可能会影响性能，需要权衡
使用捕获组：合理使用捕获组可以更方便地提取信息

3. 性能考虑

区域设置的开销：频繁切换区域设置会产生开销，应该尽量减少切换
正则表达式的开销：正则表达式的编译和匹配都有开销，对于性能敏感的场景，需要谨慎使用
缓存编译结果：对于重复使用的正则表达式，应该缓存编译结果
优化正则表达式：避免回溯和过度使用通配符

总结

本地化和正则表达式是C++中处理文本的重要工具：

本地化：使软件能够适应不同地区和语言的需求，提高用户体验
正则表达式：提供了强大的字符串处理能力，简化了文本匹配、替换和验证等操作

在实际应用中，应该根据具体需求合理使用这些工具，同时考虑性能和可维护性的平衡。

通过本章的学习，读者应该掌握C++中的本地化功能和正则表达式库的使用，能够处理多语言文本和复杂的字符串操作。

第29章 本地化与正则表达式

本地化概述

字符编码

基本概念

C++中的字符类型

字符串字面量

区域设置（Locale）

基本概念

C++中的区域设置

区域设置的组件

使用区域设置

日期和时间格式化

使用std::put_time

格式说明符

数字格式化

使用流操纵符

使用区域设置格式化数字

货币格式化

使用std::money_put和std::money_get

正则表达式概述

正则表达式的用途

C++中的正则表达式库

正则表达式语法

基本语法

转义字符

C++正则表达式库的使用

1. std::regex_match

2. std::regex_search

3. std::regex_replace

4. 捕获组

5. 迭代器

正则表达式的高级应用

1. 命名捕获组

2. 前向断言和后向断言

3. 正则表达式标志

本地化与正则表达式的结合

处理不同语言的文本

验证国际化输入

最佳实践

1. 本地化最佳实践

2. 正则表达式最佳实践

3. 性能考虑

总结

第29章本地化与正则表达式

使用`std::put_time`

使用`std::money_put`和`std::money_get`

1. `std::regex_match`

2. `std::regex_search`

3. `std::regex_replace`