第27章本地化与正则表达式

本地化（国际化）基础

本地化的重要性

本地化（Localization，简称L10n）和国际化（Internationalization，简称I18n）是现代软件 development中的重要概念：

国际化：设计和开发软件，使其能够轻松适应不同的语言和地区
本地化：将国际化的软件适配到特定的语言和地区

在C++中，本地化主要通过<locale>头文件中的设施实现。

字符编码基础

常见字符编码

编码	描述	特点
ASCII	美国信息交换标准代码	7位编码，仅支持英文字符
ISO-8859-1	Latin-1	8位编码，支持西欧语言
UTF-8	Unicode转换格式-8	可变长度编码，支持所有Unicode字符
UTF-16	Unicode转换格式-16	2或4字节编码，支持所有Unicode字符
UTF-32	Unicode转换格式-32	4字节编码，支持所有Unicode字符

C++中的字符类型

// 基本字符类型
char c1 = 'A';          // 通常为1字节，编码依赖于环境
wchar_t c2 = L'A';      // 宽字符，通常为2或4字节

// C++11引入的字符类型
char16_t c3 = u'A';     // 16位字符，用于UTF-16
char32_t c4 = U'A';     // 32位字符，用于UTF-32
char8_t c5 = u8'A';     // 8位字符，用于UTF-8 (C++20)

// 字符串字面量
const char* s1 = "Hello";          // 窄字符串
const wchar_t* s2 = L"Hello";      // 宽字符串
const char16_t* s3 = u"Hello";     // UTF-16字符串
const char32_t* s4 = U"Hello";     // UTF-32字符串
const char8_t* s5 = u8"Hello";     // UTF-8字符串 (C++20)

// 标准字符串类型
std::string str1 = "Hello";        // 窄字符串
std::wstring str2 = L"Hello";      // 宽字符串
std::u16string str3 = u"Hello";    // UTF-16字符串
std::u32string str4 = U"Hello";    // UTF-32字符串
std::u8string str5 = u8"Hello";    // UTF-8字符串 (C++20)

std::locale的使用

locale的基本概念

std::locale是C++中处理本地化的核心类，它封装了特定地区的文化和语言设置。

#include <iostream>
#include <locale>

int main() {
    // 使用系统默认locale
    std::locale default_locale;
    std::cout << "Default locale: " << default_locale.name() << std::endl;
    
    // 使用特定locale
    std::locale fr_locale("fr_FR.UTF-8");
    std::cout << "French locale: " << fr_locale.name() << std::endl;
    
    // 使用C locale
    std::locale c_locale("C");
    std::cout << "C locale: " << c_locale.name() << std::endl;
    
    return 0;
}

std::locale使用facet系统来提供不同类型的本地化服务：

std::ctype：字符分类和转换
std::collate：字符串比较
std::numpunct：数字标点
std::num_get/num_put：数字输入/输出
std::money_get/money_put：货币输入/输出
std::time_get/time_put：时间输入/输出
std::messages：消息目录访问

#include <iostream>
#include <locale>
#include <string>

int main() {
    // 创建一个法语locale
    std::locale fr_locale("fr_FR.UTF-8");
    
    // 获取数字标点facet
    const std::numpunct<char>& numpunct = std::use_facet<std::numpunct<char>>(fr_locale);
    std::cout << "Decimal point in French: '" << numpunct.decimal_point() << "'" << std::endl;
    std::cout << "Thousands separator in French: '" << numpunct.thousands_sep() << "'" << std::endl;
    
    // 获取货币标点facet
    const std::moneypunct<char>& moneypunct = std::use_facet<std::moneypunct<char>>(fr_locale);
    std::cout << "Currency symbol in French: '" << moneypunct.curr_symbol() << "'" << std::endl;
    std::cout << "Decimal point in currency: '" << moneypunct.decimal_point() << "'" << std::endl;
    
    return 0;
}

本地化数字和货币

#include <iostream>
#include <locale>
#include <iomanip>

int main() {
    // 保存当前locale
    std::locale old_locale = std::cout.getloc();
    
    // 使用美国locale
    std::cout.imbue(std::locale("en_US.UTF-8"));
    std::cout << "US format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl;
    
    // 使用德国locale
    std::cout.imbue(std::locale("de_DE.UTF-8"));
    std::cout << "German format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl;
    
    // 使用印度locale
    std::cout.imbue(std::locale("hi_IN.UTF-8"));
    std::cout << "Indian format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl;
    
    // 恢复旧locale
    std::cout.imbue(old_locale);
    
    return 0;
}

本地化日期和时间

#include <iostream>
#include <locale>
#include <iomanip>
#include <ctime>

int main() {
    // 获取当前时间
    std::time_t now = std::time(nullptr);
    std::tm* tm_now = std::localtime(&now);
    
    // 保存当前locale
    std::locale old_locale = std::cout.getloc();
    
    // 使用美国locale
    std::cout.imbue(std::locale("en_US.UTF-8"));
    std::cout << "US format: " << std::put_time(tm_now, "%c") << std::endl;
    
    // 使用法国locale
    std::cout.imbue(std::locale("fr_FR.UTF-8"));
    std::cout << "French format: " << std::put_time(tm_now, "%c") << std::endl;
    
    // 使用日本locale
    std::cout.imbue(std::locale("ja_JP.UTF-8"));
    std::cout << "Japanese format: " << std::put_time(tm_now, "%c") << std::endl;
    
    // 恢复旧locale
    std::cout.imbue(old_locale);
    
    return 0;
}

自定义locale

#include <iostream>
#include <locale>

// 自定义numpunct facet
class CustomNumpunct : public std::numpunct<char> {
protected:
    char do_decimal_point() const override {
        return ',';
    }
    
    char do_thousands_sep() const override {
        return '.';
    }
    
    std::string do_grouping() const override {
        return "\3";
    }
};

int main() {
    // 创建自定义locale
    std::locale custom_locale(std::locale(), new CustomNumpunct);
    
    // 使用自定义locale
    std::cout.imbue(custom_locale);
    std::cout << "Custom format: " << 1234567.89 << std::endl;
    
    return 0;
}

Unicode处理

Unicode基础

Unicode是一个国际标准，为世界上所有的字符、标点符号和符号分配了唯一的数字代码点。

代码点：Unicode中每个字符的唯一编号，范围从U+0000到U+10FFFF
平面：Unicode代码点分为17个平面，每个平面包含65536个代码点
BMP：基本多文种平面（U+0000到U+FFFF），包含最常用的字符

C++中的Unicode支持

UTF-8处理

#include <iostream>
#include <string>
#include <vector>

// 计算UTF-8字符串的字符数
size_t utf8_length(const char* str) {
    size_t length = 0;
    while (*str) {
        if ((*str++ & 0xC0) != 0x80) {
            ++length;
        }
    }
    return length;
}

// 计算UTF-8字符串的字符数（使用C++11的string_view）
size_t utf8_length(const std::string& str) {
    size_t length = 0;
    for (char c : str) {
        if ((c & 0xC0) != 0x80) {
            ++length;
        }
    }
    return length;
}

// UTF-8到UTF-32的转换
std::u32string utf8_to_utf32(const std::string& utf8_str) {
    std::u32string utf32_str;
    
    for (size_t i = 0; i < utf8_str.size();) {
        unsigned char c = static_cast<unsigned char>(utf8_str[i]);
        char32_t code_point;
        
        if (c < 0x80) {
            // 单字节
            code_point = c;
            i += 1;
        } else if (c < 0xE0) {
            // 双字节
            code_point = ((c & 0x1F) << 6) | (static_cast<unsigned char>(utf8_str[i+1]) & 0x3F);
            i += 2;
        } else if (c < 0xF0) {
            // 三字节
            code_point = ((c & 0x0F) << 12) | 
                        ((static_cast<unsigned char>(utf8_str[i+1]) & 0x3F) << 6) |
                        (static_cast<unsigned char>(utf8_str[i+2]) & 0x3F);
            i += 3;
        } else {
            // 四字节
            code_point = ((c & 0x07) << 18) | 
                        ((static_cast<unsigned char>(utf8_str[i+1]) & 0x3F) << 12) |
                        ((static_cast<unsigned char>(utf8_str[i+2]) & 0x3F) << 6) |
                        (static_cast<unsigned char>(utf8_str[i+3]) & 0x3F);
            i += 4;
        }
        
        utf32_str.push_back(code_point);
    }
    
    return utf32_str;
}

// UTF-32到UTF-8的转换
std::string utf32_to_utf8(const std::u32string& utf32_str) {
    std::string utf8_str;
    
    for (char32_t code_point : utf32_str) {
        if (code_point < 0x80) {
            // 单字节
            utf8_str.push_back(static_cast<char>(code_point));
        } else if (code_point < 0x800) {
            // 双字节
            utf8_str.push_back(static_cast<char>(0xC0 | (code_point >> 6)));
            utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F)));
        } else if (code_point < 0x10000) {
            // 三字节
            utf8_str.push_back(static_cast<char>(0xE0 | (code_point >> 12)));
            utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 6) & 0x3F)));
            utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F)));
        } else {
            // 四字节
            utf8_str.push_back(static_cast<char>(0xF0 | (code_point >> 18)));
            utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 12) & 0x3F)));
            utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 6) & 0x3F)));
            utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F)));
        }
    }
    
    return utf8_str;
}

int main() {
    std::string utf8_str = u8"Hello, 世界! こんにちは! مرحبا!";
    std::cout << "UTF-8 string: " << utf8_str << std::endl;
    std::cout << "UTF-8 length (bytes): " << utf8_str.size() << std::endl;
    std::cout << "UTF-8 length (characters): " << utf8_length(utf8_str) << std::endl;
    
    // 转换为UTF-32
    std::u32string utf32_str = utf8_to_utf32(utf8_str);
    std::cout << "UTF-32 length (characters): " << utf32_str.size() << std::endl;
    
    // 转换回UTF-8
    std::string converted_back = utf32_to_utf8(utf32_str);
    std::cout << "Converted back: " << converted_back << std::endl;
    
    return 0;
}

使用ICU库

对于复杂的Unicode处理，推荐使用ICU（International Components for Unicode）库，它提供了更全面的Unicode支持。

#include <iostream>
#include <string>
#include <unicode/unistr.h>
#include <unicode/translit.h>

int main() {
    // 创建Unicode字符串
    UnicodeString ustr = UnicodeString::fromUTF8("Hello, 世界! こんにちは!");
    
    // 转换为小写
    UnicodeString lower_str = ustr; 
    lower_str.toLower();
    
    // 转换为大写
    UnicodeString upper_str = ustr;
    upper_str.toUpper();
    
    // 音译（罗马化）
    UnicodeString romanized;
    Transliterator* transliterator = Transliterator::createInstance("Any-Latin");
    if (transliterator) {
        transliterator->transliterate(ustr, 0, ustr.length(), romanized);
        delete transliterator;
    }
    
    // 输出结果
    std::string output;
    
    std::cout << "Original: ";
    ustr.toUTF8String(output);
    std::cout << output << std::endl;
    
    std::cout << "Lowercase: ";
    lower_str.toUTF8String(output);
    std::cout << output << std::endl;
    
    std::cout << "Uppercase: ";
    upper_str.toUTF8String(output);
    std::cout << output << std::endl;
    
    std::cout << "Romanized: ";
    romanized.toUTF8String(output);
    std::cout << output << std::endl;
    
    return 0;
}

消息国际化

gettext系统

gettext是一个广泛使用的消息国际化系统，它允许开发者将消息与代码分离，便于翻译。

基本使用

标记需要国际化的消息：

#include <iostream>
#include <libintl.h>
#include <locale.h>

#define _(msg) gettext(msg)

int main() {
    // 设置locale
    setlocale(LC_ALL, "");
    
    // 设置消息目录
    bindtextdomain("myapp", "./locale");
    textdomain("myapp");
    
    // 使用国际化消息
    std::cout << _("Hello, World!") << std::endl;
    std::cout << _("How are you?") << std::endl;
    
    return 0;
}

提取消息：

使用xgettext工具提取消息到PO文件：

1	xgettext -d myapp -o myapp.pot main.cpp

翻译消息：

创建语言特定的PO文件（如fr_FR.po）并翻译消息。

编译消息：

使用msgfmt工具将PO文件编译为MO文件：

1 2	mkdir -p locale/fr_FR/LC_MESSAGES msgfmt -o locale/fr_FR/LC_MESSAGES/myapp.mo fr_FR.po

自定义消息系统

#include <iostream>
#include <string>
#include <map>
#include <locale>

class MessageCatalog {
private:
    std::map<std::string, std::map<std::string, std::string>> catalogs;
    std::string current_locale;

public:
    MessageCatalog() {
        // 设置默认locale
        std::locale default_locale;
        current_locale = default_locale.name();
    }
    
    void set_locale(const std::string& loc) {
        current_locale = loc;
    }
    
    void add_message(const std::string& msgid, const std::string& msgstr, const std::string& loc) {
        catalogs[loc][msgid] = msgstr;
    }
    
    std::string get_message(const std::string& msgid) {
        // 尝试在当前locale中查找
        auto loc_it = catalogs.find(current_locale);
        if (loc_it != catalogs.end()) {
            auto msg_it = loc_it->second.find(msgid);
            if (msg_it != loc_it->second.end()) {
                return msg_it->second;
            }
        }
        
        // 尝试在默认locale中查找
        auto default_it = catalogs.find("C");
        if (default_it != catalogs.end()) {
            auto msg_it = default_it->second.find(msgid);
            if (msg_it != default_it->second.end()) {
                return msg_it->second;
            }
        }
        
        // 返回原始消息ID
        return msgid;
    }
};

// 全局消息目录
MessageCatalog msg_cat;

// 便捷宏
#define _(msgid) msg_cat.get_message(msgid)

int main() {
    // 添加消息
    msg_cat.add_message("Hello, World!", "Hello, World!", "C");
    msg_cat.add_message("How are you?", "How are you?", "C");
    
    msg_cat.add_message("Hello, World!", "Bonjour, Monde!", "fr_FR");
    msg_cat.add_message("How are you?", "Comment allez-vous?", "fr_FR");
    
    msg_cat.add_message("Hello, World!", "こんにちは、世界！", "ja_JP");
    msg_cat.add_message("How are you?", "お元気ですか？", "ja_JP");
    
    // 使用默认locale
    std::cout << "Default locale:" << std::endl;
    std::cout << _("Hello, World!") << std::endl;
    std::cout << _("How are you?") << std::endl;
    
    // 使用法语locale
    std::cout << "\nFrench locale:" << std::endl;
    msg_cat.set_locale("fr_FR");
    std::cout << _("Hello, World!") << std::endl;
    std::cout << _("How are you?") << std::endl;
    
    // 使用日语locale
    std::cout << "\nJapanese locale:" << std::endl;
    msg_cat.set_locale("ja_JP");
    std::cout << _("Hello, World!") << std::endl;
    std::cout << _("How are you?") << std::endl;
    
    return 0;
}

正则表达式基础

正则表达式语法

C++11引入了<regex>头文件，提供了正则表达式支持。C++的正则表达式语法基于ECMAScript标准。

基本语法

语法	描述	示例
`^`	匹配字符串开头	`^Hello` 匹配以Hello开头的字符串
`$`	匹配字符串结尾	`World$` 匹配以World结尾的字符串
`.`	匹配任意单个字符（除换行符）	`H.llo` 匹配Hello, Hallo等
`*`	匹配前面的字符0次或多次	`ab*c` 匹配ac, abc, abbc等
`+`	匹配前面的字符1次或多次	`ab+c` 匹配abc, abbc等，不匹配ac
`?`	匹配前面的字符0次或1次	`ab?c` 匹配ac, abc，不匹配abbc
`{n}`	匹配前面的字符恰好n次	`a{3}` 匹配aaa
`{n,}`	匹配前面的字符至少n次	`a{2,}` 匹配aa, aaa等
`{n,m}`	匹配前面的字符n到m次	`a{2,4}` 匹配aa, aaa, aaaa
`[]`	匹配括号内的任意一个字符	`[aeiou]` 匹配任意元音字母
`[^]`	匹配括号内字符以外的任意字符	`[^0-9]` 匹配任意非数字字符
`	`	匹配左右任意一个表达式
`()`	捕获组，匹配括号内的表达式并捕获	`(ab)+` 匹配ab, abab等，并捕获ab
`\d`	匹配任意数字，等同于[0-9]	`\d+` 匹配一个或多个数字
`\D`	匹配任意非数字，等同于[^0-9]	`\D+` 匹配一个或多个非数字
`\w`	匹配任意字母、数字或下划线，等同于[a-zA-Z0-9_]	`\w+` 匹配一个或多个单词字符
`\W`	匹配任意非单词字符，等同于[^a-zA-Z0-9_]	`\W+` 匹配一个或多个非单词字符
`\s`	匹配任意空白字符（空格、制表符、换行符等）	`\s+` 匹配一个或多个空白字符
`\S`	匹配任意非空白字符	`\S+` 匹配一个或多个非空白字符
`\b`	匹配单词边界	`\bword\b` 匹配完整的word单词
`\B`	匹配非单词边界	`\Bword\B` 匹配单词内部的word

std::regex的使用

基本操作

#include <iostream>
#include <regex>
#include <string>

int main() {
    // 1. 匹配操作
    std::string text = "Hello, World!";
    std::regex pattern(R"(Hello, \w+!)" );
    
    if (std::regex_match(text, pattern)) {
        std::cout << "Text matches pattern" << std::endl;
    } else {
        std::cout << "Text does not match pattern" << std::endl;
    }
    
    // 2. 搜索操作
    std::string text2 = "The price is $45.99, discounted to $39.99";
    std::regex price_pattern(R"(\$\d+\.\d{2})" );
    
    std::smatch matches;
    if (std::regex_search(text2, matches, price_pattern)) {
        std::cout << "Found price: " << matches[0] << std::endl;
    }
    
    // 3. 替换操作
    std::string text3 = "The cat sat on the mat";
    std::regex cat_pattern(R"(cat)" );
    std::string replaced = std::regex_replace(text3, cat_pattern, "dog");
    std::cout << "Original: " << text3 << std::endl;
    std::cout << "Replaced: " << replaced << std::endl;
    
    // 4. 迭代搜索
    std::string text4 = "Contact info: john@example.com, jane@test.org";
    std::regex email_pattern(R"([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" );
    
    std::sregex_iterator it(text4.begin(), text4.end(), email_pattern);
    std::sregex_iterator end;
    
    std::cout << "Found emails:" << std::endl;
    while (it != end) {
        std::cout << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

捕获组

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "John Doe (john@example.com) - 30 years old";
    
    // 带捕获组的正则表达式
    std::regex pattern(R"(([A-Za-z]+) ([A-Za-z]+) \(([^)]+)\) - (\d+) years old)" );
    
    std::smatch matches;
    if (std::regex_match(text, matches, pattern)) {
        std::cout << "Full match: " << matches[0] << std::endl;
        std::cout << "First name: " << matches[1] << std::endl;
        std::cout << "Last name: " << matches[2] << std::endl;
        std::cout << "Email: " << matches[3] << std::endl;
        std::cout << "Age: " << matches[4] << std::endl;
    }
    
    return 0;
}

正则表达式标志

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "Hello, WORLD! hello, world!";
    
    // 不区分大小写
    std::regex case_insensitive_pattern(R"(hello, world!)" , std::regex_constants::icase);
    
    std::smatch matches;
    if (std::regex_search(text, matches, case_insensitive_pattern)) {
        std::cout << "Found: " << matches[0] << std::endl;
    }
    
    // 多行模式
    std::string multi_line_text = "Line 1: test\nLine 2: TEST\nLine 3: Test";
    std::regex multi_line_pattern(R"(^Line \d+: test$)" , std::regex_constants::icase | std::regex_constants::multiline);
    
    std::sregex_iterator it(multi_line_text.begin(), multi_line_text.end(), multi_line_pattern);
    std::sregex_iterator end;
    
    std::cout << "Found lines:" << std::endl;
    while (it != end) {
        std::cout << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

高级正则表达式技巧

前瞻和后顾断言

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "Price: $45.99, Discount: $10.00, Total: $35.99";
    
    // 前瞻断言：匹配$后面的数字，但不包含$
    std::regex lookahead_pattern(R"(\$(?=\d+\.\d{2}))" );
    std::string result1 = std::regex_replace(text, lookahead_pattern, "USD ");
    std::cout << "With lookahead: " << result1 << std::endl;
    
    std::string text2 = "apple pie, banana bread, cherry tart";
    
    // 后顾断言：匹配pie前面的单词，但不包含pie
    std::regex lookbehind_pattern(R"((?<=\w+) pie)" );
    std::smatch matches;
    if (std::regex_search(text2, matches, lookbehind_pattern)) {
        std::cout << "Found with lookbehind: " << matches[0] << std::endl;
    }
    
    // 负前瞻：匹配不是@gmail.com结尾的邮箱
    std::string emails = "john@example.com, jane@gmail.com, bob@test.org";
    std::regex negative_lookahead(R"([a-zA-Z0-9._%+-]+@(?!gmail\.com)[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" );
    
    std::sregex_iterator it(emails.begin(), emails.end(), negative_lookahead);
    std::sregex_iterator end;
    
    std::cout << "Non-Gmail emails:" << std::endl;
    while (it != end) {
        std::cout << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

量词的贪婪与非贪婪

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "<div>First div</div><div>Second div</div>";
    
    // 贪婪匹配：尽可能匹配更多字符
    std::regex greedy_pattern(R"(<div>.*</div>)" );
    std::smatch greedy_match;
    if (std::regex_search(text, greedy_match, greedy_pattern)) {
        std::cout << "Greedy match: " << greedy_match[0] << std::endl;
    }
    
    // 非贪婪匹配：尽可能匹配更少字符
    std::regex non_greedy_pattern(R"(<div>.*?</div>)" );
    std::sregex_iterator it(text.begin(), text.end(), non_greedy_pattern);
    std::sregex_iterator end;
    
    std::cout << "Non-greedy matches:" << std::endl;
    while (it != end) {
        std::cout << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

条件正则表达式

#include <iostream>
#include <regex>
#include <string>

int main() {
    // 匹配电话号码，支持带区号和不带区号的格式
    std::string phone_numbers = "555-1234, (123) 555-6789";
    std::regex phone_pattern(R"((\()?(\d{3})(?(1)\) |-))?\d{3}-\d{4})" );
    
    std::sregex_iterator it(phone_numbers.begin(), phone_numbers.end(), phone_pattern);
    std::sregex_iterator end;
    
    std::cout << "Found phone numbers:" << std::endl;
    while (it != end) {
        std::cout << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

正则表达式与本地化的结合

处理Unicode字符

#include <iostream>
#include <regex>
#include <string>
#include <locale>

int main() {
    // 设置UTF-8 locale
    std::locale::global(std::locale("en_US.UTF-8"));
    
    // 匹配包含非ASCII字符的文本
    std::u32string text = U"Hello, 世界! こんにちは!";
    
    // 转换为UTF-8进行正则匹配
    std::string utf8_text;
    for (char32_t c : text) {
        if (c < 0x80) {
            utf8_text.push_back(static_cast<char>(c));
        } else if (c < 0x800) {
            utf8_text.push_back(static_cast<char>(0xC0 | (c >> 6)));
            utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F)));
        } else if (c < 0x10000) {
            utf8_text.push_back(static_cast<char>(0xE0 | (c >> 12)));
            utf8_text.push_back(static_cast<char>(0x80 | ((c >> 6) & 0x3F)));
            utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F)));
        } else {
            utf8_text.push_back(static_cast<char>(0xF0 | (c >> 18)));
            utf8_text.push_back(static_cast<char>(0x80 | ((c >> 12) & 0x3F)));
            utf8_text.push_back(static_cast<char>(0x80 | ((c >> 6) & 0x3F)));
            utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F)));
        }
    }
    
    // 匹配包含中文字符的部分
    std::regex chinese_pattern(R"([\p{Han}]+)" , std::regex_constants::u);
    std::sregex_iterator it(utf8_text.begin(), utf8_text.end(), chinese_pattern);
    std::sregex_iterator end;
    
    std::cout << "Found Chinese characters:" << std::endl;
    while (it != end) {
        std::cout << it->str() << std::endl;
        ++it;
    }
    
    return 0;
}

本地化的正则表达式

#include <iostream>
#include <regex>
#include <string>
#include <locale>

int main() {
    // 根据不同locale使用不同的数字格式
    std::string german_number = "1.234.567,89";
    std::string us_number = "1,234,567.89";
    
    // 匹配德国数字格式
    std::regex german_pattern(R"(\d{1,3}(\.\d{3})*(,\d{2})?)" );
    
    // 匹配美国数字格式
    std::regex us_pattern(R"(\d{1,3}(,\d{3})*(\.\d{2})?)" );
    
    std::cout << "German number format:" << std::endl;
    if (std::regex_match(german_number, german_pattern)) {
        std::cout << "Valid: " << german_number << std::endl;
    }
    
    std::cout << "US number format:" << std::endl;
    if (std::regex_match(us_number, us_pattern)) {
        std::cout << "Valid: " << us_number << std::endl;
    }
    
    return 0;
}

性能优化

正则表达式性能

正则表达式的性能取决于多个因素，包括模式复杂度、输入大小和实现细节。以下是一些优化技巧：

1. 编译正则表达式

对于重复使用的正则表达式，应预编译以避免重复编译开销。

#include <iostream>
#include <regex>
#include <string>
#include <chrono>

int main() {
    std::string text = "The price is $45.99, discounted to $39.99, final price $29.99";
    
    // 测量重复编译的时间
    auto start1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 10000; ++i) {
        std::regex pattern(R"(\$\d+\.\d{2})" );
        std::smatch matches;
        std::regex_search(text, matches, pattern);
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    std::cout << "Time with repeated compilation: " << duration1 << " ms" << std::endl;
    
    // 测量预编译的时间
    std::regex pattern(R"(\$\d+\.\d{2})" );
    auto start2 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 10000; ++i) {
        std::smatch matches;
        std::regex_search(text, matches, pattern);
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    std::cout << "Time with precompiled regex: " << duration2 << " ms" << std::endl;
    
    return 0;
}

2. 避免回溯

回溯是正则表达式性能问题的常见原因，特别是在处理大输入时。

#include <iostream>
#include <regex>
#include <string>
#include <chrono>

int main() {
    // 创建一个可能导致回溯的文本
    std::string text = std::string(1000, 'a') + "b";
    
    // 有问题的模式：可能导致大量回溯
    std::regex bad_pattern(R"(a*a*b)" );
    
    // 优化的模式：避免回溯
    std::regex good_pattern(R"(a*b)" );
    
    // 测量有问题的模式
    auto start1 = std::chrono::high_resolution_clock::now();
    std::smatch matches1;
    std::regex_search(text, matches1, bad_pattern);
    auto end1 = std::chrono::high_resolution_clock::now();
    auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    std::cout << "Time with bad pattern: " << duration1 << " ms" << std::endl;
    
    // 测量优化的模式
    auto start2 = std::chrono::high_resolution_clock::now();
    std::smatch matches2;
    std::regex_search(text, matches2, good_pattern);
    auto end2 = std::chrono::high_resolution_clock::now();
    auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    std::cout << "Time with good pattern: " << duration2 << " ms" << std::endl;
    
    return 0;
}

3. 使用适当的正则表达式引擎

C++11提供了几种正则表达式引擎实现，可根据需要选择：

std::regex_constants::ECMAScript：默认，基于ECMAScript语法
std::regex_constants::basic：基本POSIX语法
std::regex_constants::extended：扩展POSIX语法
std::regex_constants::awk：awk风格语法
std::regex_constants::grep：grep风格语法
std::regex_constants::egrep：egrep风格语法

#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "Hello, World!";
    
    // 使用不同的引擎
    std::regex ecma_pattern(R"(Hello, \w+!)" , std::regex_constants::ECMAScript);
    std::regex basic_pattern("Hello, [a-zA-Z0-9_]+!" , std::regex_constants::basic);
    
    std::cout << "ECMAScript engine:" << std::endl;
    if (std::regex_match(text, ecma_pattern)) {
        std::cout << "Match found" << std::endl;
    }
    
    std::cout << "Basic engine:" << std::endl;
    if (std::regex_match(text, basic_pattern)) {
        std::cout << "Match found" << std::endl;
    }
    
    return 0;
}

本地化性能优化

1. 缓存locale对象

#include <iostream>
#include <locale>
#include <map>
#include <string>

class LocaleCache {
private:
    std::map<std::string, std::locale> cache;

public:
    std::locale get_locale(const std::string& name) {
        auto it = cache.find(name);
        if (it != cache.end()) {
            return it->second;
        }
        
        // 创建并缓存新locale
        std::locale loc(name);
        cache[name] = loc;
        return loc;
    }
};

int main() {
    LocaleCache cache;
    
    // 获取locale（首次创建）
    std::locale fr_locale = cache.get_locale("fr_FR.UTF-8");
    std::cout << "French locale: " << fr_locale.name() << std::endl;
    
    // 获取locale（从缓存中获取）
    std::locale fr_locale2 = cache.get_locale("fr_FR.UTF-8");
    std::cout << "French locale (cached): " << fr_locale2.name() << std::endl;
    
    return 0;
}

2. 避免频繁的locale切换

#include <iostream>
#include <locale>
#include <iomanip>
#include <chrono>

int main() {
    double value = 1234567.89;
    
    // 保存默认locale
    std::locale default_locale = std::cout.getloc();
    
    // 创建需要的locale
    std::locale fr_locale("fr_FR.UTF-8");
    std::locale de_locale("de_DE.UTF-8");
    std::locale us_locale("en_US.UTF-8");
    
    // 测量频繁切换locale的时间
    auto start1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 1000; ++i) {
        std::cout.imbue(fr_locale);
        std::cout << value << " ";
        std::cout.imbue(de_locale);
        std::cout << value << " ";
        std::cout.imbue(us_locale);
        std::cout << value << " ";
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    std::cout << "\nTime with frequent locale switches: " << duration1 << " ms" << std::endl;
    
    // 测量批量处理的时间
    auto start2 = std::chrono::high_resolution_clock::now();
    std::cout.imbue(fr_locale);
    for (int i = 0; i < 1000; ++i) {
        std::cout << value << " ";
    }
    
    std::cout.imbue(de_locale);
    for (int i = 0; i < 1000; ++i) {
        std::cout << value << " ";
    }
    
    std::cout.imbue(us_locale);
    for (int i = 0; i < 1000; ++i) {
        std::cout << value << " ";
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    std::cout << "\nTime with batch processing: " << duration2 << " ms" << std::endl;
    
    // 恢复默认locale
    std::cout.imbue(default_locale);
    
    return 0;
}

实际应用案例

案例1：表单验证

#include <iostream>
#include <regex>
#include <string>
#include <vector>

class FormValidator {
private:
    std::regex email_pattern{R"([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})"};
    std::regex phone_pattern{R"(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})"};
    std::regex zip_pattern{R"(\d{5}(-\d{4})?)"};
    std::regex date_pattern{R"(\d{4}-\d{2}-\d{2})"};

public:
    bool validate_email(const std::string& email) {
        return std::regex_match(email, email_pattern);
    }
    
    bool validate_phone(const std::string& phone) {
        return std::regex_match(phone, phone_pattern);
    }
    
    bool validate_zip(const std::string& zip) {
        return std::regex_match(zip, zip_pattern);
    }
    
    bool validate_date(const std::string& date) {
        return std::regex_match(date, date_pattern);
    }
    
    std::vector<std::string> validate_form(const std::string& email, 
                                         const std::string& phone, 
                                         const std::string& zip, 
                                         const std::string& date) {
        std::vector<std::string> errors;
        
        if (!validate_email(email)) {
            errors.push_back("Invalid email format");
        }
        
        if (!validate_phone(phone)) {
            errors.push_back("Invalid phone format");
        }
        
        if (!validate_zip(zip)) {
            errors.push_back("Invalid zip code format");
        }
        
        if (!validate_date(date)) {
            errors.push_back("Invalid date format (YYYY-MM-DD)");
        }
        
        return errors;
    }
};

int main() {
    FormValidator validator;
    
    // 测试有效输入
    std::vector<std::string> errors1 = validator.validate_form(
        "john@example.com",
        "(123) 456-7890",
        "12345-6789",
        "2023-12-25"
    );
    
    std::cout << "Valid input errors: " << errors1.size() << std::endl;
    for (const auto& error : errors1) {
        std::cout << "- " << error << std::endl;
    }
    
    // 测试无效输入
    std::vector<std::string> errors2 = validator.validate_form(
        "invalid-email",
        "1234567",
        "1234",
        "2023/12/25"
    );
    
    std::cout << "\nInvalid input errors: " << errors2.size() << std::endl;
    for (const auto& error : errors2) {
        std::cout << "- " << error << std::endl;
    }
    
    return 0;
}

案例2：日志分析

#include <iostream>
#include <regex>
#include <string>
#include <fstream>
#include <map>

class LogAnalyzer {
private:
    std::regex log_pattern{R"(\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] \[(\w+)\] (.*))"};
    std::map<std::string, int> level_counts;
    std::map<std::string, std::vector<std::string>> level_messages;

public:
    void analyze_file(const std::string& filename) {
        std::ifstream file(filename);
        if (!file) {
            std::cerr << "Could not open file: " << filename << std::endl;
            return;
        }
        
        std::string line;
        while (std::getline(file, line)) {
            analyze_line(line);
        }
        
        file.close();
    }
    
    void analyze_line(const std::string& line) {
        std::smatch matches;
        if (std::regex_match(line, matches, log_pattern)) {
            std::string timestamp = matches[1];
            std::string level = matches[2];
            std::string message = matches[3];
            
            // 统计级别
            level_counts[level]++;
            
            // 存储消息
            level_messages[level].push_back(timestamp + " - " + message);
        }
    }
    
    void print_summary() {
        std::cout << "Log level summary:" << std::endl;
        for (const auto& pair : level_counts) {
            std::cout << pair.first << ": " << pair.second << " messages" << std::endl;
        }
        
        std::cout << "\nDetailed messages:" << std::endl;
        for (const auto& pair : level_messages) {
            std::cout << "\n" << pair.first << ":" << std::endl;
            for (const auto& message : pair.second) {
                std::cout << "  " << message << std::endl;
            }
        }
    }
};

int main() {
    LogAnalyzer analyzer;
    
    // 假设有一个日志文件
    // analyzer.analyze_file("app.log");
    
    // 手动添加一些日志行进行测试
    analyzer.analyze_line("[2023-12-01 10:00:00] [INFO] Application started");
    analyzer.analyze_line("[2023-12-01 10:05:30] [ERROR] Database connection failed");
    analyzer.analyze_line("[2023-12-01 10:06:00] [INFO] Retrying database connection");
    analyzer.analyze_line("[2023-12-01 10:06:15] [INFO] Database connection established");
    analyzer.analyze_line("[2023-12-01 10:10:00] [WARNING] Disk space low");
    
    analyzer.print_summary();
    
    return 0;
}

案例3：国际化应用

#include <iostream>
#include <string>
#include <map>
#include <locale>
#include <regex>

class I18nApplication {
private:
    std::map<std::string, std::map<std::string, std::string>> messages;
    std::string current_locale;
    std::map<std::string, std::regex> localized_patterns;

public:
    I18nApplication() {
        // 设置默认locale
        std::locale default_locale;
        current_locale = default_locale.name();
        
        // 初始化消息
        init_messages();
        
        // 初始化本地化模式
        init_patterns();
    }
    
    void init_messages() {
        // 英语消息
        messages["en_US"]["greeting"] = "Hello, {0}!";
        messages["en_US"]["farewell"] = "Goodbye, {0}!";
        messages["en_US"]["welcome"] = "Welcome to our application!";
        messages["en_US"]["error"] = "An error occurred: {0}";
        
        // 法语消息
        messages["fr_FR"]["greeting"] = "Bonjour, {0}!";
        messages["fr_FR"]["farewell"] = "Au revoir, {0}!";
        messages["fr_FR"]["welcome"] = "Bienvenue dans notre application!";
        messages["fr_FR"]["error"] = "Une erreur s'est produite: {0}";
        
        // 日语消息
        messages["ja_JP"]["greeting"] = "こんにちは、{0}さん！";
        messages["ja_JP"]["farewell"] = "さようなら、{0}さん！";
        messages["ja_JP"]["welcome"] = "アプリケーションへようこそ！";
        messages["ja_JP"]["error"] = "エラーが発生しました: {0}";
    }
    
    void init_patterns() {
        // 英语模式
        localized_patterns["en_US"] = std::regex(R"(Hello,\s+(\w+))" );
        
        // 法语模式
        localized_patterns["fr_FR"] = std::regex(R"(Bonjour,\s+(\w+))" );
        
        // 日语模式
        localized_patterns["ja_JP"] = std::regex(R"(こんにちは、(\w+)さん！)" );
    }
    
    void set_locale(const std::string& loc) {
        current_locale = loc;
    }
    
    std::string get_message(const std::string& key, const std::string& param = "") {
        auto loc_it = messages.find(current_locale);
        if (loc_it == messages.end()) {
            // 回退到英语
            loc_it = messages.find("en_US");
        }
        
        auto msg_it = loc_it->second.find(key);
        if (msg_it == loc_it->second.end()) {
            return "[Missing message]";
        }
        
        std::string message = msg_it->second;
        if (!param.empty()) {
            // 简单的参数替换
            size_t pos = message.find("{0}");
            if (pos != std::string::npos) {
                message.replace(pos, 3, param);
            }
        }
        
        return message;
    }
    
    bool parse_greeting(const std::string& greeting, std::string& name) {
        auto pattern_it = localized_patterns.find(current_locale);
        if (pattern_it == localized_patterns.end()) {
            // 回退到英语
            pattern_it = localized_patterns.find("en_US");
        }
        
        std::smatch matches;
        if (std::regex_match(greeting, matches, pattern_it->second)) {
            name = matches[1];
            return true;
        }
        
        return false;
    }
    
    void run() {
        std::cout << get_message("welcome") << std::endl;
        
        // 测试不同locale
        test_locale("en_US", "John");
        test_locale("fr_FR", "Jean");
        test_locale("ja_JP", "Tanaka");
    }
    
    void test_locale(const std::string& loc, const std::string& name) {
        set_locale(loc);
        std::cout << "\nTesting locale: " << loc << std::endl;
        std::cout << get_message("greeting", name) << std::endl;
        std::cout << get_message("farewell", name) << std::endl;
        std::cout << get_message("error", "Connection failed") << std::endl;
        
        // 测试解析问候语
        std::string test_greeting;
        if (loc == "en_US") {
            test_greeting = "Hello, " + name;
        } else if (loc == "fr_FR") {
            test_greeting = "Bonjour, " + name;
        } else if (loc == "ja_JP") {
            test_greeting = "こんにちは、" + name + "さん！";
        }
        
        std::string parsed_name;
        if (parse_greeting(test_greeting, parsed_name)) {
            std::cout << "Parsed name: " << parsed_name << std::endl;
        }
    }
};

int main() {
    I18nApplication app;
    app.run();
    return 0;
}

最佳实践总结

本地化最佳实践

设计时考虑国际化：
- 避免硬编码字符串
- 使用Unicode编码
- 考虑文本长度变化（某些语言翻译后文本会变长）
- 考虑从右到左的语言（如阿拉伯语、希伯来语）
使用标准库设施：
- 优先使用std::locale进行本地化
- 使用std::string、std::wstring等标准字符串类型
- 对于复杂的Unicode处理，考虑使用ICU库
消息国际化：
- 使用gettext或类似系统进行消息国际化
- 保持消息简洁明了
- 避免在消息中包含格式依赖的内容
测试不同locale：
- 在不同locale下测试应用
- 确保数字、日期、时间格式正确
- 确保Unicode字符正确显示

正则表达式最佳实践

编写清晰的正则表达式：
- 使用原始字符串字面量（R”()”）避免转义
- 对于复杂模式，添加注释
- 合理使用捕获组
优化性能：
- 预编译频繁使用的正则表达式
- 避免回溯（使用占有量词、原子组等）
- 对于简单模式，考虑使用字符串操作
错误处理：
- 捕获正则表达式编译错误
- 验证用户输入的正则表达式模式
安全性：
- 避免使用用户输入直接构造正则表达式（防止ReDoS攻击）
- 对复杂模式设置匹配超时
测试：
- 测试边界情况
- 测试不同输入长度
- 测试国际化输入

结合使用的最佳实践

处理国际化输入：
- 使用Unicode感知的正则表达式
- 考虑不同语言的字符特性
本地化的模式匹配：
- 根据不同locale调整正则表达式模式
- 考虑本地化的数字、日期格式
性能平衡：
- 在本地化和正则表达式处理之间平衡性能
- 缓存常用的本地化模式
代码组织：
- 将本地化和正则表达式逻辑封装在专门的类中
- 使用依赖注入管理locale和模式

通过遵循这些最佳实践，可以开发出既支持国际化又高效处理文本的C++应用程序。本地化和正则表达式是现代C++编程中的重要工具，掌握它们对于开发高质量的软件至关重要。

第27章 本地化与正则表达式