第27章 本地化与正则表达式
本地化(国际化)基础
本地化的重要性
本地化(Localization,简称L10n)和国际化(Internationalization,简称I18n)是现代软件 development中的重要概念:
- 国际化:设计和开发软件,使其能够轻松适应不同的语言和地区
- 本地化:将国际化的软件适配到特定的语言和地区
在C++中,本地化主要通过<locale>头文件中的设施实现。
字符编码基础
常见字符编码
| 编码 | 描述 | 特点 |
|---|
| ASCII | 美国信息交换标准代码 | 7位编码,仅支持英文字符 |
| ISO-8859-1 | Latin-1 | 8位编码,支持西欧语言 |
| UTF-8 | Unicode转换格式-8 | 可变长度编码,支持所有Unicode字符 |
| UTF-16 | Unicode转换格式-16 | 2或4字节编码,支持所有Unicode字符 |
| UTF-32 | Unicode转换格式-32 | 4字节编码,支持所有Unicode字符 |
C++中的字符类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| char c1 = 'A'; wchar_t c2 = L'A';
char16_t c3 = u'A'; char32_t c4 = U'A'; char8_t c5 = u8'A';
const char* s1 = "Hello"; const wchar_t* s2 = L"Hello"; const char16_t* s3 = u"Hello"; const char32_t* s4 = U"Hello"; const char8_t* s5 = u8"Hello";
std::string str1 = "Hello"; std::wstring str2 = L"Hello"; std::u16string str3 = u"Hello"; std::u32string str4 = U"Hello"; std::u8string str5 = u8"Hello";
|
std::locale的使用
locale的基本概念
std::locale是C++中处理本地化的核心类,它封装了特定地区的文化和语言设置。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| #include <iostream> #include <locale>
int main() { std::locale default_locale; std::cout << "Default locale: " << default_locale.name() << std::endl; std::locale fr_locale("fr_FR.UTF-8"); std::cout << "French locale: " << fr_locale.name() << std::endl; std::locale c_locale("C"); std::cout << "C locale: " << c_locale.name() << std::endl; return 0; }
|
facet 系统
std::locale使用facet系统来提供不同类型的本地化服务:
- std::ctype:字符分类和转换
- std::collate:字符串比较
- std::numpunct:数字标点
- std::num_get/num_put:数字输入/输出
- std::money_get/money_put:货币输入/输出
- std::time_get/time_put:时间输入/输出
- std::messages:消息目录访问
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| #include <iostream> #include <locale> #include <string>
int main() { std::locale fr_locale("fr_FR.UTF-8"); const std::numpunct<char>& numpunct = std::use_facet<std::numpunct<char>>(fr_locale); std::cout << "Decimal point in French: '" << numpunct.decimal_point() << "'" << std::endl; std::cout << "Thousands separator in French: '" << numpunct.thousands_sep() << "'" << std::endl; const std::moneypunct<char>& moneypunct = std::use_facet<std::moneypunct<char>>(fr_locale); std::cout << "Currency symbol in French: '" << moneypunct.curr_symbol() << "'" << std::endl; std::cout << "Decimal point in currency: '" << moneypunct.decimal_point() << "'" << std::endl; return 0; }
|
本地化数字和货币
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| #include <iostream> #include <locale> #include <iomanip>
int main() { std::locale old_locale = std::cout.getloc(); std::cout.imbue(std::locale("en_US.UTF-8")); std::cout << "US format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl; std::cout.imbue(std::locale("de_DE.UTF-8")); std::cout << "German format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl; std::cout.imbue(std::locale("hi_IN.UTF-8")); std::cout << "Indian format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl; std::cout.imbue(old_locale); return 0; }
|
本地化日期和时间
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| #include <iostream> #include <locale> #include <iomanip> #include <ctime>
int main() { std::time_t now = std::time(nullptr); std::tm* tm_now = std::localtime(&now); std::locale old_locale = std::cout.getloc(); std::cout.imbue(std::locale("en_US.UTF-8")); std::cout << "US format: " << std::put_time(tm_now, "%c") << std::endl; std::cout.imbue(std::locale("fr_FR.UTF-8")); std::cout << "French format: " << std::put_time(tm_now, "%c") << std::endl; std::cout.imbue(std::locale("ja_JP.UTF-8")); std::cout << "Japanese format: " << std::put_time(tm_now, "%c") << std::endl; std::cout.imbue(old_locale); return 0; }
|
自定义locale
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| #include <iostream> #include <locale>
class CustomNumpunct : public std::numpunct<char> { protected: char do_decimal_point() const override { return ','; } char do_thousands_sep() const override { return '.'; } std::string do_grouping() const override { return "\3"; } };
int main() { std::locale custom_locale(std::locale(), new CustomNumpunct); std::cout.imbue(custom_locale); std::cout << "Custom format: " << 1234567.89 << std::endl; return 0; }
|
Unicode处理
Unicode基础
Unicode是一个国际标准,为世界上所有的字符、标点符号和符号分配了唯一的数字代码点。
- 代码点:Unicode中每个字符的唯一编号,范围从U+0000到U+10FFFF
- 平面:Unicode代码点分为17个平面,每个平面包含65536个代码点
- BMP:基本多文种平面(U+0000到U+FFFF),包含最常用的字符
C++中的Unicode支持
UTF-8处理
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
| #include <iostream> #include <string> #include <vector>
size_t utf8_length(const char* str) { size_t length = 0; while (*str) { if ((*str++ & 0xC0) != 0x80) { ++length; } } return length; }
size_t utf8_length(const std::string& str) { size_t length = 0; for (char c : str) { if ((c & 0xC0) != 0x80) { ++length; } } return length; }
std::u32string utf8_to_utf32(const std::string& utf8_str) { std::u32string utf32_str; for (size_t i = 0; i < utf8_str.size();) { unsigned char c = static_cast<unsigned char>(utf8_str[i]); char32_t code_point; if (c < 0x80) { code_point = c; i += 1; } else if (c < 0xE0) { code_point = ((c & 0x1F) << 6) | (static_cast<unsigned char>(utf8_str[i+1]) & 0x3F); i += 2; } else if (c < 0xF0) { code_point = ((c & 0x0F) << 12) | ((static_cast<unsigned char>(utf8_str[i+1]) & 0x3F) << 6) | (static_cast<unsigned char>(utf8_str[i+2]) & 0x3F); i += 3; } else { code_point = ((c & 0x07) << 18) | ((static_cast<unsigned char>(utf8_str[i+1]) & 0x3F) << 12) | ((static_cast<unsigned char>(utf8_str[i+2]) & 0x3F) << 6) | (static_cast<unsigned char>(utf8_str[i+3]) & 0x3F); i += 4; } utf32_str.push_back(code_point); } return utf32_str; }
std::string utf32_to_utf8(const std::u32string& utf32_str) { std::string utf8_str; for (char32_t code_point : utf32_str) { if (code_point < 0x80) { utf8_str.push_back(static_cast<char>(code_point)); } else if (code_point < 0x800) { utf8_str.push_back(static_cast<char>(0xC0 | (code_point >> 6))); utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F))); } else if (code_point < 0x10000) { utf8_str.push_back(static_cast<char>(0xE0 | (code_point >> 12))); utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 6) & 0x3F))); utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F))); } else { utf8_str.push_back(static_cast<char>(0xF0 | (code_point >> 18))); utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 12) & 0x3F))); utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 6) & 0x3F))); utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F))); } } return utf8_str; }
int main() { std::string utf8_str = u8"Hello, 世界! こんにちは! مرحبا!"; std::cout << "UTF-8 string: " << utf8_str << std::endl; std::cout << "UTF-8 length (bytes): " << utf8_str.size() << std::endl; std::cout << "UTF-8 length (characters): " << utf8_length(utf8_str) << std::endl; std::u32string utf32_str = utf8_to_utf32(utf8_str); std::cout << "UTF-32 length (characters): " << utf32_str.size() << std::endl; std::string converted_back = utf32_to_utf8(utf32_str); std::cout << "Converted back: " << converted_back << std::endl; return 0; }
|
使用ICU库
对于复杂的Unicode处理,推荐使用ICU(International Components for Unicode)库,它提供了更全面的Unicode支持。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
| #include <iostream> #include <string> #include <unicode/unistr.h> #include <unicode/translit.h>
int main() { UnicodeString ustr = UnicodeString::fromUTF8("Hello, 世界! こんにちは!"); UnicodeString lower_str = ustr; lower_str.toLower(); UnicodeString upper_str = ustr; upper_str.toUpper(); UnicodeString romanized; Transliterator* transliterator = Transliterator::createInstance("Any-Latin"); if (transliterator) { transliterator->transliterate(ustr, 0, ustr.length(), romanized); delete transliterator; } std::string output; std::cout << "Original: "; ustr.toUTF8String(output); std::cout << output << std::endl; std::cout << "Lowercase: "; lower_str.toUTF8String(output); std::cout << output << std::endl; std::cout << "Uppercase: "; upper_str.toUTF8String(output); std::cout << output << std::endl; std::cout << "Romanized: "; romanized.toUTF8String(output); std::cout << output << std::endl; return 0; }
|
消息国际化
gettext系统
gettext是一个广泛使用的消息国际化系统,它允许开发者将消息与代码分离,便于翻译。
基本使用
- 标记需要国际化的消息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| #include <iostream> #include <libintl.h> #include <locale.h>
#define _(msg) gettext(msg)
int main() { setlocale(LC_ALL, ""); bindtextdomain("myapp", "./locale"); textdomain("myapp"); std::cout << _("Hello, World!") << std::endl; std::cout << _("How are you?") << std::endl; return 0; }
|
- 提取消息:
使用xgettext工具提取消息到PO文件:
1
| xgettext -d myapp -o myapp.pot main.cpp
|
- 翻译消息:
创建语言特定的PO文件(如fr_FR.po)并翻译消息。
- 编译消息:
使用msgfmt工具将PO文件编译为MO文件:
1 2
| mkdir -p locale/fr_FR/LC_MESSAGES msgfmt -o locale/fr_FR/LC_MESSAGES/myapp.mo fr_FR.po
|
自定义消息系统
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
| #include <iostream> #include <string> #include <map> #include <locale>
class MessageCatalog { private: std::map<std::string, std::map<std::string, std::string>> catalogs; std::string current_locale;
public: MessageCatalog() { std::locale default_locale; current_locale = default_locale.name(); } void set_locale(const std::string& loc) { current_locale = loc; } void add_message(const std::string& msgid, const std::string& msgstr, const std::string& loc) { catalogs[loc][msgid] = msgstr; } std::string get_message(const std::string& msgid) { auto loc_it = catalogs.find(current_locale); if (loc_it != catalogs.end()) { auto msg_it = loc_it->second.find(msgid); if (msg_it != loc_it->second.end()) { return msg_it->second; } } auto default_it = catalogs.find("C"); if (default_it != catalogs.end()) { auto msg_it = default_it->second.find(msgid); if (msg_it != default_it->second.end()) { return msg_it->second; } } return msgid; } };
MessageCatalog msg_cat;
#define _(msgid) msg_cat.get_message(msgid)
int main() { msg_cat.add_message("Hello, World!", "Hello, World!", "C"); msg_cat.add_message("How are you?", "How are you?", "C"); msg_cat.add_message("Hello, World!", "Bonjour, Monde!", "fr_FR"); msg_cat.add_message("How are you?", "Comment allez-vous?", "fr_FR"); msg_cat.add_message("Hello, World!", "こんにちは、世界!", "ja_JP"); msg_cat.add_message("How are you?", "お元気ですか?", "ja_JP"); std::cout << "Default locale:" << std::endl; std::cout << _("Hello, World!") << std::endl; std::cout << _("How are you?") << std::endl; std::cout << "\nFrench locale:" << std::endl; msg_cat.set_locale("fr_FR"); std::cout << _("Hello, World!") << std::endl; std::cout << _("How are you?") << std::endl; std::cout << "\nJapanese locale:" << std::endl; msg_cat.set_locale("ja_JP"); std::cout << _("Hello, World!") << std::endl; std::cout << _("How are you?") << std::endl; return 0; }
|
正则表达式基础
正则表达式语法
C++11引入了<regex>头文件,提供了正则表达式支持。C++的正则表达式语法基于ECMAScript标准。
基本语法
| 语法 | 描述 | 示例 |
|---|
^ | 匹配字符串开头 | ^Hello 匹配以Hello开头的字符串 |
$ | 匹配字符串结尾 | World$ 匹配以World结尾的字符串 |
. | 匹配任意单个字符(除换行符) | H.llo 匹配Hello, Hallo等 |
* | 匹配前面的字符0次或多次 | ab*c 匹配ac, abc, abbc等 |
+ | 匹配前面的字符1次或多次 | ab+c 匹配abc, abbc等,不匹配ac |
? | 匹配前面的字符0次或1次 | ab?c 匹配ac, abc,不匹配abbc |
{n} | 匹配前面的字符恰好n次 | a{3} 匹配aaa |
{n,} | 匹配前面的字符至少n次 | a{2,} 匹配aa, aaa等 |
{n,m} | 匹配前面的字符n到m次 | a{2,4} 匹配aa, aaa, aaaa |
[] | 匹配括号内的任意一个字符 | [aeiou] 匹配任意元音字母 |
[^] | 匹配括号内字符以外的任意字符 | [^0-9] 匹配任意非数字字符 |
| ` | ` | 匹配左右任意一个表达式 |
() | 捕获组,匹配括号内的表达式并捕获 | (ab)+ 匹配ab, abab等,并捕获ab |
\d | 匹配任意数字,等同于[0-9] | \d+ 匹配一个或多个数字 |
\D | 匹配任意非数字,等同于[^0-9] | \D+ 匹配一个或多个非数字 |
\w | 匹配任意字母、数字或下划线,等同于[a-zA-Z0-9_] | \w+ 匹配一个或多个单词字符 |
\W | 匹配任意非单词字符,等同于[^a-zA-Z0-9_] | \W+ 匹配一个或多个非单词字符 |
\s | 匹配任意空白字符(空格、制表符、换行符等) | \s+ 匹配一个或多个空白字符 |
\S | 匹配任意非空白字符 | \S+ 匹配一个或多个非空白字符 |
\b | 匹配单词边界 | \bword\b 匹配完整的word单词 |
\B | 匹配非单词边界 | \Bword\B 匹配单词内部的word |
std::regex的使用
基本操作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
| #include <iostream> #include <regex> #include <string>
int main() { std::string text = "Hello, World!"; std::regex pattern(R"(Hello, \w+!)" ); if (std::regex_match(text, pattern)) { std::cout << "Text matches pattern" << std::endl; } else { std::cout << "Text does not match pattern" << std::endl; } std::string text2 = "The price is $45.99, discounted to $39.99"; std::regex price_pattern(R"(\$\d+\.\d{2})" ); std::smatch matches; if (std::regex_search(text2, matches, price_pattern)) { std::cout << "Found price: " << matches[0] << std::endl; } std::string text3 = "The cat sat on the mat"; std::regex cat_pattern(R"(cat)" ); std::string replaced = std::regex_replace(text3, cat_pattern, "dog"); std::cout << "Original: " << text3 << std::endl; std::cout << "Replaced: " << replaced << std::endl; std::string text4 = "Contact info: john@example.com, jane@test.org"; std::regex email_pattern(R"([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" ); std::sregex_iterator it(text4.begin(), text4.end(), email_pattern); std::sregex_iterator end; std::cout << "Found emails:" << std::endl; while (it != end) { std::cout << it->str() << std::endl; ++it; } return 0; }
|
捕获组
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| #include <iostream> #include <regex> #include <string>
int main() { std::string text = "John Doe (john@example.com) - 30 years old"; std::regex pattern(R"(([A-Za-z]+) ([A-Za-z]+) \(([^)]+)\) - (\d+) years old)" ); std::smatch matches; if (std::regex_match(text, matches, pattern)) { std::cout << "Full match: " << matches[0] << std::endl; std::cout << "First name: " << matches[1] << std::endl; std::cout << "Last name: " << matches[2] << std::endl; std::cout << "Email: " << matches[3] << std::endl; std::cout << "Age: " << matches[4] << std::endl; } return 0; }
|
正则表达式标志
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| #include <iostream> #include <regex> #include <string>
int main() { std::string text = "Hello, WORLD! hello, world!"; std::regex case_insensitive_pattern(R"(hello, world!)" , std::regex_constants::icase); std::smatch matches; if (std::regex_search(text, matches, case_insensitive_pattern)) { std::cout << "Found: " << matches[0] << std::endl; } std::string multi_line_text = "Line 1: test\nLine 2: TEST\nLine 3: Test"; std::regex multi_line_pattern(R"(^Line \d+: test$)" , std::regex_constants::icase | std::regex_constants::multiline); std::sregex_iterator it(multi_line_text.begin(), multi_line_text.end(), multi_line_pattern); std::sregex_iterator end; std::cout << "Found lines:" << std::endl; while (it != end) { std::cout << it->str() << std::endl; ++it; } return 0; }
|
高级正则表达式技巧
前瞻和后顾断言
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| #include <iostream> #include <regex> #include <string>
int main() { std::string text = "Price: $45.99, Discount: $10.00, Total: $35.99"; std::regex lookahead_pattern(R"(\$(?=\d+\.\d{2}))" ); std::string result1 = std::regex_replace(text, lookahead_pattern, "USD "); std::cout << "With lookahead: " << result1 << std::endl; std::string text2 = "apple pie, banana bread, cherry tart"; std::regex lookbehind_pattern(R"((?<=\w+) pie)" ); std::smatch matches; if (std::regex_search(text2, matches, lookbehind_pattern)) { std::cout << "Found with lookbehind: " << matches[0] << std::endl; } std::string emails = "john@example.com, jane@gmail.com, bob@test.org"; std::regex negative_lookahead(R"([a-zA-Z0-9._%+-]+@(?!gmail\.com)[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" ); std::sregex_iterator it(emails.begin(), emails.end(), negative_lookahead); std::sregex_iterator end; std::cout << "Non-Gmail emails:" << std::endl; while (it != end) { std::cout << it->str() << std::endl; ++it; } return 0; }
|
量词的贪婪与非贪婪
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| #include <iostream> #include <regex> #include <string>
int main() { std::string text = "<div>First div</div><div>Second div</div>"; std::regex greedy_pattern(R"(<div>.*</div>)" ); std::smatch greedy_match; if (std::regex_search(text, greedy_match, greedy_pattern)) { std::cout << "Greedy match: " << greedy_match[0] << std::endl; } std::regex non_greedy_pattern(R"(<div>.*?</div>)" ); std::sregex_iterator it(text.begin(), text.end(), non_greedy_pattern); std::sregex_iterator end; std::cout << "Non-greedy matches:" << std::endl; while (it != end) { std::cout << it->str() << std::endl; ++it; } return 0; }
|
条件正则表达式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| #include <iostream> #include <regex> #include <string>
int main() { std::string phone_numbers = "555-1234, (123) 555-6789"; std::regex phone_pattern(R"((\()?(\d{3})(?(1)\) |-))?\d{3}-\d{4})" ); std::sregex_iterator it(phone_numbers.begin(), phone_numbers.end(), phone_pattern); std::sregex_iterator end; std::cout << "Found phone numbers:" << std::endl; while (it != end) { std::cout << it->str() << std::endl; ++it; } return 0; }
|
正则表达式与本地化的结合
处理Unicode字符
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
| #include <iostream> #include <regex> #include <string> #include <locale>
int main() { std::locale::global(std::locale("en_US.UTF-8")); std::u32string text = U"Hello, 世界! こんにちは!"; std::string utf8_text; for (char32_t c : text) { if (c < 0x80) { utf8_text.push_back(static_cast<char>(c)); } else if (c < 0x800) { utf8_text.push_back(static_cast<char>(0xC0 | (c >> 6))); utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F))); } else if (c < 0x10000) { utf8_text.push_back(static_cast<char>(0xE0 | (c >> 12))); utf8_text.push_back(static_cast<char>(0x80 | ((c >> 6) & 0x3F))); utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F))); } else { utf8_text.push_back(static_cast<char>(0xF0 | (c >> 18))); utf8_text.push_back(static_cast<char>(0x80 | ((c >> 12) & 0x3F))); utf8_text.push_back(static_cast<char>(0x80 | ((c >> 6) & 0x3F))); utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F))); } } std::regex chinese_pattern(R"([\p{Han}]+)" , std::regex_constants::u); std::sregex_iterator it(utf8_text.begin(), utf8_text.end(), chinese_pattern); std::sregex_iterator end; std::cout << "Found Chinese characters:" << std::endl; while (it != end) { std::cout << it->str() << std::endl; ++it; } return 0; }
|
本地化的正则表达式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| #include <iostream> #include <regex> #include <string> #include <locale>
int main() { std::string german_number = "1.234.567,89"; std::string us_number = "1,234,567.89"; std::regex german_pattern(R"(\d{1,3}(\.\d{3})*(,\d{2})?)" ); std::regex us_pattern(R"(\d{1,3}(,\d{3})*(\.\d{2})?)" ); std::cout << "German number format:" << std::endl; if (std::regex_match(german_number, german_pattern)) { std::cout << "Valid: " << german_number << std::endl; } std::cout << "US number format:" << std::endl; if (std::regex_match(us_number, us_pattern)) { std::cout << "Valid: " << us_number << std::endl; } return 0; }
|
性能优化
正则表达式性能
正则表达式的性能取决于多个因素,包括模式复杂度、输入大小和实现细节。以下是一些优化技巧:
1. 编译正则表达式
对于重复使用的正则表达式,应预编译以避免重复编译开销。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
| #include <iostream> #include <regex> #include <string> #include <chrono>
int main() { std::string text = "The price is $45.99, discounted to $39.99, final price $29.99"; auto start1 = std::chrono::high_resolution_clock::now(); for (int i = 0; i < 10000; ++i) { std::regex pattern(R"(\$\d+\.\d{2})" ); std::smatch matches; std::regex_search(text, matches, pattern); } auto end1 = std::chrono::high_resolution_clock::now(); auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count(); std::cout << "Time with repeated compilation: " << duration1 << " ms" << std::endl; std::regex pattern(R"(\$\d+\.\d{2})" ); auto start2 = std::chrono::high_resolution_clock::now(); for (int i = 0; i < 10000; ++i) { std::smatch matches; std::regex_search(text, matches, pattern); } auto end2 = std::chrono::high_resolution_clock::now(); auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count(); std::cout << "Time with precompiled regex: " << duration2 << " ms" << std::endl; return 0; }
|
2. 避免回溯
回溯是正则表达式性能问题的常见原因,特别是在处理大输入时。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| #include <iostream> #include <regex> #include <string> #include <chrono>
int main() { std::string text = std::string(1000, 'a') + "b"; std::regex bad_pattern(R"(a*a*b)" ); std::regex good_pattern(R"(a*b)" ); auto start1 = std::chrono::high_resolution_clock::now(); std::smatch matches1; std::regex_search(text, matches1, bad_pattern); auto end1 = std::chrono::high_resolution_clock::now(); auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count(); std::cout << "Time with bad pattern: " << duration1 << " ms" << std::endl; auto start2 = std::chrono::high_resolution_clock::now(); std::smatch matches2; std::regex_search(text, matches2, good_pattern); auto end2 = std::chrono::high_resolution_clock::now(); auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count(); std::cout << "Time with good pattern: " << duration2 << " ms" << std::endl; return 0; }
|
3. 使用适当的正则表达式引擎
C++11提供了几种正则表达式引擎实现,可根据需要选择:
std::regex_constants::ECMAScript:默认,基于ECMAScript语法std::regex_constants::basic:基本POSIX语法std::regex_constants::extended:扩展POSIX语法std::regex_constants::awk:awk风格语法std::regex_constants::grep:grep风格语法std::regex_constants::egrep:egrep风格语法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| #include <iostream> #include <regex> #include <string>
int main() { std::string text = "Hello, World!"; std::regex ecma_pattern(R"(Hello, \w+!)" , std::regex_constants::ECMAScript); std::regex basic_pattern("Hello, [a-zA-Z0-9_]+!" , std::regex_constants::basic); std::cout << "ECMAScript engine:" << std::endl; if (std::regex_match(text, ecma_pattern)) { std::cout << "Match found" << std::endl; } std::cout << "Basic engine:" << std::endl; if (std::regex_match(text, basic_pattern)) { std::cout << "Match found" << std::endl; } return 0; }
|
本地化性能优化
1. 缓存locale对象
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| #include <iostream> #include <locale> #include <map> #include <string>
class LocaleCache { private: std::map<std::string, std::locale> cache;
public: std::locale get_locale(const std::string& name) { auto it = cache.find(name); if (it != cache.end()) { return it->second; } std::locale loc(name); cache[name] = loc; return loc; } };
int main() { LocaleCache cache; std::locale fr_locale = cache.get_locale("fr_FR.UTF-8"); std::cout << "French locale: " << fr_locale.name() << std::endl; std::locale fr_locale2 = cache.get_locale("fr_FR.UTF-8"); std::cout << "French locale (cached): " << fr_locale2.name() << std::endl; return 0; }
|
2. 避免频繁的locale切换
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
| #include <iostream> #include <locale> #include <iomanip> #include <chrono>
int main() { double value = 1234567.89; std::locale default_locale = std::cout.getloc(); std::locale fr_locale("fr_FR.UTF-8"); std::locale de_locale("de_DE.UTF-8"); std::locale us_locale("en_US.UTF-8"); auto start1 = std::chrono::high_resolution_clock::now(); for (int i = 0; i < 1000; ++i) { std::cout.imbue(fr_locale); std::cout << value << " "; std::cout.imbue(de_locale); std::cout << value << " "; std::cout.imbue(us_locale); std::cout << value << " "; } auto end1 = std::chrono::high_resolution_clock::now(); auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count(); std::cout << "\nTime with frequent locale switches: " << duration1 << " ms" << std::endl; auto start2 = std::chrono::high_resolution_clock::now(); std::cout.imbue(fr_locale); for (int i = 0; i < 1000; ++i) { std::cout << value << " "; } std::cout.imbue(de_locale); for (int i = 0; i < 1000; ++i) { std::cout << value << " "; } std::cout.imbue(us_locale); for (int i = 0; i < 1000; ++i) { std::cout << value << " "; } auto end2 = std::chrono::high_resolution_clock::now(); auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count(); std::cout << "\nTime with batch processing: " << duration2 << " ms" << std::endl; std::cout.imbue(default_locale); return 0; }
|
实际应用案例
案例1:表单验证
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
| #include <iostream> #include <regex> #include <string> #include <vector>
class FormValidator { private: std::regex email_pattern{R"([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})"}; std::regex phone_pattern{R"(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})"}; std::regex zip_pattern{R"(\d{5}(-\d{4})?)"}; std::regex date_pattern{R"(\d{4}-\d{2}-\d{2})"};
public: bool validate_email(const std::string& email) { return std::regex_match(email, email_pattern); } bool validate_phone(const std::string& phone) { return std::regex_match(phone, phone_pattern); } bool validate_zip(const std::string& zip) { return std::regex_match(zip, zip_pattern); } bool validate_date(const std::string& date) { return std::regex_match(date, date_pattern); } std::vector<std::string> validate_form(const std::string& email, const std::string& phone, const std::string& zip, const std::string& date) { std::vector<std::string> errors; if (!validate_email(email)) { errors.push_back("Invalid email format"); } if (!validate_phone(phone)) { errors.push_back("Invalid phone format"); } if (!validate_zip(zip)) { errors.push_back("Invalid zip code format"); } if (!validate_date(date)) { errors.push_back("Invalid date format (YYYY-MM-DD)"); } return errors; } };
int main() { FormValidator validator; std::vector<std::string> errors1 = validator.validate_form( "john@example.com", "(123) 456-7890", "12345-6789", "2023-12-25" ); std::cout << "Valid input errors: " << errors1.size() << std::endl; for (const auto& error : errors1) { std::cout << "- " << error << std::endl; } std::vector<std::string> errors2 = validator.validate_form( "invalid-email", "1234567", "1234", "2023/12/25" ); std::cout << "\nInvalid input errors: " << errors2.size() << std::endl; for (const auto& error : errors2) { std::cout << "- " << error << std::endl; } return 0; }
|
案例2:日志分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
| #include <iostream> #include <regex> #include <string> #include <fstream> #include <map>
class LogAnalyzer { private: std::regex log_pattern{R"(\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] \[(\w+)\] (.*))"}; std::map<std::string, int> level_counts; std::map<std::string, std::vector<std::string>> level_messages;
public: void analyze_file(const std::string& filename) { std::ifstream file(filename); if (!file) { std::cerr << "Could not open file: " << filename << std::endl; return; } std::string line; while (std::getline(file, line)) { analyze_line(line); } file.close(); } void analyze_line(const std::string& line) { std::smatch matches; if (std::regex_match(line, matches, log_pattern)) { std::string timestamp = matches[1]; std::string level = matches[2]; std::string message = matches[3]; level_counts[level]++; level_messages[level].push_back(timestamp + " - " + message); } } void print_summary() { std::cout << "Log level summary:" << std::endl; for (const auto& pair : level_counts) { std::cout << pair.first << ": " << pair.second << " messages" << std::endl; } std::cout << "\nDetailed messages:" << std::endl; for (const auto& pair : level_messages) { std::cout << "\n" << pair.first << ":" << std::endl; for (const auto& message : pair.second) { std::cout << " " << message << std::endl; } } } };
int main() { LogAnalyzer analyzer; analyzer.analyze_line("[2023-12-01 10:00:00] [INFO] Application started"); analyzer.analyze_line("[2023-12-01 10:05:30] [ERROR] Database connection failed"); analyzer.analyze_line("[2023-12-01 10:06:00] [INFO] Retrying database connection"); analyzer.analyze_line("[2023-12-01 10:06:15] [INFO] Database connection established"); analyzer.analyze_line("[2023-12-01 10:10:00] [WARNING] Disk space low"); analyzer.print_summary(); return 0; }
|
案例3:国际化应用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138
| #include <iostream> #include <string> #include <map> #include <locale> #include <regex>
class I18nApplication { private: std::map<std::string, std::map<std::string, std::string>> messages; std::string current_locale; std::map<std::string, std::regex> localized_patterns;
public: I18nApplication() { std::locale default_locale; current_locale = default_locale.name(); init_messages(); init_patterns(); } void init_messages() { messages["en_US"]["greeting"] = "Hello, {0}!"; messages["en_US"]["farewell"] = "Goodbye, {0}!"; messages["en_US"]["welcome"] = "Welcome to our application!"; messages["en_US"]["error"] = "An error occurred: {0}"; messages["fr_FR"]["greeting"] = "Bonjour, {0}!"; messages["fr_FR"]["farewell"] = "Au revoir, {0}!"; messages["fr_FR"]["welcome"] = "Bienvenue dans notre application!"; messages["fr_FR"]["error"] = "Une erreur s'est produite: {0}"; messages["ja_JP"]["greeting"] = "こんにちは、{0}さん!"; messages["ja_JP"]["farewell"] = "さようなら、{0}さん!"; messages["ja_JP"]["welcome"] = "アプリケーションへようこそ!"; messages["ja_JP"]["error"] = "エラーが発生しました: {0}"; } void init_patterns() { localized_patterns["en_US"] = std::regex(R"(Hello,\s+(\w+))" ); localized_patterns["fr_FR"] = std::regex(R"(Bonjour,\s+(\w+))" ); localized_patterns["ja_JP"] = std::regex(R"(こんにちは、(\w+)さん!)" ); } void set_locale(const std::string& loc) { current_locale = loc; } std::string get_message(const std::string& key, const std::string& param = "") { auto loc_it = messages.find(current_locale); if (loc_it == messages.end()) { loc_it = messages.find("en_US"); } auto msg_it = loc_it->second.find(key); if (msg_it == loc_it->second.end()) { return "[Missing message]"; } std::string message = msg_it->second; if (!param.empty()) { size_t pos = message.find("{0}"); if (pos != std::string::npos) { message.replace(pos, 3, param); } } return message; } bool parse_greeting(const std::string& greeting, std::string& name) { auto pattern_it = localized_patterns.find(current_locale); if (pattern_it == localized_patterns.end()) { pattern_it = localized_patterns.find("en_US"); } std::smatch matches; if (std::regex_match(greeting, matches, pattern_it->second)) { name = matches[1]; return true; } return false; } void run() { std::cout << get_message("welcome") << std::endl; test_locale("en_US", "John"); test_locale("fr_FR", "Jean"); test_locale("ja_JP", "Tanaka"); } void test_locale(const std::string& loc, const std::string& name) { set_locale(loc); std::cout << "\nTesting locale: " << loc << std::endl; std::cout << get_message("greeting", name) << std::endl; std::cout << get_message("farewell", name) << std::endl; std::cout << get_message("error", "Connection failed") << std::endl; std::string test_greeting; if (loc == "en_US") { test_greeting = "Hello, " + name; } else if (loc == "fr_FR") { test_greeting = "Bonjour, " + name; } else if (loc == "ja_JP") { test_greeting = "こんにちは、" + name + "さん!"; } std::string parsed_name; if (parse_greeting(test_greeting, parsed_name)) { std::cout << "Parsed name: " << parsed_name << std::endl; } } };
int main() { I18nApplication app; app.run(); return 0; }
|
最佳实践总结
本地化最佳实践
设计时考虑国际化:
- 避免硬编码字符串
- 使用Unicode编码
- 考虑文本长度变化(某些语言翻译后文本会变长)
- 考虑从右到左的语言(如阿拉伯语、希伯来语)
使用标准库设施:
- 优先使用
std::locale进行本地化 - 使用
std::string、std::wstring等标准字符串类型 - 对于复杂的Unicode处理,考虑使用ICU库
消息国际化:
- 使用gettext或类似系统进行消息国际化
- 保持消息简洁明了
- 避免在消息中包含格式依赖的内容
测试不同locale:
- 在不同locale下测试应用
- 确保数字、日期、时间格式正确
- 确保Unicode字符正确显示
正则表达式最佳实践
编写清晰的正则表达式:
- 使用原始字符串字面量(R”()”)避免转义
- 对于复杂模式,添加注释
- 合理使用捕获组
优化性能:
- 预编译频繁使用的正则表达式
- 避免回溯(使用占有量词、原子组等)
- 对于简单模式,考虑使用字符串操作
错误处理:
- 捕获正则表达式编译错误
- 验证用户输入的正则表达式模式
安全性:
- 避免使用用户输入直接构造正则表达式(防止ReDoS攻击)
- 对复杂模式设置匹配超时
测试:
结合使用的最佳实践
处理国际化输入:
- 使用Unicode感知的正则表达式
- 考虑不同语言的字符特性
本地化的模式匹配:
- 根据不同locale调整正则表达式模式
- 考虑本地化的数字、日期格式
性能平衡:
- 在本地化和正则表达式处理之间平衡性能
- 缓存常用的本地化模式
代码组织:
- 将本地化和正则表达式逻辑封装在专门的类中
- 使用依赖注入管理locale和模式
通过遵循这些最佳实践,可以开发出既支持国际化又高效处理文本的C++应用程序。本地化和正则表达式是现代C++编程中的重要工具,掌握它们对于开发高质量的软件至关重要。