第27章 本地化与正则表达式

本地化(国际化)基础

本地化的重要性

本地化(Localization,简称L10n)和国际化(Internationalization,简称I18n)是现代软件 development中的重要概念:

  • 国际化:设计和开发软件,使其能够轻松适应不同的语言和地区
  • 本地化:将国际化的软件适配到特定的语言和地区

在C++中,本地化主要通过<locale>头文件中的设施实现。

字符编码基础

常见字符编码

编码描述特点
ASCII美国信息交换标准代码7位编码,仅支持英文字符
ISO-8859-1Latin-18位编码,支持西欧语言
UTF-8Unicode转换格式-8可变长度编码,支持所有Unicode字符
UTF-16Unicode转换格式-162或4字节编码,支持所有Unicode字符
UTF-32Unicode转换格式-324字节编码,支持所有Unicode字符

C++中的字符类型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// 基本字符类型
char c1 = 'A'; // 通常为1字节,编码依赖于环境
wchar_t c2 = L'A'; // 宽字符,通常为2或4字节

// C++11引入的字符类型
char16_t c3 = u'A'; // 16位字符,用于UTF-16
char32_t c4 = U'A'; // 32位字符,用于UTF-32
char8_t c5 = u8'A'; // 8位字符,用于UTF-8 (C++20)

// 字符串字面量
const char* s1 = "Hello"; // 窄字符串
const wchar_t* s2 = L"Hello"; // 宽字符串
const char16_t* s3 = u"Hello"; // UTF-16字符串
const char32_t* s4 = U"Hello"; // UTF-32字符串
const char8_t* s5 = u8"Hello"; // UTF-8字符串 (C++20)

// 标准字符串类型
std::string str1 = "Hello"; // 窄字符串
std::wstring str2 = L"Hello"; // 宽字符串
std::u16string str3 = u"Hello"; // UTF-16字符串
std::u32string str4 = U"Hello"; // UTF-32字符串
std::u8string str5 = u8"Hello"; // UTF-8字符串 (C++20)

std::locale的使用

locale的基本概念

std::locale是C++中处理本地化的核心类,它封装了特定地区的文化和语言设置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <iostream>
#include <locale>

int main() {
// 使用系统默认locale
std::locale default_locale;
std::cout << "Default locale: " << default_locale.name() << std::endl;

// 使用特定locale
std::locale fr_locale("fr_FR.UTF-8");
std::cout << "French locale: " << fr_locale.name() << std::endl;

// 使用C locale
std::locale c_locale("C");
std::cout << "C locale: " << c_locale.name() << std::endl;

return 0;
}

facet 系统

std::locale使用facet系统来提供不同类型的本地化服务:

  • std::ctype:字符分类和转换
  • std::collate:字符串比较
  • std::numpunct:数字标点
  • std::num_get/num_put:数字输入/输出
  • std::money_get/money_put:货币输入/输出
  • std::time_get/time_put:时间输入/输出
  • std::messages:消息目录访问
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <iostream>
#include <locale>
#include <string>

int main() {
// 创建一个法语locale
std::locale fr_locale("fr_FR.UTF-8");

// 获取数字标点facet
const std::numpunct<char>& numpunct = std::use_facet<std::numpunct<char>>(fr_locale);
std::cout << "Decimal point in French: '" << numpunct.decimal_point() << "'" << std::endl;
std::cout << "Thousands separator in French: '" << numpunct.thousands_sep() << "'" << std::endl;

// 获取货币标点facet
const std::moneypunct<char>& moneypunct = std::use_facet<std::moneypunct<char>>(fr_locale);
std::cout << "Currency symbol in French: '" << moneypunct.curr_symbol() << "'" << std::endl;
std::cout << "Decimal point in currency: '" << moneypunct.decimal_point() << "'" << std::endl;

return 0;
}

本地化数字和货币

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <iostream>
#include <locale>
#include <iomanip>

int main() {
// 保存当前locale
std::locale old_locale = std::cout.getloc();

// 使用美国locale
std::cout.imbue(std::locale("en_US.UTF-8"));
std::cout << "US format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl;

// 使用德国locale
std::cout.imbue(std::locale("de_DE.UTF-8"));
std::cout << "German format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl;

// 使用印度locale
std::cout.imbue(std::locale("hi_IN.UTF-8"));
std::cout << "Indian format: " << std::fixed << std::setprecision(2) << 1234567.89 << std::endl;

// 恢复旧locale
std::cout.imbue(old_locale);

return 0;
}

本地化日期和时间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <locale>
#include <iomanip>
#include <ctime>

int main() {
// 获取当前时间
std::time_t now = std::time(nullptr);
std::tm* tm_now = std::localtime(&now);

// 保存当前locale
std::locale old_locale = std::cout.getloc();

// 使用美国locale
std::cout.imbue(std::locale("en_US.UTF-8"));
std::cout << "US format: " << std::put_time(tm_now, "%c") << std::endl;

// 使用法国locale
std::cout.imbue(std::locale("fr_FR.UTF-8"));
std::cout << "French format: " << std::put_time(tm_now, "%c") << std::endl;

// 使用日本locale
std::cout.imbue(std::locale("ja_JP.UTF-8"));
std::cout << "Japanese format: " << std::put_time(tm_now, "%c") << std::endl;

// 恢复旧locale
std::cout.imbue(old_locale);

return 0;
}

自定义locale

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <iostream>
#include <locale>

// 自定义numpunct facet
class CustomNumpunct : public std::numpunct<char> {
protected:
char do_decimal_point() const override {
return ',';
}

char do_thousands_sep() const override {
return '.';
}

std::string do_grouping() const override {
return "\3";
}
};

int main() {
// 创建自定义locale
std::locale custom_locale(std::locale(), new CustomNumpunct);

// 使用自定义locale
std::cout.imbue(custom_locale);
std::cout << "Custom format: " << 1234567.89 << std::endl;

return 0;
}

Unicode处理

Unicode基础

Unicode是一个国际标准,为世界上所有的字符、标点符号和符号分配了唯一的数字代码点。

  • 代码点:Unicode中每个字符的唯一编号,范围从U+0000到U+10FFFF
  • 平面:Unicode代码点分为17个平面,每个平面包含65536个代码点
  • BMP:基本多文种平面(U+0000到U+FFFF),包含最常用的字符

C++中的Unicode支持

UTF-8处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#include <iostream>
#include <string>
#include <vector>

// 计算UTF-8字符串的字符数
size_t utf8_length(const char* str) {
size_t length = 0;
while (*str) {
if ((*str++ & 0xC0) != 0x80) {
++length;
}
}
return length;
}

// 计算UTF-8字符串的字符数(使用C++11的string_view)
size_t utf8_length(const std::string& str) {
size_t length = 0;
for (char c : str) {
if ((c & 0xC0) != 0x80) {
++length;
}
}
return length;
}

// UTF-8到UTF-32的转换
std::u32string utf8_to_utf32(const std::string& utf8_str) {
std::u32string utf32_str;

for (size_t i = 0; i < utf8_str.size();) {
unsigned char c = static_cast<unsigned char>(utf8_str[i]);
char32_t code_point;

if (c < 0x80) {
// 单字节
code_point = c;
i += 1;
} else if (c < 0xE0) {
// 双字节
code_point = ((c & 0x1F) << 6) | (static_cast<unsigned char>(utf8_str[i+1]) & 0x3F);
i += 2;
} else if (c < 0xF0) {
// 三字节
code_point = ((c & 0x0F) << 12) |
((static_cast<unsigned char>(utf8_str[i+1]) & 0x3F) << 6) |
(static_cast<unsigned char>(utf8_str[i+2]) & 0x3F);
i += 3;
} else {
// 四字节
code_point = ((c & 0x07) << 18) |
((static_cast<unsigned char>(utf8_str[i+1]) & 0x3F) << 12) |
((static_cast<unsigned char>(utf8_str[i+2]) & 0x3F) << 6) |
(static_cast<unsigned char>(utf8_str[i+3]) & 0x3F);
i += 4;
}

utf32_str.push_back(code_point);
}

return utf32_str;
}

// UTF-32到UTF-8的转换
std::string utf32_to_utf8(const std::u32string& utf32_str) {
std::string utf8_str;

for (char32_t code_point : utf32_str) {
if (code_point < 0x80) {
// 单字节
utf8_str.push_back(static_cast<char>(code_point));
} else if (code_point < 0x800) {
// 双字节
utf8_str.push_back(static_cast<char>(0xC0 | (code_point >> 6)));
utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F)));
} else if (code_point < 0x10000) {
// 三字节
utf8_str.push_back(static_cast<char>(0xE0 | (code_point >> 12)));
utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 6) & 0x3F)));
utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F)));
} else {
// 四字节
utf8_str.push_back(static_cast<char>(0xF0 | (code_point >> 18)));
utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 12) & 0x3F)));
utf8_str.push_back(static_cast<char>(0x80 | ((code_point >> 6) & 0x3F)));
utf8_str.push_back(static_cast<char>(0x80 | (code_point & 0x3F)));
}
}

return utf8_str;
}

int main() {
std::string utf8_str = u8"Hello, 世界! こんにちは! مرحبا!";
std::cout << "UTF-8 string: " << utf8_str << std::endl;
std::cout << "UTF-8 length (bytes): " << utf8_str.size() << std::endl;
std::cout << "UTF-8 length (characters): " << utf8_length(utf8_str) << std::endl;

// 转换为UTF-32
std::u32string utf32_str = utf8_to_utf32(utf8_str);
std::cout << "UTF-32 length (characters): " << utf32_str.size() << std::endl;

// 转换回UTF-8
std::string converted_back = utf32_to_utf8(utf32_str);
std::cout << "Converted back: " << converted_back << std::endl;

return 0;
}

使用ICU库

对于复杂的Unicode处理,推荐使用ICU(International Components for Unicode)库,它提供了更全面的Unicode支持。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <iostream>
#include <string>
#include <unicode/unistr.h>
#include <unicode/translit.h>

int main() {
// 创建Unicode字符串
UnicodeString ustr = UnicodeString::fromUTF8("Hello, 世界! こんにちは!");

// 转换为小写
UnicodeString lower_str = ustr;
lower_str.toLower();

// 转换为大写
UnicodeString upper_str = ustr;
upper_str.toUpper();

// 音译(罗马化)
UnicodeString romanized;
Transliterator* transliterator = Transliterator::createInstance("Any-Latin");
if (transliterator) {
transliterator->transliterate(ustr, 0, ustr.length(), romanized);
delete transliterator;
}

// 输出结果
std::string output;

std::cout << "Original: ";
ustr.toUTF8String(output);
std::cout << output << std::endl;

std::cout << "Lowercase: ";
lower_str.toUTF8String(output);
std::cout << output << std::endl;

std::cout << "Uppercase: ";
upper_str.toUTF8String(output);
std::cout << output << std::endl;

std::cout << "Romanized: ";
romanized.toUTF8String(output);
std::cout << output << std::endl;

return 0;
}

消息国际化

gettext系统

gettext是一个广泛使用的消息国际化系统,它允许开发者将消息与代码分离,便于翻译。

基本使用

  1. 标记需要国际化的消息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <iostream>
#include <libintl.h>
#include <locale.h>

#define _(msg) gettext(msg)

int main() {
// 设置locale
setlocale(LC_ALL, "");

// 设置消息目录
bindtextdomain("myapp", "./locale");
textdomain("myapp");

// 使用国际化消息
std::cout << _("Hello, World!") << std::endl;
std::cout << _("How are you?") << std::endl;

return 0;
}
  1. 提取消息

使用xgettext工具提取消息到PO文件:

1
xgettext -d myapp -o myapp.pot main.cpp
  1. 翻译消息

创建语言特定的PO文件(如fr_FR.po)并翻译消息。

  1. 编译消息

使用msgfmt工具将PO文件编译为MO文件:

1
2
mkdir -p locale/fr_FR/LC_MESSAGES
msgfmt -o locale/fr_FR/LC_MESSAGES/myapp.mo fr_FR.po

自定义消息系统

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#include <iostream>
#include <string>
#include <map>
#include <locale>

class MessageCatalog {
private:
std::map<std::string, std::map<std::string, std::string>> catalogs;
std::string current_locale;

public:
MessageCatalog() {
// 设置默认locale
std::locale default_locale;
current_locale = default_locale.name();
}

void set_locale(const std::string& loc) {
current_locale = loc;
}

void add_message(const std::string& msgid, const std::string& msgstr, const std::string& loc) {
catalogs[loc][msgid] = msgstr;
}

std::string get_message(const std::string& msgid) {
// 尝试在当前locale中查找
auto loc_it = catalogs.find(current_locale);
if (loc_it != catalogs.end()) {
auto msg_it = loc_it->second.find(msgid);
if (msg_it != loc_it->second.end()) {
return msg_it->second;
}
}

// 尝试在默认locale中查找
auto default_it = catalogs.find("C");
if (default_it != catalogs.end()) {
auto msg_it = default_it->second.find(msgid);
if (msg_it != default_it->second.end()) {
return msg_it->second;
}
}

// 返回原始消息ID
return msgid;
}
};

// 全局消息目录
MessageCatalog msg_cat;

// 便捷宏
#define _(msgid) msg_cat.get_message(msgid)

int main() {
// 添加消息
msg_cat.add_message("Hello, World!", "Hello, World!", "C");
msg_cat.add_message("How are you?", "How are you?", "C");

msg_cat.add_message("Hello, World!", "Bonjour, Monde!", "fr_FR");
msg_cat.add_message("How are you?", "Comment allez-vous?", "fr_FR");

msg_cat.add_message("Hello, World!", "こんにちは、世界!", "ja_JP");
msg_cat.add_message("How are you?", "お元気ですか?", "ja_JP");

// 使用默认locale
std::cout << "Default locale:" << std::endl;
std::cout << _("Hello, World!") << std::endl;
std::cout << _("How are you?") << std::endl;

// 使用法语locale
std::cout << "\nFrench locale:" << std::endl;
msg_cat.set_locale("fr_FR");
std::cout << _("Hello, World!") << std::endl;
std::cout << _("How are you?") << std::endl;

// 使用日语locale
std::cout << "\nJapanese locale:" << std::endl;
msg_cat.set_locale("ja_JP");
std::cout << _("Hello, World!") << std::endl;
std::cout << _("How are you?") << std::endl;

return 0;
}

正则表达式基础

正则表达式语法

C++11引入了<regex>头文件,提供了正则表达式支持。C++的正则表达式语法基于ECMAScript标准。

基本语法

语法描述示例
^匹配字符串开头^Hello 匹配以Hello开头的字符串
$匹配字符串结尾World$ 匹配以World结尾的字符串
.匹配任意单个字符(除换行符)H.llo 匹配Hello, Hallo等
*匹配前面的字符0次或多次ab*c 匹配ac, abc, abbc等
+匹配前面的字符1次或多次ab+c 匹配abc, abbc等,不匹配ac
?匹配前面的字符0次或1次ab?c 匹配ac, abc,不匹配abbc
{n}匹配前面的字符恰好n次a{3} 匹配aaa
{n,}匹配前面的字符至少n次a{2,} 匹配aa, aaa等
{n,m}匹配前面的字符n到m次a{2,4} 匹配aa, aaa, aaaa
[]匹配括号内的任意一个字符[aeiou] 匹配任意元音字母
[^]匹配括号内字符以外的任意字符[^0-9] 匹配任意非数字字符
``匹配左右任意一个表达式
()捕获组,匹配括号内的表达式并捕获(ab)+ 匹配ab, abab等,并捕获ab
\d匹配任意数字,等同于[0-9]\d+ 匹配一个或多个数字
\D匹配任意非数字,等同于[^0-9]\D+ 匹配一个或多个非数字
\w匹配任意字母、数字或下划线,等同于[a-zA-Z0-9_]\w+ 匹配一个或多个单词字符
\W匹配任意非单词字符,等同于[^a-zA-Z0-9_]\W+ 匹配一个或多个非单词字符
\s匹配任意空白字符(空格、制表符、换行符等)\s+ 匹配一个或多个空白字符
\S匹配任意非空白字符\S+ 匹配一个或多个非空白字符
\b匹配单词边界\bword\b 匹配完整的word单词
\B匹配非单词边界\Bword\B 匹配单词内部的word

std::regex的使用

基本操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <iostream>
#include <regex>
#include <string>

int main() {
// 1. 匹配操作
std::string text = "Hello, World!";
std::regex pattern(R"(Hello, \w+!)" );

if (std::regex_match(text, pattern)) {
std::cout << "Text matches pattern" << std::endl;
} else {
std::cout << "Text does not match pattern" << std::endl;
}

// 2. 搜索操作
std::string text2 = "The price is $45.99, discounted to $39.99";
std::regex price_pattern(R"(\$\d+\.\d{2})" );

std::smatch matches;
if (std::regex_search(text2, matches, price_pattern)) {
std::cout << "Found price: " << matches[0] << std::endl;
}

// 3. 替换操作
std::string text3 = "The cat sat on the mat";
std::regex cat_pattern(R"(cat)" );
std::string replaced = std::regex_replace(text3, cat_pattern, "dog");
std::cout << "Original: " << text3 << std::endl;
std::cout << "Replaced: " << replaced << std::endl;

// 4. 迭代搜索
std::string text4 = "Contact info: john@example.com, jane@test.org";
std::regex email_pattern(R"([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" );

std::sregex_iterator it(text4.begin(), text4.end(), email_pattern);
std::sregex_iterator end;

std::cout << "Found emails:" << std::endl;
while (it != end) {
std::cout << it->str() << std::endl;
++it;
}

return 0;
}

捕获组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <iostream>
#include <regex>
#include <string>

int main() {
std::string text = "John Doe (john@example.com) - 30 years old";

// 带捕获组的正则表达式
std::regex pattern(R"(([A-Za-z]+) ([A-Za-z]+) \(([^)]+)\) - (\d+) years old)" );

std::smatch matches;
if (std::regex_match(text, matches, pattern)) {
std::cout << "Full match: " << matches[0] << std::endl;
std::cout << "First name: " << matches[1] << std::endl;
std::cout << "Last name: " << matches[2] << std::endl;
std::cout << "Email: " << matches[3] << std::endl;
std::cout << "Age: " << matches[4] << std::endl;
}

return 0;
}

正则表达式标志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <regex>
#include <string>

int main() {
std::string text = "Hello, WORLD! hello, world!";

// 不区分大小写
std::regex case_insensitive_pattern(R"(hello, world!)" , std::regex_constants::icase);

std::smatch matches;
if (std::regex_search(text, matches, case_insensitive_pattern)) {
std::cout << "Found: " << matches[0] << std::endl;
}

// 多行模式
std::string multi_line_text = "Line 1: test\nLine 2: TEST\nLine 3: Test";
std::regex multi_line_pattern(R"(^Line \d+: test$)" , std::regex_constants::icase | std::regex_constants::multiline);

std::sregex_iterator it(multi_line_text.begin(), multi_line_text.end(), multi_line_pattern);
std::sregex_iterator end;

std::cout << "Found lines:" << std::endl;
while (it != end) {
std::cout << it->str() << std::endl;
++it;
}

return 0;
}

高级正则表达式技巧

前瞻和后顾断言

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include <regex>
#include <string>

int main() {
std::string text = "Price: $45.99, Discount: $10.00, Total: $35.99";

// 前瞻断言:匹配$后面的数字,但不包含$
std::regex lookahead_pattern(R"(\$(?=\d+\.\d{2}))" );
std::string result1 = std::regex_replace(text, lookahead_pattern, "USD ");
std::cout << "With lookahead: " << result1 << std::endl;

std::string text2 = "apple pie, banana bread, cherry tart";

// 后顾断言:匹配pie前面的单词,但不包含pie
std::regex lookbehind_pattern(R"((?<=\w+) pie)" );
std::smatch matches;
if (std::regex_search(text2, matches, lookbehind_pattern)) {
std::cout << "Found with lookbehind: " << matches[0] << std::endl;
}

// 负前瞻:匹配不是@gmail.com结尾的邮箱
std::string emails = "john@example.com, jane@gmail.com, bob@test.org";
std::regex negative_lookahead(R"([a-zA-Z0-9._%+-]+@(?!gmail\.com)[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" );

std::sregex_iterator it(emails.begin(), emails.end(), negative_lookahead);
std::sregex_iterator end;

std::cout << "Non-Gmail emails:" << std::endl;
while (it != end) {
std::cout << it->str() << std::endl;
++it;
}

return 0;
}

量词的贪婪与非贪婪

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <iostream>
#include <regex>
#include <string>

int main() {
std::string text = "<div>First div</div><div>Second div</div>";

// 贪婪匹配:尽可能匹配更多字符
std::regex greedy_pattern(R"(<div>.*</div>)" );
std::smatch greedy_match;
if (std::regex_search(text, greedy_match, greedy_pattern)) {
std::cout << "Greedy match: " << greedy_match[0] << std::endl;
}

// 非贪婪匹配:尽可能匹配更少字符
std::regex non_greedy_pattern(R"(<div>.*?</div>)" );
std::sregex_iterator it(text.begin(), text.end(), non_greedy_pattern);
std::sregex_iterator end;

std::cout << "Non-greedy matches:" << std::endl;
while (it != end) {
std::cout << it->str() << std::endl;
++it;
}

return 0;
}

条件正则表达式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <iostream>
#include <regex>
#include <string>

int main() {
// 匹配电话号码,支持带区号和不带区号的格式
std::string phone_numbers = "555-1234, (123) 555-6789";
std::regex phone_pattern(R"((\()?(\d{3})(?(1)\) |-))?\d{3}-\d{4})" );

std::sregex_iterator it(phone_numbers.begin(), phone_numbers.end(), phone_pattern);
std::sregex_iterator end;

std::cout << "Found phone numbers:" << std::endl;
while (it != end) {
std::cout << it->str() << std::endl;
++it;
}

return 0;
}

正则表达式与本地化的结合

处理Unicode字符

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include <iostream>
#include <regex>
#include <string>
#include <locale>

int main() {
// 设置UTF-8 locale
std::locale::global(std::locale("en_US.UTF-8"));

// 匹配包含非ASCII字符的文本
std::u32string text = U"Hello, 世界! こんにちは!";

// 转换为UTF-8进行正则匹配
std::string utf8_text;
for (char32_t c : text) {
if (c < 0x80) {
utf8_text.push_back(static_cast<char>(c));
} else if (c < 0x800) {
utf8_text.push_back(static_cast<char>(0xC0 | (c >> 6)));
utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F)));
} else if (c < 0x10000) {
utf8_text.push_back(static_cast<char>(0xE0 | (c >> 12)));
utf8_text.push_back(static_cast<char>(0x80 | ((c >> 6) & 0x3F)));
utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F)));
} else {
utf8_text.push_back(static_cast<char>(0xF0 | (c >> 18)));
utf8_text.push_back(static_cast<char>(0x80 | ((c >> 12) & 0x3F)));
utf8_text.push_back(static_cast<char>(0x80 | ((c >> 6) & 0x3F)));
utf8_text.push_back(static_cast<char>(0x80 | (c & 0x3F)));
}
}

// 匹配包含中文字符的部分
std::regex chinese_pattern(R"([\p{Han}]+)" , std::regex_constants::u);
std::sregex_iterator it(utf8_text.begin(), utf8_text.end(), chinese_pattern);
std::sregex_iterator end;

std::cout << "Found Chinese characters:" << std::endl;
while (it != end) {
std::cout << it->str() << std::endl;
++it;
}

return 0;
}

本地化的正则表达式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <iostream>
#include <regex>
#include <string>
#include <locale>

int main() {
// 根据不同locale使用不同的数字格式
std::string german_number = "1.234.567,89";
std::string us_number = "1,234,567.89";

// 匹配德国数字格式
std::regex german_pattern(R"(\d{1,3}(\.\d{3})*(,\d{2})?)" );

// 匹配美国数字格式
std::regex us_pattern(R"(\d{1,3}(,\d{3})*(\.\d{2})?)" );

std::cout << "German number format:" << std::endl;
if (std::regex_match(german_number, german_pattern)) {
std::cout << "Valid: " << german_number << std::endl;
}

std::cout << "US number format:" << std::endl;
if (std::regex_match(us_number, us_pattern)) {
std::cout << "Valid: " << us_number << std::endl;
}

return 0;
}

性能优化

正则表达式性能

正则表达式的性能取决于多个因素,包括模式复杂度、输入大小和实现细节。以下是一些优化技巧:

1. 编译正则表达式

对于重复使用的正则表达式,应预编译以避免重复编译开销。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include <iostream>
#include <regex>
#include <string>
#include <chrono>

int main() {
std::string text = "The price is $45.99, discounted to $39.99, final price $29.99";

// 测量重复编译的时间
auto start1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000; ++i) {
std::regex pattern(R"(\$\d+\.\d{2})" );
std::smatch matches;
std::regex_search(text, matches, pattern);
}
auto end1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
std::cout << "Time with repeated compilation: " << duration1 << " ms" << std::endl;

// 测量预编译的时间
std::regex pattern(R"(\$\d+\.\d{2})" );
auto start2 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000; ++i) {
std::smatch matches;
std::regex_search(text, matches, pattern);
}
auto end2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
std::cout << "Time with precompiled regex: " << duration2 << " ms" << std::endl;

return 0;
}

2. 避免回溯

回溯是正则表达式性能问题的常见原因,特别是在处理大输入时。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>
#include <regex>
#include <string>
#include <chrono>

int main() {
// 创建一个可能导致回溯的文本
std::string text = std::string(1000, 'a') + "b";

// 有问题的模式:可能导致大量回溯
std::regex bad_pattern(R"(a*a*b)" );

// 优化的模式:避免回溯
std::regex good_pattern(R"(a*b)" );

// 测量有问题的模式
auto start1 = std::chrono::high_resolution_clock::now();
std::smatch matches1;
std::regex_search(text, matches1, bad_pattern);
auto end1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
std::cout << "Time with bad pattern: " << duration1 << " ms" << std::endl;

// 测量优化的模式
auto start2 = std::chrono::high_resolution_clock::now();
std::smatch matches2;
std::regex_search(text, matches2, good_pattern);
auto end2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
std::cout << "Time with good pattern: " << duration2 << " ms" << std::endl;

return 0;
}

3. 使用适当的正则表达式引擎

C++11提供了几种正则表达式引擎实现,可根据需要选择:

  • std::regex_constants::ECMAScript:默认,基于ECMAScript语法
  • std::regex_constants::basic:基本POSIX语法
  • std::regex_constants::extended:扩展POSIX语法
  • std::regex_constants::awk:awk风格语法
  • std::regex_constants::grep:grep风格语法
  • std::regex_constants::egrep:egrep风格语法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
#include <regex>
#include <string>

int main() {
std::string text = "Hello, World!";

// 使用不同的引擎
std::regex ecma_pattern(R"(Hello, \w+!)" , std::regex_constants::ECMAScript);
std::regex basic_pattern("Hello, [a-zA-Z0-9_]+!" , std::regex_constants::basic);

std::cout << "ECMAScript engine:" << std::endl;
if (std::regex_match(text, ecma_pattern)) {
std::cout << "Match found" << std::endl;
}

std::cout << "Basic engine:" << std::endl;
if (std::regex_match(text, basic_pattern)) {
std::cout << "Match found" << std::endl;
}

return 0;
}

本地化性能优化

1. 缓存locale对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include <locale>
#include <map>
#include <string>

class LocaleCache {
private:
std::map<std::string, std::locale> cache;

public:
std::locale get_locale(const std::string& name) {
auto it = cache.find(name);
if (it != cache.end()) {
return it->second;
}

// 创建并缓存新locale
std::locale loc(name);
cache[name] = loc;
return loc;
}
};

int main() {
LocaleCache cache;

// 获取locale(首次创建)
std::locale fr_locale = cache.get_locale("fr_FR.UTF-8");
std::cout << "French locale: " << fr_locale.name() << std::endl;

// 获取locale(从缓存中获取)
std::locale fr_locale2 = cache.get_locale("fr_FR.UTF-8");
std::cout << "French locale (cached): " << fr_locale2.name() << std::endl;

return 0;
}

2. 避免频繁的locale切换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <iostream>
#include <locale>
#include <iomanip>
#include <chrono>

int main() {
double value = 1234567.89;

// 保存默认locale
std::locale default_locale = std::cout.getloc();

// 创建需要的locale
std::locale fr_locale("fr_FR.UTF-8");
std::locale de_locale("de_DE.UTF-8");
std::locale us_locale("en_US.UTF-8");

// 测量频繁切换locale的时间
auto start1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000; ++i) {
std::cout.imbue(fr_locale);
std::cout << value << " ";
std::cout.imbue(de_locale);
std::cout << value << " ";
std::cout.imbue(us_locale);
std::cout << value << " ";
}
auto end1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
std::cout << "\nTime with frequent locale switches: " << duration1 << " ms" << std::endl;

// 测量批量处理的时间
auto start2 = std::chrono::high_resolution_clock::now();
std::cout.imbue(fr_locale);
for (int i = 0; i < 1000; ++i) {
std::cout << value << " ";
}

std::cout.imbue(de_locale);
for (int i = 0; i < 1000; ++i) {
std::cout << value << " ";
}

std::cout.imbue(us_locale);
for (int i = 0; i < 1000; ++i) {
std::cout << value << " ";
}
auto end2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
std::cout << "\nTime with batch processing: " << duration2 << " ms" << std::endl;

// 恢复默认locale
std::cout.imbue(default_locale);

return 0;
}

实际应用案例

案例1:表单验证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#include <iostream>
#include <regex>
#include <string>
#include <vector>

class FormValidator {
private:
std::regex email_pattern{R"([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})"};
std::regex phone_pattern{R"(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})"};
std::regex zip_pattern{R"(\d{5}(-\d{4})?)"};
std::regex date_pattern{R"(\d{4}-\d{2}-\d{2})"};

public:
bool validate_email(const std::string& email) {
return std::regex_match(email, email_pattern);
}

bool validate_phone(const std::string& phone) {
return std::regex_match(phone, phone_pattern);
}

bool validate_zip(const std::string& zip) {
return std::regex_match(zip, zip_pattern);
}

bool validate_date(const std::string& date) {
return std::regex_match(date, date_pattern);
}

std::vector<std::string> validate_form(const std::string& email,
const std::string& phone,
const std::string& zip,
const std::string& date) {
std::vector<std::string> errors;

if (!validate_email(email)) {
errors.push_back("Invalid email format");
}

if (!validate_phone(phone)) {
errors.push_back("Invalid phone format");
}

if (!validate_zip(zip)) {
errors.push_back("Invalid zip code format");
}

if (!validate_date(date)) {
errors.push_back("Invalid date format (YYYY-MM-DD)");
}

return errors;
}
};

int main() {
FormValidator validator;

// 测试有效输入
std::vector<std::string> errors1 = validator.validate_form(
"john@example.com",
"(123) 456-7890",
"12345-6789",
"2023-12-25"
);

std::cout << "Valid input errors: " << errors1.size() << std::endl;
for (const auto& error : errors1) {
std::cout << "- " << error << std::endl;
}

// 测试无效输入
std::vector<std::string> errors2 = validator.validate_form(
"invalid-email",
"1234567",
"1234",
"2023/12/25"
);

std::cout << "\nInvalid input errors: " << errors2.size() << std::endl;
for (const auto& error : errors2) {
std::cout << "- " << error << std::endl;
}

return 0;
}

案例2:日志分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#include <iostream>
#include <regex>
#include <string>
#include <fstream>
#include <map>

class LogAnalyzer {
private:
std::regex log_pattern{R"(\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] \[(\w+)\] (.*))"};
std::map<std::string, int> level_counts;
std::map<std::string, std::vector<std::string>> level_messages;

public:
void analyze_file(const std::string& filename) {
std::ifstream file(filename);
if (!file) {
std::cerr << "Could not open file: " << filename << std::endl;
return;
}

std::string line;
while (std::getline(file, line)) {
analyze_line(line);
}

file.close();
}

void analyze_line(const std::string& line) {
std::smatch matches;
if (std::regex_match(line, matches, log_pattern)) {
std::string timestamp = matches[1];
std::string level = matches[2];
std::string message = matches[3];

// 统计级别
level_counts[level]++;

// 存储消息
level_messages[level].push_back(timestamp + " - " + message);
}
}

void print_summary() {
std::cout << "Log level summary:" << std::endl;
for (const auto& pair : level_counts) {
std::cout << pair.first << ": " << pair.second << " messages" << std::endl;
}

std::cout << "\nDetailed messages:" << std::endl;
for (const auto& pair : level_messages) {
std::cout << "\n" << pair.first << ":" << std::endl;
for (const auto& message : pair.second) {
std::cout << " " << message << std::endl;
}
}
}
};

int main() {
LogAnalyzer analyzer;

// 假设有一个日志文件
// analyzer.analyze_file("app.log");

// 手动添加一些日志行进行测试
analyzer.analyze_line("[2023-12-01 10:00:00] [INFO] Application started");
analyzer.analyze_line("[2023-12-01 10:05:30] [ERROR] Database connection failed");
analyzer.analyze_line("[2023-12-01 10:06:00] [INFO] Retrying database connection");
analyzer.analyze_line("[2023-12-01 10:06:15] [INFO] Database connection established");
analyzer.analyze_line("[2023-12-01 10:10:00] [WARNING] Disk space low");

analyzer.print_summary();

return 0;
}

案例3:国际化应用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
#include <iostream>
#include <string>
#include <map>
#include <locale>
#include <regex>

class I18nApplication {
private:
std::map<std::string, std::map<std::string, std::string>> messages;
std::string current_locale;
std::map<std::string, std::regex> localized_patterns;

public:
I18nApplication() {
// 设置默认locale
std::locale default_locale;
current_locale = default_locale.name();

// 初始化消息
init_messages();

// 初始化本地化模式
init_patterns();
}

void init_messages() {
// 英语消息
messages["en_US"]["greeting"] = "Hello, {0}!";
messages["en_US"]["farewell"] = "Goodbye, {0}!";
messages["en_US"]["welcome"] = "Welcome to our application!";
messages["en_US"]["error"] = "An error occurred: {0}";

// 法语消息
messages["fr_FR"]["greeting"] = "Bonjour, {0}!";
messages["fr_FR"]["farewell"] = "Au revoir, {0}!";
messages["fr_FR"]["welcome"] = "Bienvenue dans notre application!";
messages["fr_FR"]["error"] = "Une erreur s'est produite: {0}";

// 日语消息
messages["ja_JP"]["greeting"] = "こんにちは、{0}さん!";
messages["ja_JP"]["farewell"] = "さようなら、{0}さん!";
messages["ja_JP"]["welcome"] = "アプリケーションへようこそ!";
messages["ja_JP"]["error"] = "エラーが発生しました: {0}";
}

void init_patterns() {
// 英语模式
localized_patterns["en_US"] = std::regex(R"(Hello,\s+(\w+))" );

// 法语模式
localized_patterns["fr_FR"] = std::regex(R"(Bonjour,\s+(\w+))" );

// 日语模式
localized_patterns["ja_JP"] = std::regex(R"(こんにちは、(\w+)さん!)" );
}

void set_locale(const std::string& loc) {
current_locale = loc;
}

std::string get_message(const std::string& key, const std::string& param = "") {
auto loc_it = messages.find(current_locale);
if (loc_it == messages.end()) {
// 回退到英语
loc_it = messages.find("en_US");
}

auto msg_it = loc_it->second.find(key);
if (msg_it == loc_it->second.end()) {
return "[Missing message]";
}

std::string message = msg_it->second;
if (!param.empty()) {
// 简单的参数替换
size_t pos = message.find("{0}");
if (pos != std::string::npos) {
message.replace(pos, 3, param);
}
}

return message;
}

bool parse_greeting(const std::string& greeting, std::string& name) {
auto pattern_it = localized_patterns.find(current_locale);
if (pattern_it == localized_patterns.end()) {
// 回退到英语
pattern_it = localized_patterns.find("en_US");
}

std::smatch matches;
if (std::regex_match(greeting, matches, pattern_it->second)) {
name = matches[1];
return true;
}

return false;
}

void run() {
std::cout << get_message("welcome") << std::endl;

// 测试不同locale
test_locale("en_US", "John");
test_locale("fr_FR", "Jean");
test_locale("ja_JP", "Tanaka");
}

void test_locale(const std::string& loc, const std::string& name) {
set_locale(loc);
std::cout << "\nTesting locale: " << loc << std::endl;
std::cout << get_message("greeting", name) << std::endl;
std::cout << get_message("farewell", name) << std::endl;
std::cout << get_message("error", "Connection failed") << std::endl;

// 测试解析问候语
std::string test_greeting;
if (loc == "en_US") {
test_greeting = "Hello, " + name;
} else if (loc == "fr_FR") {
test_greeting = "Bonjour, " + name;
} else if (loc == "ja_JP") {
test_greeting = "こんにちは、" + name + "さん!";
}

std::string parsed_name;
if (parse_greeting(test_greeting, parsed_name)) {
std::cout << "Parsed name: " << parsed_name << std::endl;
}
}
};

int main() {
I18nApplication app;
app.run();
return 0;
}

最佳实践总结

本地化最佳实践

  1. 设计时考虑国际化

    • 避免硬编码字符串
    • 使用Unicode编码
    • 考虑文本长度变化(某些语言翻译后文本会变长)
    • 考虑从右到左的语言(如阿拉伯语、希伯来语)
  2. 使用标准库设施

    • 优先使用std::locale进行本地化
    • 使用std::stringstd::wstring等标准字符串类型
    • 对于复杂的Unicode处理,考虑使用ICU库
  3. 消息国际化

    • 使用gettext或类似系统进行消息国际化
    • 保持消息简洁明了
    • 避免在消息中包含格式依赖的内容
  4. 测试不同locale

    • 在不同locale下测试应用
    • 确保数字、日期、时间格式正确
    • 确保Unicode字符正确显示

正则表达式最佳实践

  1. 编写清晰的正则表达式

    • 使用原始字符串字面量(R”()”)避免转义
    • 对于复杂模式,添加注释
    • 合理使用捕获组
  2. 优化性能

    • 预编译频繁使用的正则表达式
    • 避免回溯(使用占有量词、原子组等)
    • 对于简单模式,考虑使用字符串操作
  3. 错误处理

    • 捕获正则表达式编译错误
    • 验证用户输入的正则表达式模式
  4. 安全性

    • 避免使用用户输入直接构造正则表达式(防止ReDoS攻击)
    • 对复杂模式设置匹配超时
  5. 测试

    • 测试边界情况
    • 测试不同输入长度
    • 测试国际化输入

结合使用的最佳实践

  1. 处理国际化输入

    • 使用Unicode感知的正则表达式
    • 考虑不同语言的字符特性
  2. 本地化的模式匹配

    • 根据不同locale调整正则表达式模式
    • 考虑本地化的数字、日期格式
  3. 性能平衡

    • 在本地化和正则表达式处理之间平衡性能
    • 缓存常用的本地化模式
  4. 代码组织

    • 将本地化和正则表达式逻辑封装在专门的类中
    • 使用依赖注入管理locale和模式

通过遵循这些最佳实践,可以开发出既支持国际化又高效处理文本的C++应用程序。本地化和正则表达式是现代C++编程中的重要工具,掌握它们对于开发高质量的软件至关重要。