cs336-01-bpe编码

cs336-01-bpe编码
gogongxtunicode编码
在python可以使用ord函数查询某个字符的unicode编码,可以chr函数反查
特殊的,chr(0) 就是空字符(NUL)
- print打印这个字符是看不到的
- repr可以更精细,看到的是’’
1
2
3
4
5
6
ord('牛') #29275
chr(29275) #'牛'
chr(0) #'\x00'
print(chr(0)) #
"this is a test" + chr(0) + "string" #'this is a test\x00string'
print("this is a test" + chr(0) + "string")#this is a teststringQ:为什么用UTF-8编码进行tokenizer而不是UTF-16/UTF-32
UTF-32:每个Unicode字符固定4 字节,不管是 a、中、🙂,全部 4
字节,例如 'a' -> 00 00 00 61
'中' -> 00 00 4E 2D
UTF-16:常见字符2 字节,少数(emoji / 古文字)4 字节,相比UTF-32好一些,但是还是有点大
UTF-8:字符ASCII 1字节;中文 3 字节;Emoji 4 字节;并且完全向后兼容
ASCII,例如 'a' -> 61 '中' -> E4 B8 AD
'🙂' -> F0 9F 99 82
Q:下面的代码有什么问题,怎么修正
1
2
3
4
5
6
7
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
return "".join(
[bytes([b]).decode("utf-8") for b in bytestring]
)
>>> print(decode_utf8_bytes_to_str_wrong("hello".encode("utf-8")))
'hello'问题在于UTF-8 是变长编码,而上面的代码在按字节强行拆开解码
比如print(decode_utf8_bytes_to_str_wrong("中文".encode("utf-8")))就会有问题'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data
Q:有没有两字节的序列,不是合法的unicode编码
以UTF-8来说,byte 序列空间 ≫ Unicode
字符空间,有效的两字节序列只有一类:110xxxxx 10xxxxxx,对应到
- 第 1 字节:
0xC2 – 0xDF - 第 2 字节:
0x80 – 0xBF
留下大量非法序列可以快速进行错误检测
bpe(byte-pair encoding, Gage, 1994)
简单来讲就是每一次迭代合并两个词,选择频率最高的那组进行组合,下一次依旧是选择频率最高的那组组合
如果合并过程中碰到两个合并频率相同,比如”ab”和”fe”,我们按照lexicographically greater pair优先(选择字符串排序大的那个),选择”fe”
举一个简单的例子:
1
2
3
low low low low low
lower lower widest widest widest
newest newest newest newest newest newest首先按照空格分词:
- low 5
- lower 2
- widest 3
- newest 6
第一轮:依次组合两个字符组合,按照频率,最高的是(e,s)和(s,t),都是9次,按照排序选择(s,t)
第二轮:依次组合两个字符组合,按照频率,最高的是(e,st)是9次
以此类推,假设再跑四轮,组合依次是:(o,w) (l,ow) (w,est) (n,e)
经过六轮过后,我们有了(st) (est) (ow) (low) (west) (ne)
把他们应用到原来的词上,得到:
- low: low
- lower: low e r
- widest: w i d est
- newest: ne west
代码实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
def run_train_bpe(
input_path: str | os.PathLike,
vocab_size: int,
special_tokens: list[str],
**kwargs,
) -> tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
"""Given the path to an input corpus, run train a BPE tokenizer and
output its vocabulary and merges.
Args:
input_path (str | os.PathLike): Path to BPE tokenizer training data.
vocab_size (int): Total number of items in the tokenizer's vocabulary (including special tokens).
special_tokens (list[str]): A list of string special tokens to be added to the tokenizer vocabulary.
These strings will never be split into multiple tokens, and will always be
kept as a single token. If these special tokens occur in the `input_path`,
they are treated as any other string.
Returns:
tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
vocab:
The trained tokenizer vocabulary, a mapping from int (token ID in the vocabulary)
to bytes (token bytes)
merges:
BPE merges. Each list item is a tuple of bytes (<token1>, <token2>),
representing that <token1> was merged with <token2>.
Merges are ordered by order of creation.
"""
# GPT-2 pretokenization pattern
GPT2_PATTERN = regex.compile(
r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)
def pretokenize(text: str) -> list[str]:
"""Split text into tokens using GPT-2 regex pattern."""
return GPT2_PATTERN.findall(text)
def split_on_special_tokens(text: str, special_tokens: list[str]) -> list[str]:
"""Split text on special token boundaries."""
# Build regex pattern to match any special token
pattern = "|".join(map(regex.escape, special_tokens))
# Split and keep delimiters
parts = regex.split(f"({pattern})", text)
return [p for p in parts if p]
# Step 1: Initialize vocabulary with 256 byte values
vocab: dict[int, bytes] = {i: bytes([i]) for i in range(256)}
# Add special tokens to vocab
special_token_set = set(special_tokens)
for tok in special_tokens:
vocab[len(vocab)] = tok.encode("utf-8")
# Step 2: Read and preprocess the training text
with open(input_path, "r", encoding="utf-8") as f:
text = f.read()
# Split on special tokens and pretokenize each part
parts = split_on_special_tokens(text, special_tokens)
tokens_list: list[list[bytes]] = []
for part in parts:
if part in special_token_set:
# Special token: keep as single token
tokens_list.append([part.encode("utf-8")])
elif part: # Non-empty non-special text
# Regular text: pretokenize and convert to bytes
for word in pretokenize(part):
# Convert each character to its byte representation
tokens_list.append([bytes([b]) for b in word.encode("utf-8")])
merges: list[tuple[bytes, bytes]] = []
# Step 3: BPE training loop
while len(vocab) < vocab_size:
# Count pair frequencies
pair_counts: Counter[tuple[bytes, bytes]] = Counter()
for tokens in tokens_list:
for i in range(len(tokens) - 1):
pair = (tokens[i], tokens[i + 1])
pair_counts[pair] += 1
if not pair_counts:
break
# Select highest frequency pair (with lexicographically larger as tiebreaker)
best_pair = max(pair_counts.items(), key=lambda x: (x[1], x[0]))[0]
# Create new token by merging
new_token = best_pair[0] + best_pair[1]
vocab[len(vocab)] = new_token
merges.append(best_pair)
# Update tokens by merging all occurrences of best_pair
for tokens in tokens_list:
i = 0
while i < len(tokens) - 1:
if (tokens[i], tokens[i + 1]) == best_pair:
# Merge the pair
tokens[i : i + 2] = [new_token]
# Don't increment i - check if the new token can merge again
else:
i += 1
return vocab, merges关于这部分代码,可以在这里找到:
详细讲解:
BPE
Tokenizer Training: What Actually Happens When You Call
tokenizer.train()
Most LLM engineers treat tokenization as a solved problem. You import tiktoken, pick a vocabulary size, and move on. But when you’re training your own tokenizer from scratch — for a new domain, a new language, or just to understand what’s happening under the hood — the implementation details matter more than you’d expect.
Here’s how Byte Pair Encoding (BPE) tokenizer training actually works, drawn from a complete reference implementation.
The Core Algorithm in One Sentence
BPE iteratively merges the most frequent adjacent token pairs until your vocabulary reaches the target size.
That’s it. The rest is implementation detail — but those details determine whether your tokenizer produces sensible merges or garbage.
Step 1: GPT-2 Pretokenization (Where the Magic Happens)
Before BPE even starts, you need to split your text into chunks. GPT-2 uses a regex pattern that’s surprisingly sophisticated:
1
2
3
GPT2_PATTERN = regex.compile(
r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)This pattern handles several edge cases that naive whitespace splitting misses:
- Contractions:
's,'d,'ll,'ve,'rebecome separate tokens — crucial for English - Numbers vs. letters:
42andhellokeep their leading space as part of the token - Punctuation clusters:
!!!stays together instead of becoming three separate tokens - Trailing whitespace:
\s+(?!\S)captures trailing spaces that would otherwise be lost
The result? Each pretokenized word starts as a sequence of single
bytes. The word “hello” becomes
[b'h', b'e', b'l', b'l', b'o'].
Step 2: Special Tokens Get Special Treatment
Special tokens — like <|endoftext|> or
<|startoftext|> — are a constant headache in
tokenizer implementations. The right approach: split on them
before pretokenization, then treat each special token as an
atomic unit.
1
2
3
4
def split_on_special_tokens(text: str, special_tokens: list[str]) -> list[str]:
pattern = "|".join(map(regex.escape, special_tokens))
parts = regex.split(f"({pattern})", text)
return [p for p in parts if p]This ensures special tokens never get broken up by the BPE merge process. They’re added directly to the vocabulary and preserved as-is.
Step 3: The Merge Loop
Here’s where vocabulary building happens:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
while len(vocab) < vocab_size:
# Count all adjacent pairs across all tokens
pair_counts: Counter[tuple[bytes, bytes]] = Counter()
for tokens in tokens_list:
for i in range(len(tokens) - 1):
pair = (tokens[i], tokens[i + 1])
pair_counts[pair] += 1
if not pair_counts:
break # Nothing left to merge
# Select the best pair: highest frequency, ties broken lexicographically
best_pair = max(pair_counts.items(), key=lambda x: (x[1], x[0]))[0]
# Create new merged token
new_token = best_pair[0] + best_pair[1]
vocab[len(vocab)] = new_token
merges.append(best_pair)
# Update all token sequences with the new merge
for tokens in tokens_list:
i = 0
while i < len(tokens) - 1:
if (tokens[i], tokens[i + 1]) == best_pair:
tokens[i : i + 2] = [new_token]
# Don't increment i — the new token might merge again
else:
i += 1Each iteration:
- Counts pair frequencies across the entire corpus
- Picks the most frequent pair (with lexicographic tiebreaking)
- Creates a new token by concatenating the pair
- Updates all token sequences, replacing the pair with the new token
The tiebreaker matters more than you’d think. When two pairs have the same frequency, choosing the lexicographically greater one ensures deterministic results across runs. Without this, your tokenizer’s vocabulary could vary based on hash ordering.
What This Implementation Doesn’t Do (And Why That’s Okay)
No parallel pretokenization. The code processes text sequentially. For large corpora, you’d want to parallelize this step. The GPT-2 pattern is stateless, so this is embarrassingly parallel.
No streaming. The entire corpus is loaded into memory. For production tokenizers training on terabytes, you’d need a streaming approach with approximate pair counts.
No caching of intermediate results. Each merge rescans all tokens. You could optimize this with a priority queue, but the naive approach is often fast enough for vocab sizes under 100k.
The Surprising Insight
The vocabulary starts with 256 single-byte tokens. Everything else — every word, every subword, every punctuation cluster — is built from these 256 atoms through repeated merging. Your tokenizer’s entire vocabulary is just compressed byte sequences.
This is why BPE handles any Unicode text without exploding vocabulary
size. A character like 你好 becomes three UTF-8 bytes
(\xe4\xbd\xa0 and \xe5\xa5\xbd), and BPE
learns to merge them into meaningful subword units if they appear
frequently enough.
The Output
The function returns two things:
- Vocabulary: A mapping from token ID to bytes
(
{0: b'\x00', 1: b'\x01', ..., 256: b'hello', ...}) - Merges: An ordered list of which pairs were merged
(
[(b'h', b'e'), (b'he', b'l'), ...])
The merge order matters for encoding. When you tokenize new text, you apply merges in the same order they were learned. Skip a merge, and you’ll get different tokenization than the training process produced.
When You’d Actually Train Your Own Tokenizer
- Domain-specific models: Legal documents, medical records, or code have very different token frequency distributions than general web text
- New languages: Training on English text produces terrible tokenization for languages with different morphological patterns
- Compression research: Tokenizer efficiency directly impacts context window utilization
- Educational purposes: Understanding tokenization helps debug why your model fails on certain inputs
The full implementation is in tests/adapters.py if you
want to trace through it yourself. The algorithm is ~100 lines of
Python, but every line exists for a reason.




