cs336-01-bpe编码

unicode编码

在python可以使用ord函数查询某个字符的unicode编码,可以chr函数反查

特殊的,chr(0) 就是空字符(NUL)

  • print打印这个字符是看不到的
  • repr可以更精细,看到的是’’
1
2
3
4
5
6
ord('牛') #29275
chr(29275) #'牛'
chr(0) #'\x00'
print(chr(0)) #
"this is a test" + chr(0) + "string" #'this is a test\x00string'
print("this is a test" + chr(0) + "string")#this is a teststring

NOTE

Q:为什么用UTF-8编码进行tokenizer而不是UTF-16/UTF-32

UTF-32:每个Unicode字符固定4 字节,不管是 a、中、🙂,全部 4 字节,例如 'a' -> 00 00 00 61 '中' -> 00 00 4E 2D

UTF-16:常见字符2 字节,少数(emoji / 古文字)4 字节,相比UTF-32好一些,但是还是有点大

UTF-8:字符ASCII 1字节;中文 3 字节;Emoji 4 字节;并且完全向后兼容 ASCII,例如 'a' -> 61 '中' -> E4 B8 AD '🙂' -> F0 9F 99 82


NOTE

Q:下面的代码有什么问题,怎么修正

1
2
3
4
5
6
7
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join(
        [bytes([b]).decode("utf-8") for b in bytestring]
    )

>>> print(decode_utf8_bytes_to_str_wrong("hello".encode("utf-8")))
'hello'

问题在于UTF-8 是变长编码,而上面的代码在按字节强行拆开解码

比如print(decode_utf8_bytes_to_str_wrong("中文".encode("utf-8")))就会有问题'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data


NOTE

Q:有没有两字节的序列,不是合法的unicode编码

以UTF-8来说,byte 序列空间 ≫ Unicode 字符空间,有效的两字节序列只有一类:110xxxxx 10xxxxxx,对应到

  • 第 1 字节:0xC2 – 0xDF
  • 第 2 字节:0x80 – 0xBF

留下大量非法序列可以快速进行错误检测


bpe(byte-pair encoding, Gage, 1994)

简单来讲就是每一次迭代合并两个词,选择频率最高的那组进行组合,下一次依旧是选择频率最高的那组组合

如果合并过程中碰到两个合并频率相同,比如”ab”和”fe”,我们按照lexicographically greater pair优先(选择字符串排序大的那个),选择”fe”

举一个简单的例子:

1
2
3
low low low low low
lower lower widest widest widest
newest newest newest newest newest newest

首先按照空格分词:

  • low 5
  • lower 2
  • widest 3
  • newest 6

第一轮:依次组合两个字符组合,按照频率,最高的是(e,s)和(s,t),都是9次,按照排序选择(s,t)

第二轮:依次组合两个字符组合,按照频率,最高的是(e,st)是9次

以此类推,假设再跑四轮,组合依次是:(o,w) (l,ow) (w,est) (n,e)

经过六轮过后,我们有了(st) (est) (ow) (low) (west) (ne)

把他们应用到原来的词上,得到:

  • low: low
  • lower: low e r
  • widest: w i d est
  • newest: ne west

代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
def run_train_bpe(
    input_path: str | os.PathLike,
    vocab_size: int,
    special_tokens: list[str],
    **kwargs,
) -> tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
    """Given the path to an input corpus, run train a BPE tokenizer and
    output its vocabulary and merges.

    Args:
        input_path (str | os.PathLike): Path to BPE tokenizer training data.
        vocab_size (int): Total number of items in the tokenizer's vocabulary (including special tokens).
        special_tokens (list[str]): A list of string special tokens to be added to the tokenizer vocabulary.
            These strings will never be split into multiple tokens, and will always be
            kept as a single token. If these special tokens occur in the `input_path`,
            they are treated as any other string.

    Returns:
        tuple[dict[int, bytes], list[tuple[bytes, bytes]]]:
            vocab:
                The trained tokenizer vocabulary, a mapping from int (token ID in the vocabulary)
                to bytes (token bytes)
            merges:
                BPE merges. Each list item is a tuple of bytes (<token1>, <token2>),
                representing that <token1> was merged with <token2>.
                Merges are ordered by order of creation.
    """
    # GPT-2 pretokenization pattern
    GPT2_PATTERN = regex.compile(
        r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
    )

    def pretokenize(text: str) -> list[str]:
        """Split text into tokens using GPT-2 regex pattern."""
        return GPT2_PATTERN.findall(text)

    def split_on_special_tokens(text: str, special_tokens: list[str]) -> list[str]:
        """Split text on special token boundaries."""
        # Build regex pattern to match any special token
        pattern = "|".join(map(regex.escape, special_tokens))
        # Split and keep delimiters
        parts = regex.split(f"({pattern})", text)
        return [p for p in parts if p]

    # Step 1: Initialize vocabulary with 256 byte values
    vocab: dict[int, bytes] = {i: bytes([i]) for i in range(256)}

    # Add special tokens to vocab
    special_token_set = set(special_tokens)
    for tok in special_tokens:
        vocab[len(vocab)] = tok.encode("utf-8")

    # Step 2: Read and preprocess the training text
    with open(input_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Split on special tokens and pretokenize each part
    parts = split_on_special_tokens(text, special_tokens)
    tokens_list: list[list[bytes]] = []

    for part in parts:
        if part in special_token_set:
            # Special token: keep as single token
            tokens_list.append([part.encode("utf-8")])
        elif part:  # Non-empty non-special text
            # Regular text: pretokenize and convert to bytes
            for word in pretokenize(part):
                # Convert each character to its byte representation
                tokens_list.append([bytes([b]) for b in word.encode("utf-8")])

    merges: list[tuple[bytes, bytes]] = []

    # Step 3: BPE training loop
    while len(vocab) < vocab_size:
        # Count pair frequencies
        pair_counts: Counter[tuple[bytes, bytes]] = Counter()
        for tokens in tokens_list:
            for i in range(len(tokens) - 1):
                pair = (tokens[i], tokens[i + 1])
                pair_counts[pair] += 1

        if not pair_counts:
            break

        # Select highest frequency pair (with lexicographically larger as tiebreaker)
        best_pair = max(pair_counts.items(), key=lambda x: (x[1], x[0]))[0]

        # Create new token by merging
        new_token = best_pair[0] + best_pair[1]
        vocab[len(vocab)] = new_token
        merges.append(best_pair)

        # Update tokens by merging all occurrences of best_pair
        for tokens in tokens_list:
            i = 0
            while i < len(tokens) - 1:
                if (tokens[i], tokens[i + 1]) == best_pair:
                    # Merge the pair
                    tokens[i : i + 2] = [new_token]
                    # Don't increment i - check if the new token can merge again
                else:
                    i += 1

    return vocab, merges

关于这部分代码,可以在这里找到:

详细讲解:

BPE Tokenizer Training: What Actually Happens When You Call tokenizer.train()

Most LLM engineers treat tokenization as a solved problem. You import tiktoken, pick a vocabulary size, and move on. But when you’re training your own tokenizer from scratch — for a new domain, a new language, or just to understand what’s happening under the hood — the implementation details matter more than you’d expect.

Here’s how Byte Pair Encoding (BPE) tokenizer training actually works, drawn from a complete reference implementation.

The Core Algorithm in One Sentence

BPE iteratively merges the most frequent adjacent token pairs until your vocabulary reaches the target size.

That’s it. The rest is implementation detail — but those details determine whether your tokenizer produces sensible merges or garbage.

Step 1: GPT-2 Pretokenization (Where the Magic Happens)

Before BPE even starts, you need to split your text into chunks. GPT-2 uses a regex pattern that’s surprisingly sophisticated:

1
2
3
GPT2_PATTERN = regex.compile(
    r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
)

This pattern handles several edge cases that naive whitespace splitting misses:

  • Contractions: 's, 'd, 'll, 've, 're become separate tokens — crucial for English
  • Numbers vs. letters: 42 and hello keep their leading space as part of the token
  • Punctuation clusters: !!! stays together instead of becoming three separate tokens
  • Trailing whitespace: \s+(?!\S) captures trailing spaces that would otherwise be lost

The result? Each pretokenized word starts as a sequence of single bytes. The word “hello” becomes [b'h', b'e', b'l', b'l', b'o'].

Step 2: Special Tokens Get Special Treatment

Special tokens — like <|endoftext|> or <|startoftext|> — are a constant headache in tokenizer implementations. The right approach: split on them before pretokenization, then treat each special token as an atomic unit.

1
2
3
4
def split_on_special_tokens(text: str, special_tokens: list[str]) -> list[str]:
    pattern = "|".join(map(regex.escape, special_tokens))
    parts = regex.split(f"({pattern})", text)
    return [p for p in parts if p]

This ensures special tokens never get broken up by the BPE merge process. They’re added directly to the vocabulary and preserved as-is.

Step 3: The Merge Loop

Here’s where vocabulary building happens:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
while len(vocab) < vocab_size:
    # Count all adjacent pairs across all tokens
    pair_counts: Counter[tuple[bytes, bytes]] = Counter()
    for tokens in tokens_list:
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i + 1])
            pair_counts[pair] += 1

    if not pair_counts:
        break  # Nothing left to merge

    # Select the best pair: highest frequency, ties broken lexicographically
    best_pair = max(pair_counts.items(), key=lambda x: (x[1], x[0]))[0]

    # Create new merged token
    new_token = best_pair[0] + best_pair[1]
    vocab[len(vocab)] = new_token
    merges.append(best_pair)

    # Update all token sequences with the new merge
    for tokens in tokens_list:
        i = 0
        while i < len(tokens) - 1:
            if (tokens[i], tokens[i + 1]) == best_pair:
                tokens[i : i + 2] = [new_token]
                # Don't increment i — the new token might merge again
            else:
                i += 1

Each iteration:

  1. Counts pair frequencies across the entire corpus
  2. Picks the most frequent pair (with lexicographic tiebreaking)
  3. Creates a new token by concatenating the pair
  4. Updates all token sequences, replacing the pair with the new token

The tiebreaker matters more than you’d think. When two pairs have the same frequency, choosing the lexicographically greater one ensures deterministic results across runs. Without this, your tokenizer’s vocabulary could vary based on hash ordering.

What This Implementation Doesn’t Do (And Why That’s Okay)

No parallel pretokenization. The code processes text sequentially. For large corpora, you’d want to parallelize this step. The GPT-2 pattern is stateless, so this is embarrassingly parallel.

No streaming. The entire corpus is loaded into memory. For production tokenizers training on terabytes, you’d need a streaming approach with approximate pair counts.

No caching of intermediate results. Each merge rescans all tokens. You could optimize this with a priority queue, but the naive approach is often fast enough for vocab sizes under 100k.

The Surprising Insight

The vocabulary starts with 256 single-byte tokens. Everything else — every word, every subword, every punctuation cluster — is built from these 256 atoms through repeated merging. Your tokenizer’s entire vocabulary is just compressed byte sequences.

This is why BPE handles any Unicode text without exploding vocabulary size. A character like 你好 becomes three UTF-8 bytes (\xe4\xbd\xa0 and \xe5\xa5\xbd), and BPE learns to merge them into meaningful subword units if they appear frequently enough.

The Output

The function returns two things:

  1. Vocabulary: A mapping from token ID to bytes ({0: b'\x00', 1: b'\x01', ..., 256: b'hello', ...})
  2. Merges: An ordered list of which pairs were merged ([(b'h', b'e'), (b'he', b'l'), ...])

The merge order matters for encoding. When you tokenize new text, you apply merges in the same order they were learned. Skip a merge, and you’ll get different tokenization than the training process produced.

When You’d Actually Train Your Own Tokenizer

  • Domain-specific models: Legal documents, medical records, or code have very different token frequency distributions than general web text
  • New languages: Training on English text produces terrible tokenization for languages with different morphological patterns
  • Compression research: Tokenizer efficiency directly impacts context window utilization
  • Educational purposes: Understanding tokenization helps debug why your model fails on certain inputs

The full implementation is in tests/adapters.py if you want to trace through it yourself. The algorithm is ~100 lines of Python, but every line exists for a reason.