Language Model | Training Data | Token-based | Size | Descriptions |
---|---|---|---|---|
English LM | CommonCrawl(en.00) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; About 1.85 billion n-grams; 'trie' binary with '-a 22 -q 8 -b 8' |
Mandarin LM Small | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; About 0.13 billion n-grams; 'probing' binary with default settings |
Mandarin LM Large | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; About 3.7 billion n-grams; 'probing' binary with default settings |