performance of countTokens #68

pczekaj · 2024-12-08T12:17:54Z

I comparing performance of gpt-tokenizer 2.7.0 and tiktoken 1.0.17, on Intel based Mac + node 22.11.0 I'm always getting worser times for gpt-tokenizer than for tiktoken. I'm I doing something wrong or is this expected?

import { countTokens } from 'gpt-tokenizer';
import { encoding_for_model } from 'tiktoken';

const SAMPLE_TEXT = 'Occaecat est tempor incididunt voluptate exercitation irure quis aliqua sunt dolor. Anim nostrud incididunt eu aliquip quis culpa do incididunt eu. Magna qui dolor deserunt sit velit. Dolor anim laborum ut ad in et occaecat enim elit culpa commodo. Sit ut sit mollit adipisicing. Labore culpa do cillum proident incididunt et. Reprehenderit nisi excepteur culpa consectetur mollit consectetur laborum';

const LONG_MSG_REPEATS = 50000;
const EXPECTED_TOKENS = 86;

const gpt35Encoding = encoding_for_model('gpt-3.5-turbo');

describe('TokenizerService', () => {
  it('gpt-tokenizer short text', () => {
    const tokens = countTokens(SAMPLE_TEXT);
    expect(tokens).toBe(EXPECTED_TOKENS);
  });

  it('tiktoken short text', () => {
    const tokens = gpt35Encoding.encode(SAMPLE_TEXT).length;
    expect(tokens).toBe(EXPECTED_TOKENS);
  });

  it('gpt-tokenizer long text', () => {
    const tokens = countTokens(SAMPLE_TEXT.repeat(LONG_MSG_REPEATS));
    expect(tokens).toBe(EXPECTED_TOKENS * LONG_MSG_REPEATS);
  });

  it('tiktoken long text', () => {
    const tokens = gpt35Encoding.encode(SAMPLE_TEXT.repeat(LONG_MSG_REPEATS)).length;
    expect(tokens).toBe(EXPECTED_TOKENS * LONG_MSG_REPEATS);
  });
});

The text was updated successfully, but these errors were encountered:

niieani · 2024-12-09T06:21:27Z

Hi @pczekaj.

I cannot reproduce this. I'm when I benchmark it, even with your own sample text tiktoken is 2x slower. I'm on node v22.11.0 and using a Macbook Pro M1 Max.

When including other samples in the benchmark (English, Chinese, French, code) it's even faster (3.5x faster than tiktoken).

How are you running the benchmark? What tool do you use to benchmark?

pczekaj · 2024-12-09T07:22:35Z

@niieani I'm executing it as jest test without any special benchmark software, I don't do anything special like invoking GC or warming it up. Screenshot is from IDE but I get similar results when executing it on command line:

npm exec jest -t "TokenizerService"
 PASS  src/services/TokenizerService.test.ts (16.426 s)
  TokenizerService
    ✓ gpt-tokenizer short text (10 ms)
    ✓ tiktoken short text (7 ms)
    ✓ gpt-tokenizer long text (11440 ms)
    ✓ tiktoken long text (4099 ms)

Test Suites: 1 passed, 1 total
Tests:       4 passed, 4 total
Snapshots:   0 total
Time:        16.54 s
Ran all test suites matching /TokenizerService/i.

I'm only checking total execution time, I don't track memory consumption, changing order of test cases didn't affect timing.

niieani · 2024-12-09T09:39:43Z

Okay I've tried it with SAMPLE_TEXT.repeat(LONG_MSG_REPEATS) instead of just SAMPLE_TEXT and I do see about 25% slower execution times.

Got a couple of fixes and additional optimizations incoming... 💨

github-actions · 2024-12-09T09:43:52Z

🎉 This issue has been resolved in version 2.8.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

niieani · 2024-12-09T09:49:11Z

Could you try again in 2.8.0 and let me know if it's any better?

pczekaj · 2024-12-09T10:23:42Z

@niieani 2.8.0 is a lot faster than 2.7.0. Execution time went down from 11440 ms to just 615 ms which is much faster than tiktoken. Thank you very much

niieani · 2024-12-09T17:55:02Z

Perfect! Thanks for your feedback.

Pozdrawiam

niieani closed this as completed in 15d13b1 Dec 9, 2024

github-actions bot added the released label Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of countTokens #68

performance of countTokens #68

pczekaj commented Dec 8, 2024

niieani commented Dec 9, 2024

pczekaj commented Dec 9, 2024

niieani commented Dec 9, 2024

github-actions bot commented Dec 9, 2024

niieani commented Dec 9, 2024

pczekaj commented Dec 9, 2024

niieani commented Dec 9, 2024

performance of countTokens #68

performance of countTokens #68

Comments

pczekaj commented Dec 8, 2024

niieani commented Dec 9, 2024

pczekaj commented Dec 9, 2024

niieani commented Dec 9, 2024

github-actions bot commented Dec 9, 2024

niieani commented Dec 9, 2024

pczekaj commented Dec 9, 2024

niieani commented Dec 9, 2024