Intended use case is calculating token count accurately on the client-side.
- Easy to use: 0 dependencies, code and data baked into a single file.
- Compatible with most LLaMA-based models (see Compatibility)
- Optimized running time: tokenize a sentence in roughly 1ms, or 2000 tokens in roughly 20ms.
- Optimized bundle size: 670KiB before minification and gzipping (the heaviest part of the tokenizer, merge data, has been compressed into a simple and efficient binary format, and then base64-encoded to bake it into the .js file)
Option 1: Install as an npm package and import as ES6 module
npm install llama-tokenizer-js
import llamaTokenizer from 'llama-tokenizer-js' console.log(llamaTokenizer.encode("Hello world!").length)
Option 2: Load as ES6 module with
<script> tags in your HTML
<script type="module" src="https://belladoreai.github.io/llama-tokenizer-js/llama-tokenizer.js"></script>
Once you have the module imported, you can encode or decode with it. Training is not supported.
When used in browser, llama-tokenizer-js pollutes global namespace with
llamaTokenizer.encode("Hello world!") > [1, 15043, 3186, 29991]
llamaTokenizer.decode([1, 15043, 3186, 29991]) > 'Hello world!'
You can run tests with:
The test suite is small, but it covers different edge cases very well.
Note that tests can be run both in browser and in Node (this is necessary because some parts of the code work differently in different environments).
Comparison to alternatives
- Some web applications make network calls to Python applications that run the Huggingface transformers tokenizer. For example, the oobabooga-text-webui exposes an API endpoint for token count. The drawback of this approach is latency: although the Python tokenizer itself is very fast, oobabooga adds a lot of overhead. In my testing, making a network call to locally running oobabooga to count tokens for short Strings of text took roughly 300ms (compared to ~1ms when counting tokens client-side with llama-tokenizer-js). The latency will be even higher when a real web client is making requests over the internet. The latency issue is even worse if an application needs to iteratively trim down a prompt to get it to fit within a context limit, requiring multiple network calls.
The tokenizer is the same for all LLaMA models which have been trained on top of the checkpoints (model weights) leaked by Facebook in early 2023.
Examples of compatible models:
Incompatible models are those which have been trained from scratch, not on top of the checkpoints leaked by Facebook. For example, OpenLLaMA models are incompatible. I’d be happy to adapt this to any LLaMA models that people need, just open an issue for it.
You are free to use llama-tokenizer-js for basically whatever you want (MIT license).
You are not required to give anything in exchange, but I kindly ask that you give back by linking to https://belladore.ai/tools in an appropriate place in your website. For example, you might link with the text “Using llama-tokenizer-js by belladore.ai” or something similar.