What are Tokens?

All Lessons

Understanding LLMs

What are Tokens?

All Lessons

What are Tokens?

Free

Dr. Amir Mohammadi

Generative AI Instructor

For language models, the smallest unit they use to process text is called a token.

The Core Mechanics of Language Models

At the core of every large language model (LLM) like ChatGPT, there are two essential components that govern how text is processed: tokens and context windows. In this section, we’ll focus on the first concept, tokens, and explore their role in helping LLMs understand and generate human language.

What are Tokens?

In human language, we typically think of words as the smallest units of meaning. However, for language models, the smallest unit they use to process text is called a token. Tokens are like the building blocks that LLMs break words and sentences into for easier analysis and prediction.

Tokens are not exactly the same as words. While words are the conventional way we structure language, tokens can represent parts of words, whole words, or even punctuation marks. For example, the sentence "I love learning Generative AI!" might be split into several tokens like "I", "love", "learning", "Gener", "ative", "AI", and "!". Each of these tokens helps the model understand the meaning and structure of the sentence.

Tokenization: Breaking Down the Sentence

To understand how LLMs process text, it's crucial to understand the process of tokenization—the act of breaking down text into tokens. When an LLM is given a sentence, it doesn’t work directly with whole words. Instead, it divides the sentence into smaller, manageable chunks called tokens.

Consider the phrase:

"Generative AI models are revolutionizing many industries."

This sentence might have 6 words, but when processed by an LLM, it could be divided into more or fewer tokens, depending on the model's tokenization method. The model might break it down into tokens like "Gener", "ative", "AI", "models", "are", "revolution", "izing", "many", "industries", and punctuation marks. So, even though there are 7 words, the sentence could be split into 10 tokens.

Average Token Length

An important thing to note is that tokens are not always the same size. Typically, a token is about 4 characters or roughly ¾ of a word, but this can vary depending on the specific text being processed.

Token IDs: The Numerical Representation of Tokens

Once a text is broken into tokens, each token is assigned a unique number called a token ID. This number helps the model recognize and reference the token in a more efficient, computationally friendly way. These token IDs are the actual units that models work with, not the text itself.

For instance:

The token "AI" could have a token ID like 17527.
The token "artificial" might have a different token IDs like 497 and 20454.

Even though the words may look the same to us, the model understands them by their corresponding token IDs. This is crucial because LLMs don’t actually understand language the way humans do—they process and predict numbers (token IDs) based on patterns learned during training.

Capitalization and Tokenization

One interesting aspect of tokenization is how capitalization affects how a model handles a word. For example, the word "apple" in lowercase is likely treated as one token, while "Apple" with a capital 'A' could be treated differently—sometimes as a distinct token because it might refer to the fruit, or even to a brand, depending on context.

This change might seem subtle, but the model treats them as distinct units, and each would have a different token ID. So, a simple change in capitalization can lead to a significant change in how a model processes a word.

Language Variability in Tokenization

Tokenization can also vary across languages. For instance, in languages that use non-Latin characters like Chinese, each character might be treated as an individual token, adding complexity to how the text is processed. In contrast, languages like English use spaces between words, so tokenization is more straightforward.

Furthermore, languages with complex grammar structures, like Finnish or Japanese, may have different tokenization rules. The model needs to adjust for these nuances, which means that tokenization can differ based on the language, the training data the model has been exposed to, and the specific rules built into the model.

Using the Online Tokenizer for Token Management

The OpenAI Online Tokenizer is a valuable tool to estimate how many tokens your text will consume when using generative AI models. Tokens can include words, punctuation, or parts of words, and knowing the exact count helps manage both performance and costs. By pasting your text into the tokenizer, you can see how it’s broken down and get an accurate token count. This helps you ensure your input fits within the model’s token limit and avoids incomplete outputs.

Why Use It:

Estimate Token Count: Check how many tokens your input will use before sending it to the model.
Optimize Inputs: Adjust phrasing to reduce token usage, saving costs.
Plan Efficiently: Make sure long texts fit within the model’s context window.
Real-World Example: Before using GPT-4o via API (4,096 token output limit), check your input to avoid truncation.
Visit the OpenAI Online Tokenizer and test different phrases to see how they’re tokenized!

The Importance of Tokenization

Understanding tokens and how they are processed by models is essential, as they form the basis for how LLMs understand and generate text. Even small changes in the text—like adding punctuation or switching the order of words—can alter the tokenization and thus affect how the model interprets the meaning.

The way tokenization is handled can also vary across different models. Newer versions of models might use different techniques, optimizing how they process language based on advancements in technology or updated training data.

Key Takeaways:

Tokens are the smallest units that LLMs use to understand and process language.
Each token is given a unique token ID, which allows the model to handle text numerically.
Changes like capitalization or adding punctuation can alter how text is tokenized.
Language models treat tokenization differently depending on the language and structure of the text.
The tokenization process can vary even within the same family of models, and newer versions may use different techniques.

Activities

Activity 1: Tokenize a Sentence

Instructions:
Take the following sentence:
"Language models can be complex, but they have incredible potential."

Use a tokenizer (there are many online tools available) to break down the sentence into tokens.
Count how many tokens the model outputs.
Try making a small change to the sentence—like replacing "incredible" with "amazing"—and see how the tokenization changes.

Submit your findings: Did the number of tokens increase or decrease with your change? How did the tokens change?

Activity 2: Explore Capitalization Effects

Instructions:

Choose a word like "data" or "research" and use a tokenizer to break it down into tokens.
Next, change the capitalization (e.g., "Data" or "Research") and tokenize it again.
Compare the tokenization results for both versions.