How to split text by tokens
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.
tiktokenβ
tiktoken is a fast BPE
tokenizer created by OpenAI
.
We can use tiktoken
to estimate tokens used. It will probably be more accurate for the OpenAI models.
- How the text is split: by character passed in.
- How the chunk size is measured: by
tiktoken
tokenizer.
CharacterTextSplitter, RecursiveCharacterTextSplitter, and TokenTextSplitter can be used with tiktoken
directly.
%pip install --upgrade --quiet langchain-text-splitters tiktoken
from langchain_text_splitters import CharacterTextSplitter
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
To split with a CharacterTextSplitter and then merge chunks with tiktoken
, use its .from_tiktoken_encoder()
method. Note that splits from this method can be larger than the chunk size measured by the tiktoken
tokenizer.
The .from_tiktoken_encoder()
method takes either encoding_name
as an argument (e.g. cl100k_base
), or the model_name
(e.g. gpt-4
). All additional arguments like chunk_size
, chunk_overlap
, and separators
are used to instantiate CharacterTextSplitter
:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
To implement a hard constraint on the chunk size, we can use RecursiveCharacterTextSplitter.from_tiktoken_encoder
, where each split will be recursively split if it has a larger size:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4",
chunk_size=100,
chunk_overlap=0,
)
We can also load a TokenTextSplitter
splitter, which works with tiktoken
directly and will ensure each split is smaller than chunk size.
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our
Some written languages (e.g. Chinese and Japanese) have characters which encode to 2 or more tokens. Using the TokenTextSplitter
directly can split the tokens for a character between two chunks causing malformed Unicode characters. Use RecursiveCharacterTextSplitter.from_tiktoken_encoder
or CharacterTextSplitter.from_tiktoken_encoder
to ensure chunks contain valid Unicode strings.
spaCyβ
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.
LangChain implements splitters based on the spaCy tokenizer.
- How the text is split: by
spaCy
tokenizer. - How the chunk size is measured: by number of characters.
%pip install --upgrade --quiet spacy
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import SpacyTextSplitter
text_splitter = SpacyTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Six days ago, Russiaβs Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
But he badly miscalculated.
He thought he could roll into Ukraine and the world would roll over.
Instead he met a wall of strength he never imagined.
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
SentenceTransformersβ
The SentenceTransformersTokenTextSplitter is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.
To split text and constrain token counts according to the sentence-transformers tokenizer, instantiate a SentenceTransformersTokenTextSplitter
. You can optionally specify:
chunk_overlap
: integer count of token overlap;model_name
: sentence-transformer model name, defaulting to"sentence-transformers/all-mpnet-base-v2"
;tokens_per_chunk
: desired token count per chunk.
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "
count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)
2
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1
# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier
print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")
tokens in text to split: 514
text_chunks = splitter.split_text(text=text_to_split)
print(text_chunks[1])
lorem
NLTKβ
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.
Rather than just splitting on "\n\n", we can use NLTK
to split based on NLTK tokenizers.
- How the text is split: by
NLTK
tokenizer. - How the chunk size is measured: by number of characters.
# pip install nltk
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import NLTKTextSplitter
text_splitter = NLTKTextSplitter(chunk_size=1000)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.
Members of Congress and the Cabinet.
Justices of the Supreme Court.
My fellow Americans.
Last year COVID-19 kept us apart.
This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents.
But most importantly as Americans.
With a duty to one another to the American people to the Constitution.
And with an unwavering resolve that freedom will always triumph over tyranny.
Six days ago, Russiaβs Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.
But he badly miscalculated.
He thought he could roll into Ukraine and the world would roll over.
Instead he met a wall of strength he never imagined.
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
Groups of citizens blocking tanks with their bodies.
KoNLPYβ
KoNLPy: Korean NLP in Python is is a Python package for natural language processing (NLP) of the Korean language.
Token splitting involves the segmentation of text into smaller, more manageable units called tokens. These tokens are often words, phrases, symbols, or other meaningful elements crucial for further processing and analysis. In languages like English, token splitting typically involves separating words by spaces and punctuation marks. The effectiveness of token splitting largely depends on the tokenizer's understanding of the language structure, ensuring the generation of meaningful tokens. Since tokenizers designed for the English language are not equipped to understand the unique semantic structures of other languages, such as Korean, they cannot be effectively used for Korean language processing.
Token splitting for Korean with KoNLPy's Kkma Analyzerβ
In case of Korean text, KoNLPY includes at morphological analyzer called Kkma
(Korean Knowledge Morpheme Analyzer). Kkma
provides detailed morphological analysis of Korean text. It breaks down sentences into words and words into their respective morphemes, identifying parts of speech for each token. It can segment a block of text into individual sentences, which is particularly useful for processing long texts.
Usage Considerationsβ
While Kkma
is renowned for its detailed analysis, it is important to note that this precision may impact processing speed. Thus, Kkma
is best suited for applications where analytical depth is prioritized over rapid text processing.
# pip install konlpy
# This is a long Korean document that we want to split up into its component sentences.
with open("./your_korean_doc.txt") as f:
korean_document = f.read()
from langchain_text_splitters import KonlpyTextSplitter
text_splitter = KonlpyTextSplitter()
texts = text_splitter.split_text(korean_document)
# The sentences are split with "\n\n" characters.
print(texts[0])
μΆν₯μ μλ μ λ¨μμ μ΄ λλ Ήμ΄λΌλ λ²Όμ¬μμΉ μλ€μ΄ μμλ€.
κ·Έμ μΈλͺ¨λ λΉλλ λ¬μ²λΌ μμκ²Όκ³ , κ·Έμ νμκ³Ό κΈ°μλ λ¨λ³΄λ€ λ°μ΄λ¬λ€.
ννΈ, μ΄ λ§μμλ μΆν₯μ΄λΌλ μ μΈ κ°μΈμ΄ μ΄κ³ μμλ€.
μΆ ν₯μ μλ¦λ€μμ κ½κ³Ό κ°μ λ§μ μ¬λλ€ λ‘λΆν° λ§μ μ¬λμ λ°μλ€.
μ΄λ λ΄λ , λλ Ήμ μΉκ΅¬λ€κ³Ό λλ¬ λκ°λ€κ° μΆ ν₯μ λ§ λ 첫 λμ λ°νκ³ λ§μλ€.
λ μ¬λμ μλ‘ μ¬λνκ² λμκ³ , μ΄λ΄ λΉλ°μ€λ¬μ΄ μ¬λμ λ§ΉμΈλ₯Ό λλμλ€.
νμ§λ§ μ’μ λ λ€μ μ€λκ°μ§ μμλ€.
λλ Ήμ μλ²μ§κ° λ€λ₯Έ κ³³μΌλ‘ μ κ·Όμ κ°κ² λμ΄ λλ Ήλ λ λ μΌλ§ νλ€.
μ΄λ³μ μν μμμλ, λ μ¬λμ μ¬νλ₯Ό κΈ°μ½νλ©° μλ‘λ₯Ό λ―Ώκ³ κΈ°λ€λ¦¬κΈ°λ‘ νλ€.
κ·Έλ¬λ μλ‘ λΆμν κ΄μμ μ¬λκ° μΆ ν₯μ μλ¦λ€μμ μμ¬μ λ΄ μ΄ κ·Έλ
μκ² κ°μλ₯Ό μμνλ€.
μΆ ν₯ μ λλ Ήμ λν μμ μ μ¬λμ μ§ν€κΈ° μν΄, μ¬λμ μꡬλ₯Ό λ¨νΈν κ±°μ νλ€.
μ΄μ λΆλ
Έν μ¬λλ μΆ ν₯μ κ°μ₯μ κ°λκ³ νΉλ
ν νλ²μ λ΄λ Έλ€.
μ΄μΌκΈ°λ μ΄ λλ Ήμ΄ κ³ μ κ΄μ§μ μ€λ₯Έ ν, μΆ ν₯μ κ΅¬ν΄ λ΄λ κ²μΌλ‘ λλλ€.
λ μ¬λμ μ€λ μλ ¨ λμ λ€μ λ§λκ² λκ³ , κ·Έλ€μ μ¬λμ μ¨ μΈμμ μ ν΄ μ§λ©° νμΈμκΉμ§ μ΄μ΄μ§λ€.
- μΆν₯μ (The Tale of Chunhyang)
Hugging Face tokenizerβ
Hugging Face has many tokenizers.
We use Hugging Face tokenizer, the GPT2TokenizerFast to count the text length in tokens.
- How the text is split: by character passed in.
- How the chunk size is measured: by number of tokens calculated by the
Hugging Face
tokenizer.
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.
Last year COVID-19 kept us apart. This year we are finally together again.
Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.
With a duty to one another to the American people to the Constitution.