splitTextChunks
Syntax
Description
Split documents recursively into text chunks.
Text often comes in large documents. Many analysis tools, including large language models (LLMs), perform better on small chunks of text. Text Analytics Toolbox™ includes a range of functions that allow you to split large documents into semantically meaningful chunks.
recursively splits the document chunkTable = splitTextChunks(str)str into text chunks of the default
target length or shorter. By default, the function first splits str
into paragraphs, then splits any paragraphs longer than the target length into sentences,
then splits any sentences longer than the target length into tokens.
splits a table of documents t into smaller text chunks.chunkTable = splitTextChunks(t)
specifies additional options using one or more name-value arguments. For example, to set a
target length of 100 characters, set chunkTable = splitTextChunks(___,Name=Value)TargetLength to
100.
Examples
Load the example data. The file sonnets.txt contains Shakespeare's sonnets in plain text. Extract the text from sonnets.txt using the extractFileText function.
str = extractFileText("sonnets.txt");Split str into text chunks using the splitTextChunks function. Specify the target length as 50.
chunkTable = splitTextChunks(str,TargetLength=50)
chunkTable=2322×1 table
Text
____________________________________________________
"THE SONNETS by William Shakespeare I"
"From fairest creatures we desire increase, That"
"thereby beauty's rose might never die, But as the"
"riper should by time decease, His tender heir"
"might bear his memory: But thou, contracted to"
"thine own bright eyes, Feed'st thy light's flame"
"with self-substantial fuel, Making a famine where"
"abundance lies, Thy self thy foe, to thy sweet"
"self too cruel: Thou that art now the world's"
"fresh ornament, And only herald to the gaudy"
"spring, Within thine own bud buriest thy content,"
"And tender churl mak'st waste in niggarding: Pity"
"the world, or else this glutton be, To eat the"
"world's due, by the grave and thee."
"II"
"When forty winters shall besiege thy brow, And dig"
⋮
To split multiple documents into chunks in a single table, create a table of documents and then use splitTextChunks with the table of documents as the input.
To retain metadata about the source documents, such as their filenames, add the metadata to the table as additional variables.
Create a table from:
A variable
Textthat contains documents.A variable
DocumentNamethat contains the names of the document.
str1 = "Document 1 contains some text."; str2 = "Document 2 also contains some text."; str3 = "Document 3 is very different from documents 1 and 2. However, it, too, contains some text."; Text = [str1;str2;str3]; DocumentName = ["Document 1";"Document 2";"Document 3"]; t = table(Text,DocumentName)
t=3×2 table
Text DocumentName
____________________________________________________________________________________________ ____________
"Document 1 contains some text." "Document 1"
"Document 2 also contains some text." "Document 2"
"Document 3 is very different from documents 1 and 2. However, it, too, contains some text." "Document 3"
Split the table of documents into text chunks using the splitTextChunks function. Specify the target length as 20.
chunkTable = splitTextChunks(t,TargetLength=20)
chunkTable=9×2 table
Text DocumentName
_____________________ ____________
"Document 1 contains" "Document 1"
"some text." "Document 1"
"Document 2 also" "Document 2"
"contains some text." "Document 2"
"Document 3 is very" "Document 3"
"different from" "Document 3"
"documents 1 and 2." "Document 3"
"However, it, too," "Document 3"
"contains some text." "Document 3"
Input Arguments
Input document, specified as a string array, character vector, or cell array of character vectors.
Data Types: string | char | cell
Input table of documents. t must have a column named
Text that contains the documents. The documents must be specified
as a string scalar.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN, where Name is
the argument name and Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: splitTextChunks(str,TargetLenth=100) sets the target length
of the output text chunks to 100.
Levels at which to split the text, specified as a string array, character vector, or cell array of character vectors containing one or more of these elements:
"paragraph""sentence""token""character"
If you do not specify SplitLevels, then the function splits
the text at the levels of paragraphs, sentences, and tokens.
Example: SplitLevels = "token"
Example: SplitLevels = ["paragraph","sentence"]
Data Types: string | char | cell
Total target length of output text chunks, specified as a positive integer that represents the number of characters.
Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
Complex Number Support: Yes
Output Arguments
Table of text chunks. chunkTable has a column
Text that contains the text chunks returned as string
scalars.
If you specify the input documents as a table, then chunkTable
also contains all the variables in the table. For each chunk, the values of the
variables are the same as for the document from which the chunk originates.
More About
Many analysis tools, including large language models (LLMs), perform better on small chunks of text than on large documents. Text Analytics Toolbox includes a range of functions that allow you to split large documents into semantically meaningful chunks.
The splitTextChunks function splits a document recursively into text chunks
of a given target length. The function first splits a document into paragraphs. If any
of the paragraphs are longer than the target length, then the function splits those
paragraphs into sentences, and so on.
chunks = splitTextChunks(str);
Split your document into sections and preserve the section metadata using one of these functions:
splitHTMLSectionsSplit an HTML-formatted document into HTML sections according to the section tags
<h1>...</h1>,<h2>...</h2>, …,<h6>...</h6>.splitMarkdownSectionsSplit a Markdown-formatted document into Markdown sections, for example according to ATX section tags #,##, …,######.splitCustomSectionsSplit a document into custom sections according to custom section delimiters. Split your documents or your chunks recursively into paragraphs, sentences, and tokens using the
splitTextChunksfunction.To avoid redundancy, join similar adjacent chunks using the
joinSimilarTextChunksfunction.Add overlap between adjacent text chunks using the
addTextChunkOverlapfunction. Adding text chunk overlap avoids changing the meaning of sentences by splitting at inopportune points, for example, splitting the sentence "I would never say I love cats" into "I would never say" and "I love cats." Adding overlap in this example results in the two chunks "I would never say I love" and "never say I love cats." You can also add surrounding text to individual chunks as context by using thefindTextChunkContextfunction.
For an example showing the advanced workflow, see Split Document Into Semantically Meaningful Text Chunks.
RAG combines the text generation capabilities of large language models (LLMs) with reliable information contained in a set of source documents. First, retrieve documents relevant to the user prompt from the set of source documents. Then, append the relevant document to the prompt and use the LLM to generate a response.
To improve the quality of the generated output, split large documents into smaller, semantically meaningful chunks.
Use information retrieval to identify the text chunks that are relevant to the query. For more information, see Information Retrieval with Document Embeddings.
Create a prompt based on the most relevant chunks. To provide the LLM with additional context, you can add text from adjacent prompts within the same section by using the
findTextChunkContextfunction, or you can you can add overlap between text chunks before information retrieval by using theaddTextChunkOverlapfunction. Create a Markdown-formatted string from the text chunks using theformatTextChunksfunction. For an example, see Create Large Language Model (LLM) Prompt from Text Chunk.Generate an answer using an LLM. To connect to large language model APIs using MATLAB, use the Large Language Models (LLMs) with MATLAB add-on.
Version History
Introduced in R2026a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Seleziona un sito web
Seleziona un sito web per visualizzare contenuto tradotto dove disponibile e vedere eventi e offerte locali. In base alla tua area geografica, ti consigliamo di selezionare: .
Puoi anche selezionare un sito web dal seguente elenco:
Come ottenere le migliori prestazioni del sito
Per ottenere le migliori prestazioni del sito, seleziona il sito cinese (in cinese o in inglese). I siti MathWorks per gli altri paesi non sono ottimizzati per essere visitati dalla tua area geografica.
Americhe
- América Latina (Español)
- Canada (English)
- United States (English)
Europa
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)