Contenuto principale

splitTextChunks

Split documents recursively into text chunks

Since R2026a

    Description

    Split documents recursively into text chunks.

    Text often comes in large documents. Many analysis tools, including large language models (LLMs), perform better on small chunks of text. Text Analytics Toolbox™ includes a range of functions that allow you to split large documents into semantically meaningful chunks.

    chunkTable = splitTextChunks(str) recursively splits the document str into text chunks of the default target length or shorter. By default, the function first splits str into paragraphs, then splits any paragraphs longer than the target length into sentences, then splits any sentences longer than the target length into tokens.

    example

    chunkTable = splitTextChunks(t) splits a table of documents t into smaller text chunks.

    example

    chunkTable = splitTextChunks(___,Name=Value) specifies additional options using one or more name-value arguments. For example, to set a target length of 100 characters, set TargetLength to 100.

    Examples

    collapse all

    Load the example data. The file sonnets.txt contains Shakespeare's sonnets in plain text. Extract the text from sonnets.txt using the extractFileText function.

    str = extractFileText("sonnets.txt");

    Split str into text chunks using the splitTextChunks function. Specify the target length as 50.

    chunkTable = splitTextChunks(str,TargetLength=50)
    chunkTable=2322×1 table
                                Text                        
        ____________________________________________________
    
        "THE SONNETS  by William Shakespeare  I"            
        "From fairest creatures we desire increase, That"   
        "thereby beauty's rose might never die, But as the" 
        "riper should by time decease, His tender heir"     
        "might bear his memory: But thou, contracted to"    
        "thine own bright eyes, Feed'st thy light's flame"  
        "with self-substantial fuel, Making a famine where" 
        "abundance lies, Thy self thy foe, to thy sweet"    
        "self too cruel: Thou that art now the world's"     
        "fresh ornament, And only herald to the gaudy"      
        "spring, Within thine own bud buriest thy content," 
        "And tender churl mak'st waste in niggarding: Pity" 
        "the world, or else this glutton be, To eat the"    
        "world's due, by the grave and thee."               
        "II"                                                
        "When forty winters shall besiege thy brow, And dig"
          ⋮
    
    

    To split multiple documents into chunks in a single table, create a table of documents and then use splitTextChunks with the table of documents as the input.

    To retain metadata about the source documents, such as their filenames, add the metadata to the table as additional variables.

    Create a table from:

    • A variable Text that contains documents.

    • A variable DocumentName that contains the names of the document.

    str1 = "Document 1 contains some text.";
    str2 = "Document 2 also contains some text.";
    str3 = "Document 3 is very different from documents 1 and 2. However, it, too, contains some text.";
    Text = [str1;str2;str3];
    DocumentName = ["Document 1";"Document 2";"Document 3"];
    t = table(Text,DocumentName)
    t=3×2 table
                                                    Text                                                DocumentName
        ____________________________________________________________________________________________    ____________
    
        "Document 1 contains some text."                                                                "Document 1"
        "Document 2 also contains some text."                                                           "Document 2"
        "Document 3 is very different from documents 1 and 2. However, it, too, contains some text."    "Document 3"
    
    

    Split the table of documents into text chunks using the splitTextChunks function. Specify the target length as 20.

    chunkTable = splitTextChunks(t,TargetLength=20)
    chunkTable=9×2 table
                Text             DocumentName
        _____________________    ____________
    
        "Document 1 contains"    "Document 1"
        "some text."             "Document 1"
        "Document 2 also"        "Document 2"
        "contains some text."    "Document 2"
        "Document 3 is very"     "Document 3"
        "different from"         "Document 3"
        "documents 1 and 2."     "Document 3"
        "However, it, too,"      "Document 3"
        "contains some text."    "Document 3"
    
    

    Input Arguments

    collapse all

    Input document, specified as a string array, character vector, or cell array of character vectors.

    Data Types: string | char | cell

    Input table of documents. t must have a column named Text that contains the documents. The documents must be specified as a string scalar.

    Name-Value Arguments

    collapse all

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: splitTextChunks(str,TargetLenth=100) sets the target length of the output text chunks to 100.

    Levels at which to split the text, specified as a string array, character vector, or cell array of character vectors containing one or more of these elements:

    • "paragraph"

    • "sentence"

    • "token"

    • "character"

    If you do not specify SplitLevels, then the function splits the text at the levels of paragraphs, sentences, and tokens.

    Example: SplitLevels = "token"

    Example: SplitLevels = ["paragraph","sentence"]

    Data Types: string | char | cell

    Total target length of output text chunks, specified as a positive integer that represents the number of characters.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
    Complex Number Support: Yes

    Output Arguments

    collapse all

    Table of text chunks. chunkTable has a column Text that contains the text chunks returned as string scalars.

    If you specify the input documents as a table, then chunkTable also contains all the variables in the table. For each chunk, the values of the variables are the same as for the document from which the chunk originates.

    More About

    collapse all

    Version History

    Introduced in R2026a