Main Content

textanalytics.unicode.nfd

Unicode decomposed normalized form (NFD)

    Description

    example

    newStr = textanalytics.unicode.nfd(str) normalizes the string str to the Unicode canonical decomposition form (NFD).

    Examples

    collapse all

    Strings that look identical can have different underlying representations. The Unicode canonical decomposition form (NFD) ensures that equivalent strings have a unique binary representation. This is useful when strings contain accented characters which can have several ways to represent them.

    Consider the string "jalapeño" which contains 8 letters.

    str = "jalapeño";
    strlength(str)
    ans = 8
    

    Normalize the string using the textanalytics.unicode.nfd function. Depending on your system, the output string may appear to be identical to the input string.

    newStr = textanalytics.unicode.nfd(str)
    newStr = 
    "jalapeño"
    

    View the number of code points in the new string.

    strlength(newStr)
    ans = 9
    

    Notice that the normalized representation includes one extra code point. In this case, the function splits the accented letter "ñ" into two separate code points. Extract the 7th and 8th code points in the normalized string. Depending on your system, the output may appear to be a single character.

    extractBetween(newStr,7,8)
    ans = 
    "ñ"
    

    Check that the strings str and newStr are equal using the == operator. The operator returns false because the strings have different underlying representations.

    tf = str == newStr
    tf = logical
       0
    
    

    Input Arguments

    collapse all

    Input text, specified as a string array, character vector, or cell array of character vectors.

    Example: ["An example of a short sentence."; "A second short sentence."]

    Data Types: string | char | cell

    Output Arguments

    collapse all

    Output text, returned as a string array, a character vector, or cell array of character vectors. str and newStr have the same data type.

    References

    [1] Unicode Standard Annex #15 Unicode Normalization Forms https://unicode.org/reports/tr15/

    Introduced in R2021a