How to read a UTF-8 encoded text file as a single character vector including white spaces and unicode special characters?
74 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Deepu George Kurian
il 14 Mag 2020
Modificato: Rik
il 19 Feb 2021
I am trying to read a UTF-8 encoded .txt file, "data.txt" containing sample info like this.
<title>
Fate/kaleid liner Prisma☆Illya (Fate/Kaleid Liner Prisma Illya) - MyAnimeList.net
</title>
If I try;
data = fileread('data.txt');
Sample read data:
<title>
Fate/kaleid liner Prisma☆Illya (Fate/Kaleid Liner Prisma Illya) - MyAnimeList.net
</title>
I lose the UTF8 encoded special characters. Here, '☆' is misread as '☆'.
If I try;
file = fopen('data.txt','r','n','UTF-8');
data = fscanf(file, '%s');
fclose(file);
Sample read data::
<title>Fate/kaleidlinerPrisma☆Illya(Fate/KaleidLinerPrismaIllya)-MyAnimeList.net</title>
I can retain the unicode characters but loses all the white space characters.
If I try;
file = fopen('data.txt','r','n','UTF-8');
data = textscan(file, '%s');
fclose(file);
Sample read data:
11×1 cell array
{'<title>' }
{'Fate/kaleid' }
{'liner' }
{'Prisma☆Illya' }
{'(Fate/Kaleid' }
......
It's a cell broken up by white spaces, even though it did read all the unicode correctly.
Can you give me possible way to overcome this issue?
0 Commenti
Risposta accettata
Walter Roberson
il 14 Mag 2020
file = fopen('data.txt','r','n','UTF-8');
data = fread(file, [1 inf], '*char');
fclose(file)
3 Commenti
Walter Roberson
il 14 Mag 2020
You have a problem: 11577.txt and New.txt are ISO-8896-1 Latin1 encoded, but 14829.txt is UTF-8 encoded.
It is sometimes possible to tell the difference between the two encodings, but there is no provided routine for doing that.
If you were using R2020a or later, then fileread() would be enough: R2020a improved encoding detection and automatic use of encodings.
Più risposte (2)
Rik
il 14 Mag 2020
Modificato: Rik
il 14 Mag 2020
I wrote the readfile function for this goal. It will result in a cell array, but you can concatenate them back to a long char array if you prefer.
data=cell2mat(readfile('data.txt'));
Note: this removes all newlines. You can replace them with spaces like this:
data=readfile('data.txt');
data(2,:)={' '};
data=data(:)';
data=cell2mat(data);
6 Commenti
MathWorks Support Team
il 19 Feb 2021
As of MATLAB R2020a, fileread accomplishes the desired task.
1 Commento
Rik
il 19 Feb 2021
Modificato: Rik
il 19 Feb 2021
This is not quite true, as it doesn't work on all Unicode characters:
fid=fopen('foo.txt','w','n','UTF-8');
fprintf(fid,'%s','😀');
fclose(fid);
fid=fopen('foo.txt','rb');fread(fid).',fclose(fid);%display raw bytes
fileread('foo.txt') % show fileread result
Vedere anche
Categorie
Scopri di più su Characters and Strings in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!