Check that *.txt file is really a TXT formatted file?

Hello!
How could I detect if the content of a *.txt file is really txt formatted, before further proceeding that file with my data import parser? I searched in folders for all files with file extension TXT in order to work with the data stored in each of them. In principal no problem so far. But it sometimes happened that a file has wrongly been stored as a *.TXT named file while its content is not in TXT format, but instead in some binary format (i.e. should better have been namened *.XLS).

 Risposta accettata

Guillaume
Guillaume il 19 Mag 2015
It all depends on what you call a text file.
If it's an ASCII file, then the code value of the characters is limited to 0-127, so you could test if any character has a value > 127. The presence of code values in the range 0-31 with the exception of 9 (tab), 10 and 13 (new lines) would also be a strong indication that the content is not meant to be read as text. It's not a guarantee though.
If it's an extended ASCII file, then the whole range 0-255 is used. Other than semantics, there's nothing distinguishing a text file from a binary file. Again characters in the range [0-8, 11-12, 14-31] would be an indication.
If it's an UTF8 file, there are some combinations that are not allowed and you could try to detect them. Again [0-31] is an indication that it's not meant to be text.
Perhaps, instead of trying to discriminate text files against binary, what you should be discriminating is files conforming to the format your code expects and those that don't?

5 Commenti

Marco
Marco il 19 Mag 2015
Modificato: Marco il 19 Mag 2015
Good information, I will keep it in mind for general cases! In my case I even know that the first characters to come would have to be the string "Instrument XYZ data file" (without the quotes). But I am unsure how to read the first "characters" if they are no characters, without my program to crash with an error message (see comment above, to Stephen). How to savely read the first characters, or bytes, without knowing what's actually coming?
At the storage level, there's no real difference between text and binary. It's just a sequence of bytes given different meaning. Therefore, reading binary data will not crash your program. You'll just get a strange sequence of characters if it is binary data. Your program may then error, but that's because you've not checked that the data you've read is what you expect. When dealing from external data (files, user input, etc.) you should always check that it conforms to your expectation anyway.
If the files you're dealing with always start with the same string, then check that it indeed starts with that string. You can use fscanf (or even fread)
fid = fopen('somefile', 'rt');
expectedstart = 'Instrument XYZ data file';
filestart = fscanf(fid, '%c', numel(expectedstart));
assert(strcmp(filestart, expectedstart), ...
'input file is not an instrument file');
Note that even if the file pass this first check, you should be prepared for some data later on in the file not being what it should.
filestart = char(fread(fid, numel(expectedstart)));
fread is more efficient than fscanf with %c format. However, using 'rt' and fscanf() will allow UTF-8 and UTF-16 to be interpreted. For those characters, UTF-8 would be the same as bytes, but for UTF-16 there would be a byte pair at the beginning marking that it is in UTF-16 form.
@Walter,
Can matlab decode UTF-16? It's certainly not listed as an option for the encoding of fopen.
Also,
filestart = char(fread(fid, numel(expectedstart)))';
%or
filestart = char(fread(fid, [1 numel(expectedstart)]));
%or
filestart = fread(fid, [1 numel(expectedstart)], '*char');
would be more akin to fscanf. But fread only works if the characters are ASCII (or more precisely, just one byte per code point).
UTF8 is the same as bytes for those code points < 128. Anything above that use more than one byte per character.
Yes, MATLAB can decode UTF-16, both little endian and big endian. It can also decode UTF-32 little endian and big endian. For any of these MATLAB will issue a warning when you fopen() the file about the encoding not being supported, but really what that means is that MATLAB does not support writing files in those formats.

Accedi per commentare.

Più risposte (1)

Stephen23
Stephen23 il 19 Mag 2015
Modificato: Stephen23 il 19 Mag 2015
It is important to note that files themselves have no semantic meaning: they are merely lots of bits that can be interpreted in a particular way, given a known encoding. To answer your question you really need to answer this question: What exactly is a text file?
Here are two methods that you could try:
  • Read the file data, and check that all of the "characters" are within the expected character range (e.g. alphanumeric, punctuation, spaces, etc). This would work best when the data is of a limited kind (e.g. numeric data) and uses only a small character set (e.g. ASCII). This is also dependent on character encoding/format, and several other factors so it is very fragile in practice.
  • Read the first few bytes and check if it matches any known file signature. This is also fragile in practice, as it would miss formats not covered by the list of signatures.

3 Commenti

Marco
Marco il 19 Mag 2015
Modificato: Marco il 19 Mag 2015
My file is ASCII (see also comment given to Guillaume), and if reading it as explained in the documentation, with
line=fgetl(fid)
this works. But you mean I could use the fgetl - command also on a binary file without having my program crash? I would use fgetl for whatever file format comes, check if I find my expected ASCII characters, and if not, because of a binary file being opened, the fgetl command wouldn't crash with an error message but also read the next "line" from a binary file, just throwing some strange looking data to my line variable, then?
Stephen23
Stephen23 il 19 Mag 2015
Modificato: Stephen23 il 21 Mag 2015
It won't crash, but don't use fgetl: this will read to the next newline character, which if this is a binary file there may be no such combination of bits that looks like a newline. And so this simple "line" ends up being 5 GiBi of random data... or however big that file might be.
A better solution would be to use fscanf, as Guillaume explained, and reading just the number of bits that you need to identify the file. You can find more useful file reading functions here:
And because you already know the first characters, then you can simply check that these are what the file contains.
Thanks a lot, really helpful! As I could only accept one answer, I at least gave you my vote.

Accedi per commentare.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by