Check that *.txt file is really a TXT formatted file?

Question

1 voto

Hello!

How could I detect if the content of a *.txt file is really txt formatted, before further proceeding that file with my data import parser? I searched in folders for all files with file extension TXT in order to work with the data stored in each of them. In principal no problem so far. But it sometimes happened that a file has wrongly been stored as a *.TXT named file while its content is not in TXT format, but instead in some binary format (i.e. should better have been namened *.XLS).

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

Guillaume il 19 Mag 2015

1 voto

It all depends on what you call a text file.

If it's an ASCII file, then the code value of the characters is limited to 0-127, so you could test if any character has a value > 127. The presence of code values in the range 0-31 with the exception of 9 (tab), 10 and 13 (new lines) would also be a strong indication that the content is not meant to be read as text. It's not a guarantee though.

If it's an extended ASCII file, then the whole range 0-255 is used. Other than semantics, there's nothing distinguishing a text file from a binary file. Again characters in the range [0-8, 11-12, 14-31] would be an indication.

If it's an UTF8 file, there are some combinations that are not allowed and you could try to detect them. Again [0-31] is an indication that it's not meant to be text.

Perhaps, instead of trying to discriminate text files against binary, what you should be discriminating is files conforming to the format your code expects and those that don't?

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Guillaume il 19 Mag 2015

Modificato: Guillaume il 19 Mag 2015

Apri in MATLAB Online

@Walter,

Can matlab decode UTF-16? It's certainly not listed as an option for the encoding of fopen.

Also,

filestart = char(fread(fid, numel(expectedstart)))';
%or
filestart = char(fread(fid, [1 numel(expectedstart)]));
%or
filestart = fread(fid, [1 numel(expectedstart)], '*char');

would be more akin to fscanf. But fread only works if the characters are ASCII (or more precisely, just one byte per code point).

UTF8 is the same as bytes for those code points < 128. Anything above that use more than one byte per character.

Walter Roberson il 23 Ago 2016

Yes, MATLAB can decode UTF-16, both little endian and big endian. It can also decode UTF-32 little endian and big endian. For any of these MATLAB will issue a warning when you fopen() the file about the encoding not being supported, but really what that means is that MATLAB does not support writing files in those formats.

Accedi per commentare.

Answer 2

Stephen23 il 19 Mag 2015

Modificato: Stephen23 il 19 Mag 2015

1 voto

It is important to note that files themselves have no semantic meaning: they are merely lots of bits that can be interpreted in a particular way, given a known encoding. To answer your question you really need to answer this question: What exactly is a text file?

Here are two methods that you could try:

Read the file data, and check that all of the "characters" are within the expected character range (e.g. alphanumeric, punctuation, spaces, etc). This would work best when the data is of a limited kind (e.g. numeric data) and uses only a small character set (e.g. ASCII). This is also dependent on character encoding/format, and several other factors so it is very fragile in practice.
Read the first few bytes and check if it matches any known file signature. This is also fragile in practice, as it would miss formats not covered by the list of signatures.

3 Commenti
Mostra 1 commento meno recente Nascondi 1 commento meno recente

Stephen23 il 19 Mag 2015

Modificato: Stephen23 il 21 Mag 2015

It won't crash, but don't use fgetl: this will read to the next newline character, which if this is a binary file there may be no such combination of bits that looks like a newline. And so this simple "line" ends up being 5 GiBi of random data... or however big that file might be.

A better solution would be to use fscanf, as Guillaume explained, and reading just the number of bits that you need to identify the file. You can find more useful file reading functions here:

http://www.mathworks.com/help/matlab/low-level-file-i-o.html

And because you already know the first characters, then you can simply check that these are what the file contains.

Marco il 19 Mag 2015

Thanks a lot, really helpful! As I could only accept one answer, I at least gave you my vote.

Accedi per commentare.

Check that *.txt file is really a TXT formatted file?

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Risposta accettata

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Più risposte (1)

3 Commenti
Mostra 1 commento meno recente Nascondi 1 commento meno recente

Categorie

Tag

Community Treasure Hunt

Check that *.txt file is really a TXT formatted file?

0 Commenti Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Risposta accettata

5 Commenti Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Più risposte (1)

3 Commenti Mostra 1 commento meno recente Nascondi 1 commento meno recente

Categorie

Tag

Vedere anche

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

3 Commenti
Mostra 1 commento meno recente Nascondi 1 commento meno recente