Extract data from a non-rectangular text file, efficiently

2 visualizzazioni (ultimi 30 giorni)
In spite of going through Matlab help and several attempts, I still do not get how to use "textscan" or other relevant functions to read from a simple non-rectangular file like the attached sample.
All I want is to efficiently extract two integers from the first line and two cell arrays (or char arrays) each containing the strings respectively to the left and two the right of | in the subsequent lines. Here is the code I tried, but it does not work:
fid = fopen('dataset.txt');
coeffs = textscan(fid, '%s%s', 1, 'Delimiter', ' ');
sequences = textscan(fid, '%s%s', 'Delimiter', '|');
fclose(fid);
Using "fileread" and "regexp" also seem to be options, but "regexp" seems slower than "textscan".
BTW if anybody can point me to a link where I can find how to extract data from files using Matlab, with examples providing ample coverage of use cases, that would be great.

Risposta accettata

dpb
dpb il 5 Mar 2017
Modificato: dpb il 5 Mar 2017
Close, but the first record is numeric, not string...
>> n=cell2mat(textscan(fid,'%f',2,'collectoutput',1))
n =
50
200
>> s=textscan(fid,'%s%s','delimiter','|','collectoutput',1)
s =
{5x2 cell}
>> s{:}
ans =
'GATGGAGTGCGGGGTGGTTGACTAGCATGGGCCCTAGGATCGCTGACGTG' 'TTGACCAAGAACAAACGTTGTACGTATTTTCGATATAATACAGTAAGCTA'
'ACTTTTTACTAAACATAAGTTCGATTTCCACATCTTCCCGCGACCATCAG' 'TTACAGCCTGCTAATACGTTCTGTTTAAATGCGTAATTAGTAGCGCTCAG'
'TGCATGAACGACGGTAGGTCCACCCGTTGTAATGCGATAGCCTATGTAGC' 'CACAAGTTCATTTTTCAAATCGATAACCTGTGGGAGTATTCTTCGGCATC'
'GATCTTGCAGGCCGGGCGCTGGCGATCTGCGCGCGACATGGCCTGCAGTG' 'GCGACCTGCTTTTCGGTTGTAACGGGAGTGCGCCTACGCGCGCAAGATAC'
'TGAGTTTAGTCACTGATCTATAACACCAAGTGGGCGCGGTAGCCGATTAG' 'CATCTTCCCGCGACCATCAGGTTTGCCCCAGTAACGCGCCTGTTGCCTGT'
>> fid=fclose(fid);
>>
As for the lament, textscan has a fair number of examples of different types of files to study and the forum here is replete with special cases. It's not possible to have examples that cover all possible issues; one needs must look for the general principles underlying the examples and consider how they relate to the file at hand.
For the most part, the issue is simply building a format string that matches the record structure and then applying that to the file in the proper sequence for the proper number of times. The most difficult issue in general does have to do with processing undelimited strings or fixed-width files as the C format rules on field width are based on the concept of fields separated by white space as opposed to actual firm character column counts; hence when one use '%s' on a field that isn't really wanted to be treated as such, havoc can often ensue...
  2 Commenti
Paolo Binetti
Paolo Binetti il 5 Mar 2017
Modificato: Paolo Binetti il 6 Mar 2017
Thank you, dpb. I tried your solution. I ended up using a combination of regexp and fileread, exploiting extra info from my specific problem (not included in my question, because I was after a more general solution. Mine is not a general solution, but it works and is good enough.
As for what you interpreted as a lament, it's just that I spent a few hours trying to reuse code I had, then checking out the help, googling, and trying more solutions, no joy. So, besides the specific question, I was just hoping to find a self-contained resource with lots of examples, although certainly not exhaustive. My hope was to find a way to code simple stuff like this in one minute rather than in hours and asking help.
dpb
dpb il 6 Mar 2017
"My hope was to find a way to code simple stuff like this in one minute rather than in hours"
Sounds like you're trying to make it too complicated...it took less than a minute to write the above solution; including copying the data to make a demo test file couldn't have been more than a couple; with a couple of iterations to test and compare result with/without 'collectoutput' parameter certainly still well under five minutes, total.
The real key to proficiency in this regard is simply "time in grade"; using textscan of the other formatted input routines becomes much more manageable with practice albeit the possible number of options and formatting may seem overpowering initially until one gains familiarity with just how C format strings work.
As an aside, I think it unfortunate that it is C that is followed/used, Fortran FORMAT forms are much simpler to write for the duplicated fields and recursion, etc., etc., and also deal with fixed-width fields much more logically than the C version.
While totally generic input routines are and have been the Holy Grail of application programming since the invention of the mechanical computer, the problem is there is simply too much variation in possible format and data content for that to be practical excepting for some special cases. TMW has built several, importdata is pretty capable but it's just not reasonable as yet to take any file as a black box and have only one routine to automagically read it into a useful form for further processing. It's trivial, of course, to simply load a cellstr textual duplicate or fread a binary image, but for most purposes that's not sufficient to do much with unless one is simply filtering the file for some content or the like.
If you have a lot of files similar to this and there is additional particular information known of their structure and you have specific processing needs, then sure, go ahead and write a specific parser for them. If the files are from some well-known source or follow some industry/academic protocol for a given field, then it could even make sense to submit it as an enhancement request or, if only of somewhat lesser commonality submit to the File Exchange for some limited notoriety for yourself...and appreciation from others with the same issue. :)

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Text Data Preparation in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by