Extract data from a non-rectangular text file, efficiently
2 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
In spite of going through Matlab help and several attempts, I still do not get how to use "textscan" or other relevant functions to read from a simple non-rectangular file like the attached sample.
All I want is to efficiently extract two integers from the first line and two cell arrays (or char arrays) each containing the strings respectively to the left and two the right of | in the subsequent lines. Here is the code I tried, but it does not work:
fid = fopen('dataset.txt');
coeffs = textscan(fid, '%s%s', 1, 'Delimiter', ' ');
sequences = textscan(fid, '%s%s', 'Delimiter', '|');
fclose(fid);
Using "fileread" and "regexp" also seem to be options, but "regexp" seems slower than "textscan".
BTW if anybody can point me to a link where I can find how to extract data from files using Matlab, with examples providing ample coverage of use cases, that would be great.
0 Commenti
Risposta accettata
dpb
il 5 Mar 2017
Modificato: dpb
il 5 Mar 2017
Close, but the first record is numeric, not string...
>> n=cell2mat(textscan(fid,'%f',2,'collectoutput',1))
n =
50
200
>> s=textscan(fid,'%s%s','delimiter','|','collectoutput',1)
s =
{5x2 cell}
>> s{:}
ans =
'GATGGAGTGCGGGGTGGTTGACTAGCATGGGCCCTAGGATCGCTGACGTG' 'TTGACCAAGAACAAACGTTGTACGTATTTTCGATATAATACAGTAAGCTA'
'ACTTTTTACTAAACATAAGTTCGATTTCCACATCTTCCCGCGACCATCAG' 'TTACAGCCTGCTAATACGTTCTGTTTAAATGCGTAATTAGTAGCGCTCAG'
'TGCATGAACGACGGTAGGTCCACCCGTTGTAATGCGATAGCCTATGTAGC' 'CACAAGTTCATTTTTCAAATCGATAACCTGTGGGAGTATTCTTCGGCATC'
'GATCTTGCAGGCCGGGCGCTGGCGATCTGCGCGCGACATGGCCTGCAGTG' 'GCGACCTGCTTTTCGGTTGTAACGGGAGTGCGCCTACGCGCGCAAGATAC'
'TGAGTTTAGTCACTGATCTATAACACCAAGTGGGCGCGGTAGCCGATTAG' 'CATCTTCCCGCGACCATCAGGTTTGCCCCAGTAACGCGCCTGTTGCCTGT'
>> fid=fclose(fid);
>>
As for the lament, textscan has a fair number of examples of different types of files to study and the forum here is replete with special cases. It's not possible to have examples that cover all possible issues; one needs must look for the general principles underlying the examples and consider how they relate to the file at hand.
For the most part, the issue is simply building a format string that matches the record structure and then applying that to the file in the proper sequence for the proper number of times. The most difficult issue in general does have to do with processing undelimited strings or fixed-width files as the C format rules on field width are based on the concept of fields separated by white space as opposed to actual firm character column counts; hence when one use '%s' on a field that isn't really wanted to be treated as such, havoc can often ensue...
2 Commenti
dpb
il 6 Mar 2017
"My hope was to find a way to code simple stuff like this in one minute rather than in hours"
Sounds like you're trying to make it too complicated...it took less than a minute to write the above solution; including copying the data to make a demo test file couldn't have been more than a couple; with a couple of iterations to test and compare result with/without 'collectoutput' parameter certainly still well under five minutes, total.
The real key to proficiency in this regard is simply "time in grade"; using textscan of the other formatted input routines becomes much more manageable with practice albeit the possible number of options and formatting may seem overpowering initially until one gains familiarity with just how C format strings work.
As an aside, I think it unfortunate that it is C that is followed/used, Fortran FORMAT forms are much simpler to write for the duplicated fields and recursion, etc., etc., and also deal with fixed-width fields much more logically than the C version.
While totally generic input routines are and have been the Holy Grail of application programming since the invention of the mechanical computer, the problem is there is simply too much variation in possible format and data content for that to be practical excepting for some special cases. TMW has built several, importdata is pretty capable but it's just not reasonable as yet to take any file as a black box and have only one routine to automagically read it into a useful form for further processing. It's trivial, of course, to simply load a cellstr textual duplicate or fread a binary image, but for most purposes that's not sufficient to do much with unless one is simply filtering the file for some content or the like.
If you have a lot of files similar to this and there is additional particular information known of their structure and you have specific processing needs, then sure, go ahead and write a specific parser for them. If the files are from some well-known source or follow some industry/academic protocol for a given field, then it could even make sense to submit it as an enhancement request or, if only of somewhat lesser commonality submit to the File Exchange for some limited notoriety for yourself...and appreciation from others with the same issue. :)
Più risposte (0)
Vedere anche
Categorie
Scopri di più su Text Data Preparation in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!