Remove intermittent text when reading in a table from a .dat file

Question

L. Borealis il 19 Feb 2021

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/750659-remove-intermittent-text-when-reading-in-a-table-from-a-dat-file

Commentato: L. Borealis il 25 Feb 2021

Hi,

I am trying to use readtable for read in a .dat file. The file looks like this, where there could be 1 to very many entries in the columns that start with a "1'" here.

# NetMHCIIpan version 4.0
# Input is in PEPTIDE format
# Prediction Mode: EL+BA
# Threshold for Strong binding peptides (%Rank)	2%
# Threshold for Weak binding peptides (%Rank)	10%
# Allele: HLA-DPA10103-DPB10101
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1   HLA-DPA10103-DPB10101      AAAAAAAAAAAAAAA    3   AAAAAAAAA     0.380        Sequence      0.020745    81.44       NA      0.366182        951.24    32.45       
   
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
# Allele: HLA-DPA10103-DPB10201
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
 
   1   HLA-DPA10103-DPB10201      BBBBBBBBBBBBBBBB    2   BBBBBBBBB     0.960        Sequence      0.491911     1.02       NA      0.712020         22.55     0.27 <=SB
     
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
# Allele: HLA-DPA10103-DPB10202
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1[.......]

These columns would then start 2,3,4,[...]. I successfully use

opts = detectImportOptions('filename.dat'); 
opts.DataLines = [16 Inf];
opts.VariableNamesLine = 14;
readtable(fullfile('path','filename.dat',opts,'ReadVariableNames', true);

for files with a large number of columns between the ----, i.e. e.g.

   # Allele: HLA-DPA10103-DPB10101
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos                     MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind      Score_BA  Affinity(nM) %Rank_BA  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1   HLA-DPA10103-DPB10101      AAAAAAAAAAAAAAA    3   AAAAAAAAA     0.380        Sequence      0.020745    81.44       NA      0.366182        951.24    32.45       
   2   HLA-....
   3   ....
   ....
   ....
   50  HLA....
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------

However, this does not work for short "fillings" and my code very much depends on being robust in either scenario.

I tried playing with the opts but did not get it to work. I would be very grateful for any advice! Maybe a method other than readtable (readtext?) is needed and then a conversion to a table? In the end I will need a table like this:

Thank you very much for your advice! I have spent a long time deleoping the code around this and this is the final part that keeps breaking...

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Vimal Rathod il 22 Feb 2021

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/750659-remove-intermittent-text-when-reading-in-a-table-from-a-dat-file#answer_630409

Hi,

Please refer to the following similar question which could be helpful to you.

How do I read data (from a .dat file) seperated by lines of text into individual vectors - MATLAB Answers - MATLAB Central (mathworks.com)

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

L. Borealis il 25 Feb 2021

Apri in MATLAB Online

Thanks, Vimal!

I had actually seen that question but even the question description was not particularly clear to me. So I had left it. Thanks for pointing me back to it. I did use part of it in the end to come up with a working (yet not elegant) solution. Maybe it is useful to someone in the future:

S = regexp(fileread('S:\scratch\cdr1pool2pep1\out_00.dat'), '\r?\n', 'split');
S = S(~cellfun('isempty',S));
if isempty(S{end}); S(end) = []; end    %regexp split leaves empty at bottom if file ended in \n which is common
nonheader = cellfun(@isempty, regexp(S, '^\s*#|^\s*-|^\s*P|^\s*N' ));  %permit space before #
starts = strfind([false nonheader], [false true]);
stops = strfind([nonheader false], [true false]);
num_blocks = length(starts);
lenRows = length(starts(1):stops(1));
S_temp = cell(num_blocks,lenRows);
for K = 1 : num_blocks
    S_temp(K,:) = S(starts(K):stops(K));
end
S = reshape(S_temp,[num_blocks*lenRows,1]);
writecell(S,'data.dat')
opts = detectImportOptions('data.dat');
tbl=readtable(('data.dat'),opts);

Accedi per commentare.

Remove intermittent text when reading in a table from a .dat file

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

Remove intermittent text when reading in a table from a .dat file

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti