Read text file lines and analyze

Question

Lmm3 il 24 Lug 2017

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/349955-read-text-file-lines-and-analyze

Risposto: OCDER il 9 Set 2017

Risposta accettata: Lmm3

Apri in MATLAB Online

I would appreciate help with reading and analyzing a text file. The text file (rosalind_gc1.txt) is in this format:

>Rosalind_4949

ACTTCTATGTAGCGCGCTATTTCAAGGGATCGGCCAATAGTACGACGTGTTTCATCTAGT GCGACAAATGTATATACCGTTTTCATTACGTACCACGATAAGTTGAAGCCCGTATTC AGACGCGGGAGCCGTCTGCTGGACAAGTACTAGCTGGTCCATCCTCCCCACCAAAGGGAA

>Rosalind_7490

AACTGGGAATTTCTATATTGGGCGGTAAGCTCGGGGCAATCTATTAGTTGAATGCAACAG TAACAAACTTGCCGTCGGTCGCTGTTCGCGCAGCATTAATAATAACTCTGGCGAGTAGAT

>Rosalind_8337

CCTTGTTGTCTACCCACCAAGTCAGATAGACAGTTGGCTGTCTCCAACGCAGATTTTCTA CGCTTCATGCTCTTGCGACTCATGTCGCCTGGGTTTATTGCTTCTCTACGGGATAACCGC CCGGGCTCACTCTACCCGCGGGAAGGCCGCCCTCTCTCCCGTGTGCCTACATAA

I would like to determine the %GC for the data sets between each “>Rosalind” heading. For example, in the example above there are 3 data sets. The %GC for the text between “>Rosalind_4949” and “>Rosalind_7490” is 48.5876% and between “>Rosalind_7490” and “>Rosalind_8337” is 45.000%.

I’m trying to use the following code but I don’t know how to read the lines as blocks between each “>” and I don’t know how to concatenate the lines as I read them. I would appreciate any help.

fid = fopen('rosalind_gc1.txt');
while ~feof(fid)
    templine = fgetl(fid);
    a = strcmp(templine, '>');
    if a == 0
        G = length(strfind(templine,'G'));
        C = length(strfind(templine,'C'));
        z = length(templine);
        %Per = (G+C)*100/z
    end
end
    Per = (G+C)*100/z

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Lmm3 il 9 Set 2017

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/349955-read-text-file-lines-and-analyze#answer_280864

Apri in MATLAB Online

The following code is what I used to read from the data file and determine %GC:

fid = fopen('rosalind_gc.txt');
n = 1;
G = 0;
C = 0;
z = 1;
while ~feof(fid)
    templine = fgetl(fid);
    a = strfind(templine, '>');
    TF = isempty(a);
    if TF == 1;
        n= n+1;
        G(1) = 0;
        C(1) = 0;
        z(1) = 0;
        G(n) = length(strfind(templine,'G'));
        C(n) = length(strfind(templine,'C'));
        z(n) = length(templine);
          G(n) = G(n) + G(n-1);
          C(n) = C(n) + C(n-1);
          z(n) = z(n) + z(n-1);
          continue
         % Per(n) = (G(n)+C(n))*100/z(n)
      else TF == 0 ;
          Per = (G(end)+C(end))*100/z(end)
          disp(templine)
          G(:,:) = [];
          C(:,:) = [];
          z (:,:)=[];
          continue
      end
  end
  Per =(G(end)+C(end))*100/z(end)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 2

KSSV il 24 Lug 2017

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/349955-read-text-file-lines-and-analyze#answer_275272

Modificato: KSSV il 24 Lug 2017

Apri in MATLAB Online

Let data.txt be your text file...You can count the number of G in your file as below:

fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
N = 0 ;
for i = 1:length(S)
    N = N+length(strfind(S{i}, 'G'));
end

Without loop :

fid = fopen('data.txt') ;
  S = textscan(fid,'%s','delimiter','\n') ;
  fclose(fid) ;
  S = S{1} ;
Ni = strfind(S,'G') ;
N = sum(cellfun(@numel,Ni)) ;

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Lmm3 il 25 Lug 2017

KSSV thank you for your response. Could you explain to me what the line S = S{1} is doing? The code returns the total number of "G" occurrences for the data file, but do you have a suggestion how to get the "G" occurrences between each of the headers that begin with ">Rosalind"? For example, in the data set above, I would like to get 3 values, the number of G occurrences between (“>Rosalind_4949” and “>Rosalind_7490”) between (“>Rosalind_7490” and “>Rosalind_8337”) and G occurrences below (">Rosalind_8337).

Accedi per commentare.

Answer 3

OCDER il 9 Set 2017

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/349955-read-text-file-lines-and-analyze#answer_280878

Apri in MATLAB Online

readFasta.m

If you deal with a lot of fasta files, look into fastaread (Matlab Bioinformatics Toolbox) or readFasta (a code I made for another project).

Also, cellfun and regexp become pretty handy tools.

To get GC %:

[Header, Seq] = readFasta('Seq.txt');
PercGC = cellfun(@(S)length(regexpi(S, 'G|C'))/length(S)*100, Seq);
PercGC =
   48.5876
   45.0000
   55.1724

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Read text file lines and analyze

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Più risposte (2)

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

Read text file lines and analyze

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Più risposte (2)

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti