searching through RNA sequence for a string of characters

10 visualizzazioni (ultimi 30 giorni)
Hi all,
I am creating a program that gets a user input (txt file) of DNA nucleotides (random lists of A, G, C, T), and gives an output file of the RNA complement (got this done), the percentages of RNA nucleotides (got this done), and asks the user if they would like to search through the RNA sequence for a particular string of nucleotides—if the string exists in the RNA file, it will tell the location in the sequence and repeat that given string.
example: the user inputs "UACG," and my program says that string of basepairs occurred at location 59.
How can I do this? My current program operates by searching through the file on a per character basis, and converts each DNA basepair to an RNA basepair.
Thank you

Risposte (3)

Joseph Cheng
Joseph Cheng il 24 Apr 2017
so, do you store the full compliment data in an array such that you can
%gen dummy data
comp = 'AGCT';
list = comp(randi(4,1,100))
position = strfind(list,'CAT')
where position will be the index in the string that has CAT in it?
  2 Commenti
John Jamison
John Jamison il 24 Apr 2017
I dont believe so. Below is some of what I do:
[base, num] = fread(dnaFile, 1, 'char');
fwrite(rnaFile, out);
Joseph Cheng
Joseph Cheng il 24 Apr 2017
well then i would suggest that you do, unless the array is too large. if you read in the whole text file into an array you should be able to then use strfind to find the position, otherwise its a tricky swapping in and out of a buffer as you search for consecutive matches by looking one by one

Accedi per commentare.


dpb
dpb il 24 Apr 2017
Modificato: dpb il 24 Apr 2017
Again, all the reason in the world to read the whole string at once as recommended earlier.
But, the answer to the question posed, given what you did previously is simply
idx=strfind(out,ASKEDFORSTRING);
AFTER the loop is complete. Fuggettabout trying to do this on a character-at-a-time basis.
idx will be empty if the sequence doesn't exist, or a vector of 1 to N position(s) where the sequence is found.
Done.
I can't emphasize enough how you should convert this from the looping you're doing to processing the input string as a vector...you'll find things will go much faster after you do, the code will become almost trivial by comparison and much simpler to develop/debug/maintain.
  6 Commenti
John Jamison
John Jamison il 24 Apr 2017
My code is the same as from the previous post. Here is my dna.txt file that I use for my file
dpb
dpb il 24 Apr 2017
Modificato: dpb il 25 Apr 2017
"My code is the same as from the previous post"
Can't be--you had to have processed the input string instead of continuing to try to read a file character-by-character or you would have had an FEOF error on the first next read.
We need to see the changes you did make...for the NaN result, it looks like you never collected a single count so the result was 0/0 for all.
ADDENDUM
OK, the input file is OK; one hurdle cleared.
What you need to do is after the above read statement, if you're still hung up on looping solution, replace the while loop with a counted for loop
for i=1:length(base)
and remove the fread statement at the end of the loop entirely.
Oh, just dawned on me--if you left the code alone, since base now refers to the entire string not just a single character, the various if clauses will never be satisfied. You must address each character in the string in turn:
for i=1:length(base)
if base(i)=='C'
...
etc., etc., etc., ...
But, of course, the thing to do when you read the full string is to use the vectorized solution that I showed you...there's where the real simplification comes from.

Accedi per commentare.


dpb
dpb il 25 Apr 2017
Modificato: dpb il 25 Apr 2017
OK, while this isn't really answering the question of the thread (I did that up above), I fixed your script to operate on the full string. This isn't fully vectorized yet but goes a fair step along the way eliminating the unnecessary big loop and switch block for the lookup table approach.
% set up lookup table inputs
DNA=['C','G','T','A']; % base characters in DNA sequence
RNA=['G','C','A','U']; % corresponding characters in RNA
% get input file, do bookkeeping for output file, etc., ..
filename=uigetfile('*.txt','Select DNA sequence file');
if filename==0, error('User Canceled'),end
[dnaFile, message] = fopen(filename, 'r');
% %if input file has extension .dna, switch to .rna
if length(filename) > 4 && strcmp(filename(end-3 : end), '.dna')
outputFile = [filename(1:end-4) '.rna'];
else
outputFile = [filename '.rna'];
end
[rnaFile, message] = fopen(outputFile, 'w');
% now read full string into char array
dna=fread(dnaFile,'*char').'; % orient as row vector
fclose(dnaFile); clear dnaFile % we're done with input file
% and translate DNA --> RNA
rna=RNA(arrayfun(@(c) find(c==DNA),dna));
% compute counts, frequencies
counts=zeros(size(RNA)).';
for i=1:length(RNA)
counts(i)=sum(rna==RNA(i));
end
freqs=100*counts/sum(counts);
% output results
fprintf('Base pair frequencies: \n')
for i=1:length(RNA)
fprintf('%c: %6.2f\n',RNA(i),freqs(i));
end
fwrite(rnaFile, rna);
rnaFile=fclose(rnaFile); clear rnaFile
Sample run from your dna.txt--
> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies:
G: 14.63
C: 31.71
A: 26.83
U: 26.83
>>
The searching would be on the variable rna as outlined above after getting the desired sequence for which to search from the user.
I attached the m-file as well...HTH.
ADDENDUM
OK, I added the lookup; using the search string you typed in the original question the nul result is the expected answer...there is no sequence like that in the output string. I ran one and picked one I knew would work and results were:
>> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies:
G: 14.63
C: 31.71
A: 26.83
U: 26.83
Enter RNA sequence of interest: aauu
Sequence AAUU locations (empty report-->not found)
Occurrence Location
1 3
2 16
3 28
>>
So, the lookup does work if there is anything to be found.
  2 Commenti
John Jamison
John Jamison il 25 Apr 2017
Thank you. How can I get it to search for a user input of RNA basepairs?
I have the following, and thought it would work, but it always returns a " [] ", which means it is finding nothing...
inputBase = input('Enter RNA sequence of interest: ', 's');
location = strfind(out, inputBase);
dpb
dpb il 25 Apr 2017
Modificato: dpb il 25 Apr 2017
Well, syntax looks ok but again we can't debug what we can't see, sorry.
Same old song, umpteenth verse. SHOW us enough we can see what you did and what you used to do it with (both code and data)...we can't see your terminal from here.
This can be done simply from command line with the output string you operated on and the search string as a minimum...
ADDENDUM
You did notice if you used my updated code that the output variable is now a more expressive name rna, not out, I hope?
If you're still stuck on your old code, need what said above...show us the string you're searching and for what...remember strfind is case-sensitive.

Accedi per commentare.

Categorie

Scopri di più su Genomics and Next Generation Sequencing in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by