Click to See Complete Forum and Search --> : C++ and DNA sequences


WedDa
February 10th, 2005, 06:49 PM
Hello Forum!

This is my first post here, and my first week as a C++ programmer, so please be gentle ;-)
Anyways, I'm a molecular biologist dealing with lots of DNA sequences in simple text files in GNU/Linux. I consider myself a fairly advanced computer user, but as I programmer I am a total newbie. I feel though it is time to take that big step and code some small apps myself. I've been reading up a bit on C++ and written some small programs and think I get the basics. What I do not get though is how to deal with regular expressions, which I suspect is relevant for the task below. Basically, what I want to do is simply to take a DNA sequence file (FASTA format) looking like this:

>Sequence1 # The name of sequence. The '>' signals new sequence.
ACGGTGCAATTGACCA # The actual DNA sequence string
GTCGGTTGAACCGTCA
CCGTGA
>Sequence2
GGTGCCACAAGTGGCA
GTCGATTGACCACGTA
TTTGGG

and convert it to something like this n a new file:

Sequence1 ACGGTGCAATTGACCAGTCGGTTGAACCGTCACCGTGA
Sequence2 GGTGCCACAAGTGGCAGTCGATTGACCACGTATTTGGG

without the '>' in the names.

In practice, the sequence name in the original file can be any number of characters long (or say at least 255 chars) and is always follow by a carriage return. A single can hold thousands of sequences, each of which can be tens of thousands of DNA characters long.

I have yet to find info on this in basic C++ introductions. How do I separate the sequence names from the DNA strings when the file is read so that they become different but related objects?

Are there any other special pitfalls or tricks you think I should consider or remember when writing this little program?

I guess this would be kinda simple to implement with perl, but I am here to learn C++ :-)

Very thankful for any feedback!

Regards,
Andreas

HighCommander4
February 11th, 2005, 04:33 PM
Making such a program can be very simple if you you use standard tools like the <fstream> library for file I/O, and the Standard Template Library classes, like std::string.

The insertion operator >> reads one word from the screen (in the case of an istream object like cin) or from a file (in the case of an ifstream object) into a target string (if you use std::string, the number of characters that it can read in is practically unlimited). So assuming that your sequence name and the actual sequence are separated by spaces (or newlines, tabs etc.), you can read them in like this:

ifstream fin; // ifstream object for reading file
fin.open("myfile.txt"); // open the file
std::string name1, name2, sequence1, sequence2;
fin >> name1; // stores name of sequence 1
fin >> sequence 1; // stores sequence 1
fin >> name2; // stores name of sequence 2
fin >> sequence2; // stores sequence 2

That of course is assuming that there are no spaces within each sequence of characters. If there are, you may need multiple strings to store each segment and concatenate them after using the += operator. Or, use the getline() function instead, with the > character marking the beginning of the next sequence as the character at which to stop reading.

Now, say you want to erase the > from the string name, simply do:

name1.erase(name1.begin()); // erases the first character of the string

Writing the new strings to a file is even simpler than inputting them: you use the same syntax as if you'd be outputting to the screen, except replace cout with an ofstream object:

ofstream fout;
fout.open("output_file.txt");
fout << name1 << " " << sequence1 << " " << name 2 /* etc */;

As you can see, a program like this is very simple to make. As long as you are familiar with the basic ways of handling an std::string, you can do things like this and even more complex stuff, very easily. This (http://www.sgi.com/tech/stl/basic_string.html) is a good reference for working with strings, detailing what each member function does. Note that the page is part of a larger site that describes the entire STL, not just the string class.