WedDa
February 10th, 2005, 06:49 PM
Hello Forum!
This is my first post here, and my first week as a C++ programmer, so please be gentle ;-)
Anyways, I'm a molecular biologist dealing with lots of DNA sequences in simple text files in GNU/Linux. I consider myself a fairly advanced computer user, but as I programmer I am a total newbie. I feel though it is time to take that big step and code some small apps myself. I've been reading up a bit on C++ and written some small programs and think I get the basics. What I do not get though is how to deal with regular expressions, which I suspect is relevant for the task below. Basically, what I want to do is simply to take a DNA sequence file (FASTA format) looking like this:
>Sequence1 # The name of sequence. The '>' signals new sequence.
ACGGTGCAATTGACCA # The actual DNA sequence string
GTCGGTTGAACCGTCA
CCGTGA
>Sequence2
GGTGCCACAAGTGGCA
GTCGATTGACCACGTA
TTTGGG
and convert it to something like this n a new file:
Sequence1 ACGGTGCAATTGACCAGTCGGTTGAACCGTCACCGTGA
Sequence2 GGTGCCACAAGTGGCAGTCGATTGACCACGTATTTGGG
without the '>' in the names.
In practice, the sequence name in the original file can be any number of characters long (or say at least 255 chars) and is always follow by a carriage return. A single can hold thousands of sequences, each of which can be tens of thousands of DNA characters long.
I have yet to find info on this in basic C++ introductions. How do I separate the sequence names from the DNA strings when the file is read so that they become different but related objects?
Are there any other special pitfalls or tricks you think I should consider or remember when writing this little program?
I guess this would be kinda simple to implement with perl, but I am here to learn C++ :-)
Very thankful for any feedback!
Regards,
Andreas
This is my first post here, and my first week as a C++ programmer, so please be gentle ;-)
Anyways, I'm a molecular biologist dealing with lots of DNA sequences in simple text files in GNU/Linux. I consider myself a fairly advanced computer user, but as I programmer I am a total newbie. I feel though it is time to take that big step and code some small apps myself. I've been reading up a bit on C++ and written some small programs and think I get the basics. What I do not get though is how to deal with regular expressions, which I suspect is relevant for the task below. Basically, what I want to do is simply to take a DNA sequence file (FASTA format) looking like this:
>Sequence1 # The name of sequence. The '>' signals new sequence.
ACGGTGCAATTGACCA # The actual DNA sequence string
GTCGGTTGAACCGTCA
CCGTGA
>Sequence2
GGTGCCACAAGTGGCA
GTCGATTGACCACGTA
TTTGGG
and convert it to something like this n a new file:
Sequence1 ACGGTGCAATTGACCAGTCGGTTGAACCGTCACCGTGA
Sequence2 GGTGCCACAAGTGGCAGTCGATTGACCACGTATTTGGG
without the '>' in the names.
In practice, the sequence name in the original file can be any number of characters long (or say at least 255 chars) and is always follow by a carriage return. A single can hold thousands of sequences, each of which can be tens of thousands of DNA characters long.
I have yet to find info on this in basic C++ introductions. How do I separate the sequence names from the DNA strings when the file is read so that they become different but related objects?
Are there any other special pitfalls or tricks you think I should consider or remember when writing this little program?
I guess this would be kinda simple to implement with perl, but I am here to learn C++ :-)
Very thankful for any feedback!
Regards,
Andreas