##( mork calling hometown... come in hometown! )##

**galathaea** · December 15th, 2003, 12:40 AM

I just saw your question about block substitution matrices over in the vc++ forum and wanted to make a few comments. However, we should stick to a more appropriate forum. I'd PM you, but its very difficult to use PMs because I easily hit the ceiling and have to break responses into several separates (which begs the question of why there are limits in the first place since I will take up just as much if not more database space because I won't be confined that easily -- same goes for forum posts but at least I don't have to break things up as often).

First I should stress that those kinds of calculations are not really a focus of mine. I've tossed together a few programs before to calculate phylogenetic information from sequences to help me study, but those involved more simplified models than the programs the professionals use regularly. But I'll try to explain what I can.

Laboratories everywhere these days are cranking out gene sequences. There is a whole industry of rather poorly paid lab techs who chop up genomes, gel separate them, run PCRs, and then do various spectral or antibody tests to identify the various fragments. What this results in is a huge collection of DNA sequence information from all sorts of bacteria, plants, fungi, animals, etc. being passed around the scientific and industrial communities. ACCTGTCGATTACA...

One of the useful pieces of information that can be formed from this information is the evolutionary relationships between the various species. The basic theory is that DNA sequences which differ by a lot indicate species which diverged evolutionarily furthur back in history than sequences which are more closely similar.

However, one needs to come up with a good measure for the differences between sequences. One of the simpler possibilities is to just use an extended Hamming distance, where you just count the number of different letters per position in a sequence. However, this is not very useful because one common evolutionary mutation is an insertion event, which will put a new nucleotide or series of nucleotides into a sequence somewhere in the middle of a gene, tossing off all subsequent matches and indicating a large distance between the sequences even though only one evolutionary event occured.

So instead, the similarities are often weighted and a pattern matching algorithm is applied to measure the distances, and often the algorithms are applied to a full codon (3 nucleotides - which collectively code for one amino acid in the gene's protein product). The reason there needs to be a weighting is because another common evolutionary event is substitution, and it has been found that substitutions are more common between amino acids that share similar biochemistries. So experiments have been done, and various substitution theories have been developed. Most often these are represented in scoring matrices, such as the various BLOSUMs. Effectively, each type of scoring matrix describes a particular evolutionary model being applied.

It looks like the question is asking to just basically look up in the matrix the corresponding substitution values between various amino acids with similar properties (for example, V is valine), in order to illustrate to the student that they will have proportionately more likelihood for substitution. I don't have a copy of the BLOSUM50 matrix around, so I couldn't help out much with walking through the problem, but I'm sure its out there on the internet, maybe even on the site your question comes from. But its basically looking things up in a matrix, and not too exciting.

However, as I mention above, this isn't really the field I am most interested in. Particularly recently, I have been much more focused on simulations surrounding the origins of life.

Thread: ##( mork calling hometown... come in hometown! )##

Thread Tools

Display

Threaded View

##( mork calling hometown... come in hometown! )##

Posting Permissions