The insertion or deletion can be an artifact of sequencing chemistry and not indicative of the authentic DNA sequence. Why are gaps important?Ī gap is one or more spaces in a single string of a given alignment and usually corresponds to an insertion or deletion in one or more sequences within the alignment. It is also useful to use the consensus to identify potential gaps in your aligned sequences. The use of consensus sequences can be very useful when examining evolutionary relationships between sequences with high degrees of identity. For instance, if you align 5 sequences, and the nucleotides at position 20 are A, A, T, A, and G, then the consensus sequence will have an A at position 20. What is a “consensus” sequence?Ī consensus sequence usually appears at the top of your alignment worktable, and each nucleotide (or amino acid) of the sequence is based on the residue that appears at that position most frequently in your aligned sequence. A high percentage of similar residues can also suggest a conserved function or structure. Similarity is the degree of resemblance between two sequences when they are compared, and indicates that the amino acids or nucleotides at a particular position have some properties in common (for instance, charge or hydrophobicity), but are not identical. It is important to note that 2 or more completely unrelated sequences can have 20% identity or greater, so this is not a hard and fast rule. Generally, an identity of 25% or higher suggests the potential for similarity of function an identity of 18-25% implies similarity of structure or function. Identity is the degree of correlation between 2 un-gapped sequences, and indicates that the amino acids or nucleotides at a particular position are an exact match. What is the difference between similarity and identity? For aligning a large number of sequences, you must have sufficient computer memory and storage. For some perspective, I can usually align ~750 sequences of 1000 nucleotides each in about an hour using MUSCLE. For instance, the sequencing program MUSCLE can usually handle large data sets with a premium on accuracy. First, you must choose an appropriate algorithm. You can align several hundred to several thousand if you wish, but there are several factors that can make this straightforward and simple or a time hog if not impossible. MUSCLE or one of the Clustal algorithms like ClustalW. Most programs will align 3 or more sequences at a time and will require a different algorithm e.g. For comparing 2 sequences you’ll need to perform a “pairwise” alignment. You must have a minimum of 2 sequences to perform an alignment. We’re going to take a look at just the basics of sequence alignment to get you started.
Whether you’re employing sequencing gels, Sanger-based methods, or the latest in pyrosequencing or ion torrent technologies, obtaining, manipulating and analyzing your sequences has never been easier.
Fortunately, those of us who have learned how to sequence know that aligning sequences is a lot easier and less time consuming than creating them.