Algorithms for Molecular Biology
Biological Databases and Retrieval Systems
Grail has been recently incorporated into the Oak Ridge genome analysis pipeline, which provides a unified web interface to a number of convenient analysis tools. Utilization of frequency profiles for database searches had a profound effect on the quality and depth of sequence and structure analysis. The probability of matching one amino acid residue is 1/20 (assuming equal frequencies of all 20 amino acids in the database). Spurious hits with lower E-values are uncommon: they are observed more or less as frequently as expected according to Karlin-Altschul statistics. These false results would have badly polluted any large-scale database search, and the respective proteins would have been refractory to any meaningful sequence analysis. The way to properly capture the information contained in sequence motifs is to represent them as amino acid frequency profiles, which incorporate the frequencies of each of the 20 amino acid residues in each position of the motif. Given all these advantages, comparisons of any coding sequences are typically carried out at the level of protein sequences. A domain is the smallest unit of evolution by the definition from the SCOP (Murzin et al., 1995) database of known protein structures. In principle, if models were developed for all protein families, the problem of classifying a new protein sequence would have been essentially solved. Pattern-Hit-Initiated BLAST (PHI-BLAST) is a variant of BLAST that searches for homologs of the query that contain a particular sequence pattern. Searching the COG database may be viewed as a rough prototype of this approach. The T-Coffee programs is a recent modification of Clustal that incorporates heuristics partially solving these problems. Although it slows loading the page, this option is essential for quick examination of the output to get an idea of the domain architecture of the query. Although, in theory, a global alignment is best for describing relationships between sequences, in practice, local alignments are of more general use for two reasons: (i) it is common that only parts of compared proteins are homologous. However, over the time, database became a preferable term. This will identify all the sequences in the database that are identical to the query sequence (or include it). Even decreasing the word size to 7, the lowest word size currently allowed for BLASTN, would not change the result if the longest stretch of identical nucleotides in this alignment is only 6 bases long. Sequence motifs are extremely convenient descriptors of conserved, functionally important short portions of proteins. PSI-BLAST also employs a simple sequence-weighting scheme, which is applied for PSSM construction at each iteration. Thus, the hierarchical algorithms essentially reduce the O (nk) multiple alignment problem to a series of O (n2) problems, which makes the algorithm feasible but potentially at the price of alignment quality. There are two fundamental ways to design a substitution score matrix. Third, we certainly do not advocate lowering the statistical cut-off for any large-scale searches, let alone automated searches. Database hits that have "significant" E-values but, upon more detailed analysis, turn out not to reflect homology, seems to be subtle compositional bias missed by composition-based statistics or low-complexity filtering. The different types of databases: Accession codes vs identifiers, Nucleotide sequence databases, Protein sequence databases, Sequence motif databases, Macromolecular 3D structure databases. For these reasons, for several years, SEG filtering had been used as the default for BLAST searches to mask low-complexity segments in the query sequence. The E-value of 0.005 is a relatively conservative cut-off. The pitfalls are further exacerbated in protein comparisons than in nucleotide comparisons. Statistical significance can be any positive number; the default pairwise alignment methods utilize modifications of the solid database or to identify coding regions and distinguish them from non-coding DNA, Glimmer uses interpolated Markov models. 