NCBI-101:
How to search for the Tree in the Forest ... or the Gene in the Genome
Due to numerous sequencing projects around the world more than 70 genomes
of bacteria, archaea, eukaryota and over 700 viral and organellar genomes
are completly sequenced. The human genome is almost deciphered. However,
the number of genes encoding proteins is still under debate and finding the
genes in the genome is a difficult task.
Finding the genes in micobiological genomes can be as simple as finding
a translational start codon (ATG -coding for the initial Methionin) preceding
a long stretch of an open reading frame (ORF). In Prokaryotes, the ORFs
are long, not interrupted by non-coding regions (introns) and the gene density
is high. Moreover, conserved sequence patterns are found preceding the coding
regions, e.g., the Pribnow box (a TATAAT consensus region at position -10)
or the -35-region. Thus, the genome can be scanned for such sequences in
order to check the regions for genes. Even though the genes can be located
on the plus or minus strand of the DNA, finding the protein-encoding genes
is relatively straightforward.
In eukaryotic genomes the situation is more complex and predicting protein-encoding
genes is more difficult. The gene density in eukaryotes is low, which means
that the main proportion of DNA in the genomes are non-coding sequences.
Cis-acting elements in eucaryotes are not as thoroughly caracterized as in
prokaryotes. Furthermore, the genes are not only being transcribed into mRNAs
as in prokaryotes, but the mRNAs are processed prior to translation into
proteins in different ways. MRNA maturation processes involve the removal
of non-coding regions (splicing) and/or editing, the exchange of nucleotides,
resulting in differences between mRNA vs. genomic DNA sequences. Splice junctions
have to be identified and splice junctions can even vary, resulting in "alternatively
spliced genes" and alternating proteins. The molecular mechanisms of RNA
modifications are still under investigation, but once RNA-binding sites and
RNA-binding proteins are more thoroughly understood, the knowledge will help
to predict gene structures and its variations.
Today, many computer programs take care of finding the genes within the
bulk of DNA sequences, and different gene-finding strategies are being used.
In general, three approaches can be distinguished. One method can be described
as content-based, the second as site-based and the third as comparative.
Content-based methods determine the overall properties of a sequence, including
an evaluation of the codon usage. Since synonymous codons, codons that stand
for the same amino acid, are not only distributed randomly among species,
but codon usage is also different between weakly and strongly expressed genes
in the same organism, this method can be used to identify coding vs. non-coding
regions. Site-based methods determine transcription factor binding sites,
polyA signals, start and stop codons, splice junctions, and other specific
sequences or sequence patterns. Comparative methods make use of already determined
sequences by a comparison of sequence data. Thus, these methods are "trained"
and the results are better, the closer the test sequences are to the sequences
of the training set. Eventually, the gene-prediction is most reliable when
the application of different methods and programs results in the same predicted
gene-structure.
Two divisions of the National Institutes of Health offer freely available
software programs to help with the gene annotation process.
The ORF-finder
(Open-reading-frame-finder) is available at the National Institute of Biotechnology
information (NCBI). The sequence is submitted as GI or accession number,
or in the FASTA format. It will be translated by the program into six-frames
and will be returned as a graphic that indicates the location of each ORF
found. The sequences of predicted protein products can directly be submitted
for BLAST similarity searching. The program identifies the open reading frames
using the standard or alternative genetic codes. If you are not sure which
genetic code to apply for the organisms under investigation, check out the
genetic code at
NCBI's Taxonomy Browser
or search the database for codon
usage
at the Kazusa DNA Research Institute (KDRI) in Japan.
GeneMachine
from the National Human Genome Research Institute (NHGRI) is an integrated
tool intended to perform both comparative and predictive gene identification
techniques in a single run. The result file (returned in ASN.1 format by
E-mail) can then be viewed using NCBI's Sequin.
The integrated analysis programs are
- GRAIL
for internal coding exon prediction,
- MZEF
for coding exon prediction,
- GENSCAN
for gene structure prediction,
-
FGENES for gene structure prediction,
-
RepeatMasker for complexity and interspersed repeat prediction,
- Sputnik
for repeat prediction, and
- BLASTX and BLASTN
for sequence homology searches.
Nevertheless, there are limits of the computational analysis and gene
prediction methods. It is still hard to find RNA genes, genes that function
on the RNA level and very small genes. It is also a challange to explore
alternative splicing and the multiple protein products that a single gene
can encode. However, the latest news about a program for the "Computational
identification of promoters and first exons in the human genome" comes from
the authors R.V. Davuluri, I.Grosse and
M.Q. Zhang at Cold Spring Harbor Laboratory (NY). The publication in Nature Genetics reports about a new program called FirstEF(First Exon Finder). Using the comparative approach the program is supposedly able to recognize features ~500 bp of either side of the first exon, thus recognizing a potential promoter and coding and/or non-coding first exons. Even though the program id developed for the annotation of promoter regions and first exons in the human genome, the authors claim it useful for the annotation of other mammalian genomes, too.
Further reading:
- A Bibliography
on Computational Gene Recognition
- Mount, David W. 2001.
Bioinformatics: Sequence and Genome Analysis, Chapter 8, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York
- Baxevanis, Andreas D. and Ouellette, Francis B.F. 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Chapter 10, Wiley-Interscience
- Davuluri, Ramana V., Grosse, Ivo, Zhang, Michael Q. 2001. Computational
identification of promoters and first exons in the human genome. Nature
Genetics 29:412-417
|