Eukaryotic Gene Prediction Lab

by Samuel S. Shepard

Steps

Download the lecture slides and watch the lecture video on basic Markov chains. Read this primer on HMMs (Hidden Markov models).

Check out the genetic code on wikipedia. Notice that there are three STOP codons: TAA, TGA, and TAG.
1. Download the coding sequence for POLA1. The sequence contains no untranslated regions or introns. Each consecutive triplet is a codon when you start from ATG, the trusty start codon.
2. Download my custom Perl script to examine the codon usage of POLA1. Download the script and the POLA 1 sequence to your home directory (Cygwin home directory for Windows). Pick one of the following options:
  1. Option 1: on Mac OS X open /Applications/Utilities/Terminal.app to run the script.
  2. Option 2: run the script using your linux/unix bash shell, perl should be installed on many systems by default.
  3. Option 3: run the script on your Cygwin bash shell from within Windows. This will be installed already if you followed the instructions from the SVM lab.
3. Usage:
```
 perl codonUsage.txt <cds.fasta>
```
4. How much do the three stop codons occur in POLA1? How about the start codon?
5. What is the most frequent codon for the whole genome? Is that the most frequent codon for POLA1? Report these answers in your homework.

Go to our Genomic MRI web-site.
1. Use POLA1_cds.fasta.txt from (2.i) to start an MRI session. Note: SRI Analyzer/Generator are frameless algorithms and slide along the sequence nucleotide by nucleotide.
2. Analyze the file using SRI analyzer (up to mer-level 6). Notice that you start getting 0's for 4-mers and up.
3. Next, use SRI Generator to create a randomized sequence based on your POLA1 sequence. Select mer-level to be 3. Here is mine if you have trouble.
4. Download the randomized file and analyze its codon usage using the Perl program. How many stop codons are there? Is there a difference between the randomized sequence and POLA1's 5 least frequent codons? Explain your answer in your homework.
5. Hint:
```
perl codonUsage.txt <file> | sort -k3 -n | head -n5
```

Try out gene prediction on the web.
1. Do some web searches on gene prediction programs. What technology is the most commonly used for gene prediction (Markov model related [HMM, GHMM, MM], neural networks, SVM, etc.)? Make a list of the gene prediction programs you find and their core technology in your homework.
2. GeneMark™ is the gene prediction software produced by the Borodovsky lab at GA Tech. "E" is for "Eukaryotic" and "S" is for "Self-training" in the context of "ES-3.0". Self-training requires the input sequence only and no training set; however, you must provide a sufficiently long sequence to self-train (100+ kb). Pre-built self-trained models are offered as well as standard "trained" models for different species.
3. Using Eukaryotic GeneMark.hmm, select H. sapiens from the species, upload your sequence for POLA1, and then hit the "Start" button below. Observe the result and graph. Black bars (not lines) on the graph is for a prediction. The thin line is for an open reading frame and dashes are for stop codons. A perfect prediction would be 1..4389.
4. Now copy the whole mRNA (with UTRs) into the box and do the same prediction. Is the start site correctly predicted now? Confer with the GenBank CDS annotation to find out!
5. Observe then that sequence context is important to gene prediction.

Not in love with gene prediction yet? You might need a little oxytocin. No problem, just grab the GenBank sequence!

This sequence is the gene sequence or pre-mRNA. It has UTRs and introns in addition to coding sequences. How many exons does it have? What are the coordinates for the coding sequences (CDS)? Report your answers in the homework.
The sequence viewer at PubMed is also pretty handy.
Run the FASTA sequence of OXT on GeneMark.hmm using your bioinformatics skills.
Are the predicted coordinates matching the GenBank annotation? Look at the PDF generated by GeneMark, what frames are the exons in? What strand?
Repeat the same sequence with GeneMark.hmm only this time choose O. sativa (rice) as your species. Try a couple more species as well. Do you think that gene prediction is species-specific? How far away phylogenetically is too far? Check out the codon usage pattern for the O. sativa genome.

Examine TMEM23 (a.k.a. SGMS1).

View the protein in Sequence Viewer.
How many coding exons are there? How many UTR exons? Write down the coordinates for each element.
Download the FASTA sequence for TMEM23 and run it in GeneMark.
How many coding exons are predicted correctly? How many UTR exons? (If you didn't realize there was a difference, your bio prof may be...unhappy.) What are the sizes of exons predicted correctly and those missed? How many genes are predicted? Download my worksheet if you need extra help. Report your answers in the homework.
What is different between the prediction of the coding versus non-coding exons?

Last updated: 2.2011 | Author: Samuel S. Shepard, Ph.D. | Contact: sammysheep@gmail.com