Database of Orthologous Mammalian INtrons of 5 species
Human, Cow, Dog, Mouse and Rat

Alexei Fedorov, Ph.D.
Associate Professor
Department of Medicine,
University of Toledo,
email:
Alexei.Fedorov@utoledo.edu



Database of Orthologous Mammalian INtrons Of 5 species (DOMINO5) is the latest version of the Mammalian Orthologous Intron Database (Fedorov et.al 2006).

In an attempt to create a database of orthologous introns, we decided to use five mammalian species, human, cow, dog, rat and mouse. The species selected were those, whose full length genomic DNA sequences have been determined, and available on NCBI. The species were selected such that they are evolutionarily distant from the human so that the orthologous introns in the database would more accurately reflect those that would be found within the genome of a common ancestor of mammals. The evolutionary distance between the species would also ensure that the introns in the database would be of functional significance owing to their conservation within the gene across millions of years of speciation.

Proteins being the chief actors within the cell, translate the genetic code within the DNA into initiation or regulation of various cellular processes. Evolutionary selection pressures would have a direct impact on the structure of the protein which would be a result of a change in the nucleotide sequence coding for it. Thus a preservation of protein function across species would require the corresponding DNA region to be conserved. This would explain the high degree of conservation between exons across species that are evolutionarily widely spaced.

Similar to exons many studies have shown extended regions of high conservation of intron sequences (e.g., nPTB [1], FGFR1, FGFR2 [2,3]). These higly conserved regions might be splice sites, transcription factor binding sites or non-coding RNA (ncRNA) embedded within intronic sequences[4]. These highly conserved regions might be responsible for the conservation of intron position within a gene across species, which can be identified as orthologous introns.

Method:
1) Homologous protein database construction:
The first step in the construction of the MODI5 would be to create a subset of all homologous proteins within the five selected species, which have a high degree of sequence similarity. This could be done using BLAST [5]. BLAST search of all the protein sequences of human with each of the other four species, and comparing it with the BLAST search results of protein sequences of human with another organism. Only those BLAST hits which had a bit score above 80 were taken into consideration so as to keep the e-value of the hit less than 2*10-16.
There would be several instances where a protein has more than one isoform, which would map onto a single isoform within the other specie. A simple one way BLAST search (Human protein blasted against the organism database) would cause duplications within the database and inaccuracies, as more than one isoform could be shown to have homology with a single isoform of the protein in the other organism. So in order to establish the true association between the isoforms in the human and the corresponding isoform within the other specie; we BLAST all the proteins in the other species with the proteins of human. This would also help to resolve conditions where multiple isoforms in an organism would show homology with one protein in human.The BLAST search results from human with different organisms can be then compiled into one group that would be common to all the five organisms. This list would include all those proteins that would have been derived from the common ancestor of the five species.

2) Finding Orthologous introns from homologous proteins:
An orthologous intron could be defined as any intron which would have the same location between corresponding exons in all the five species that we consider for our database. The intron length or sequence information does not contribute towards determining orthology, but is based solely on their position and phase within the DNA sequence for that protein. Extracting this information from the set of homologous proteins would need the analysis of DNA sequences for the corresponding proteins in all the species. This analysis was achieved with a perl program CIPgenome.pl. This program maps all the introns onto the protein blast result between two of the organisms. Another perl program was used to analyze the output of the CIPgenome.pl program and procure all the instances where the introns were at the same corresponding position in the two organisms, and the two introns would be the same phase.

3) Grouping and Aligning orthologous introns:
The orthologous introns from each of the pairs of species is compared and pruned to make one set of orthologous introns which is common to all the five species of organisms. The respective introns are procured from the intron-exon database and placed into groups (Orthologous intron groups). Orthologous introns groups were further aligned using the stand alone MAFFT multiple alignment software (MAFFT - L-INS-I).
These multiple alignments are available in the webpage: http://bpg.utoledo.edu/~aprakash/orthologous_introns_alignments.html.