FELINES:

A utility for extracting and examining EST-defined introns and exons

Written by Scott D. Drabenstot, Doris M. Kupfer1, James D. White1, David W. Dyer, Bruce A. Roe1, Kent L. Buchanan2 and Juneann W. Murphy*

Department of Microbiology and Immunology, University of Oklahoma Health Sciences Center, PO Box 26901, BMSB 1053, Oklahoma City, OK 73190, USA, 1 Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019, USA and 2 Department of Microbiology and Immunology, Tulane University Medical Center, New Orleans, LA 70112, USA

*To whom correspondence should be addressed. Tel: +1 405 271 6622; Fax: +1 405 271 3117; Email: juneann-murphy@ouhsc.edu


Last Updated: October 9, 2003

FELINES -- Finding and Examining Lots of Intron 'N' Exon Structures

BACKGROUND:

FELINES is an intron and exon alignment, extraction, and analysis utility. It was designed to automate the process of completing these tasks. The names of most of the files are associated with FELINES cat-nature.

  • wiscrs.pl -- (write individual sequence comparisons, run spidey) pronounced like "whiskers". The front of the FELINES utility does a lot of the "dirty-work" of getting the sequences into shape by pairing and aligning them.

  • gumbie.pl -- (gather up mRNA based introns and exons) pronounced like "Gumby", the famous action figure. Special thanks to t.s. eliot and his cat poems for this name. This program "sits, and sits, and sits, and sits" until you need it. Then it gets the Spidey alignments into order.

  • icat.pl -- (intron consensus analysis tool) pronounced like "I cat". This invaluable tool is everywhere in the introns, identifying potential branchsites, polypyrimidine tracts and 5' and 3' splice sites in the introns.

  • cattracts.pl -- (catenate tracts) pronounced "cat tracks". It looks for evidence of user-defined conserved motifs. Then it catenates overlapping instances of any of the motifs that are identified.

  • findnmers.pl -- pronounced "findin' mers" or "find nmers". This program finds all of the instances of the all hexamer, or pentamers, or heptamers, etc. (as the user desires). These are then provided in one output file.


    BASICS:

    This package provides a set of perl scripts that extends the productivity of widely available programs enabling more complex analysis of genomic data. This package is especially useful for creating intron and exon databases using EST to genomic sequence alignments and for identifying conserved motifs in DNA/RNA sequences. The package is divided into three different layers:

    Alignment Layer:

    Each EST sequence is paired to its homologous genomic sequence and the EST sequence is aligned to the genomic sequence.

  • wiscrs.pl -- a program for pairing EST sequences with their corresponding genomic sequences. This tool requires the blastall executable. It assumes that it resides in the same directory as the blastall executable. Detailed help is available within the program.

    Extraction Layer:

    Using the alignments from above, the intron and exon regions are extracted, filtered, and added to their respective datasets.

  • gumbie.pl -- a program that parses EST to genomic alignments to filter and construct intron and exon sequence databases. This program requires the Statistics::Lite module from www.cpan.org. This program takes the output from the summary files of Spidey and extracts the genomic coordinates of the intron and exon borders. Using these coordinates, intron and exons are extracted from individual genomic files. Detailed help is available within the program.

    Analysis Layer:

    In the analysis layer, the sequences are searched for consensus sequence elements.

  • icat.pl -- a program used to filter intron databases imported into FELINES and analyze consensus intron motifs. This program requires the Statistics::Lite module from www.cpan.org. The program icat.pl is a program for analysing any flat-file of introns in fasta format. It identifies the branchsite and end regions of each intron and then can be used to identify the consensus patterns for the splice sites and the branchsite. Detailed help is available within the program.

  • findnmers.pl -- a tool to identify over-represented fixed length polynucleotide sequences. This program requires the Statistics::Lite module from www.cpan.org. This identifies fixed-length sequence motifs. It uses any FASTA DNA/RNA files. Detailed help is available within the program. The Z scores are calculated by computing the difference of the number of occurrences of each nmer from the theoretical mean number of occurrences for the length of sequence(s), based upon all possible different nmers. The difference is then divided by the actual standard devation of all nmers for the sequence(s) examined.

  • cattracts.pl -- a FASTA search tool enabling searches for multiple consensus elements simulataneously. The program cattracts.pl requires the Statistics::Lite module from www.cpan.org. This program is useful for characterizing the location and number of sequence motifs. Detailed help is available within the program.


    OBTAINING THE FELINES PROGRAMS:

    The Felines programs are located at ftp://ftp.genome.ou.edu/pub/programs/.

  • FELINES_options

  • wiscrs.pl

  • gumbie.pl

  • icat.pl

  • cattracts.pl

  • findnmers.pl

    The authors cannot control changes made by the authors of the programs below which break FELINES. The programs of the FELINES utility are most easily used when placed in a directory that is part of the path (e.g., /usr/local/bin/).

    The package has been developed using the programs below. If any of these programs are not currently available locally, they will need to be downloaded and installed in the directory path or in their proper places.

  • standalone Spidey.linux.exe 1.35 (ftp://ftp.ncbi.nih.gov/pub/wheelan/Spidey/)

  • standalone Blast, version 2.2.2 (ftp://ftp.ncbi.nih.gov/blast/executables/)

  • Perl, version 5.8.0 (http://www.perl.org/)

  • Statistics::Lite, version 1.02 (http://www.cpan.org/)

  • all2many -- this is a useful tool for separating the genomic and EST files into separate directories. This is useful but not required. (http://www.genome.ou.edu/informatics.html)

    All testing was done on a Linux PC system and should work on any similar system.


    Using FELINES (short version):

    1. Create a subdirectory.

    2. Run all2many on EST and Genomic sequence FASTA files.

    3. Make genomic and est list files.

    4. Customize options file.

    5. Run wiscrs.pl

    6. Run gumbie.pl

    7. Run icat.pl or findnmers.pl or cattracts.pl


    Using FELINES (extended version):

    To use FELINES, one must first create a subdirectory below the directory that FELINES is to be used in. In this directory, two FASTA files should be placed -- one containing the genomic sequences and one containing the EST sequences. Check the file names in both of these files. If they exceed about 20 characters, the names will be truncated by Spidey and the FELINES utility will choke. Also, this is a good time to change any X's in either file into N's. Spidey doesn't accept X's. It may be desired to remove N's as well if there are a substantial number. While you are at it, if the EST and genomic sequence names are similar, change one group. The computer will have no problem keeping them separated (provided they have at least slightly different names) but it may test your sanity.

    Next, the user will need create a separate file for each genomic and EST sequence. The easiest way to do this is using all2many (James D. White & Bruce Roe, unpublished, http://genome.ou.edu/informatics.html). The user now has a subdirectory containing all of the EST and genomic sequences in separate files.

    While still in the subdirectory, format the genomic sequence database for BLAST using formatdb. It is important that the file formatted for BLAST is identical to the file split apart using all2many. This is because FELINES depends on the data being consistent (Contig5 is always Contig5).

    Back in the main directory, create two files. The first will contain the file names of all of the EST sequence files. The second will contain the names of all of the genomic sequence files. Each name is one line of the file and will provide the relative location of the file. Also, place a copy of the options file in this directory level and change the options file to reflect this. While in the options file, adjust the other parameters to suit your needs.

    There are now three files in the main directory to be used and many files in the subdirectory. At this point you can run wiscrs.pl. All you have to do is enter the -O flag at the command line with the name of the options file you are using.

    Now you can run gumbie.pl. At the command line give gumbie.pl the name of the alignment files (with the -S flag), the name of the options file (with the -O flag) and any other flags that you want to use. Be sure not to mix these up. Gumbie.pl may be a "smart cat" but it can't read your mind. This sometimes catches the authors.

    Okay, now you have run wiscrs.pl and gumbie.pl. Now it's time to do the real analysis. Play around with the other programs. Help is available if you try to run the program without any flags. This is true for all of the FELINES components. If you get really stuck, contact the authors (Scott Drabenstott).


    Home Page


    Bruce Roe, broe@ou.edu