Informatics at the University of Oklahoma's Advanced Center for Genome Technology
In addition to the programs and scripts listed below, other programs
developed at the Oklahoma University Genome Center can be obtained
via ftp from the
University of Oklahoma ACGT ftp site
The programs described in:
Drabenstot, S. et al. 2003
"FELINES:A utility for extracting and examining
EST-defined introns and exons" Nucleic Acids Research 31, e141
can be accessed through
the FELINES package hyperlink.
Programs and scripts developed by our informatics group to aid
the contig assembly, proofreading, and data quality checking, with links for downloading and documentation
- 454_base36 - perl script to convert 454/Roche base 36 universal accession number strings to/from numbers, X_Y coordinates, and timestamps.
Click here for full documentation.
- 454_multi_asm -
perl program to build a shell script to perform three 454 Newbler assemblies on 454 data (two assemblies with shortened reads) and copy the assembled contigs into chromat_dir with reduced quality scores for Phrap assembly.
Updated July 23, 2010
- all2many - perl script to split a fasta
file with multiple contigs into separate fasta files.
- autofish -
a set of tools used to search shotgun reads for those
reads that match a set of query contig sequences in order to close
gaps in a sequence assembly. Autofish performs the search using hash
words of the sequences. It is less sensitive than blastn, but performs
alignments and rejects matches with high quality mismatches. Once the
matching reads have been identified, they can be extracted from the
input files and used to improve the assembled sequence.
Click here for more information.
- blast2many - perl script to split a blast
output file with results of multiple queries into separate blast output files.
Click here for full documentation.
- Blast2table - A program
to parse Blast output using BioPerl's Bio::Tools::Blast.pm and to write
the data from each HSP in tabular form in a variety of formats. For some
formats, the data may be modified to display the hit name as an HTML link
to Genbank. The data optionally can be sorted in various ways. Blast2table
processes a list of files named on the command line, or uses standard
input if no filename is given. Output is written to standard output.
Click here for full documentation.
NOTE: Requires the module Bio::SearchIO.pm from BioPerl to be installed.
See http://www.bioperl.org/ for more information.
Updated October 25, 2010
- blast_find_keywords - program
to read a file of Blast hits (from Blast2table, above), an outline file containing a list of keywords, and
an optional file containing EC numbers with matching names to augment the keyword list. (The EC file is the
ENZYME nomenclature database, "enzyme.dat", found from "http://www.expasy.org/enzyme/".) The program uses
the keyword list to select the "best" keyword for each "contig". The program also finds all other blast
hits with better scores than the best hit matching any EC number or keyword. The program produces lists
of the top blast hits, sorted by contig and then blast score. Once the outline file has been finalized,
the report format can be requested. blast_find_keywords replaces older programs blast_best_contigs,
blast_best_keyword, blast_best_nonkeyword, blast_print_keywords, and blast_sort_keywords.
Click here for full documentation.
- codon_usage - A program to get codon usage of nucleotide sequences of proteins in Fasta
format. For general dna sequences (not just the in-frame dna sequences of
proteins) the program computes di- and tri-nucleotide usage frequencies at
each base location for the forward, reverse, and/or both strands. Click here for full documentation.
- copy_454_rundata - perl script to copy subsets of 454/Roche GS20/FLX run data from a run folder to a project folder.
Click here for full documentation.
Updated July 23, 2010
- EMBL_feature_programs - a
collection of Perl scripts for working with EMBL feature files. These
programs are used at the University of Oklahoma Advanced Center for
Genome Technology in conjunction with Artemis from the Sanger Centre.
- exgap - Program for contig ordering, subclone and primer selection for primer walking and
graphic display of relationships between clone read pairs in contigs. Click here for full documentation.
- extract_454_paired_ends - perl script to separate 454/Roche GS20/FLX mixed paired-end and non-paired-end run data. This program is called by get_454_paired_ends.
Click here for full documentation.
Updated July 23, 2010
- extract_fasta - perl script
to extract all or a portion of a contig from a fasta sequence file.
and optionally the corresponding portion of a contig from a fasta qual file. The output contig may be
reversed and complemented. The output contig name may be shortened automatically. Click here for full documentation.
Updated July 23, 2010
- fastaq2phd -
perl program to convert all or part of a fasta sequence file and a matching fasta
quality file to a phd file. Also can be used to convert a multi-contig fasta
file into a set of phd files in a directory.
Click here for full documentation.
- find_fasta
- a perl program to search a fasta file for patterns, which may be perl-style
regular expressions. The input sequences may be either nucleotide or amino
acid sequences, but IUB codes to represent inexact searches are not allowed;
patterns are used instead. All matches are reported. Click here for full documentation.
Updated June 22, 2010
- fix_454_rundata -
perl program to workaround read misnaming in SFF files for non-Titanium runs on FLX. Also used to run processing programs before or after file transfers.
Click here for full documentation.
Updated October 25, 2010
- ftp_chromats_mac - a Perl program to transfer ABI 377 chromatogram files from a Mac to a
Unix host for further processing. The script automates the file transfer process and provides
safeguards against overwriting existing data or resending the same data twice. ftp_chromats_mac also
logs all transfers locally and on the target ftp host. Requires Mac Perl. ftp_chromats_mac.readme
- ftp_chromats_windowsNT - a Perl/TK program
to transfer chromatogram files collected on ABI 3700, Molecular Dynamics MegaBACE, MJGeneSys BaseStation, or SpectruMedix
sequencers from a Windows 95/98/NT (or Unix) computer to a Unix host for further processing. The
script automates the file transfer process and provides safeguards against overwriting existing data
or resending the same data twice. ftp_chromats_windowsNT can be configured to run phred (by Brent Ewing and Phil Green at the University of Washington Genome Center) to compute
base quality scores for each sample. These quality scores are displayed to give an overview of run
quality before the data is transfered.
ftp_chromats_windowsNT also logs all transfers locally and on the target ftp host. Requires Perl
with the Perl/TK module.
Also requires OUTkForms.pm. (See below.) ftp_chromats_windowsNT has been written for easy customization and is
designed to integrate with sheet_writer.pl. ftp_chromats_windowsNT.readme
- get_454_mids - perl script to separate 454/Roche GS20/FLX run data which uses MID tags.
Click here for full documentation.
Updated July 23, 2010
- get_454_paired_ends - perl script to separate 454/Roche GS20/FLX mixed paired-end and non-paired-end run data and process paired-end data for use by phrap.
Click here for full documentation.
- get_454_pools - perl script to separate 454/Roche GS20/FLX tagged run data and run a series of programs on each separate set of data.
Click here for full documentation.
Updated July 23, 2010
- get_amino_stats
- a perl program to get amino acid counts of sequences in Fasta format. Click here for full documentation.
- get_contig_ends
- Get ends of contigs for blast searches to find matching ends for gap closure.
May also be used to extract contigs longer than a minimum size, to remove
leading and trailing Xs and Ns, to shorten long contig names, to reverse and
complement the output contigs, and to reformat output sequence data lines to
a specific number of bases per line.
Click here for full documentation.
- get_fasta_stats -
Get statistics of contigs in Fasta format. A fasta quality file can also be read
to give error and quality statistics. Can be used to compute mono-, di, and
tri-nucleotide frequencies. Can produce contig length histograms and 454 read statistics
for Newbler assembled contigs.
Click here
for full documentation.
Updated October 25, 2010
- get_multi_fasta_stats - perl script to read a list of fasta input files and output stats about each each file.
- get_newbler_contig_reads -
program to extract the read sequences used by the 454/Roche Newbler assembler to form a contig.
Click here
for full documentation.
New October 25, 2010
- index_contigs_by_tag - perl script to read a fasta file and create index files based on a tag sequence contained at a fixed position
in each contig. One index file is created for each unique tag sequence at that fixed position.
Alternatively a tag_file containing a list of desired tags may be read..
Click here for full documentation.
Updated July 23, 2010
- match_contigs -
program to match contigs for gap closure based on matching blast hits for
contigs containing the same split gene. Used by match_contig_ends.
Click here
for full documentation.
- match_contig_ends
- Match ends of contigs for gap closure by performing blast searches on
contig ends to look for matching split genes. Also matches contig ends against
entire contigs to look for false joins. Uses get_contig_ends and match_contigs.
Click here
for full documentation.
- maxmatch -
a tool for comparing two sequences similar to what dotter
does. Maxmatch uses a suffix tree data structure to find exact matches,
so it is much faster, but not as sensitive as dotter.
Click here for more information.
- OUTkForms.pm - Perl/TK
module for fill-in-the-blank menu applications. OUTkForms.pm provides a
consistent appearance, using table driven input to define the form and to
define simple syntax checking and error messages for the fields.
OUTkForms.pm is used by ftp_chromats_windowsNT and Sheet_writer.pl.
Requires Perl with the Perl/TK module. Help and customization information are
included in the module. Rename OUTkForms_WindowsNT_V1.0.pm to OUTkForms.pm
and place in one of the Perl @INC directories before using.
- plate_reverse
- program to rename samples from a sequencing plate that was loaded
backwards. For a 96 well plate, well A01 is named as if it were in H12,
A02 is named as if it were in H11, A03 is named as if it were in H10, ...,
B01 is named as if it were in G12, ..., H12 is named as if it were in A01.
For a 384 well plate, well A01 is named as if it were in P24, A02 is named
as if it were in P23, A03 is named as if it were in P22, ..., B01 is
named as if it were in O24, ..., P24 is named as if it were in A01. Click
here for full documentation.
- primer_check - perl
program to check a primer request file for proper format and number of primers
to be synthesized by the MerMade Oligo-nucleotide Synthesizer. Primer_check
also adds control primers from a file and reports numbers and lengths of
primers from each source. Click
here for full documentation.
- PrimOU - the University of Oklahoma
version of the UT-SWMC Primo Primer Picking Program. Changes include fixing
bugs and screening for uniqueness against existing known sequence in an
entire project.
- ReArray.exe - ReArray.exe
is a self-extracting archive file for Microsoft Windows that contains a set of program
and example data files for collecting scattered samples from one or more
wells from various plates and putting them into one destination plate using
the Beckman Biomek 2000. The Biomek uses the P20 tool to transfer up to
20uL per sample. There are two Tool Control Language programs. The program
"R96_96.tcl" rearrays many 96-well microtiter plates into one 96-well
microtiter plate. The program "R96_384.tcl" rearrays many 96-well microtiter
plates into one 384-well microtiter plate.
Click
here for more information.
- rename_454_reads_to_uaccno - perl script to automatically rename old 454 reads using X-Y coordinate based names to the new
universal accession numbers. This function is available from sort_contigs -u.
Use of this program is deprecated.
- replace_454_data -
A Perl program to remove assembled 454 GS20 files
prefixed by "454_" from both both phd_dir and chromat_dir, to
run get_contig_ends for trimming Ns and Xs from the ends of the 454
contigs and for removing short contigs, and to run fastaq2phd to
create new phd files and chromat placeholders using fasta sequence and
quality files assembled from 454 GS20 runs. Get_contig_ends and
fastaq2phd can be run automatically a second time to create duplicate
phd files for the contig end sequences to force phrap to treat the 454
contigs as contigs and not allow them to become singlets.
Click here for full documentation.
- report_polyphred.pl -
This script reformats polyphred output to produce three tab-delimited files
that can be opened as Excel spreadsheets for further statistical examination.
The first file "excel_polyphred_report.txt", contains the overall report of all
SNPs (including their scores) detected for each clone. It also lists the failed
clones. The second file "excel_snp_counts.txt", contains the total count for
each of the ten possible SNPs at every position. Finally, the third output file
"ratios.txt", contains the ratios of those bases that occurred at each SNP
position.
- report_prettybase.pl -
This script reformats prettybase output to produce four tab-delimited files
that can be opened as Excel spreadsheets for further statistical examination
and a Visual Basic macro file.
The first file "snp_report.txt", contains the overall report of all
SNPs (including their scores) detected for each clone. It also lists the failed
clones. The second file "snp_counts.txt", contains the total count for
each of the ten possible SNPs at every position. The third output file
"snp_ratios.txt", contains the ratios of those bases that occurred at each SNP
position. The fourth file "snp_summary.txt", summarizes SNP statistics by
SNP position. Finally the fifth file "snp_macro.bas" provides a Visual Basic
macro for coloring cells of the "snp_report.txt" file according to SNP type.
(The "snp_report.txt" file must be opened first; then the "snp_macro.bas"
file should be imported into the report file through the macro menu in Excel
and run from there.)
- select_contigs
- Select a subset of contigs from a fasta input file into a new fasta output
file. A fasta quality file also can be processed. Partial contigs can be
extracted. Can also automatically rename old 454 reads using X-Y coordinate
based names to the new universal accession numbers. Various filtering options
are available.
Click here
for full documentation.
If contigs need to be reordered or joined together, then use sort_contigs instead.
Updated July 23, 2010
- SheetWriter - a sample sheet generating program for the Macintosh for PE/ABI 377 sequencers
- sheet_writer.pl - Perl/TK
program to creates run sheets for ABI 3700 capillary DNA sequencer or
MJGeneSys BaseStation 96-lane gel sequencer. Sheet_writer.pl is easily
customizable and is written to provide information to ftp_chromats_windowsNT
for easy file transfers without re-entry of run information. Requires Perl
with the Perl/TK module. Also requires OUTkForms.pm module. (See above.)
- sort_contigs
- Sort a fasta input file, alphabetically by contig name, numerically by
contig name, by contig size, or according to a file giving contig order.
The output can be a set of ordered contigs, or a single joined contig with
separator strings. A fasta quality file also can be processed. Contigs can be
reversed and complemented, and partial contigs can be extracted. Can also
automatically rename old 454 reads using X-Y coordinate based names to the new
universal accession numbers. Various filtering options are available. If
neither reordering of contigs nor joining contigs is required, then
select_contigs may be a better choice, especially for large input files,
because it uses much less memory.
Click here
for full documentation.
- split_454_pools - perl script to separate 454/Roche GS20/FLX tagged run data and run a series of programs on each separate set of data This program is called by get_454_pools.
Click here for full documentation.
Updated October 25, 2010
- stream_file_extract
- splits an input file into multiple output text files.
A file name prefix may used, along with the names from the input file.
Data for files to be created may be embedded within data to be passed to
Standard Output. This was written to be able to pass multiple files
back from an rsh command. The program stream_file_insert may be used to
create the input file containing the embedded files to be extracted.
Click here
for full documentation.
- stream_file_insert
- inserts one or more text files into the Standard
Output stream for later extraction by stream_file_extract. A file name
prefix may be removed before embedding the names into the output.
This was written to be able to pass multiple files back from an rsh
command. Only text files are supported.
Click here
for full documentation.
- unique_contigs
- Read a fasta input file and output the set of unique contigs. The program
screens for both duplicate contig names and duplicate sequences.
Click here
for full documentation.
- zap_file.pl
- A program to perform a global string substitution
on a binary file. The full pathname to the file may be specified,
or the program will find a file in the Perl \@INC include libaries
if the Perl style relative path is known.
Click here
for full documentation.
A listing of other examples of programs and scripts developed by our informatics group
to aid the contig assembly, proofreading, and data quality checking
We are working on making these programs and scripts available by links from this page
- automake - perl script to manually initiate autophrap for a single project
- autophrap.pl - master perl script with a make file to automatically assemble projects
containing new data at both high and low stringency, to select extension and coverage
primers for projects nearing completion, to compare high and low stringency assemblies,
to produce update files for Genbank submission, and to update the Web assembly report.
autophrap.pl executes several other scripts and programs. This script is automatically
run each hour by a cron job. Each project is updated automatically once new sequence data is added
- autorep.pl - produce a report about the automatic assembly performed by autophrap.pl
- autosubmit.pl - generate asn.1 format Genbank submissions automatically using fa2htgs
- autotroll.pl - check for filenames containing spaces (messes up phrap)
- biggest_bases.pl - find the largest contigs in a fasta format file
- changex2n.in - editor scripts for changing phrap x bases to n
- check_consensus.pl - compare the latest assembly consensus with the most recent Genbank submission
- chromat_check.pl - check chromat_dirs for stupid file names containing spaces etc.
- contigpairs.pl - parse rel.out file into pairs of contigs with spanning clones
- contigs2tbl - compute table of contig sizes for Web report
- dsgaps.pl - locate DS-Gap regions in the phrap.out file
- dsregions.pl - locate DS-Gap regions in the phrap.out file and output Primo format
- find_latest.pl - a tool used by web_tabler.pl
- forcescreen.pl - force ecoli screen on phrap data bases
- initialize.pl - required by autosubmit.pl
- lastbase.in - editor scripts for postprimo processing
- makereadme - create, edit, or print 00readme.txt file for a project.
- make_autoqual.html - a perl script to produce a tabular html listing of the number of reads
with phred20 scores for each project
- overlap.pl - look up two projects in automaster.list and crossmatch to compute overlap and unique regions
- overlaps.pl - look up adjacent entries in automaster.list and crossmatch to compute overlap and unique regions
- postprimoc.pl - convert primo output to mermade output for large-insert clones
- postprimos.pl - convert primo output to mermade output with sub-clones information
- print_unique - perl program to extract key and value fields from each record of a file, compute a
single output value from the records with the same key, and output a single record for each key with
the key and value.
- printreadme.pl - print 00readme.txt files for all projects
- printrev_phrap - modified Jeremy Parson's printrev program to use phrap output (via the rel.out
file from relationships_phrap), to print the graphic output across multiple pages and to order contigs
based on forward/reverse clone linkage information.
- qualreport.pl - checks the phred20 bases, pUC vector only reads and bacterial host genomic contamination
for each sequencing gel
- readme_check.pl - check readme files for various strings
- relationships_phrap - perl program to extract read information for all contigs in a phrap .ace file.
The file produced is named rel.out.
- total_bases.pl - compute the number of bases in a fasta format file
- web_tabler.pl - build the status tables with accession numbers and web hot links
- weekly_bacterial - perl program to automate weekly Blast searches of bacterial sequencing projects for Web searches of Blast results.
Programming Related Sites of Possible Interest
Perl Language Home Page
Lincoln Stein's home page with a wealth of information and links from the author of "How to Set Up and Maintain a Web Site.
A Beginner's Guide to HTML
HTML Code Tutorial
HTML Goodies
W3Schools
ColorTool
Swish-e web site
Oracle Support
Understanding Computer Programming
Local Versions of Various Docs
Welcome to Sybase web.sql 1.2 GA Release
MySql Documentation TOC
FPC - Finger-Printed Contigs FAQ
Image Gel Visualization and Tracking Program
Bruce Roe, broe@ou.edu