Last Updated: August 24, 2005
Here is a collection of Perl scripts for working with EMBL feature files. These programs are used at the University of Oklahoma Advanced Center for Genome Technology in conjunction with Artemis from the Sanger Centre.
artemis_blast2EMBL - takes a list of blast results file names and
extracts the best blast hit (by Pvalue) for each query in a file.
The best blast hit is converted to an EMBL feature. The '-a' flag
can be specified to force the creation of a dummy EMBL feature for
queries which do not have a hit meeting the requested criteria.
The '-e' flag can be specified to write out every hit with Expect
value < 'max_expect' as a separate EMBL feature, instead of just
the top hit for each file. The program uses the Artemis Blast
"Query=" lines or the Artemis fasta sequence file headers to find
feature locations for the EMBL feature file. Output is written to
standard output.
usage: artemis_blast2EMBL [-a] [-e] [-m 'max_expect'] [-s 'min_score']
[-t 'feature_type'] [-v]
blast_file1 [blast_file2...]
or
artemis_blast2EMBL -h <-- what you are reading
where 'blast_file1', 'blast_file2', ... are the names of Blast results
files to be processed.
'max_expect' is the maximum Blast Expect value for which
information about the best blast hit is returned. The default
value for 'max_expect' is 0.0001.
'min_score' is the minimum Blast Score value for which
information about the best blast hit is returned. The default
value for 'min_score' is 50.
'feature_type' is the name of the feature_type to be created.
The default is: 'BLASTCDS'.
OPTIONS:
-a Create an EMBL feature for ALL input queries processed. If no
Blast hits were found, then a feature is created as '***no hits***'.
If the Blast Expect value is greater than 'max_expect', then the
feature is created as '***no significant hits***'. If '-a' is not
specified, then EMBL features are created only for those queries
having a best hit Expect value less than or equal to 'max_expect'.
-e Output a feature for every hit with Expect value < 'max_expect' as
a separate EMBL feature, not just the top hit for each query.
-m Specifies the maximum Expect value. See 'max_expect' above. This
option may also be specified as '-p max_expect' for backwards
compatibility.
-s Specifies the minimum Score value. See 'min_score' above.
-t Specifies the feature type to be created. See 'feature_type'
above.
-v Verbose mode -- write out some statistics and progress messages
to STDERR.
LAST_MODIFIED: August 31, 2005
crossmatch2EMBL - takes crossmatch results file names and
extracts the matches. Each match is converted to an EMBL
feature. Output is written to standard output. This program
is useful for finding location of features with known (or
almost known) sequences.
usage: crossmatch2EMBL [-t 'feature_type']
crossmatch_file1 [crossmatch_file2 [crossmatch_file3 ...] ]
where 'crossmatch_file1', 'crossmatch_file2', 'crossmatch_file3', etc.
are the names of the crossmatch results files to be processed.
'feature_type' is the name of the feature_type to be created.
The default is: 'rRNA'.
OPTIONS:
-t Specifies the feature type to be created. See 'feature_type'
above.
DATE PROGRAM LAST MODIFIED: Sep 5, 2001
find_split_EMBL_features - reads a fasta sequence file and a
corresponding EMBL feature file, and outputs the features that
appear to span joined contigs, where joined contigs are indicated
by a long run of Xs and/or Ns. If the "-i" flag (inverse) is
specified, then only the features that do NOT span joined contigs
are to be output. The features are sorted by "leftmost" beginning
or ending position in the output feature file.
find_split_EMBL_features does not support remote entry identifiers
(data base and accession number info) in feature locations.
USAGE: find_split_EMBL_features [-i] [-l join_len] [-v]
fasta_file EMBL_input EMBL_output
or
find_split_EMBL_features -h # the file you are reading
where 'fasta_file' is the name of the fasta sequence file which matches
'EMBL_input'.
'EMBL_input' is the name of the EMBL feature file to be processed.
'EMBL_output' is the name of the output subset EMBL feature file
to be created. A log file named "'EMBL_output'.log" is also
created.
'join_len' is the number of consecutive Xs and/or Ns used to
indicate joined contigs. Default is 50.
OPTIONS:
-h Help message - what you are reading.
-i Inverse mode. Output features that are NOT split across joined
contigs, instead of those which do.
-l 'join_len' - Specify number of consecutive Xs and/or Ns used to
join multiple contigs for Artemis processing.
-v Verbose mode. Output messages to STDERR, as well as to the log
file.
DATE PROGRAM LAST MODIFIED: Nov 16, 2001
genemark2artemis - converts genemark splice boundary output into EMBL
feature format for input into artemis.
USAGE: genemark2artemis GeneMark_file Artemis_feature_file
where 'GeneMark_file' is the name of the genemark output file to be read.
'Artemis_feature_file' is the name of the output EMBL feature file
to be created with the predicted exons.
DATE PROGRAM LAST MODIFIED: Oct 15, 1999
merge_EMBL_features - takes EMBL feature files, which describe the
same sequence file and merges the features into one EMBL feature file.
The features are sorted by "leftmost" beginning or ending position.
Output is written to standard output. merge_EMBL_features does not
support remote entry identifiers (data base and accession number info)
in feature locations.
USAGE: merge_EMBL_features [-v] EMBL_file1 [EMBL_file2 [EMBL_file3 ...]] > EMBL_output_file
or
merge_EMBL_features -h
where 'EMBL_file1', 'EMBL_file2, 'EMBL_file3', etc.
are the names of the input EMBL feature files to be processed.
'EMBL_output_file' is the name of the output EMBL feature file to be
created.
OPTIONS:
-h Help message -- what you are reading.
-v Verbose mode. Print messages during processing.
PROGRAM LAST MODIFIED: Sep 19, 2001
trna2EMBL - Convert tRNAscan-SE output to artemis EMBL feature format.
Input file must be an output file from tRNAscan-SE. Output is written
to standard out.
USAGE: trna2EMBL [-c colour] [-a/-t] trnascan_file > EMBL_feature_file
or
trna2EMBL -h <-- what you are reading
where 'trnascan_file' is the name of the output file from tRNAscan-SE
to be used as input.
'colour' is the color number to be used by artemis.
'EMBL_feature_file' is the name of the artemis EMBL feature to
be created.
OPTIONS:
'-a' indicates that the created tRNA features are to be labeled with
the tRNA type and anti-codon as 'tRNA-()'.
For example, 'tRNA-Thr(TGT)', 'tRNA-Met(CAT)', ...
'-c colour' is used to mark the created tRNA features with a specific
color number. If '-c' is not specified, then no colour option
will be written for the newly written features.
'-t' indicates that the created tRNA features are to be labeled with
the tRNA type as 'tRNA-'. For example, 'tRNA-Thr',
'tRNA-Met', ...
If '-a' and '-t' are omitted, the created tRNA features are to be labeled
as 'tRNA#', where '#' is the 'tRNA #' from the trnascan output. For
example, 'tRNA1', 'tRNA2', 'tRNA3', ...
DATE PROGRAM LAST MODIFIED: Sep 7, 2001
update_EMBL_features - reads a fasta sequence file and a corresponding
EMBL feature file, and updates the feature locations to match a
second fasta sequence file. The features are sorted by "leftmost"
beginning or ending position in the output feature file.
update_EMBL_features does not support remote entry identifiers (data
base and accession number info) in feature locations.
update_EMBL_features looks for sequences matching 'search_len' bases on
each end of each feature in order to find the location in the new sequence.
If only one end has a match, then an extended search is attempted using
'search_len'/2 bases and if necessary, 'search_len'/4 bases, of the end
that was not located. But at least 'min_search' bases will always be used.
USAGE: update_EMBL_features [-c colour] [-d debug_level]
[-l search_len] [-m] [-n min_search] [-v]
fasta_file1 EMBL_input fasta_file2 EMBL_output
or
update_EMBL_features -h # the file you are reading
where 'fasta_file1' is the name of the fasta sequence file which matches
'EMBL_input'.
'EMBL_input' is the name of the EMBL feature file to be processed.
'fasta_file2' is the name of the new fasta sequence file to be used
for re-locating the features in 'EMBL_output'.
'EMBL_output' is the name of the output EMBL feature file to be
created. A log file named "'EMBL_output'.log" is also created.
'colour' is the color number used to mark features that need to be
examined individually after relocation to the new sequence. The
default value for 'colour' is '4'.
'debug_level' is a number >= 0 for turning debug messages on and off.
The default value is 0.
'search_len' is the number of bases on each end of a feature to use
in finding the location of a match in the new sequence file.
The default value is 100. The minimum value for
'search_len' is the value for 'min_search'.
'min_search' is the minimum number of bases on each end of a feature
which must be used in finding the location of a match in the new
sequence file. The default value is 10.
OPTIONS:
-c 'colour' - If 'colour' is numeric, add a 'colour' tag to the relocated
features that need to be examined individually, because either the
sequence is changed, or else a formerly nearest neighboring feature
is missing or no longer in the same relative position (possibly
indicating a sequence break).
-d 'debug_level' - Debug_level = 0 turns debugging off. Debug_level > 0
turns debugging on, with larger values producing more output.
-h Help message - what you are reading.
-l 'search_len' - Specify number of bases on ends of features to use for
finding matching features in new sequence.
-m Mark features with previous location by adding extra line to relocated
features. Also remove markers from previous run.
-n 'min_search' - Specify the minimum number of bases on each end of a
feature which must be used in finding the location of a match in the
new sequence file.
-v Verbose mode. Also print progress messages to STDERR during processing.
DATE LAST MODIFIED: June 23, 2005