EMBL Feature Programs

Last Updated: August 24, 2005

Here is a collection of Perl scripts for working with EMBL feature files. These programs are used at the University of Oklahoma Advanced Center for Genome Technology in conjunction with Artemis from the Sanger Centre.

  • artemis_blast2EMBL - takes a list of blast results files and extracts the best blast hit for each query and converts it to an EMBL BLASTCDS feature. Updated - August 31, 2005

  • crossmatch2EMBL - takes crossmatch results files and extracts the matches, converting each to an EMBL feature. This program is useful for finding locations of features with known (or almost known) sequences.

  • find_split_EMBL_features - reads a fasta sequence file and a corresponding EMBL feature file, and finds the features that appear to span joined contigs, where joined contigs are indicated by a long run of Xs and/or Ns. Because Artemis does not handle multiple contigs, we sometimes join contigs for projects which are nearing completion with runs of Ns to create a single sequence for Artemis, so we can begin early annotation. Artemis sometimes includes these runs of Ns within Open Reading Frames that it identifies. This program can find those features which are thus falsely joined across contig boundaries for examination and possible splitting, adjustmentof a feature end, or removal.

  • genemark2artemis - converts genemark splice boundary output into EMBL format for input into artemis.

  • merge_EMBL_features - takes EMBL feature files, which describe the same sequence file and merges the features into one EMBL feature file. This task can be accomplished within Artemis by using the "File/Read An Entry ...", "File/Read Features Into" and "File/Save An Entry As/New File" commands.

  • trna2EMBL - converts tRNAscan-SE output to artemis EMBL feature format. Input file must be an output file from tRNAscan-SE.

  • update_EMBL_features - reads a fasta sequence file and a corresponding EMBL feature file, and updates the feature locations to match a second fasta sequence file. This program allows annotation to be started for a project before the sequence is finalized. update_EMBL_features automatically finds most features in the new sequence, and identifies the features that it could not locate. Updated - June 23, 2005


    Full help for artemis_blast2EMBL

    artemis_blast2EMBL - takes a list of blast results file names and
    extracts the best blast hit (by Pvalue) for each query in a file.
    The best blast hit is converted to an EMBL feature.  The '-a' flag
    can be specified to force the creation of a dummy EMBL feature for
    queries which do not have a hit meeting the requested criteria.
    The  '-e' flag can be specified to write out every hit with Expect
    value < 'max_expect' as a separate EMBL feature, instead of just
    the top hit for each file.  The program uses the Artemis Blast
    "Query=" lines or the Artemis fasta sequence file headers to find
    feature locations for the EMBL feature file.  Output is written to
    standard output.
    
    
    usage:  artemis_blast2EMBL [-a] [-e] [-m 'max_expect'] [-s 'min_score']
                     [-t 'feature_type'] [-v]
                                blast_file1 [blast_file2...]
              or
            artemis_blast2EMBL -h    <-- what you are reading
    
    where 'blast_file1', 'blast_file2', ... are the names of Blast results
             files to be processed.
    
          'max_expect' is the maximum Blast Expect value for which
             information about the best blast hit is returned.  The default
             value for 'max_expect' is 0.0001.
    
          'min_score' is the minimum Blast Score value for which
             information about the best blast hit is returned.  The default
             value for 'min_score' is 50.
    
          'feature_type' is the name of the feature_type to be created.
             The default is: 'BLASTCDS'.
    
    OPTIONS:
    
      -a  Create an EMBL feature for ALL input queries processed.  If no
          Blast hits were found, then a feature is created as '***no hits***'.
          If the Blast Expect value is greater than 'max_expect', then the
          feature is created as '***no significant hits***'.  If '-a' is not
          specified, then EMBL features are created only for those queries
          having a best hit Expect value less than or equal to 'max_expect'.
    
      -e  Output a feature for every hit with Expect value < 'max_expect' as
          a separate EMBL feature, not just the top hit for each query.
    
      -m  Specifies the maximum Expect value.  See 'max_expect' above.  This
          option may also be specified as '-p max_expect' for backwards
          compatibility.
    
      -s  Specifies the minimum Score value.  See 'min_score' above.
    
      -t  Specifies the feature type to be created.  See 'feature_type'
          above.
    
      -v  Verbose mode -- write out some statistics and progress messages
          to STDERR.
    
    
    LAST_MODIFIED: August 31, 2005
    


    Full help for crossmatch2EMBL

    crossmatch2EMBL - takes crossmatch results file names and
    extracts the matches.  Each match is converted to an EMBL
    feature. Output is written to standard output.  This program
    is useful for finding location of features with known (or
    almost known) sequences.
    
    usage:  crossmatch2EMBL [-t 'feature_type']
               crossmatch_file1 [crossmatch_file2 [crossmatch_file3 ...] ]
    
    where 'crossmatch_file1', 'crossmatch_file2', 'crossmatch_file3', etc.
             are the names of the crossmatch results files to be processed.
    
          'feature_type' is the name of the feature_type to be created.
             The default is: 'rRNA'.
    
    OPTIONS:
    
      -t  Specifies the feature type to be created.  See 'feature_type'
          above.
    
    DATE PROGRAM LAST MODIFIED: Sep 5, 2001
    


    Full help for find_split_EMBL_features

    find_split_EMBL_features - reads a fasta sequence file and a
    corresponding EMBL feature file, and outputs the features that
    appear to span joined contigs, where joined contigs are indicated
    by a long run of Xs and/or Ns.  If the "-i" flag (inverse) is
    specified, then only the features that do NOT span joined contigs
    are to be output. The features are sorted by "leftmost" beginning
    or ending position in the output feature file.
    
    find_split_EMBL_features does not support remote entry identifiers
    (data base and accession number info) in feature locations. 
    
    USAGE:  find_split_EMBL_features [-i] [-l join_len] [-v]
                                     fasta_file EMBL_input EMBL_output
              or
            find_split_EMBL_features -h       # the file you are reading
               
    
    where 'fasta_file' is the name of the fasta sequence file which matches
             'EMBL_input'.
    
          'EMBL_input' is the name of the EMBL feature file to be processed.
    
          'EMBL_output' is the name of the output subset EMBL feature file
             to be created.  A log file named "'EMBL_output'.log" is also
             created.
    
          'join_len' is the number of consecutive Xs and/or Ns used to
             indicate joined contigs.  Default is 50.
    
    
    OPTIONS:
    
      -h  Help message - what you are reading.
      -i  Inverse mode.  Output features that are NOT split across joined
          contigs, instead of those which do.
      -l 'join_len' - Specify number of consecutive Xs and/or Ns used to
          join multiple contigs for Artemis processing.
      -v  Verbose mode.  Output messages to STDERR, as well as to the log
          file.
    
    
    DATE PROGRAM LAST MODIFIED: Nov 16, 2001
    


    Full help for genemark2artemis

    genemark2artemis - converts genemark splice boundary output into EMBL
    feature format for input into artemis.
    
    
    USAGE: genemark2artemis GeneMark_file Artemis_feature_file
    
    where 'GeneMark_file' is the name of the genemark output file to be read.
    
          'Artemis_feature_file' is the name of the output EMBL feature file
             to be created with the predicted exons.
    
    
    DATE PROGRAM LAST MODIFIED: Oct 15, 1999
    


    Full help for merge_EMBL_features

    merge_EMBL_features - takes EMBL feature files, which describe the
    same sequence file and merges the features into one EMBL feature file.
    The features are sorted by "leftmost" beginning or ending position.
    Output is written to standard output.  merge_EMBL_features does not
    support remote entry identifiers (data base and accession number info)
    in feature locations. 
    
    USAGE:  merge_EMBL_features [-v] EMBL_file1 [EMBL_file2 [EMBL_file3 ...]] > EMBL_output_file
                or
            merge_EMBL_features -h
               
    
    where 'EMBL_file1', 'EMBL_file2, 'EMBL_file3', etc.
             are the names of the input EMBL feature files to be processed.
    
          'EMBL_output_file' is the name of the output EMBL feature file to be
             created.
    
    OPTIONS:
    
      -h  Help message -- what you are reading.
      -v  Verbose mode.  Print messages during processing.
    
    
    PROGRAM LAST MODIFIED: Sep 19, 2001
    


    Full help for trna2EMBL

    trna2EMBL - Convert tRNAscan-SE output to artemis EMBL feature format.
    Input file must be an output file from tRNAscan-SE.  Output is written
    to standard out.
    
    USAGE: trna2EMBL [-c colour] [-a/-t] trnascan_file > EMBL_feature_file
              or
           trna2EMBL -h        <-- what you are reading
    
      where 'trnascan_file' is the name of the output file from tRNAscan-SE
              to be used as input.
    
            'colour' is the color number to be used by artemis.
    
            'EMBL_feature_file' is the name of the artemis EMBL feature to
              be created.
    
    
    OPTIONS:
    
    
      '-a' indicates that the created tRNA features are to be labeled with
           the tRNA type and anti-codon as 'tRNA-()'.
           For example, 'tRNA-Thr(TGT)', 'tRNA-Met(CAT)', ...
    
      '-c colour' is used to mark the created tRNA features with a specific
           color number.  If '-c' is not specified, then no colour option
           will be written for the newly written features.
    
      '-t' indicates that the created tRNA features are to be labeled with
           the tRNA type as 'tRNA-'. For example, 'tRNA-Thr',
           'tRNA-Met', ...
    
    If '-a' and '-t' are omitted, the created tRNA features are to be labeled
    as 'tRNA#', where '#' is the 'tRNA #' from the trnascan output. For
    example, 'tRNA1', 'tRNA2', 'tRNA3', ...
    
    
    DATE PROGRAM LAST MODIFIED: Sep 7, 2001
    


    Full help for update_EMBL_features

    
    update_EMBL_features - reads a fasta sequence file and a corresponding
    EMBL feature file, and updates the feature locations to match a
    second fasta sequence file.  The features are sorted by "leftmost"
    beginning or ending position in the output feature file.
    update_EMBL_features does not support remote entry identifiers (data
    base and accession number info) in feature locations. 
    
    update_EMBL_features looks for sequences matching 'search_len' bases on
    each end of each feature in order to find the location in the new sequence.
    If only one end has a match, then an extended search is attempted using
    'search_len'/2 bases and if necessary, 'search_len'/4 bases, of the end
    that was not located.  But at least 'min_search' bases will always be used.
    
    
    USAGE:  update_EMBL_features [-c colour] [-d debug_level]
                                 [-l search_len] [-m] [-n min_search] [-v]
                                 fasta_file1 EMBL_input fasta_file2 EMBL_output
              or
            update_EMBL_features -h       # the file you are reading
               
    
    where 'fasta_file1' is the name of the fasta sequence file which matches
             'EMBL_input'.
    
          'EMBL_input' is the name of the EMBL feature file to be processed.
    
          'fasta_file2' is the name of the new fasta sequence file to be used
             for re-locating the features in 'EMBL_output'.
    
          'EMBL_output' is the name of the output EMBL feature file to be
             created.  A log file named "'EMBL_output'.log" is also created.
    
          'colour' is the color number used to mark features that need to be
             examined individually after relocation to the new sequence.  The
             default value for 'colour' is '4'.
    
          'debug_level' is a number >= 0 for turning debug messages on and off.
             The default value is 0.
    
          'search_len' is the number of bases on each end of a feature to use
             in finding the location of a match in the new sequence file.
             The default value is 100.  The minimum value for
             'search_len' is the value for 'min_search'.
    
          'min_search' is the minimum number of bases on each end of a feature
             which must be used in finding the location of a match in the new
             sequence file.  The default value is 10.
    
    
    OPTIONS:
    
      -c 'colour' - If 'colour' is numeric, add a 'colour' tag to the relocated
          features that need to be examined individually, because either the
          sequence is changed, or else a formerly nearest neighboring feature
          is missing or no longer in the same relative position (possibly
          indicating a sequence break).
      -d 'debug_level' - Debug_level = 0 turns debugging off.  Debug_level > 0
          turns debugging on, with larger values producing more output.
      -h  Help message - what you are reading.
      -l 'search_len' - Specify number of bases on ends of features to use for
          finding matching features in new sequence.
      -m  Mark features with previous location by adding extra line to relocated
          features.  Also remove markers from previous run.
      -n 'min_search' - Specify the minimum number of bases on each end of a
          feature which must be used in finding the location of a match in the
          new sequence file.
      -v  Verbose mode.  Also print progress messages to STDERR during processing.
    
    
    DATE LAST MODIFIED: June 23, 2005
    
    


    Home Page Bruce Roe, broe@ou.edu