first previous next last contents

Introduction

There are two types of vector that readings will often need to be screened for: the sequencing vector and, for cases where, say, whole cosmids have been shotgunned, the cloning vector. The two tasks are different. When screening for the sequencing vector we may expect to find data to exclude, both from the primer region and, when the insert is short, from the other side of the cloning site. It is also a wise precaution to check for rearrangements of the sequencing vector. When screening out cosmid vector we may find that either the 5' end, or the 3' end, or the whole of the sequence is vector. Also for the cosmid search we need to compare both strands of the sequence.

Some groups may also need to check for cloning problems by looking for the presence of known restriction sites. This is described lower down in section Screening for Restriction Sites.

The program that performs all of these tasks is called vepe and it is normally run by the script pregap. See section `Pregap' in pregap. Usually the program will be called three times: the first to locate and mark the sequencing vector; next to check for vector rearrangements; and finally to locate and mark cosmid vector segments. The program operates on batches of sequence using file of file names, and modifies the reading files for those found to contain vector. All the information needed about the vector (i.e. its name, location on disk, the cloning and primer sites used) for each reading is expected to be stored in its experiment file. See section Experiment File. For notes on defining the cloning and primer sites, see section Defining the Positions of Cloning and Primer Sites for Vepe.

The algorithm compares the sequence of the reading against the vector sequence using a hashing technique and for this it needs a "Word length", the "Number of diagonals to combine" and a "Cutoff score". The word length is the minimum number of consecutive bases that will count as a match. The algorithm treats the problem like a dot matrix comparison and finds the diagonal with the highest score. Then it adds the scores for the adjacent "Minimum number of diagonals to combine". If the combined score is at least "Minimim score" the experiment file is updated to indicate the location of the vector sequence. The score represents the proportion of a diagonal that contains matching words, so the maximum score for any diagonal is 1.0.

The following is the start of a typical run for finding the sequencing vector.

 vepe v4.0: vector excising program. May 95
 Select task
 X  1 Mark sequencing vector
    2 Mark cloning vector
    3 Check for vector rearrangements
    4 Check for restriction sites
 ? Selection (1-4) (1) =
 ? Input file of file names=files
 ? Output file of passed file names=files.pass
 ? Output file of failed file names=files.fail
 ? Word length (2-6) (4) =
 ? Number of diagonals to combine (1-11) (7) =
 ? Cutoff score (0.10-1.00) (0.35) =
 >>>> Read number  1 length  536 
 xb54a3.s1                                                                    
 M13MP18 cloning vector from vecbase  
 Vector length =  7250
 >>>> Read number  2 length  489 
 xb54b12.s1
 ...

The "Output file of failed file names" is used to store the names of all the readings that fail the processing. The following errors can occur.

  1. Error opening experiment file (the file could not be opened)
  2. Error getting gel reading (no sequence found in experiment file)
  3. Gel reading too short to compare (the reading must be at least "Word length" long)
  4. Error reading vector file (the vector file name not found in the experiment file)
  5. Error reading cloning site (cloning site not found in experiment file)
  6. Error reading primer site (primer site not found in experiment file)
  7. Error opening vector file (vector file could not be opened)
  8. No message (failed to write vector positions to experiment file)
  9. No message (after vector removal the reading would be < 14 bases long)
  10. No message (reading found to be entirely cloning vector)
  11. No message (reading found to contain a restriction site)

The search for "cloning vector" is identical to that for the sequencing vector but that for "vector rearrangements" uses a simpler algorithm and looks only for a "Minimum match length". All readings that contain a string of characters of at least this length that match a segment of the vector sequence exactly will be classed as "vector rearrangements" and their names will not be written the file of passed file names.


first previous next last contents
This page is maintained by James Bonfield. Last generated on 29 April 1996.
URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/vepe_repe_2.html