/*****************************************************************************
#   Copyright (C) 1993-1996 by Phil Green.                          
#   All rights reserved.                           
#                                                                           
#   This software is part of a beta-test version of the swat/cross_match/phrap 
#   package.  It should not be redistributed or
#   used for any commercial purpose, including commercially funded
#   sequencing, without written permission from the author and the
#   University of Washington.
#   
#   This software is provided ``AS IS'' and any express or implied
#   warranties, including, but not limited to, the implied warranties of
#   merchantability and fitness for a particular purpose, are disclaimed.
#   In no event shall the authors or the University of Washington be
#   liable for any direct, indirect, incidental, special, exemplary, or
#   consequential damages (including, but not limited to, procurement of
#   substitute goods or services; loss of use, data, or profits; or
#   business interruption) however caused and on any theory of liability,
#   whether in contract, strict liability, or tort (including negligence
#   or otherwise) arising in any way out of the use of this software, even
#   if advised of the possibility of such damage.
#                                                                         
#*****************************************************************************/

SWAT, CROSS_MATCH
 Swat and cross_match are programs for rapid protein and nucleic acid
sequence comparison and database search, based on an efficient
implementation of the Smith-Waterman-Gotoh algorithm (Smith & Waterman
1981, Gotoh 1982).  By revising the recursion relations slightly and
making efficient use of word-packing, we have been able to
significantly reduce the number of machine instructions executed per
Smith-Waterman matrix cell, resulting in a speed enhancement of 2-3
fold over the previous best implementation (Pearson and Miller, 1992).
Swat is about 1 / 10 as fast as BLAST, making it quite adequate for
performing database searches on a workstation.  Statistical
significance of the hits is evaluated based on the empirical score
distribution for the search, taking into account dependence of the
score variance and mean on the length of the database sequence.

 Cross_match uses the same algorithm as swat, but in addition allows
the comparison of a pair of sequences to be constrained to bands of
the Smith-Waterman matrix that surround one or more matching words in
the sequences. This substantially increases speed for large-scale
nucleotide sequence comparisons without significantly compromising
sensitivity.  We have found cross_match useful for the following
tasks: comparing a set of reads to a set of vector sequences to
produce vector-masked versions of the reads; comparing contig
sequences found by two alternative assembly procedures (e.g.  phrap
and xbap) to each other; and comparing assembled contigs to the final
edited cosmid sequence, in retrospective studies of the accuracy of
fully automated assembly.  We also now routinely use cross_match to
find repeats in finished cosmid sequences and mask them out prior to
database searches.

 Swat and cross_match can also be used to search a profile against a
database.

PHRAP
Phrap is a program for shotgun sequence assembly.  Key features
include:
--Use of data quality information, both direct (from phred trace
analysis) and indirect (from pairwise read comparisons), to delineate
the likely accurate base calls in each read; this helps discriminate
repeats, permits use of the full reads in assembly, and allows a
highly accurate consensus sequence to be generated.  A probability of
error is computed for each consensus sequence position, which can be
used to focus human editing on particular regions, to automate
decision-making about where additional data are needed, and to provide
users of the final sequence with information about local variations in
quality.

--The full length of each read (or more precisely, as much as is
accurate enough to align against other reads) is used in the assembly.
Other programs typically require that reads be trimmed fairly
conservatively to remove all low-quality data, since the algorithms
are such that inaccurate data would tend to cause correct overlaps to
be rejected; this necessitates excluding a significant amount of
useful (if slightly less accurate) sequence from the initial assembly,
which then must be brought in manually later to make additional contig
joins and help resolve ambiguities. Our approach minimizes or
eliminates the need to make manual joins, and perhaps more
importantly, by using all available data it increases the chance of
correctly distinguishing repeats during assembly.  Use of full reads
does complicate analysis somewhat due to the high error rate towards
the ends of the reads and the fact that read chimerism is more likely
to occur there; thus it is essential to use methods that allow for
this. In particular, we use local (Smith-Waterman) alignments for the
pairwise read comparisons, rather than global (Needleman-Wunsch).

-- There is no masking of repeats from known families, or other use of 
biological assumptions about the sequence.

-- Aberrant data, including likely deletion and chimeric reads, and
unremoved sequencing vector, are automatically identified, and treated
specially to avoid causing problems with assembly.

-- The contig sequence is constructed as a mosaic of the highest
quality parts of the reads, rather than as a ``consensus''.  This
avoids both the complex algorithmic issues associated with multiple
alignment methods, and problems that sometimes occur with these methods
causing the consensus to be less accurate than individual reads
at some positions. The sequence produced by phrap is in fact quite
accurate, with less than 1 error per 10 kb in typical datasets.

-- Information to provide indications of accuracy and uniqueness of
assembly is generated. Possible sites where misassembly may have
occurred, or where additional data may be required are flagged.

-- Large datasets can be assembled efficiently. On a Dec Alpha
3000/900 workstation (comparable in speed to a 200 Mhz Pentium PC),
cosmid datasets typically require an average of 2 minutes, and human
BACs 10 min. to several hours.  A 36,000 read bacterial genome
(assembled at Genome Therapeutics, Inc.) required < 2 hrs.  The main
factor limiting speed of assembly is the number of repeats in the
genomic target sequence, rather than the total number of reads, since
repeats disproportionately increase the number of Smith-Waterman
comparisons that need to be performed.



PHRAPVIEW 
Phrapview is a graphical tool that provides a "global" view of the
phrap assembly, complementary to the "local" view provided by the
sequence editor CONSED. It is written in perlTk; to run it you will
need to have installed on your system a recent version of perl (perl5
or later) that includes the Tk library. This is available for free
from a number of web sites. 

 Phrapview is invoked with the command

  phrapview filename 

where filename is the name of a ".view" file that was created by
running phrap with the -view option. 	


The following information can be displayed:

(i) Basic information concerning the numbers of reads, singletons,
contigs, chimeras, etc. Each contig, and each chimeric read, is
represented by a horizontal black line proportional in length to the
contig or read size.  Note that although they are shown separately in this
display, chimeric reads in fact generally are incorporated by phrap
into one of the contigs, in a region that corresponds to one of the
two chimeric pieces; the location of this is indicated by the black
"chimera match" line connecting the chimera to a contig (see below).

(ii) Depth of coverage.  Graphs (in yellow) above and below the contig
line indicate the depth of coverage (i.e. number of reads aligned
against the contig) on each strand at each position; the orange lines
correspond to a depth of 10. In addition to the ordinary depth of
coverage (which counts all reads in the contig, and is mainly useful
to indicate regions where an abnormal deficit or excess occurs), the
"reduced depth" can also be displayed. This counts only reads that are
non-chimeric, have a positive LLR score against the contig, and have
no positive LLR scoring match to a read elsewhere in the assembly --
i.e. the "confidently placed" reads. (See phrap.doc for a description
of LLR scores). Misjoins usually occur in places where the reduced
depth is 0 on both strands.

(iii) "Matches" and "links" (these require pressing an appropriate button
to display). "Matches" are pairwise read matches between reads in
different parts of the assembly, and are indicated by a single (curved
or straight)) line connecting the midpoints of the two matching
regions.  Contig matches (between two non-chimeric reads in the
assembled contigs) and chimera matches (between a chimeric read and
itself or another read) can be displayed separately. "Links" connect the
startpoints of two reads that derive from the same subclone:
forward-reverse links correspond to reads from opposite ends of a
subclone insert, while same-strand links correspond to walking or
duplicated reads in the same direction on the insert.  Links will only
be indicated properly if the read naming convention assumed by phrap
is used (see phrap documentation).
 Each match and link is considered to be in one of three classes,
indicated by color: "problems" (red), which indicate serious
discrepancies or a significant possibility of incorrect, incomplete,
or non-unique assembly; "ok" (black), which tend to confirm the
assembly or are otherwise consistent with it; and "grayzone" (blue or
green), which may in some cases indicate problems but are probably ok.
Links are considered "ok" if the two read starts meet the expected
spacing and orientation constraints, otherwise they are marked as
"problems" (in many cases however these reflect subclone tracking
errors rather than assembly errors).  Matches are considered "ok" if
they are between reads assembled in the same location, or have LLR
scores below the cutoff; "ok" matches are not shown. Matches are
considered "problems" if they have LLR scores above the cutoff, are
non-local, and involve regions where the reduced depth is 0.
"Problem" matches between large contigs often indicate the presence of
near-perfect repeats, which may or may not have been misassembled; or
(in some cases) a join that was missed.  "Problem" matches between a
small (e.g. singleton) contig and a large one typically represent
cases where a read could not be assembled into its proper location by
phrap, due generally to some anomaly involving it (e.g. an internal
region of low quality data).  Non-local matches with LLR scores above
the cutoff which do not lie in regions of reduced depth 0 are
considered to be "grayzone". Usually these involve isolated pairs of
reads each of which extends part way into a repeat, such that the
overlapping region is too small to contain high-quality discrepancies.
The fact that the reduced depth is non-zero means that there are other
reads in the same region which do not have any positive LLR scores to
the other region, so it is unlikely that there is a significant
misassembly in such cases, although it is possible that one of the two
matching reads may have been incorrectly placed.
 Chimeras generally have one "ok" match (to the site where the read
was assembled) and one or more "problem" matches to the region from
which the other part of the chimera derives (sometimes these matches
all have negative LLR scores and are thus not shown with the default
LLR cutoff). Some chimeras actually represent subclones with internal
deletions; these tend to be obvious since the two pieces map to nearby
locations in some contig and are in the same orientation there.
 Double-headed arrows on matches indicate that the matching regions
are on opposite strands; double-headed arrows on links indicate that
the orientation of the two reads is inconsistent with the read names
(i.e. forward-reverse pairs that point in the same direction, or same
strand pairs that point in opposite directions; these are colored as
"problems" if the reads are in the same contig).
  Passing the mouse pointer (slowly!) over a match or link curve results
in its being highlighted (converted to yellow); it remains highlighted
until the pointer is moved over another curve, or a contig line.
Textual information about the highlighted match or link is shown in a
sunken box in the top central region of the display.
 It is possible to change the values of several parameters that
influence the display of the matches and links, although you probably
will not need to do this. The role of the LLR cutoff is described
above. Displace refers to the relative offset of the contig positions
of the two matching segments; together with the orientation of the
segments, it determines whether or not a match is considered "local".
It is basically a "fudge factor" that is required because the offset
between two overlapping reads is not uniquely defined, due to the
presence of insertions and deletions. The unalign parameter refers to
the size of the largest segment involved in the pairwise read match
that is unaligned against the contig sequence. If large, such segments
may indicate a misplaced read or (occasionally) an undetected chimera.
Min fwd-rev and max fwd-rev refer to the minimum and maximum allowed
separation (in bp) between (the starts of) forward-reverse read pairs;
i.e. these represent the minimum and maximum allowed subclone insert
size.  Separations outside this range (or in the wrong orientation)
are marked as problems. Min ss and max ss are the minimum and maximum
spacing between same-strand reads (i.e. size of a walking step).
 The colors, and horizontal scale, can at present only be changed by
altering the variable definitions that occur near the beginning of the
phrapview source file, and re-running the program.