Every base in every reading is given an estimate of its accuracy (in the range 1 to 99). These values are stored in SCF files and copied into gap databases during assembly. The consensus calculation is the heart of our strategy and it is described below. First, only those bases whose accuracy estimate is above some threshold are included in the calculation. Secondly, instead of counting how many bases of each type exceed this threshold, we add up their numerical accuracy estimates. To decide if there is a good consensus at any position, and hence if a particular base should be assigned, we find if one base type has a sufficient proportion of the total accuracy estimate at that point. That is, for those bases above the threshold we sum the values for each base type to give Bt where B is A,C,G,T or * ("*" is a padding character placed in the readings to produce alignment) and we sum the Bt to give S, then if for any B, Bt/S is above another cutoff we assign base type B to the consensus. Otherwise the consensus is denoted as unknown "-". An important variant on this calculation is to perform it separately for each strand of the sequence so that only if each strand has a definite character, and the two strands agree, do we produce a definite character (A,C,G,T,*). More detail is given in Bonfield,J.K. and Staden,R. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Res. 23, 1406-1410 (1995).
The consensus sequence produced by gap uses either of these methods and we know that for each A,C,G,T,* appearing in it that there was good agreement between the data adjudged to be of high accuracy.