Poisson calculations

The strategy for the shotgun approach follows the Lander and Waterman application of the Poisson distribution - Lander ES, Waterman MS "Genomic mapping by fingerprinting random clones: a mathematical analysis" Genomics 2 (3): 231-239 (1988)

updated 1-30-01


Table 1:

The probability a base is not sequenced is given by:
  Fold                                    P0x100=
coverage                 P0=e-c        % not sequence  % sequenced    

 0.25 P0=e-0.25  =1/e0.25 =      0.78       78%          22%
 0.50 P0=e-0.50  =1/e0.50 =      0.61       61%          39%
 0.75 P0=e-0.75  =1/e0.75 =      0.47       47%          53%
  1   P0=e-1=1/e1=1/2.718 =     0.37       37%          63%
  2   P0=e-2=1/e2=1/7.389 =     0.135      13.5%        87.5%
  3   P0=e-3=1/e3=1/20.086 =    0.05        5%          95%
  4   P0=e-4=1/e4=1/54.598 =    0.018       1.8%        98.2%
  5   P0=e-5=1/e5=1/148.4 =     0.0067      0.6%        99.4%
  6   P0=e-6=1/e6=1/403.4 =     0.0025      0.25%       99.75%
  7   P0=e-7=1/e7=1/1096.6 =    0.0009      0.09%       99.91%
  8   P0=e-8=1/e8=1/2980.95 =   0.0003      0.03%       99.97
  9   P0=e-9=1/e9=1/8103.08 =   0.0001      0.01%       99.99%
 10   P0=e-10=1/e10=1/22026.5 = 0.000045    0.005%      99.995%
Note: % sequenced is independent of read length when fold coverage is considered

Table 2:

Total Gap Length=Ge-c where c = Fold coverage, G=target sequence length and e-c = P0
genome
size =          50kb   150kb    300kb       2Mb       4Mb       500Mb
Fold coverage   Ge-c      Ge-c     Ge-c     Ge-c       Ge-c        Ge-c

  1           18,500  55,500  111,000   740,000  1,480,000  185,000,000
  2            6,750  20,250   40,500   270,000    540,000   67,500,000
  3            2,500   7,500   15,000   100,000    200,000   25,000,000
  4              900   2,700    5,400    36,000     72,000    9,000,000
  5              335   1,005    2,010    13,400     26,800    3,350,000
  6              125     375      750     5,000     10,000    1,250,000
  7               45     135      270     1,800      3,600      450,000
  8               15      45       90       600      1,200      150,000
  9                5      15       30       200        400       50,000
  10               2       6       12        90        180       20,000

Table 3:

Genome Size x Fold Cov./BasesPerRead = N = Number of Gels for x-fold coverage:
genome size =        50kb                    150kb                     300kb     
Read Length = 400    500    600       400     500     600       400     500     600
Fold cov.
    1         125    100     84       375     300     250       750     600     500    
    2         250    200    168       750     600     500      1500    1200    1000    
    3         375    300    242      1125     900     750      2250    1800    1500    
    4         500    400    326      1500    1200    1000      3000    2400    2000    
    5         625    500    410      1875    1500    1250      3750    3000    2500    
    6         750    600    500      2250    1800    1500      4500    3600    3000    
    7         875    700    583      2625    2100    1750      5250    4200    3500    
    8        1000    800    667      3000    2400    2000      6000    4800    4000    
    9        1125    900    750      3375    2700    2250      6750    5400    4500    
   10        1250   1000    833      3750    3000    2500      7500    6000    5000    
    
genome size =          2Mb                          4Mb                          500Mb       
Read Length = 400      500      600        400      500      600        400      500      600
Fold cov.    
    1        5000     4000     3333      10000     8000     6667    1250000  1000000   833333
    2       10000     8000     6667      20000    16000    13334    2500000  2000000  1666666
    3       15000    12000    10000      30000    24000    20000    3750000  3000000  2500000
    4       20000    16000    13333      40000    32000    26667    5000000  4000000  3333333
    5       25000    20000    16667      50000    40000    33334    6250000  5000000  4366666
    6       30000    24000    20000      60000    48000    40000    7500000  6000000  5000000
    7       35000    28000    23333      70000    56000    46667    8750000  7000000  5833333
    8       40000    32000    26667      80000    64000    53333   10000000  8000000  6666666
    9       45000    36000    30000      90000    72000    60000   11250000  9000000  7500000
   10       50000    40000    33333     100000    80000    66667   12500000 10000000  8733333


Table 4: Number of Gaps = Ne-c

50kb Target Clone:

Read Length=          400                        500                       600           
Fold Cov.    N      e-c     #Gaps=Ne-c   N      e-c    #Gaps=Ne-c   N    e-c    #Gaps=Ne-c     
   1        125    0.37       46       100    0.37       37       84    0.37     31    
   2        250    0.135      34       200    0.135      27      168    0.135    23    
   3        375    0.05       19       300    0.05       15      242    0.05     12    
   4        500    0.018       9       400    0.018       7      326    0.018     6    
   5        625    0.0067      4       500    0.0067      3      410    0.0067    3    
   6        750    0.0025      2       600    0.0025      2      500    0.0025    1    
   7        875    0.0009      1       700    0.0009      1      583    0.0009    1    
   8       1000    0.0003      0       800    0.0003      0      667    0.0003    0    
   9       1125    0.0001      0       900    0.0001      0      750    0.0001    0    
  10       1250    0.000045    0      1000    0.000045    0      833    0.000045  0    

150kb Target Clone:

Read Length=          400                        500                       600           
Fold Cov.    N      e-c     #Gaps=Ne-c   N      e-c    #Gaps=Ne-c   N    e-c    #Gaps=Ne-c     
   1        375    0.37      139       300    0.37      111      250    0.37     93    
   2        750    0.135     101       600    0.135      81      500    0.135    68    
   3       1125    0.05       56       900    0.05       45      750    0.05     38    
   4       1500    0.018      27      1200    0.018      22     1000    0.018    18    
   5       1875    0.0067     13      1500    0.0067     10     1250    0.0067    8    
   6       2250    0.0025      6      1800    0.0025      5     1500    0.0025    4    
   7       2625    0.0009      2      2100    0.0009      2     1750    0.0009    2    
   8       3000    0.0003      1      2400    0.0003      1     2000    0.0003    1    
   9       3375    0.0001      0      2700    0.0001      0     2250    0.0001    0    
  10       3750    0.000045    0      3000    0.000045    0     2500    0.000045  0    

300kb Target Clone:

Read Length=          400                        500                       600           
Fold Cov.    N      e-c     #Gaps=Ne-c   N      e-c    #Gaps=Ne-c   N    e-c    #Gaps=Ne-c     
   1        750    0.37      278       600    0.37      222      500    0.37    185    
   2       1500    0.135     203      1200    0.135     162     1000    0.135   135    
   3       2250    0.05      113      1800    0.05       90     1500    0.05     75    
   4       3000    0.018      54      2400    0.018      43     2000    0.018    36    
   5       3750    0.0067     25      3000    0.0067     20     2500    0.0067   17    
   6       4500    0.0025     11      3600    0.0025      9     3000    0.0025    8    
   7       5250    0.0009      5      4200    0.0009      4     3500    0.0009    3    
   8       6000    0.0003      2      4800    0.0003      2     4000    0.0003    1    
   9       6750    0.0001      1      5400    0.0001      1     4500    0.0001    0    
  10       7500    0.000045    0      6000    0.000045    0     5000    0.000045  0

2Mb Target Genome:

Read Length=          400                        500                       600           
Fold Cov.    N      e-c     #Gaps=Ne-c   N      e-c    #Gaps=Ne-c   N    e-c    #Gaps=Ne-c     
   1       5000    0.37     1850      4000    0.37     1480     3333    0.37   1233    
   2      10000    0.135    1350      8000    0.135    1080     6667    0.135   900    
   3      15000    0.05      750     12000    0.05      600    10000    0.05    500    
   4      20000    0.018     360     16000    0.018     288    13333    0.018   240    
   5      25000    0.0067    168     20000    0.0067    134    16667    0.0067  112    
   6      30000    0.0025     75     24000    0.0025     60    20000    0.0025   50    
   7      35000    0.0009     32     28000    0.0009     25    23333    0.0009   21    
   8      40000    0.0003     12     32000    0.0003     10    26667    0.0003    8    
   9      45000    0.0001      5     36000    0.0001      4    30000    0.0001    3    
  10      50000    0.000045    2     40000    0.000045    2    33333    0.000045  1    

4Mb Target Genome:

Read Length=          400                        500                       600           
Fold Cov.    N      e-c     #Gaps=Ne-c   N      e-c    #Gaps=Ne-c   N    e-c    #Gaps=Ne-c     
   1      10000    0.37     3700      8000    0.37     2960     6667    0.37   2467    
   2      20000    0.135    2700     16000    0.135    2160    13334    0.135  1800    
   3      30000    0.05     1500     24000    0.05     1200    20000    0.05   1000    
   4      40000    0.018     720     32000    0.018     576    26667    0.018   480    
   5      50000    0.0067    335     40000    0.0067    268    33334    0.0067  223    
   6      60000    0.0025    150     48000    0.0025    120    40000    0.0025  100    
   7      70000    0.0009     63     56000    0.0009     50    46667    0.0009   42    
   8      80000    0.0003     24     64000    0.0003     19    53333    0.0003   16    
   9      90000    0.0001      9     72000    0.0001      7    60000    0.0001    6    
  10     100000    0.000045    5     80000    0.000045    4    66667    0.000045  3    

500Mb Target Genome:

Read Length=          400                        500                       600           
Fold Cov.    N      e-c   #Gaps=Ne-c      N      e-c  #Gaps=Ne-c        N      e-c #Gaps=Ne-c     
   1    1250000   0.37     462500   1000000   0.37     370000      840000   0.37   308375    
   2    2500000   0.135    337500   2000000   0.135    270000     1680000   0.135  225000    
   3    3750000   0.05     187500   3000000   0.05     150000     2420000   0.05   125000    
   4    5000000   0.018     90000   4000000   0.018     72000     3260000   0.018   60000    
   5    6250000   0.0067    41875   5000000   0.0067    33500     4100000   0.0067  27875    
   6    7500000   0.0025    18750   6000000   0.0025    15000     5000000   0.0025  12500    
   7    8750000   0.0009     8125   7000000   0.0009     6250     5830000   0.0009   5250    
   8   10000000   0.0003     3000   8000000   0.0003     2375     6670000   0.0003   2000    
   9   11250000   0.0001     1125   9000000   0.0001      875     7500000   0.0001    750    
  10   12500000   0.000045    625  10000000   0.000045    500     8330000   0.000045  375    

Table 5:

The values for each fold coverage for a 50kb cosmid (G=50,000) with an average read length of 500 bases are:

                             From Table 2:  From Table4:            
  fold   Total bases       total gap length  Number of   Gap Length/# gaps=    %    
coverage  sequenced    e-c   in bases =Ge-c  Gaps = Ne-c   # bases per gap  complete    
   1        50000    0.37       18,500          37            500            63    
   2       100000    0.135       6,750          27            250            87.5    
   3       150000    0.05        2,500          15            167            95    
   4       200000    0.018         900           7            129            98.2    
   5       250000    0.0067        335           3            112            99.4    
   6       300000    0.0025        125           2             63            99.75    
   7       350000    0.0009         45           1             45            99.91    
   8       400000    0.0003         15           1             15            99.97    
   9       450000    0.0001          5           1              5            99.99    
  10       500000    0.000045        2           1              2            99.995    

The values for each fold coverage for a 150kb BAC (G=150,000) with an average read length of 500 bases are:

                             From Table 2:  From Table4:            
  fold   Total bases       total gap length  Number of   Gap Length/# gaps=    %    
coverage  sequenced    e-c   in bases =Ge-c  Gaps = Ne-c   # bases per gap  complete    
   1       150000    0.37       55,500         111            500           63    
   2       300000    0.135      20,250          81            250           87.5    
   3       450000    0.05        7,500          45            167           95    
   4       600000    0.018       2,700          22            123           98.2    
   5       750000    0.0067      1,005          10            101           99.4    
   6       900000    0.0025        375           5             75           99.75    
   7      1050000    0.0009        135           2             68           99.91    
   8      1200000    0.0003         45           1             45           99.97    
   9      1350000    0.0001         15           1             15           99.99    
  10      1500000    0.000045        6           1              6           99.995    

The values for each fold coverage for a 300kb BAC (G=300,000) with an average read length of 500 bases are:

                             From Table 2:  From Table4:            
  fold   Total bases       total gap length  Number of   Gap Length/# gaps=    %    
coverage  sequenced    e-c   in bases =Ge-c  Gaps = Ne-c   # bases per gap  complete    
   1       300000    0.37      111,000         222            500           63    
   2       600000    0.135      40,500         162            250           87.5    
   3       900000    0.05       15,000          90            167           95    
   4      1200000    0.018       5,400          43            126           98.2    
   5      1550000    0.0067      2,010          20            101           99.4    
   6      1800000    0.0025        750           9             83           99.75    
   7      2100000    0.0009        270           4             68           99.91    
   8      2400000    0.0003         90           2             45           99.97    
   9      2700000    0.0001         30           1             30           99.99    
  10      3000000    0.000045       12           1             12           99.995    

The values for each fold coverage for a 2Mb Genome (G=2,000,000) with an average read length of 500 bases are:

                             From Table 2:  From Table4:            
  fold   Total bases       total gap length  Number of   Gap Length/# gaps=    %    
coverage  sequenced    e-c   in bases =Ge-c  Gaps = Ne-c   # bases per gap  complete    
   1      2000000    0.37      740,000        1480             500           63    
   2      4000000    0.135     270,000        1080             250           87.5    
   3      6000000    0.05      100,000         600             167           95    
   4      8000000    0.018      36,000         288             125           98.2    
   5     10000000    0.0067     13,400         134             100           99.4    
   6     12000000    0.0025      5,000          60              83           99.75    
   7     14000000    0.0009      1,800          25              72           99.91    
   8     16000000    0.0003        600          10              60           99.97    
   9     18000000    0.0001        200           4              50           99.99    
  10     20000000    0.000045       90           2              45           99.995    

The values for each fold coverage for a 4Mb Genome (G=4,000,000) with an average read length of 500 bases are:

                             From Table 2:  From Table4:            
  fold   Total bases       total gap length  Number of   Gap Length/# gaps=    %    
coverage  sequenced    e-c   in bases =Ge-c  Gaps = Ne-c   # bases per gap  complete    
   1      4000000    0.37    1,480,000        2960             500           63    
   2      8000000    0.135     540,000        2160             250           87.5    
   3     12000000    0.05      200,000        1200             167           95    
   4     16000000    0.018      72,000         576             125           98.2    
   5     20000000    0.0067     26,800         268             100           99.4    
   6     24000000    0.0025     10,000         120              83           99.75    
   7     28000000    0.0009      3,600          50              72           99.91    
   8     32000000    0.0003      1,200          19              63           99.97    
   9     36000000    0.0001        400           7              57           99.99    
  10     40000000    0.000045      180           4              45           99.995    

The values for each fold coverage for a 500Mb Genome (G=500,000,000) with an average read length of 500 bases are:

                             From Table 2:  From Table4:   Gap Length  genome - gap length
  fold   Total bases       total gap length  Number of       # gaps        # contigs         %    
coverage  sequenced    e-c   in bases =Ge-c  Gaps = Ne-c   =# bases/gap  = contig length  complete    
   1      500000000  0.37     185,000,000     370000            500             851         63    
   2     1000000000  0.135     67,500,000     270000            250            1620         87.5    
   3     1500000000  0.05      25,000,000     150000            167            3167         95    
   4     2000000000  0.018      9,000,000      72000            125            6819         98.2    
   5     2500000000  0.0067     3,350,000      33500            100           14825         99.4    
   6     3000000000  0.0025     1,250,000      15000             83           33250         99.75    
   7     3500000000  0.0009       450,000       6250             72           79928         99.91    
   8     4000000000  0.0003       150,000       2375             63          210463         99.97    
   9     4500000000  0.0001        50,000        875             57          571371         99.99    
  10     5000000000  0.000045      20,000        500             40          999960         99.995    

References

Lander ES, Waterman MS "Genomic mapping by fingerprinting random clones: a mathematical analysis" Genomics 2 (3): 231-239 (1988)

Whitehead Institute for Biomedical Research, Cambridge Center, Massachusetts 02142.

Results from physical mapping projects have recently been reported for the genomes of Escherichia coli, Saccharomyces cerevisiae, and Caenorhabditis elegans, and similar projects are currently being planned for other organisms. In such projects, the physical map is assembled by first "fingerprinting" a large number of clones chosen at random from a recombinant library and then inferring overlaps between clones with sufficiently similar fingerprints. Although the basic approach is the same, there are many possible choices for the fingerprint used to characterize the clones and the rules for declaring overlap. In this paper, we derive simple formulas showing how the progress of a physical mapping project is affected by the nature of the fingerprinting scheme. Using these formulas, we discuss the analytic considerations involved in selecting an appropriate fingerprinting scheme for a particular project.


Arratia R, Lander ES, Tavare S, Waterman MS "Genomic mapping by anchoring random clones: a mathematical analysis" Genomics 11 (4): 806-827 (1991)

Department of Mathematics, University of Southern California, Los Angeles 90089.

A complete physical map of the DNA of an organism, consisting of overlapping clones spanning the genome, is an extremely useful tool for genomic analysis. Various methods for the construction of such physical maps are available. One approach is to assemble the physical map by "fingerprinting" a large number of random clones and inferring overlap between clones with sufficiently similar fingerprints. E.S. Lander and M.S. Waterman (1988, Genomics 2:231-239) have recently provided a mathematical analysis of such physical mapping schemes, useful for planning such a project. Another approach is to assemble the physical map by "anchoring" a large number of random clones--that is, by taking random short regions called anchors and identifying the clones containing each anchor. Here, we provide a mathematical analysis of such a physical mapping scheme.


Port E, Sun F, Martin D, Waterman MS "Genomic mapping by end-characterized random clones: a mathematical analysis" Genomics 26 (1): 84-100 (1995)

Department of Mathematics, University of Southern California, Los Angeles 90089-1113, USA.

Physical maps can be constructed by "fingerprinting" a large number of random clones and inferring overlap between clones when the fingerprints are sufficiently similar. E. Lander and M. Waterman (Genomics 2: 231-239, 1988) gave a mathematical analysis of such mapping strategies. The analysis is useful for comparing various fingerprinting methods. Recently it has been proposed that ends of clones rather than the entire clone be fingerprinted or characterized. Such fingerprints, which include sequenced clone ends, require a mathematical analysis deeper than that of Lander-Waterman. This paper studies clone islands, which can include uncharacterized regions, and also the islands that are formed entirely from the ends of clones.


Nelson DO, Speed TP "Predicting progress in directed mapping projects" Genomics 24 (1): 41-52 (1994)

Human Genome Center, Lawrence Livermore National Laboratory, Livermore, CA 94550.

Several recent mapping efforts have used so-called "directed" approaches to construct their maps. However, most, but not all, published methods for modeling the progress in physical mapping projects have been focused on random approaches, such as bottom-up fingerprinting and STS-content mapping. In addition, those few efforts that did model directed approaches used methods that required assuming that all insert lengths were the same. This assumption is unnecessary. Using properties of stationary processes, one can derive simple asymptotic formulas that apply equally to constant and variable clone lengths. Also, in the case of constant clone lengths, these results are equivalent to, and extend, those published results for directed mapping derived by other methods. Simulations show that these methods provide estimates well within the limits of uncertainty inherent in any mapping project.


Home Page


Bruce Roe, broe@ou.edu