High Scoring Segment Statistics Test
High Scoring Segment Statistics Test
January 2007
I worked with Ankit Agrawal on a program that generates a distribution of possible scores, simulates a number of simulated sequences of scores based on that distribution, and tests Karlin et. al. statistics on the simulation data.
The program is according to the following assignment:
Develop a computer program for calculating the probablity of the maximum score of regions of a sequence of numbers based on the Karlin et al. formula. The program takes as input two files of real numbers. One file is a set of different numbers and the other is a set of the same size that gives the discrete probability distribution of the numbers in the first file. The program should verify that the second file is a discrete probability distribution of the first file and that the mean number of the first file is negative and that there is a positive probability of getting a positive number. Then the program computes K and lambda estimates of the Karlin et al. formula below for the two input files.
Karlin et al. formula Prob(S >= x) = 1 - exp( -KN * exp(-lambda * x) ),
where x is the maximum score of regions of a sequence of numbers of length N drawn from the first file according to the input probability distribution in the second file. The score of a region of a sequence of numbers is the sum of numbers in the region.
Next the program computes a score cutoff that occurs with a probability of 0.01.
Finally the program repeats the following task T times. The task is to generate a random number sequence of length N, to compute the maximum score of all regions of the sequence, and to estimate the probability of the maximum score with the Karlin et al. formula. The program also takes as input both T and N.
Check that the number of high scoring segments follows a Poisson distribution as derived in class.
An adjustment to the Karlin et. al. formula
Rather than calculate my observed Prob(S >= x), why not just calculate Prob(S = x)? This involves simpler programming and faster running time because there is one less nested loop. I had to convert the Karlin et. al. distribution function to a density function, by simply taking its derivative. The plot would then show an obvious unimodal shape and the view can get an idea of the expected value of the highest score of a segment in a random sequence.
The below plot shows output from the program with 210 possible scores and 10,000 simulated sequences of length 1,000. The x-axis is the maximum score of all segments of a sequence of scores. The y-axis is probability of that maximum score occurring. The green curve is the Karlin et. al. density function. The red dots are a normalized histogram of the observed maximum segment scores of the 10,000 simulated score sequences.

karlin.c
I found a version of the original BLAST module, karlin.c, used for calculated Karlin et. al. function parameters K and lambda on a Penn State University website. I made some syntactical adjustments and incorporated it into a C++ utility class.