Peak Calling

Introduction

Chromatin immunoprecipitation (ChIP) is used to find the localized binding sites of regulatory proteins. Advances in statistical analyses has dramatically increased the accuracy of the ChIP process. One such statistical technique is called peak calling. This peak calling counters the problem of false positive predictions. Peak calling is done in both of the DNA strands separately. This process is called twin peak calling. Twin peak calling combined with kernal density estimators and false discoveries rate estimations based on control libraries aids in the eradication of false positive interaction. Ensembele methods are used in filtering the peak values. In the below article prediction of human growth- associated binding proteins (GABPα) based on ChIP-seq observations.

In response to external stimuli or internal signals transcription can happen. This transcription process is controlled by a series of complex networks, agents and mechanism. Transcription factors, cofactors, nucleosomes, histone modifications, DNA methylation, micro RNAs control the transcription process. Interaction of these molecules activates and initiates the RNA polymerase complex. Chromatin Immunoprecipitation (ChIP) is used in mapping of these transcription factor binding sites (TFBS) to complex genomes.  As ChIP-seq results not only draws out the DNA associated with transcription factor (TF) of interest but also DNA fragments associated other proteins. There are challenges that remain in mapping TF binding sites. Specificity and selectivity of the antibody is the deciding factor of the ChIP studies. Antibodies can bind with other proteins of the TF family creating a non-specific signal also TFs can bind with modified or bound to the cofactors which are not recognized by antibodies. TFs can also bind to distal promoter, enhancer, introgenic regions and exons. IN DNA the binding site is generally short that is like 5 base pairs (bp). Thererfore analysis of TFBS in isolation is typically not feasible. These binding sites are frequently organized into cis-regulatory modules (CRMs). These CRMs are generally 100-1000 bp in length. TFs bind in these sites and regulate transcription rates. Computational methods for these CRMs are available and accurate. The predictions based on these methods are based on a very limited samples of verified binding sites. Based on such samples generalizing the motifs happen using mathematical models is highly challenging. Next generation ChIP-seq increases the representative samples in recent times. ChIP-Seq provides finer resolution than other methods. Computational analyses of a cell division regulator, the growth associated binding protein α-chain (GABPα), and its binding site in human genome is explained below.

Chromatin Immunoprecipitation (ChIP)

Formaldehyde cross-linking is used to anchor the protein of interest to its invivo DNA location. The DNA is either sonicated or sheared. This results in DNA sequences of length of few hundred bp. The protein molecule associated with the DNA molecule is incubated in the presence of specific antibody and immunoprecipitation is performed. 150-200 bp long DNA segments are selected by gel electrophoresis. The size of the TFs binding site varies from 5-26 bp. As discussed earlier we select 150- 200 bp DNA segments assuming to be with CRMs. But this compromises the resolution of this method. Estimation of false positive rates and background noise becomes necessary. This is done by creation of control libraries. These control libraries are controlled by reversing the cross-linking and ChIP, ChIP with a non-selective protein like IgG or with no immunoprecipitation at all. When no immunoprecipitation is done no protein is pulled down with DNA molecules. The resulting output is considered as background noise and can be used for estimating the false discovery rate.

Identification of Chromatin-Bound DNA

ChIP enriched DNA segments can be identified by two methods:

(I) Genomic tilting/ promoter microarrays (ChIP- chip)

(II) Ultra high-throughput sequencing (ChIP-Seq).

ChIP-chip

ChIP-chip works well in smaller genomes. This method is mostly used in yeast for identification of more than 100 TFBS. Only selected regions of the genome is analyzed by this method. Selected regions include promoter regions, chromosome 22 and pilot-ENCODE regions. Identification of binding sites for CREB (cAMP response element-binding protein), polycomb group TFs, mouse embryogenic stem cell regulatory network and estrogen receptor have been done by ChIP-chip. The resolution of this method is upto 500 bp in eukaryotes. High level hybridization and pre-requirement of pre-designed chip are the major limitation of this method.

ChIP-Seq

Coupled with ChIP, immunoprecipitated DNA fragments are parallel sequenced. Thus the DNA bound protein is identified. ChIP-seq produces rather a large number sequenced reads which could easily be in millions. This leads to a massive and noisy data. Statistically significant data set peaks is separated and found by mathematical analysis. The resolution of ChIP-seq is in the range of 25-200 bp. Increased sensitivity and selectivity counters the limitation of cross- hybridization. There is no need of microarray in this method.

Computational Discovery of Binding Sites from ChIP-seq Observations

The algorithmic methods for calling peaks, forms the basis of density distribution of sequencing reads. These peaks actually infers the actual binding sites. The choice of algorithm determines the resolution, sensitivity and selectivity of the peak calling method. However the algorithm are lacking far behind the experimental improvements. A diverse array of tools have been implemented for the purpose of peak calling. Each tool has their unique methods for background correction, normalization, and analyzing twin peaks for the opposite strands of the DNA. Certain tools demands a control library. Some tools like model based ChIP-seq work without a control library. Few tools employ distribution like poission for the control purpose. Binomal p-values of the peaks are used in their ranking. Recent tools employs the shift between the peaks as estimating the peaks vaklues. False discovery rate is calculate using different strategies in different tool as follows:

(I) Control libraries is used.

(II) Performing monte-carlo simulations

(III) Tag aggragation with no p-values.

(IV)  Kernel density estimation

(V)  Functional Divergence Ratio (FDR) is also used.

Quantitative Enrichment of Sequence Tags (QuEST) was developed by Anton Valouev and colleagues at Stanford. This tool uses the directionality of the sequencing reads to find genomic regions enriched with TF-bound DNA fragments. Kernel density estimation method is used to generate smoothed sequencing reads density. Local density maxima are sought. QuEST can statically analyze the peak values that indicate a higher likelihood of biologically relevant TFBS.

Growth-Activated Binding Protein (GABPα)

GABPα is a member of the EtF family (Electron transferring Flavoprotein). It is necessary in cell division. GABPα has a 10–11 bp footprint on ChIP was performed by using antibodies specific to this protein and sequencing was performed on the Genome Analyzer platform. QuEST tool is used for finding the binding site of the GABPa.

Methods

TFBS discovery using QuEST is described in nine steps

1) Sequence reads are mapped to the genome.

2) Based on the density distribution of the reads candidate peaks are called on the both strands of the DNA.

3) Extent of the Shift between the forward and reverse strand peaks on each side of the potential binding site is estimated.

4) Combine the density distribution on the two strands.

5) Peaks with significant differences to the background library are called.

6) False discovery rate is estimated in order to reduce the number of biologically irrelevant or statistically not significant peaks.

7) Run QuEST

8) Estimation of the number of potentially missed sites is done by saturation analysis.

9) The called peaks are displayed.

Mapping ChIP-seq Reads to the Genome

ChIP- seq produces several million sequencing reads (tags) of varying length. Number of tools can be used to map these sequencing reads onto the genome. The list of tools include Bowtie, MAQ, Eland, SSAKE, SHARCGS, Corona. Inputs to QuEST are the coordinates of the genome and strand of the sequencing reads. For every genomic position i, the number of high-quality forward reads C+(i) and reverse reads C-(i) is recorded.

Kernel Density Estimation

Biologically significant functional binding sites is generally indicated by loci significantly enriched ChIP-seq reads. This enrichment is computed in QuEST as kernel density. QuEST employs a non-parametric method for computing smooth estimates over noisy observation.

First, we calculate strand specific smoothed density functions H+(i) from C+(i) and H-(i) for C-(i) at nucleotide position i in the genome for the forward strand and analogously for the reverse sequencing reads where

is the kernel density function and h is the kernel density bandwidth.

h is the kernel density bandwidth, is the number of base pairs considered. It is user selectable.  The kernel density estimator is a weighted moving average of the number of sequencing reads where K (j−i/h ) denotes the weight. The normal kernel is selected here for its computational efficiency. With increasing distance of  j from i, the weight K (j−i/h) decreases for C+(j) or C−(j). The bandwidth h is adjustable and the 30 bp default is recommended for the binding sites of GABPα, which has a 10–11 bp footprint on the DNA, with low information content in the last four positions. The optimal selection of bandwidth depends on the footprint width, the experimental characteristics, and the presence of co-binding proteins, and usually determined by trial and error.

Estimating the Peak Shift

During amplification and sequencing polymerase applied attaches to the 5′ termini of the sample DNA segments. When moving towards the 3′ end, the polymerase dissociates from the DNA with a sharply increasing frequency. Therefore reads are over represented at the 5′ ends on both strands of ChIP-enriched DNA fragments. Reads from the two strands, form two peaks, one at each side of the binding site. QuEST estimates peak shift, the distance between the peaks on the forward and the reverse strands. For shift estimation, only twin peaks with high confidence are selected. For each fixed length (default: 300 bp) sliding window r, the highest local maximum of forward reads is Mr+ and of the reverse reads is Mr− ; the second highest local maximum of forward reads is Nr+ and of reverse reads is Nr−. Window r is selected for peak shift calculation if it satisfies the following three conditions:

  1. Window r is covered by more than t reads (default: 600). This condition ensures robust estimates of local maxima.
  2. Mr+ > 20Nr+ and Mr− > 20Nr−. If the highest local maximum is much greater than the second highest local maximum, then the highest local maximum is more likely to be a real peak instead of some random spike.
  3. Mr+ > 20cr+ and Mr− > 20cr− , where cr+ and cr− are the local maxima of the same window in the pseudo-ChIP library. This condition ensures that the peaks safely exceed the background level.

Let S denote the set of all selected windows. For every window r ∈ S, we compute the dr distance between the highest peaks Mr+ and Mr− . The peak shift λ is estimated by where M is the number of windows in S.

Combining Strand Densities

The above estimate for the peak shift allows us to calculate the combined densities of forward and reverse reads for both ChIPseq and control library:

H(i) = H+(i − λ) + H−(i + λ).

The combined density is the basis for peak calling below.

Peak Calling

Windows of high concentration of sequencing reads at a locus on the genome is called peaks. These peaks may represent TFBSs. Scanning the genome using narrow sliding windows (default: 21 bp) in local maxima of the combined density gives the candidate peaks. Let p1, . . . pB denote the positions of the candidate peaks; and let c1, . . . , cB denote the corresponding density in the control library. To facilitate conservative binding site predictions, a candidate at position pi will be called if and only if it satisfies all of the following criteria:

H(pi) ≥ t, where t is a user-specified threshold (default: 30) to control the false discovery rate. By definition, the false discovery rate is the (estimated) frequency of false positives with a score equal to t or higher. Increasing t decreases the false discovery rate.

Background test. Either ci ≤ τ or H(pi)/ci > r, where ci is the background density at peak i and τ is the general background threshold, and r is a user-specified “rescue” ratio, which, by default, is set to 10.

To ensure clear separation (“valley”) between neighboring peaks, a minimum of 10% drop in H(j) read density is required.

0.9 * min{H(pi−1),H(pi)} ≥ max{H(j)|pi−1 < j < pi}

and

0.9 * min{H(pi),H(pi+1)} ≥ max{H(j)|pi < j < pi+1 }.

The selection of parameter values can be arbitary.

False Discovery Rate

The proportion of erroneously called peaks that are either not binding sites or binding sites for proteins other than the TF of interest is called false discovery rate. These peaks are called since they satisfy all the three conditions above and score above the threshold value. Conditions and thresholds selected to strike a delicate balance of maximizing true positive and minimizing false positive predictions. Since there is no reference where everyone of the genomic position is reliably characterized as a binder or as a non-binder, false discovery rate is approximated by control libraries. These control libraries are created by reversing the cross-links and performing no ChIP. The library with no IP is randomly split into a pseudo-ChIP-seq library and a background set. Splitting the pseudo-ChIp library only increases the accuracy of the false-positive estimation.  It is done only if a satisfactory number of pseudo-ChIP reads are available. For compatibility the number of reads in the pseudo-ChIP-seq library must match the number of reads in the real ChIP-seq library. The peak calling procedure is performed by comparing the pseudo-ChIP-seq library to the background set.

Any pseudo-peak called in this comparison is considered false. Then peaks are called for real ChIP-seq library using the same background set.

Running QuEST on ChIP-Enriched Sequencing Reads

Sequencing reads obtained by the ChIP-seq experiments of the GABPα binding sites in human Jurkat lymphoblastoma cells were aligned to the human genome (Version hg18). Peaks are called and evaluated.

Peak Saturation

The number of peaks when using a two or more sequencing lanes. The number of peaks missed is given by saturation curves. Saturation curve where the number of peaks is a function of the number of reads. Saturation analysis can be performed by randomly selecting subsets of varying size from the original data and calculating the number of peaks for each subset as in previous steps.