Introduction to Promoter Analysis
Each cell contains a copy of the whole gene. But different cells in different tissues perform different functions. Each cell only expresses only a subset of genes. This subset is limited by various factors like specific tissues, developmental stages, physiological condition. The ability of the same set of the genome to perform differently is regulated by gene expression. At multiple points, this gene expression is controlled by chromatin modifications, transcription control, splicing, transport and translational control. The most common way of regulation is translational regulation. It is done during the transcription phase.
Regulation of Transcription
The coded DNA region responsible for the regulation is called promoter region. Each promoter contains several short DNA subsequences called binding sites (BS). These binding sites are bound by regulatory proteins called transcription factors (TF). These TFs combine to form transcriptional switches which respond to external or internal stimuli. TFs are proteins which bind to the promoter region and responsible for the control of gene expression /transcription. TFs bind to a specific site in the DNA. But they can be influenced by the presence of other protein molecules. Thus the forming the complex network required for the control of transcription.
Cis and Trans Regulation
If a DNA sequence that acts to change the expression of the gene adjacent to it is cis-acting and if it is at a distance then it is trans-acting. Promoter elements are cis-acting. Sequence controlling the expression of the TF itself is trans acting.
Regulation of Transcription
When TFs bind with the gene promoter it will either induce or repress the gene expression. The specific combination of BSs with the promoter determines the condition of the gene expression. Promoter analysis can be done by analyzing the expression levels of RNA. The assumption is that genes that have similar expression levels have similar transcriptional regulation control and common binding sites. This knowledge is used we have to find promoter regions on genome sequences. Various methods can be used to find binding sites.
The first step is to find the promoter region in the DNA sequence. Mostly the 5-end region of the gene is likely to overlap with the promoter region. A promoter is stretches of DNA located upstream of the Transcription start sites (TSS) of genes. The promoter region is important in regulation. Therefore we will try to locate TSS since promoters present before the TSS. If TSS is too short this approach will miss real BSs (false negative) and if TSS is too long we will hit a lot of wrong hits (false positive). SO the length of the TSS is very important in this approach. Mostly, TSS is specified specific. The common practice is to have TSS of length 500 -2000 bp. Analyzing both strands of the DNA is recommended and also avoid repeating sequences. Repeating sequences are inserted during course of evolution and it is not important in the transcription process.
Models for finding Binding Sites
Exact string model, string mismatches model, degenerate string model and, finally, position weight matrix are the various models used for finding the BSs.
Exact String model: The Exact String model will try to find an exact sequence in the DNA sequence
String Mismatches model: The String Mismatches model will try to find an almost the exact sequence and will tolerate a mistake in one of the positions.
Degenerate String model The Degenerate String model, also known as consensus model will try to find a sequence but allows various bases to be placed in specific positions of the sequence.
Position Weight Matrix model The Position Weight Matrix model, also known as
Position Specific Scoring Matrix model will create a matrix, where each column represents a position and each row represents a base and the value in the cell is the probability of the base to appear in the specified position. When scanning the target, we compute the total probability, while we assume that appearances of each base at any position are statistically independent.
Few other complex models have also been postulated such as PWM with spacers, Markov model (dependency between adjacent columns of PWM), Hybrid models,.
Chromatin Immunoprecipitation (ChIP)
Invivo identification of DNA elements occupied by DNA regulatory proteins can be done by this procedure. DNA and proteins in living cells are covalently cross-linked, the cells are lysed, and fragmentation of DNA is done by sonication. This protein-DNA complex is immunoprecipitated by antibodies of the binding protein. The DNA bound protein is purified during the cross-linking. This DNA region is then amplified and information about the sequence is obtained. It includes the following stages: Arresting using formaldehyde, shearing the desired proteins and then to replicate these part of the DNA by the PCR process.
Analyzing the entire genome allows protein-DNA interactions to be monitored across the entire genome. The procedure combines chromatin immunoprecipitation (ChIP) procedure with DNA microarray analysis.
Computational approaches to promoter analysis
We can divide the promoter analysis computational problem into three strategies:
- Given groups of co-regulated genes and known binding sites models (PWMs) find enriched Cis elements in the groups, for instance, using PRIMA algorithm.
- Given a set of binding site models (PWMs) find CRM (cis-regulatory-modules) which are sets of binding sites that tend to cluster together, for instance, using CREME algorithm.
- Given a set of co-regulated genes (from gene expression clustering) or putative targets of a TF (from chip-ChIP) build motif models that are enriched in the sets.
PRIMA (Promoter Integration in Microarray Analysis) is used to find TFs whose binding sites are enriched in a given set of promoters. PRIMA is used for the large-scale gene expression data analysis. Different biological conditions alter gene expression at different levels. This is measured by microarray analysis. However, the transcriptional networks that underlie the observed transcriptional modulations are not revealed directly by this method. The main aim of PRIMA is the identification of TFs that take part in these networks. It is based on an assumption is that genes that are co-expressed over multiple biological conditions are regulated by common TFs. This, in turn, leads to the assumption that they share common regulatory elements in their promoters. This algorithm is used in human genomic sequences and models for binding sites (BSs) of known TFs. PRIMA identifies TF whose BSs are significantly over-represented in a given set of promoters.
The algorithm: has inputs of a target set (e.g., a list of co-expressed genes found in a microarray experiment) and a background set (e.g., the 13K set) and PWMs of known TFs. The output of PRIMA algorithm is p-values of enriched TFs.
For each PWM:
Compute a threshold score for declaring hits of the PWM (hit = subsequence that is similar to the PWM = hypothetical BS)
Scan BG and target-set promoters for hits.
Compute enrichment score to decide whether the number of hits in the target-set is significantly higher than expected by chance, given the distribution of hits in the BG. (Synergism test: Find co-occurring pairs of TFs)
A threshold T(P) for the similarity score of the TFs PWM P is determined. This value is used to identify putative binding sites, or hits, of a TF. When similarity score of subsequences is above T(P), it is regarded as hits of P.
The threshold for each PWM is computed as follows:
Compute 2nd-order Markov-Model of BG seqs.
Generate random seqs using MM (e.g., 1,000 seqs of length 1,000 bp)
Set threshold s.t. PWM has f hits in the random sequences (e.g., f=100)
This threshold value ensures a pre-defined false-positives rate. Estimating false-negatives (positives) rate requires good positive (negative) training-sets. The enrichment score is computed as follows: Suppose each promoter has 0 or 1 hits.
Let: B = # of BG promoters
T = # of target-set promoters
b = # of hits in BG promoters
t = # of hits in target-set promoters
Then: Prob. for t hits in target-set:
Prob. for at least t hits:
In cases where the number of BSs is more in a genome. There is a chance that the that the promoter region has more than 1 hit. It is mostly done in cases when BSs are high in number to encourage transcription. It increases the possibility of getting a hit.
So, we will take into account up to 3 hits per promoter.
Let: B ,T = # of promoters in BG , target-set
b1 , b2 , b3 = # of BG promoters with 1,2,3 hits
t = total # of hits in target-set
Then: Prob. for at least t hits: (HG score)
Synergism score: Find pairs of TFs that tend to occur in the same promoters
Let: T = # of promoters in target-set
t1 , t2 = # of promoters with 1+ hits of TF 1,2
t12 = # of promoters with 1+ hits of both TFs (w/o overlaps!)
Then: Prob. for co-occurrence of at least t12:
PRIMA has been successfully used in partitioning the cell cycle-regulated genes according to their expression periodicity patterns. It is done in five clusters which correspond to different phases of the cell cycle. When the promoter sequences of these clusters were scanned for enriched PWMs, two PWMs were enriched in a specific phase cluster, but not in the 568 set as a whole.
CREME – Cis-Regulatory Module Explorer
Transcription factors whose binding sites are spatially clustered are called cis-regulatory modules. CREME is a web-server for identifying and visualizing cis-regulatory modules in the promoter regions of a given set of potentially co-regulated genes. CREME relies on knowledge of putative transcription factor binding sites (TFBS) that are annotated across the human genome using evolutionary conservation with the mouse and rat genomes. An efficient search algorithm is applied to this data set to identify combinations of transcription factors, whose binding sites tend to co-occur in close proximity within the promoter regions of the input gene set. These combinations are statistically evaluated, and significant combinations are reported and visualized.
Goal: Discover modules = groups of TFs whose BSs are abundant and tend to co-occur in close proximity in promoters of co-expressed genes. The main characteristics of these modules are: limited knowledge of TFs uses PWMs to model BSs, ignores an order of TFs within the module, does not take into account multiple hits per TF.
Module = Set of PWMs
r = # of PWMs in the module
Instance of a module = A set of hits, at least one per PWM in the module, that occur in a short interval in a promoter
w = length of interval
The inputs for this algorithm are promoter sequences of BG and target sets PWMs of known TFs, module parameters (r, w). The output of the algorithm is p-values of enriched modules.
Find enriched PWMs (p less than 0.01).
Filter similar PWMs (more than 50% overlapping hits).
Build a list of all (r,w)-modules that have instances in the target-set.
Compute Monte-Carlo enrichment score of each module (given enrichment of PWMs), and pass those with p less than 0.05.
Filter similar modules (more than 75% overlapping instances).
If we look closely at the third step of the algorithm, we shall see that if n = # of given PWMs then there are nr possible modules. Focus on the ones only those actually have instances in the target set.
Simplification: Search for modules with a consecutive instance = a promoter
interval that contains 1+ hits for each PWM in the module, and no hits for other PWMs
Finding modules with a consecutive instance in a promoter sequence using a hashing algorithm.:
Let M = list of all hits, ordered by position. We shall build a hash C of modules Copen = a hash of active modules and their starting positions.
The running time of the algorithm is O(r|M|) since Copen contains at most r modules.
Motif Finding Tools
Motif(l,d) = string M of length l that appears in many of the given promoters, each occurrence contains (exactly) d mismatches. For example, the string ”CATA” is a (4,1)-motif in AGGCCTAGGTG , GTAAACATGAAG , ACCAGAGAG.
Our Goal is given a set of t promoters, and l, d, find the (l,d)-motif(s) that appear in at least t of the promoters.
The main idea of the algorithm is: choose a projection h: 4l to 4k, hash each l-mer x in the input sequence to its bucket h(x). h(x) is constructed by choosing k (out of l) positions at random. Many instances of the motif are likely to fall into the same bucket = motif bucket, thus buckets with large count are likely to correspond to a motif.
The algorithm: Run m iterations:
Choose a random projection h.
Scan promoters using h and fill buckets.
For each bucket with a count larger than s, try to recover motif using an iterative refinement procedure.
An example of random projection, where l=5, d=1, k=3, motif: M=CATAG, h(x1x2x3x4x5 )=x1x2x5, The motif bucket is CAG. In the example, we can use any base for x3 and x4 and we look at all the sub-sequences that fall into the same bucket. And we find x3 and x4 according to the most frequent sub-sequences.
Analysis: Choosing of k and s can be very important since for larger k values we get more buckets but in every one of them there truer sub-sequence values. When k is small, we get fewer buckets, but in every one of them, there are more false positives.
Good values for k and s are as follows: k = l – d – 1 (to keep average bucket size small)
s = 2t(L − l + 1)/4k
where L = average promoter length.
The probability that a motif instance hashes to the motif bucket are:
since l-d known positions define a bucket.
The probability that fewer than s (out of t) motif instances hash to the motif bucket (in a single iteration):
The probability that s or more motif instances hash to the motif bucket in at least 1 (out of m) iteration:
Thus, the number of iterations required to ensure a certain success rate, p is:
Refinement procedure: Let: S = multiset of l-means that hashed to current bucket
fi = BG distribution of base i
A, W = 4 x l matrices
Initialize: Ai,j (# l-mers in S with base i at pos j) + fi
Repeat until convergence:
Reset A: Ai,j , fi.
Score all l-mers in promoters using W.
Add to A each l-mer with a positive score.
Compute W’ from A.
if(entropy(W’) < entropy(W)) gives (W <– W’)
Scan promoters using W, select best l-mer from each promoter (with positive score), and output their consensus.
MEME identifies likely motifs within the input set of sequences. You may specify a range of motif widths to target, as well as the number of unique motifs to search for. MEME uses Bayesian probability to incorporate prior knowledge of the similarities among amino acids into its predictions of likely motifs. The resulting motifs are output as profiles. A profile is a log-odds matrix used to judge how well an unknown sequence segment matches the motif. MEME is one of the most popular programs for motif finding. It uses the expectation maximization
(EM) approach: first obtain an initial motif, then iteratively obtain a better motif with the following two steps:
Expectation: compute the statistical composition of the current motif and find the probability of finding the site at each position in each sequence.
Maximization: These probabilities are used to update the statistical composition.
The Algorithm: Let: zi,j= prob. of BS at pos j in promoter i pb,c = prob. of base b at pos c in motif.
Choose starting p
Repeat until convergence of p:
Re-estimate z from p
Re-estimate p from z