DNA Patterns

Introduction

Graphical representation of DNA or RNA sequences is termed as DNA patterns. DNA patterns technique can be applied to various functional structures such as promoters and genes, and also larger structures like bacterial or viral genomes can also be analyzed.

Application of DNA patterns in case of DNA promoters lead to 2 major observations:

1) There are totally 10 possible classes of eukaryotic promoters.

2) The application of DNA patterns in case of diabetes leads to the identification two phenotypes of diabetes.

Background

Evolution is generally guided by promoters for more than four million years. These promoters are the main contributing factor responsible for integrating different mutations in a different set of environmental conditions. They are located on the upstream side of the TSS (Transcriptional Start Site) and are important in regulating in complex genomes. The composition of the promoter region consists of the core promoter and regulatory domains.  There are different promoter elements such as TATA box, GC-box, CCAAT-box, BRE and INR box. These elements help to recognize the whole genome. Enlisting the promoter elements thus becomes important in finding these promoter regions.  Thus, classification using these promoter elements has proven to be difficult and disadvantageous in accordance with different functional correlations between promoter sequences. Moreover considering the evolutionary approach, the non- coding regulatory regions within themselves can change their order of nucleotides more frequently leading to the property of making the binding sites very small and unstable. The classification of these promoter regions has been done by a variety of methods.  motif sequences and other structural parameters, such as DNA curvature, bendability, stability, nucleosome positioning or comparison of various DNA sequences are those methods. Considering vertebrate classification of promoters two major classes have been identified and used  TATA and CpG types taken the case of mammals there is a subclassification in TATA box–enriched and CpG-rich promoters.To understand possible interactions between different biological processes, an alternative approach to an overall correlation between DNA sequence features among promoter regions can be used. Kappa Index of Coincidence (Kappa IC) and (C+ G)% values can also be used to classify promoter sequences. In this type of classification, the shape and density of these promoter patterns are taken main parameters.

In the following application, the structural properties of these patterns and correlations between promoter sequences of several different species are studied. Tools of bioinformatics have led more accurate recognition and extraction of promoter sequences. Eukaryotic Promoter Database and PlantProm Database are the best examples of experimental approaches how the TSSs have been compiled on a genome-wide scale. A total of  20,597 promoter sequences from Arabidopsis thaliana, Drosophila melanogaster, Homo sapiens and Oryza sativa is studied in the below-described methodology.  The difference in expression of promoter regions in different tissues has also been represented.

Methodology

Promoter datasets

The collection of eukaryotic promoters whose transcription start site have been determined experimentally have been described in the Eukaryotic Promoter Database and PlantProm Database and tested.    20,586 gene promoters from The Eukaryotic Promoter Database (6,649 gene promoters – Oryza sativa, 1,922 gene promoters –  Drosophila Melanogaster and 8,512 gene promoters – Homo sapiens) and PlantProm Database (3,503 gene promoters – Arabidopsis thaliana). The main interest in the regions flanking the putative TSS region is studied.

From Eukaryotic Promoter Database, promoter segments ranging from -499b to 100b are extracted, relative to the TSS. From PlantProm DB promoter segments ranging from 200 bp upstream and 51 bp downstream of the TSS are used.

Tissue-specific datasets

An available list of 6,534 tissue-specific gene names (under Tissue-Specific Genes based on Expressed Sequence Tags (ESTs)) from the TiGER database was used.  The promoters for these tissues are found from Eukaryotic Promoter Database. In total 2,369 promoters is found. 2,369 promoter patterns were generated and sorted in order to their proportion in each tissue.

Promoter patterns

In this application, a Visual Basic software program was developed for promoter analysis – called PromKappa (Promoter analysis by Kappa), and a software program for sorting promoter patterns – called PromNN (Promoter analysis by Neural Network) was developed.  Promoter patterns were generated by PromKappa program. Sliding window approach was used to extract two types of values: Kappa IC and (C + G)%. A sliding window with a step of 1 and a window size of 30 nt, allowed us to detail the structure of known promoters. Kappa Index of Coincidence values was plotted on a graph against (C+ G)% values, which form a recognizable pattern composed of clusters of various sizes on the Y-axis. The X-coordinate of each point was represented by a (C+ G)% value and the Y-coordinate was represented by a corresponding Kappa IC value. As can be expected, using a large window size developed smooth promoter patterns, whereas a small window size generated sharp and distinguishable characteristics of promoters which makes categorization easy.

Promoter analysis

Three types of analysis were conducted for promoter analysis. Initially, for each promoter sequence, we generated a graph, representing a promoter pattern. In total 20,586 graphs was generated. These graphs were sorted by their shape and density using a neural network.

In the second analysis, the center of each pattern was plotted on a graph designed to show the distribution of promoters for each species. A color scheme is used to highlight the denser surfaces. Red areas represent clusters of similar promoters while blue areas represent unique or rare promoters.

For the third analysis, the specificity of each promoter class among thirty tissues by using 2,369 promoters was measured.

Pattern recognition and sorting

Promoter sequences were distinguished into ten classes by using the maximum number (≥100) of appearances of similar promoter patterns. To determine the biological characteristics of promoter sequences, machine learning methods were used. All patterns were analyzed and sorted by PromNN, a pattern recognizer program using 93,264 artificial neurons and a single layer perceptron. It has the ability to learn patterns and classify them into specified classes. Supervised learning was used to train the neural network by using 200 input patterns. PromNN recognized ten promoter classes and provided information about the match score and match percentage for each promoter pattern.

Cytosine and guanine content

C+G values from each sliding window were extracted considering the nucleotide frequencies from the entire promoter sequence. In the first stage, to determine the (C + G)% content for the entire promoter sequence this formula is used:

Where “TOT” (total) designates the promoter sequence.

CGTOT represents the percentage of cytosine and guanine of the entire promoter.

(A+T+C+G)TOT represents the sum of occurrences of A, T, C and G.

(C +G)TOT represents the sum of occurrences of C and G.

 

In the next stage the value of CGTOT was used to calculate the (C +G)% content from the sliding window (SW):

Where CGSW represents the percentage of cytosine and guanine from the sliding window.

In this stage, CGSW value is relative to CGTOT.

The expression (A+T+C+G) TOT represents the sum of occurrences of A, T, C and G from the sliding window sequence.

(C +G)SW represents the sum of C and G occurrences in the sliding window sequence.

Kappa Index of Coincidence

The Index of coincidence principle is based on letter frequency distributions and has been used for the analysis of natural-language plaintext in cryptanalysis. Kappa Index of Coincidence is a form of Index of Coincidence used for matching two text strings.

Here, Kappa IC is used for calculating the level of “randomization” of a DNA sequence. By extracting Kappa IC and C+G content from a sliding window we have been able to measure the localized values along each promoter sequence. Kappa IC is sensitive to various degrees of sequence organization such as simple sequence repeats (SSRs) or short tandem repeats (STRs). The formula for Kappa IC is shown below, where sequences A and B have the same length N. Only if an A[i] nucleotide from sequence A matches the B[i] correspondent from sequence B, then S is incremented by 1.

Results

In the analysis promoter patterns, the first priority was given to repeated patterns. Secondly, species specificity of certain patterns and their role and distribution in evolutionary implications was investigated. In the third step of the analysis, distribution of these promoter classes among human tissues was examined.

Promoter classification

Promoter sequences are less conserved between species they exhibit similar patterns. Each pattern is composed of vertically aligned clusters of Kappa IC (y-axis) and (G+ C)% (x-axis) values. Vertical positions of these clusters form a promoter pattern which has a specific form for each promoter sequence. Although the overall shape and density seem to be conserved across different classes of promoters, they do differ in finer details. This may indicate a further possible organization of promoter classes in several subclasses. Their shape is explained by the presence of different structures such as simple sequence repeats (SSRs) or short tandem repeats (STRs). Among these structures, we found an interesting distribution of short and long homopolymer tracts or di- and trinucleotides formations, many of which are consistent.

1) AT-based promoters. AT-based representative patterns are distinguished by high (A + T)% and Kappa IC values.  The left side of the pattern is predominant, while the right side is significantly less pronounced.

2) CG-based promoters. These promoters are represented by patterns containing a high percentage of C+G and high Kappa IC values. CG-based promoters show a high CpG content. The right side of the pattern is predominant

3) ATCG-compact promoters. ATCG-compact patterns characterize promoters with centrally disposed of clusters, leading to the formation of a round shaped pattern

4) ATCG-balanced promoters. Promoter sequences belonging to ATCG-balanced class show an almost balanced G+C and A+ T content. The right and the left side of the pattern tend to share a relative 2-fold rotational symmetry.

5) ATCG-middle promoters. ATCG-middle patterns are characterized mainly by promoter sequences containing A+ T and C+G balanced values and higher than average Kappa IC values. The right side and the left side of the pattern are equally distributed. However, the central part is pronounced.

6) ATCG-less promoters. Promoters from this class are represented by an abrupt transition between two C+G threshold levels. the right side and the left side of the pattern is equally distributed, however, some sequences around the central region are missing or have a lower density

7) AT-less promoters. Promoter sequences belonging to AT-less class exhibit a high frequency of short CG-rich sequences. Although both sides of the pattern show a relative 2-fold rotational symmetry, the clusters from the left side of the pattern exhibit a lower density than those on the right.

8) CG-less promoters. In contrast, CG-less promoters are distinguished by a high frequency of short ATrich sequences  The right and left the side of the pattern to tend to be equally distributed, however, the clusters from the right side of the pattern exhibit a lower density than those on the left.

9) AT-spike promoters. Promoter sequences belonging to AT-spike class are represented by long repetitive sequences with a high content of A or T nucleotides. These patterns exhibit a central part and an elongated left side containing small density clusters.

10) CG-spike promoters. In contrast to AT-spike promoter architecture, these promoters are represented by long repetitive sequences with a high content of C or G nucleotides. CG-spike patterns exhibit a central part and an elongated right side containing small density clusters.

Promoter distribution

Phylogenetic relationships are mostly based on sequence alignment algorithms, Kappa IC approach is based on a frequency/content comparison. A superposition between promoter distributions from each species shows the shared surfaces, representing conserved promoter sequences

Transitional states

Different chromatin structure has been studied in TATA-less and TATAcontaining promoters.Various mutational events and the distribution of point mutations are influenced by chromatin structure in the promoter sequence. This type of  chromatin-dependent distribution of point mutations may lead to a a gradual shift from a promoter class to another promoter class (ie. by disruption of poly(dA:dT) or poly(dC:dG) tracts in shorter elements), resulting in changing the predisposition for low or high levels of gene expression. Promoter patterns “trapped” in transitional states between classes may also perhaps indicate a change of their gene relationship towards other biological pathways.

Tissue-specificity in humans

Specific interaction in particular tissues such as muscle and heart or kidney and liver have been reported in clusters.  Interaction groups both between promoter classes and within each promoter class has been studied. In addition to these groups, the tissue order from each class further reflects the significance of the observed interactions. The highlights of our observations include:

  1. CG-based promoters have the highest percentage of occurrence (37.59%) and appear to be TATA-less class correspondents which tend to be associated with “housekeeping” genes.
  2. AT-based promoters (5.25%) are present in all tissues but are absent from the mammary gland. The first six tissues in which AT-based promoters have the highest percentages are liver, heart, kidney, lymph node, soft tissue, and muscle.
  3. AT-less promoters (14.36%) are overestimated in uterus while CG-less and ATCG-balanced promoters are overestimated in testis
  4. CG-less promoters have an occurrence of 3.98% and are present in all tissues but they are absent from Spleen.
  5. There was no clear correlation regarding tissue order between AT-less and CG-less promoters. Nevertheless, we noticed that some tissues have a a tendency to stay grouped, such as muscle and heart, stomach and soft tissue, larynx, and colon, lymph node and liver or bone marrow and peripheral nervous system . These groups may suggest a role of these promoters in simple feedback mechanisms among tissues responsible for maintaining homeostasis.
  6. AT-spike promoters are found especially in tissues that require high levels of gene expression such as lung, eye, pancreas, uterus, liver, soft tissue, brain, kidney, prostate, and blood.
  7. CG-spike promoters also appear to be involved in survival mechanisms. These promoters are found in large numbers especially in tissues that need a short-term critical gene expression. This is supported by the order of the first seven tissues in which these promoters are most common, such as lung, eye, brain, peripheral nervous system, spleen, heart and blood, which also tends to have a high interaction with the environment.
  8. The proportions of CG-spike and AT-spike promoters seem to be similar in the first two tissues, namely in lung and eye.
  9. The percentage of occurrences between CG-based and AT-spike promoters appears to be relative and nearly complementary in all tissues.
  10. The proportion of ATCG-compact and AT-less promoters seem to have similar values in tissues from kidney and lymph node whereas ATCGcompact and AT-based promoters appear to have similar values in bladder, skin, and uterus.
  11. There was no clear correlation regarding the tissue order between ATCG-balanced and ATCGcompact promoters. However, ATCG-balanced and ATCG-compact promoters seem to have almost equal percentages in about 16 tissues.
  12. ATCG-less promoters are rare (0.03%) and are even more enigmatic since they are mainly represented in cervix and tongue.
  13. ATCG-middle promoters are present only in nine of the thirty tissues, namely in soft tissue, eye, pancreas, liver, placenta, bladder, muscle, larynx and bone marrow