Selection of Protein Binding Sites from Random Nucleic Acid Sequences


A variety of methodologies have been demonstrated to find the DNA/RNA sequences and protein binding sites. The binding sites can provide valuable information about the protein functions. The methodologies have found its application in oncology particularly in finding proteins associated with oncogenesis. Each strategy employed to find the protein binding sites can yield different information about the binding site. Generally when a particular strategy is selected. Any selected strategy consists of an amplification by PCR step. The amplified sequences allow performing multiple rounds of amplification and sequencing. The approach is commonly called as selection and application of binding sites (SAAB) technique.

Generalized procedure for SAAB

 A typical SAAB consists of the following steps:

(i) Template synthesis: A template is synthesized by a random nucleotide sequence in the binding site. This leads to three situations: (a) when both the binding site and protein are known, (b) Only the binding site is known, (c) Binding protein is known and details both the binding site sequence has to be found. In the last case scenario, the number of random nucleotide position should be large.

(ii) Hybridization: Incubate the double-stranded label and protein. Usually, protein molecules are produced in vivo with fusion technologies. In case of an unspecific binding site longer, incubation period and a large quantity of protein maximize the hybridization efficiency.

(iii) EMSA: Electrophoretic mobility shift assay (EMSA) is used to isolate DNA bound protein by autoradiography. First, the DNA bound protein is separated by agarose gel electrophoresis.

(iv) Amplification step: The bound DNA is purified by gel excision. It is then purified and amplified and PCR. Amplified starting template acts as a positive control for PCR.

(v) Labeling: The Amplified binding site is labeled. Reselect for binding by EMSA. The binding step is repeated at least for 5 times with the amplified labeled DNA sample and fusion protein.

(vi) Sequence the DNA: After the final step of selection and electrophoresis. The DNA is cloned into a cloning vector. Sequence the DNA-cloned vector by Pyro-sequencing or other techniques.

The basic strategy is applied to select binding sites in vitro. It is applied to genomic DNA fragments and RNA sequences. In case of RNA sequence libraries, the RNA molecule is first transcribed into a cDNA molecule, which after the amplification step is transcribed to RNA for the selection procedure. As in the protocols itself has multiple rounds of selection, they can be employed to circumstances like rare sequences or very small number of proteins. Given the affinity of the sequence-specific interactions exceeds nonspecific binding.

In vitro selection scheme, it is possible to find the binding sites for any protein for any DNA or RNA sequences. It is applicable in cases where the in vivo targets of the protein molecule are not known. They are used in the identification of sequences, present in a cell extract, which can bind with multiple protein complexes. It uses an antibody or affinity column which binds at least with one of the proteins in the complex.  They are applied to study the DNA recognition properties of different family members, all of which bind to related DNA sequences.  The sensitivity of the in-vitro selection experiment is used as the basic principle in the above application. Specificity of the particular binding sequence to the particular protein also can be obtained by this method. It employs a pool sequence assay where only part of the oligonucleotide sequence used in the first step is known. They form a pool when sequenced. They are applied in finding how the mutation affects the binding capacity of a particular protein and DNA sequence.

Binding Site Selection in Vitro

Preparation of Random Sequence Library

An oligonucletide is synthesized to a random sequence with equal amounts of A, G, C and T.  The random sequence is flanked by fixed sequence. Fixed sequence aids in PCR amplification. PCR primer is generally fixed at a minimum length of 16 bp.  The fixed sequence sites should contain restriction enzyme sites.  These fixed molecules should permit cloning of individual selected molecules. They are also placed near the ends to allow PCR amplification after subcloning. The length is dependent on the particular experiment. In selections involving multi protein complexes, the length of the random sequences between two binding sites should be more than 30 bp. It allows recognition of multiple sites of the DNA. In case of different library templates are to be used and in pool sequencing the danger of cross-contamination is high. In these cases, additional fixed bases are added as a tag, to identify the library from where they are derived.

A double-standard library can be produced by PCR amplification of DNA molecules that are isolated during selection. However, know fragment of DNA polymerase can be used for large-scale synthesis. In this procedure, microgram amounts of the library oligonucleotide are annealed with a 10-fold excess PCR primer. This excess primer extends itself in a large-scale polymerization reaction to complete the second strand. Then the double-stranded library is run n 14% agarose gel. This isolates the full-length molecules as elutes.

Separation of Bound and Free Nucleic Acid Sequences by EMSA

Protein of interest and labeled random sequence library are incubated together. The reaction mixture is loaded onto a non-denaturing polyacrylamide gel. Separation is achieved based on the fact that the protein-bound DNA migrates slowly than the free DNA. The bound complex is identified by autoradiography and excised from the gel. After elution, the selected DNA is then amplified.  The advantage of the protocol of this protocol is that it allows visualization of each step. The protocol apparently males the relative ratio of bound and free DNA, and specific and nonspecific bonding visual. It reveals whether a selection of specific binding sites has occurred during the experiment.

When initial DNA binding reaction is done, enough library DNA should be added, so all of the potential specific binding sites that are being selected are free to bind. However, in an experimental situation, it is recommended to use more than an optimal concentration of the library is used.  If multiprotein complexes in involved a large library template is required. But a complete recognition of the template is not necessary because recognition element only constitutes a fraction of the total length.

The most practical approach is to include input DNA in each round so that only a fraction will be bound in the selection EMSA. If the bound fraction is smaller, the selection is self-driven based on the sequence-specific protein- DNA interaction relative to non-specific binding. The appropriate amount of input protein can be estimated from the predicted binding affinity.  If a high-affinity DNA binding protein would be expected to bind to its cognate site with an affinity in the nanomolar range, it could be used to select binding sites at a concentration of 10 nm or less, with the DNA library added to a higher concentration.

But in situations like the formation of a protein dimer or a higher order oligomer to bind DNA, the protein to be added cannot be predicted and should be determined empirically. The dimerization influence the affinity of the protein.

For the first round of selection, the random sequence library is end labeled by a kinase reaction. In general, the DNA binding reaction can be performed under standard conditions, such as 50 mM Na+ or K+, 20 mM HEPES, 1 mM dithiothreitol 3 mM MgC12, 1 mM EDTA, and 5% glycerol.

In case of proteins produced by overexpression in bacteria add the detergent NP-40 to 0.5% to inhibit aggregation. A nonspecific competitor like poly(dI: dC) can be added to increase the stringency of selection, given the nonspecific competitor does not compete significantly with sequence-specific interactions by the protein complex of interest. Incubation at room temperature for 20 rain prior to electrophoresis is usually sufficient for achieving equilibrium conditions.

The EMSA gel run protocol

Materials and Reagents

  1. DTT Dithiothreitol (DTT)
  2. Poly-dIdC
  3. 32P-labeled probe
  4. BSA Bovine Serum Albumin
  5. 1.5x binding buffer

5x binding buffer Composition

50 mM Tris HCl (pH 8.0), 750 mM KCl, 2.5 mM EDTA, 0.5% Triton-X 100, 62.5 % glycerol (v/v), 1 mM DTT

Recipe for 10 ml

0.5 ml of 1 M Tris HCl (pH 8.0), 3 ml of 2.5 M KCl, 50 μl of 0.5 M EDTA (pH 8.0), 50 μl Triton-X 100

7.87 g glycerol, add DTT fresh before use

  1. 10x TBE buffer (1 L)

106 g of Tris base, 55 g of boric acid, 40 ml of 0.5 M EDTA (pH 8.0)


  1. Plates
  2. Spacers
  3. Clamps
  4. Saran wrap
  5. Whatman paper


  1. Pour protein polyacrylamide gel.
  2. Assemble plates, spacers, and clamps. Seal with 1% agarose to prevent leaks.
  3. Pour 5% polyacrylamide gel.

Plate size                                    Large         Medium

H2O                                                78 ml              39 ml

10x TBE                                           5 ml               2.5 ml

30% acrylamide stock (19:1)    16.6 ml            8.4 ml

10% APS                                       1,000 μl           500 μl

Mix well while minimizing bubble formation. Add 100 μl/50 μl TEMED. Mix and pour, add combs. The gel will take ~10 min to polymerize. After polymerization, the gel can be wrapped in saran wrap and stored at 4 °C.

  1. Prepare 5x binding buffer.
  2. Set binding reaction:

1 μl of poly-dIdC (1 μg/μl in TE)

2 μl of 5x binding buffer

1 μl of labeled probe

1 μl cold competitor – unlabeled DNA fragments containing the binding sequences (if needed)

0.1 μl 100x BSA

X μl nuclear extract (5 μg protein total)

Add H2O to 10 μl final volume

Incubate for 30 min at room temperature (RT). Add antibody for supershift (if needed). Incubate additional 30 min at RT.

  1. While the binding reaction is incubating, run the polyacrylamide gel without any sample at 150 V, 30 min, using 0.5x TBE as the running buffer. Then run samples on the polyacrylamide gel for ~2 h at 150 V.
  2. Dry the gel (optional).

Transfer gel to Whatman paper. Cover top of gel with saran wrap and dry at 80 °C in a vacuum dryer for 1-2 h.

  1. Expose the gel.

Place gel in the cassette with reflection screen. Add film and place in -80 °C freezer.

Recovery, Amplification, and Reselection of Bound Sequences

After the first round of EMSA gel, it is possible that protein-DNA complexes would be apparent. And these may contain a population of specific binding sites. This autoradiogram can show three different patterns. They might show multiple complexes, nonspecific interactions, or no obvious complexes at all. SO the general recommendation is to run the first-round selection gel for a very short time (under standard conditions for about 20 min TM). This makes the free fraction of DNA runs only about 1 cm into the gel. The entire upper 0.75 cm of the lane, which includes all DNA that has migrated at reduced mobility relative to the free fraction, is then excised from the dried gel (including the paper backing) for recovery and amplification of the DNA within.

In the second selection round, this batch selection can be repeated along with, or instead of, a full-length EMSA gel.  As this procedure differentiates specific and non-specific binding. Specific binding is strong while non-specific is weaker this results in the gradient of the movement in the EMSA gel. Specific complexes become increasingly abundant and can be easily identified in the EMSA gel as the selection progresses. The center two- thirds of the lane and are 0.3 cm wide which contain DNA within these specific protein-DNA complexes are isolated from full-length gels by excising them in slices. DNA molecules which are migrating with low mobility are left behind by this procedure.

DNA is purified from these gel slices by incubation at 37 ° for 3-4 hr in 0.5 ml of 0.5 M ammonium acetate, 10 mM MgCl2, 1 mM EDTA, and 0.1% SDS. An overnight incubation increases DNA recovery. After debris are spun out and 5 mg of tRNA carrier is added to the elute, it is extracted twice each with phenol and chloroform/isoamyl alcohol (24:1 v/v) and precipitated with ethanol. These samples are then resuspended in 0.3 M sodium acetate and reprecipitated with ethanol. Approximately one-fifth of the resuspended sample is then amplified by 35 cycles of the PCR in a 100-ml reaction. The PCR is performed under standard conditions following optimization of Mg concentration, with extreme care taken to avoid cross-contamination of samples.  These conditions yield about 100 ng of DNA product and allow recovery from as few as 50 cpm of an EMSA-isolated template. The products of these reactions are electrophoresed on a 14% polyacrylamide gel, which is then stained with ethidium bromide. The sample is excised from the wet gel and eluted and purified as above.

For subsequent selection rounds, after amplification and purification, these selected DNA pools are labeled by incorporation using the PCR. Approximately 5 ng (as estimated from the preparative gel) of the purified amplified template is labeled for one cycle in a 20 ml reaction that contains 30mCi of [32p]dTTP, 50 mM each of dATP, dGTP, and dCTP, and 100 ng of each primer in the standard PCR reaction buffer. It is important to add a high amount of primers.

Analysis of Selected DNA Molecules

Sequence Analysis of Selected Sites

The selected DNA molecules are enriched with specific binding sequences. The restriction sites within the primer sequences are used for cloning the individual selected molecules and their nucleotide sequence can be determined. The presence of high percentage of a sequence motif in a high is an indication of a potential binding site consensus. This criterion misses sites that conform only partially to the consensus. Therefore assay binding to individually selected sequences becomes essential. These selected sequences are recovered from plasmid clones and labeled by PCR amplification. This analysis allows us to discriminate specific and nonspecific binding. The most accurate identification specific binding sites can be found by performing biochemical foot printing analyses of binding to selected DNA molecules or to test binding to individual synthesized oligonucleotides with sequences that are based on these molecules.

Pool Sequence Assay

For in vitro selection experiments that are to be analyzed by pool sequencing. Similar to the normal procedure the first step is to synthesize a random sequence library template except a portion of the binding site is fixed in sequence. This leads to the fact that the protein complex will bind at a defined location on the template DNA. A total template size of about 55 bp is optimal for the sequencing protocol.

After the selection process has been completed, the resulting sequence pool is amplified by the PCR and purified as in each selection round, except that it is eluted from the gel overnight to increase the yield for sequencing. Another strategy of expanding the size of the molecules that contain the selected sequence can also be used to increase the efficiency. This modification can be done by performing the last amplification step using primers that each overhang the ends of the starting library in the 5′ direction.

For the sequencing protocol, one of the two original PCR primers is labeled with 32p by a kinase reaction, and an unincorporated label is removed with a spin column. As these primers are small a two-stage spin is recommended to recover most of the labeled DNA.

Sequencing protocol

Labeled primer (10 ng) is mixed with approximately 5 ng of a purified amplified template pool in a 12qxl sample, which includes 1ml of Sequenase manganese (Mn) buffer (to increase the uniformity of band intensities) and 2 ml of 5× Sequenase buffer. This mixture is placed at 95 °C for 5 min, then allowed to cool at room temperature for 1 min, during which it must be quick-spun to bring down condensation. The sample is placed on ice, and then 1 ml of 0.1 M dithiothreitol and 2ml of diluted Sequenase enzyme (1: 8 in ice-cold TE, pH 7.4) are added. A 3.5-ml aliquot of this mix is then added to 2.5ml of each of the four Sequenase dGTP termination mixes (USB). After incubation at 45 ° C for 4 min, these reactions are stopped by the addition of 4 ml of Sequenase stop solution, dITP termination mixes can be employed, but do not give data of the same high quality.  This protocol yield sequences up to about 80 bp from the 5′ end of the primer, at which point the extent of termination by dideoxynucleotide incorporation is virtually complete. A 1.5-ml aliquot of each reaction is run on a 14% sequencing gel containing 8 M urea in 1 × TBE. Prior to autoradiography, the gel is fixed in 10% acetic acid, 10% methanol, but only after the unreacted primer (which is present in vast excess of incorporated product) has been cut away to prevent its diffusion.