Exomes are whole sequences that compose of all the exons (coding regions for proteins in a genome) and cover about 1 % to 2 % of the genome depending on species. Genome-wide association studies suggest that common genetic variants explain only a small fraction of heritable risk for common diseases, raising the question of whether rare variants account for a significant fraction of unexplained heritability. While DNA sequencing costs have fallen dramatically, they remain far from what is necessary for rare and novel variants to be routinely identified at a genome-wide scale in large cohorts. Second-generation methods have been developed for targeted sequencing of all protein-coding regions (`exomes’), to reduce costs while enriching for the discovery of highly penetrant variants. Here we report on the targeted capture and massively parallel sequencing of the exomes of twelve humans. These include eight HapMap individuals representing three populations, and four unrelated individuals with a rare dominantly inherited disorder, Freeman-Sheldon syndrome (FSS). We demonstrate the sensitive and specific identification of rare and common variants in over 300 megabases (Mb) of a coding sequence. Using FSS as a proof-of-concept, we show that candidate genes for monogenic disorders can be identified by exome sequencing of a small number of unrelated, affected individuals. This strategy may be extendable to diseases with more complex genetics through larger sample sizes and appropriate weighting of nonsynonymous variants by predicting the functional impact.
Steps involved in the Exome sequencing
DNA samples, targeted capture, and massively parallel sequencing
DNA samples were obtained from Coriell Repositories (HapMap) or by M.B. (FSS). Each shotgun library was hybridized to two Agilent 244K microarrays for target enrichment, followed by washing, elution, and additional amplification. The first array targeted CCDS (2007), while the second was designed against targets poorly captured by the first array plus updates to CCDS in 2008. All sequencing was performed on the genome analyzer.
Read mapping and variant analysis
Reads were mapped to the reference human genome (UCSC hg18), initially with ELAND software for quality recalibration, and then again with Maq13. Sequence calls were also performed by Maq, and filtered to coordinates with >= 8× coverage and a phred-like15 consensus quality >= 30. Sequence calls for HapMap individuals were compared against Illumina Human1M-Duo genotypes. NA18507 SNPs from whole genome data12 were obtained. Annotations of cSNPs were based on NCBI and UCSC databases, supplemented with PolyPhen Grid Gateway24 predictions for nonsynonymous SNPs.
Identification of Coding Indels Involves
- a) Gapped alignment of unmapped reads to the genome to generate a set of candidate indels using cross-match;
- b) Ungapped alignment of all reads to the reference and alternative alleles for all candidate indels using Maq;
- c) Filtering by coverage and allelic ratio.
Genomic DNA Samples
- Targeted capture was performed on genomic DNA from 8 HapMap individuals (4 Yoruba (NA18507, NA18517, NA19129, NA19240), 2 East Asians (NA18555, NA18956), 2 European-Americans (NA12156, NA12878), 4 European-American individuals affected by Freeman-Sheldon syndrome (FSS10066, FSS10208, FSS22194, FSS24895).
- Genomic DNA for HapMap individuals was obtained from Coriell Cell Repositories.
- Genomic DNA for Freeman-Sheldon syndrome individuals was obtained by M.B.
Oligonucleotides and adaptors
All oligonucleotides were synthesized by Integrated DNA Technologies (IDT) and resuspended in nuclease-free water to a stock concentration of 100 M. Double-stranded library adaptors SLXA_1 and SLXA_2 were prepared to a final concentration of 50 M by incubating equimolar amounts of SLXA_1_HI and SLXA_1_LO together and SLXA_2_HI and SLXA_2_LO together at 95°C for 3 mins and then leaving the adaptors to cool to room temperature in the heat block.
Shotgun library construction
Shotgun libraries were generated from 10 g of genomic DNA (gDNA) using protocols modified from the standard protocol. Each library provided sufficient material for hybridization to two microarrays. For each sample, gDNA in 300l 1× Tris-EDTA was first sonicated for 30min, then end-repaired for 45 mins in a 100 l reaction volume with using 1× End-It Buffer, 10 l dNTP mix, and 10 l ATP as supplied in the End-It DNA End-Repair Kit. The fragments were then A-tailed for 20 mins at 70°C in a 100l reaction volume with 1× PCR buffer, 1.5mM MgCl2, 1mM dATP and 5U AmpliTaq DNA polymerase. Next, library adaptors SLXA_1 and SLXA_2 were ligated to the A-tailed sample in a 90l reaction volume with 1× Quick Ligation Buffer with 5ml Quick T4 DNA Ligase and each adaptor in 10× molar excess of the sample. Samples were purified on purification columns after each of these four steps and DNA concentration determined on a Nanodrop-1000 when necessary.
Each sample was subsequently size selected for fragments of size 150–250bp using gel electrophoresis on a 6% TBE-polyacrylamide gel. A gel slice containing the fragments of interest was then excised and transferred to a siliconized 0.5ml microfuge tube with a 20G needle-punched hole in the bottom. This tube was placed in a 1.5ml siliconized microfuge tube and centrifuged at 13.2rpm for 5mins to create a gel slurry that was then resuspended in 200l 1× Tris-EDTA and incubated at 65°C for 2hrs, with periodic vortexing. This allowed for passive elution of DNA, and the aqueous phase was then separated from gel fragments by centrifugation through 0.2m Ultrafiltration columns and the DNA recovered using a standard ethanol precipitation.
Recovered DNA was resuspended in EB buffer (10mM Tris-Cl, pH8.5) and the entire volume used in a 1ml bulk PCR reaction volume with 1× Master Mix and 0.5M each of primers SLXA_FOR_AMP and SLXA_REV_AMP in the following conditions – 98°C for 30s; 20 cycles at 98°C for 30s, 65°C for 10s and 72°C for 30s; and finally 72°C for 5 min. PCR products were purified across 4 ultrafiltration columns and all the eluants pooled.
Design of exome capture arrays
All well-annotated protein coding regions as defined by the CCDS was the target. Coordinates were extracted from entries with “public” status, and regions with overlapping coordinates were merged. This resulted in a target with 164,007 discontiguous regions summing to 27,931,548 bp. By comparison, coding sequence defined by all of Reference Sequence comprises 31.9 Mb (14% larger). Hybridization probes against the target were designed primarily such that they were evenly spaced across each region. Probes were also constrained a) to be relatively unique, such that the average occurrence of each 15-mer in the probe sequence is less than 1008, b) to be between 20–60 bases in length, with preference for longer probes, and c) to have a calculated melting temperature (Tm) ≤ 69°C, with preference for higher Tms. Tm was calculated by 64.9 + 41 * (number of G+Cs − 16.4) / length of probe.
Two arrays were designed and used per individual. The first array was common to all individuals, and contained 241,071 probes designed mainly against the subset of the target that was also found in a previous version of the CCDS (CCDS20070227). For most exomes, the second array was custom-designed specifically for target regions that had not been adequately represented after capture on the first array and subsequent sequencing. For two individuals (FSS10066, FSS10208), the matching was to a different individual’s first-array data. However, this did not appear to significantly impact performance, likely because features capturing poorly on the first array largely did so consistently. Additionally, all of the second arrays also targeted sequences found in CCDS20080902 that were not in CCDS20070227 and hence not targeted by the first array. A subset of arrays used lacked control grids.
Targeted capture by hybridization to DNA microarrays
Hybridizations to the arrays were performed per manufacturer’s instructions with modifications. For each enrichment, a 520l hybridization solution containing 20g of the bulk amplified gDNA library, 1× aCGH Hybridization Buffer, 1× Blocking Agent, 50g Human CotI DNA and 0.92nmol each of the blocking oligos SLXA_FOR_AMP, SLXA_REV_AMP, SLXA_FOR_AMP_rev, SLXA_REV_AMP_rev was incubated at 95°C for 3 min and then at 37°C for at least 30mins. The hybridization solution was then loaded and the hybridization chamber assembled as per manufacturer’s instructions. Incubation was done at 65°C for at least 66hrs with rotation at 20rpm in a hybridization oven.
After hybridization, the slide-gasket sandwich was removed from the chamber and placed in a 50ml conical tube filled with aCGH Wash Buffer 1. The slide was separated from the gasket while in the buffer and then washed, first with fresh aCGH Wash Buffer 1 at room temperature for 10mins on an orbital shaker set on low speed, and then in pre-warmed a CGH Wash Buffer 2 at 37°C for 5mins. Both washes were also done in 50ml conical tubes.
A Secure-Seal was then affixed firmly over the active area of the washed slide and heated briefly according to manufacturer’s instructions. One port was sealed with a seal tab and the seal chamber completely filled with approximately 1ml of hot EB (95°C). The other port was sealed and the slide incubated at 95°C on a heat block. After 5min, one port was unsealed and the solution recovered. DNA was purified from the solution using a standard ethanol precipitation.
Precipitated DNA was resuspended in EB and the entire volume used in a 50l PCR volume comprising of 1× iTaq SYBR Green Supermix with passive dye and 0.2M each of primers SLXA_FOR_AMP and SLXA_REV_AMP. Thermal cycling was done in a Real-time PCR system with the following program: 95°C for 5min, then 30 cycles of 95°C for 30sec, 55°C for 2min, and 72°C for 2min. Each sample was monitored and extracted from the PCR machine when fluorescence began to plateau. Samples were then purified and sequenced.
All sequencing of post-enrichment shotgun libraries was carried out on an Genome Analyzer as single-end 76 bp reads, following the manufacturer’s protocols and using the standard sequencing primer. Image analysis and base-calling were performed by the Genome Analyzer with default parameters but no pre-filtering of reads by quality. Quality values were recalibrated by alignment to the reference human genome with the Eland module.
The reference human genome used in these analyses was UCSC assembly hg18 (NCBI build 36.1), including unordered sequence (chrN_random.fa) but not including alternate haplotypes. For each lane, reads with calibrated qualities were extracted from the Eland export output. Base qualities were rescaled and reads mapped to the human reference genome using Maq software . Unmapped reads were dumped using the −u option and subsequently used for indel mapping. Mapped reads that overlapped target regions (“target reads”) were used for all other analyses.
All possible 76-bp reads that overlapped the aggregate target were simulated, mapped using Maq and consensus called using maq assemble with parameters −q 1 −r 0.2 −t 0.9. Target coordinates that had read depth < 76 (i.e. half of the expected depth), reflecting poor mappability, were removed from consideration for downstream analyses, leaving a 26,553,795 bp target.
All reads with a map score > 0 from each individual were merged and filtered for duplicates such that only the read with the highest aggregate base quality at any given start position and orientation was retained. Sequence calls were obtained using maq assemble with parameters −r 0.2 −t 0.9, and only coordinates with at least 8× coverage and an estimated phred-like consensus quality value of at least 30 were used for downstream variant analyses.
Comparison of sequence calls to array genotypes, dbSNP, and whole genome sequencing
For the 8 HapMap individuals, sequence calls were compared to array-based genotyping data. We excluded from consideration genotyping assays where all 8 individuals were called by the arrays as homozygous non-reference as well as the MHC locus at chr6:32500001–33300000, as both sets are likely to be error-enriched in the genotyping data. ~14.2 million non-redundant coordinates were defined by this file-set. For comparison of NA18507 cSNPs to whole genome data, variant lists were obtained from Illumina, Inc.
Identification of coding indels
Reads for which Maq was unsuccessful in identifying an ungapped alignment were converted to fasta format and mapped to the human reference genome with cross_match, using parameters –gap_ext -1 –bandwidth 10 -minmatch 20 –maxmatch 24. Output options –tags –discrep_lists –alignments –score_hist were also set.
Alignments with an indel were then filtered for those that
- a) had a score at least 40 more than the next best alignment
- b) mapped at least 75 bases of the read
- c) had no substitutions in addition to the indel, and
- d) overlapped a target region.
Reads from filtered alignments that mapped to the negative strand were then reverse complemented and, together with the rest of the filtered reads, re-mapped with cross_match using the same parameters. This was to reduce ambiguity in called indel positions due to different read orientations. After the second mapping, alignments were re-filtered using the same criteria a) through d). For each sample, a putative indel event was called if at least 2 filtered reads covered the same event. A fasta file containing the sequences of all called events +/− 75 bp, as well as the reference sequence at the same positions were then generated for each individual. All the reads from each individual were then mapped to its “indel reference” with Maq using default parameters. Reads that mapped multiple times (map score 0) or had redundant start sites were removed, after which the number of reads mapping to either the reference or the non-reference allele was counted for each individual and indel. An indel was called if there were at least 8 non-reference allele reads making up at least 30% of all reads at that genomic position. Indels were called as heterozygous if non-reference alleles were 30–70% of reads at that position, and homozygous non-reference if >70%.
For cSNP annotation, we constructed a local server that integrates data from NCBI (including dbSNP and Consensus CDS files) and from UCSC Genome Bioinformatics. We also generated PolyPhen predictions24 for all cSNPs identified here, using the PolyPhen Grid Gateway and Perl scripts. The server reads files with SNP locations and alleles and produces annotation files available for download. Annotation includes dbSNP rs IDs, overlapping-gene accession numbers, SNP function (e.g. whether coding missense), conservation scores, HapMap minor-allele frequencies, and various protein annotations (sequence, position, amino acid changes with physicochemical properties, and PolyPhen classification). Indels were considered annotated by dbSNP if an entry was found with the same allele (or reverse complemented) within 1 bp of the variant position. This was to allow for ambiguities in calling the indel position.
Calculation of genome-wide estimates
Extrapolated estimates for the genome-wide number of cSNPs of various classes were calculated based on the number of cSNP calls in that individual, the estimated sensitivity for making a variant call in that individual at any given position within the aggregate target (based on the fraction of array-based genotypes of that class that were successfully called; calculated separately for heterozygous and homozygous non-reference variants), and extrapolation to an estimated exome size of exactly 30 Mb (i.e. multiplying by 30/26.6 = 1.13). A similar approach was taken to estimate the genome-wide number of uncommon cSNPs introducing nonsense codons, starting with the number observed in each individual and extrapolating based on estimated sensitivity for heterozygote detection and an estimated exome size of exactly 30 Mb.