PhD Student, Brown University
Identifying genomic rearrangements in cancer genomes
Cancer is a disease caused by mutations in the genome that promote cell proliferation and growth. Often, these mutations are not single nucleotide changes but large-scale genomic rearrangements, called structural variants. Such rearrangements include deletions, duplications, or inversions of entire genes. These structural variants are difficult to identify using current sequencing technologies because DNA must be fragmented into small pieces before being sequenced. The resulting fragments of DNA are pieced together, like a genomic puzzle, by aligning them to the human reference genome. I developed an probabilistic model to detect regions of a cancer genome affected by structural variation. The model utilizes a new sequencing technology, linked-read sequencing, which provides long-range information by labelling DNA fragments which originate from the same long DNA molecule.
Abstract: Retrotransposons constitute a major source of genetic variation, and somatic retrotransposon insertions have been reported in cancer. Here, we applied TranspoSeq, a computational framework that identifies retrotransposon insertions from sequencing data, to whole genomes from 200 tumor/normal pairs across 11 tumor types as part of The Cancer Genome Atlas (TCGA) Pan-Cancer Project. In addition to novel germline polymorphisms, we find 810 somatic retrotransposon insertions primarily in lung squamous, head and neck, colorectal, and endometrial carcinomas. Many somatic retrotransposon insertions occur in known cancer genes. We find that high somatic retrotransposition rates in tumors are associated with high rates of genomic rearrangement and somatic mutation. Finally, we developed TranspoSeq-Exome to interrogate an additional 767 tumor samples with hybrid-capture exome data and discovered 35 novel somatic retrotransposon insertions into exonic regions, including an insertion into an exon of the PTEN tumor suppressor gene. The results of this large-scale, comprehensive analysis of retrotransposon movement across tumor types suggest that somatic retrotransposon insertions may represent an important class of structural variation in cancer.
Pub.: 16 May '14, Pinned: 30 Jun '17
Abstract: With the advent of next-generation sequencing technologies, we have witnessed a rapid pace of discovery of new patterns of somatic structural variation in cancer genomes, and an attempt to figure out their underlying mechanisms. Some of these mechanisms are associated with particular cancer types, and in some cases are the main cause of the structural mutations that drive the oncogenic process. This review provides an overview of the patterns of somatic structural variation and chromosomal structures that characterize cancer genomes, their causal mechanisms and their impact in oncogenesis.
Pub.: 24 Apr '15, Pinned: 30 Jun '17
Abstract: Identifying large-scale structural variation in cancer genomes continues to be a challenge to researchers. Current methods rely on genome alignments based on a reference that can be a poor fit to highly variant and complex tumor genomes. To address this challenge we developed a method that uses available breakpoint information to generate models of structural variations. We use these models as references to align previously unmapped and discordant reads from a genome. By using these models to align unmapped reads, we show that our method can help to identify large-scale variations that have been previously missed.
Pub.: 13 Aug '15, Pinned: 30 Jun '17
Abstract: Over the last decade or so, sophisticated technological advances in array-based genomics have firmly established the contribution of structural alterations in the human genome to a variety of complex developmental disorders, and also to diseases such as cancer. In fact, multiple 'novel' disorders have been identified as a direct consequence of these advances. Our understanding of the molecular events leading to the generation of these structural alterations is also expanding. Many of the models proposed to explain these complex rearrangements involve DNA breakage and the coordinated action of DNA replication, repair and recombination machinery. Here, and within the context of Genomic Disorders, we will briefly overview the principal models currently invoked to explain these chromosomal rearrangements, including Non-Allelic Homologous Recombination (NAHR), Fork Stalling Template Switching (FoSTeS), Microhomology Mediated Break-Induced Repair (MMBIR) and Breakage-fusion-bridge cycle (BFB). We will also discuss an unanticipated consequence of certain copy number variations (CNVs) whereby the CNVs potentially compromise fundamental processes controlling genomic stability including DNA replication and the DNA damage response. We will illustrate these using specific examples including Genomic Disorders (DiGeorge/Veleocardiofacial syndrome, HSA21 segmental aneuploidy and rec (3) syndrome) and cell-based model systems. Finally, we will review some of the recent exciting developments surrounding specific CNVs and their contribution to cancer development as well as the latest model for cancer genome rearrangement; 'chromothripsis'.
Pub.: 02 Aug '11, Pinned: 30 Jun '17
Abstract: Structural variations (SVs) are large genomic rearrangements that vary significantly in size, making them challenging to detect with the relatively short reads from next-generation sequencing (NGS). Different SV detection methods have been developed; however, each is limited to specific kinds of SVs with varying accuracy and resolution. Previous works have attempted to combine different methods, but they still suffer from poor accuracy particularly for insertions. We propose MetaSV, an integrated SV caller which leverages multiple orthogonal SV signals for high accuracy and resolution. MetaSV proceeds by merging SVs from multiple tools for all types of SVs. It also analyzes soft-clipped reads from alignment to detect insertions accurately since existing tools underestimate insertion SVs. Local assembly in combination with dynamic programming is used to improve breakpoint resolution. Paired-end and coverage information is used to predict SV genotypes. Using simulation and experimental data, we demonstrate the effectiveness of MetaSV across various SV types and sizes.Code in Python is at http://firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
Pub.: 12 Apr '15, Pinned: 30 Jun '17
Abstract: Current clinical genomics assays primarily utilize short-read sequencing (SRS), but SRS has limited ability to evaluate repetitive regions and structural variants. Long-read sequencing (LRS) has complementary strengths, and we aimed to determine whether LRS could offer a means to identify overlooked genetic variation in patients undiagnosed by SRS.
Pub.: 22 Jun '17, Pinned: 29 Jun '17
Abstract: Paired-end sequencing is a common approach for identifying structural variation (SV) in genomes. Discrepancies between the observed and expected alignments indicate potential SVs. Most SV detection algorithms use only one of the possible signals and ignore reads with multiple alignments. This results in reduced sensitivity to detect SVs, especially in repetitive regions. We introduce GASVPro, an algorithm combining both paired read and read depth signals into a probabilistic model which can analyze multiple alignments of reads. GASVPro outperforms existing methods with a 50-90% improvement in specificity on deletions and a 50% improvement on inversions.
Pub.: 29 Mar '12, Pinned: 29 Jun '17
Abstract: Determining the genome sequence of an organism is challenging, yet fundamental to understanding its biology. Over the past decade, thousands of human genomes have been sequenced, contributing deeply to biomedical research. In the vast majority of cases, these have been analyzed by aligning sequence reads to a single reference genome, biasing the resulting analyses, and in general, failing to capture sequences novel to a given genome. Some de novo assemblies have been constructed free of reference bias, but nearly all were constructed by merging homologous loci into single "consensus" sequences, generally absent from nature. These assemblies do not correctly represent the diploid biology of an individual. In exactly two cases, true diploid de novo assemblies have been made, at great expense. One was generated using Sanger sequencing, and one using thousands of clone pools. Here, we demonstrate a straightforward and low-cost method for creating true diploid de novo assemblies. We make a single library from ∼1 ng of high molecular weight DNA, using the 10x Genomics microfluidic platform to partition the genome. We applied this technique to seven human samples, generating low-cost HiSeq X data, then assembled these using a new "pushbutton" algorithm, Supernova. Each computation took 2 d on a single server. Each yielded contigs longer than 100 kb, phase blocks longer than 2.5 Mb, and scaffolds longer than 15 Mb. Our method provides a scalable capability for determining the actual diploid genome sequence in a sample, opening the door to new approaches in genomic biology and medicine.
Pub.: 07 Apr '17, Pinned: 29 Jun '17