PhD Student, Stanford
We develop contrastive PCA to explore patterns that are specific to one dataset relative to anothe
Principal component analysis (PCA) is ubiquitous in data exploration and visualization. Standard PCA is limited to finding low dimensional structure of one dataset. However, in many applications, we have multiple datasets (e.g. treatment and control, or multiple time points) and we are interested in exploring patterns that are specific to one dataset and are not shared with the other dataset. We develop contrastive PCA, an efficient method to identify interesting subspaces that contrast between different datasets. Our experiments show that contrastive PCA identifies dataset specific patterns which are missed by the standard PCA, demonstrating that it can be a powerful new tool for data exploration.
Abstract: Once considered provocative1, the notion that the wisdom of the crowd is superior to any individual has become itself a piece of crowd wisdom, leading to speculation that online voting may soon put credentialed experts out of business2, 3. Recent applications include political and economic forecasting4, 5, evaluating nuclear safety6, public policy7, the quality of chemical probes8, and possible responses to a restless volcano9. Algorithms for extracting wisdom from the crowd are typically based on a democratic voting procedure. They are simple to apply and preserve the independence of personal judgment10. However, democratic methods have serious limitations. They are biased for shallow, lowest common denominator information, at the expense of novel or specialized knowledge that is not widely shared11, 12. Adjustments based on measuring confidence do not solve this problem reliably13. Here we propose the following alternative to a democratic vote: select the answer that is more popular than people predict. We show that this principle yields the best answer under reasonable assumptions about voter behaviour, while the standard ‘most popular’ or ‘most confident’ principles fail under exactly those same assumptions. Like traditional voting, the principle accepts unique problems, such as panel decisions about scientific or artistic merit, and legal or historical disputes. The potential application domain is thus broader than that covered by machine learning and psychometric methods, which require data across multiple questions14, 15, 16, 17, 18, 19, 20.
Pub.: 25 Jan '17, Pinned: 01 Jul '17
Abstract: T cells are defined by a heterodimeric surface receptor, the T cell receptor (TCR), that mediates recognition of pathogen-associated epitopes through interactions with peptide and major histocompatibility complexes (pMHCs). TCRs are generated by genomic rearrangement of the germline TCR locus, a process termed V(D)J recombination, that has the potential to generate marked diversity of TCRs (estimated to range from 10(15) (ref. 1) to as high as 10(61) (ref. 2) possible receptors). Despite this potential diversity, TCRs from T cells that recognize the same pMHC epitope often share conserved sequence features, suggesting that it may be possible to predictively model epitope specificity. Here we report the in-depth characterization of ten epitope-specific TCR repertoires of CD8(+) T cells from mice and humans, representing over 4,600 in-frame single-cell-derived TCRαβ sequence pairs from 110 subjects. We developed analytical tools to characterize these epitope-specific repertoires: a distance measure on the space of TCRs that permits clustering and visualization, a robust repertoire diversity metric that accommodates the low number of paired public receptors observed when compared to single-chain analyses, and a distance-based classifier that can assign previously unobserved TCRs to characterized repertoires with robust sensitivity and specificity. Our analyses demonstrate that each epitope-specific repertoire contains a clustered group of receptors that share core sequence similarities, together with a dispersed set of diverse 'outlier' sequences. By identifying shared motifs in core sequences, we were able to highlight key conserved residues driving essential elements of TCR recognition. These analyses provide insights into the generalizable, underlying features of epitope-specific repertoires and adaptive immune recognition.
Pub.: 22 Jun '17, Pinned: 01 Jul '17
Abstract: In the development of a magnetic resonance imaging spectrometer, the equipment fault detection methods are mainly reliant on visual inspection of reconstructed images or k-space data, combined with observation of the output waveforms via an oscilloscope. However, when using the above methods, it may be quite difficult to determine minor design flaws that would produce image ghost or other problems. This article presents a fault detection method that is based on acquisition and analysis of the output waveforms from the spectrometer. While a sequence is running, the spectrometer outputs, including the digital gate and the gradients, are sampled using a data acquisition card. The acquired data is then processed using a high-performance graphic processing unit to allow the feature points, which are the endpoints of the waveform segments in this design, to be extracted. The processing operation is composed of data filtering, differencing, and clustering. Finally, the extracted feature points are compared with the predefined feature points of the sequence to determine any design errors. This method has been used to solve image ghost problems in our home-built spectrometer.
Pub.: 27 Jun '17, Pinned: 01 Jul '17
Abstract: Historical normal and abnormal data sets are prerequisites for process monitoring, alarm system rationalization, fault detection, and diagnosis. This paper proposes a new method to automatically find normal and abnormal data segments from historical data sets based on variational directions of multiple process variables. The minimum time duration and the minimum amplitude shift are introduced as empirical knowledge to define underlying stages in the data sets. Two major challenges in identifying these stages are addressed by using a density-based clustering algorithm. The effectiveness of the proposed method is illustrated using numerical and industrial examples.
Pub.: 12 Jun '17, Pinned: 01 Jul '17