A pinboard by
Abubakar Abid

PhD Student, Stanford


We develop contrastive PCA to explore patterns that are specific to one dataset relative to anothe

Principal component analysis (PCA) is ubiquitous in data exploration and visualization. Standard PCA is limited to finding low dimensional structure of one dataset. However, in many applications, we have multiple datasets (e.g. treatment and control, or multiple time points) and we are interested in exploring patterns that are specific to one dataset and are not shared with the other dataset. We develop contrastive PCA, an efficient method to identify interesting subspaces that contrast between different datasets. Our experiments show that contrastive PCA identifies dataset specific patterns which are missed by the standard PCA, demonstrating that it can be a powerful new tool for data exploration.


A solution to the single-question crowd wisdom problem

Abstract: Once considered provocative1, the notion that the wisdom of the crowd is superior to any individual has become itself a piece of crowd wisdom, leading to speculation that online voting may soon put credentialed experts out of business2, 3. Recent applications include political and economic forecasting4, 5, evaluating nuclear safety6, public policy7, the quality of chemical probes8, and possible responses to a restless volcano9. Algorithms for extracting wisdom from the crowd are typically based on a democratic voting procedure. They are simple to apply and preserve the independence of personal judgment10. However, democratic methods have serious limitations. They are biased for shallow, lowest common denominator information, at the expense of novel or specialized knowledge that is not widely shared11, 12. Adjustments based on measuring confidence do not solve this problem reliably13. Here we propose the following alternative to a democratic vote: select the answer that is more popular than people predict. We show that this principle yields the best answer under reasonable assumptions about voter behaviour, while the standard ‘most popular’ or ‘most confident’ principles fail under exactly those same assumptions. Like traditional voting, the principle accepts unique problems, such as panel decisions about scientific or artistic merit, and legal or historical disputes. The potential application domain is thus broader than that covered by machine learning and psychometric methods, which require data across multiple questions14, 15, 16, 17, 18, 19, 20.

Pub.: 25 Jan '17, Pinned: 01 Jul '17

Quantifiable predictive features define epitope-specific T cell receptor repertoires.

Abstract: T cells are defined by a heterodimeric surface receptor, the T cell receptor (TCR), that mediates recognition of pathogen-associated epitopes through interactions with peptide and major histocompatibility complexes (pMHCs). TCRs are generated by genomic rearrangement of the germline TCR locus, a process termed V(D)J recombination, that has the potential to generate marked diversity of TCRs (estimated to range from 10(15) (ref. 1) to as high as 10(61) (ref. 2) possible receptors). Despite this potential diversity, TCRs from T cells that recognize the same pMHC epitope often share conserved sequence features, suggesting that it may be possible to predictively model epitope specificity. Here we report the in-depth characterization of ten epitope-specific TCR repertoires of CD8(+) T cells from mice and humans, representing over 4,600 in-frame single-cell-derived TCRαβ sequence pairs from 110 subjects. We developed analytical tools to characterize these epitope-specific repertoires: a distance measure on the space of TCRs that permits clustering and visualization, a robust repertoire diversity metric that accommodates the low number of paired public receptors observed when compared to single-chain analyses, and a distance-based classifier that can assign previously unobserved TCRs to characterized repertoires with robust sensitivity and specificity. Our analyses demonstrate that each epitope-specific repertoire contains a clustered group of receptors that share core sequence similarities, together with a dispersed set of diverse 'outlier' sequences. By identifying shared motifs in core sequences, we were able to highlight key conserved residues driving essential elements of TCR recognition. These analyses provide insights into the generalizable, underlying features of epitope-specific repertoires and adaptive immune recognition.

Pub.: 22 Jun '17, Pinned: 01 Jul '17