A pinboard by
this curator

I do research in large user facilities and manage a lab in one to help scientists from all fields.


Machine learning is becoming more and more wide spread in all kinds of scientific fields...

Computers doing science?

In the past two decades many tech companies (Google, Facebook, etc.) made it to the Fortune 500 list by analyzing large amounts of data with newly developed computer algorithms. This field is called Machine Learning and can even have a big impact on modern physics, chemistry and biology.

No more knob turning...

A group of Australian physicists and computer scientists built a proof-of-principle experiment that consisted of a very complicated laser system to produce ultra-cold atoms in Bose-Einstein condensates (Nobel Prize in Physics, 2001). The original experiments took quite a long time, however the machine learning algorithm mastered the optimization task in one hour!

Where is Artificial Intelligence going?

There are many answers to this question and today a big part of them is just speculation. Some are afraid of the future capabilities of AI, others promote and open source research in the field, but one thing is certain: computer algorithms can do more and more each day.

In diagnostics:

Image recognition is one big success of machine learning. The tools currently available are often more accurate than trained doctors due to the large databases the algorithms can be trained on.

In biology:

Big data, such as gene sequencing can be also examined with computer codes to uncover hidden links between parts of the genom.

In physics and chemistry:

Computers can now predict the properties of new materials only based on calculations! This saves unnecessary efforts in synthesis, which also has a very big and positive environmental impact.

The opportunities...

...are endless. Since Machine Learning in Natural Sciences is a relatively new field, there are many low hanging fruits, but the field is saturating quickly.


Computational Sensing Using Low-Cost and Mobile Plasmonic Readers Designed by Machine Learning.

Abstract: Plasmonic sensors have been used for a wide-range of biological and chemical sensing applications. Emerging nano-fabrication techniques have enabled these sensors to be cost-effectively mass-manufactured onto various types of substrates. To accompany these advances, major improvements in sensor read-out devices must also be achieved to fully realize the broad impact of plasmonic nano-sensors. Here, we propose a machine learning framework which can be used to design low-cost and mobile multi-spectral plasmonic readers that do not use traditionally employed bulky and expensive stabilized light-sources or high-resolution spectrometers. By training a feature selection model over a large set of fabricated plasmonic nano-sensors, we select the optimal set of illumination light-emitting-diodes needed to create a minimum-error refractive index prediction model, which statistically takes into account the varied spectral responses and fabrication-induced variability of a given sensor design. This computational sensing approach was experimentally validated using a modular mobile plasmonic reader. We tested different plasmonic sensors with hexagonal and square periodicity nano-hole arrays, and revealed that the optimal illumination bands differ from those that are 'intuitively' selected based on the spectral features of the sensor, e.g., transmission peaks or valleys. This framework provides a universal tool for the plasmonics community to design low-cost and mobile multi-spectral readers, helping the translation of nano-sensing technologies to various emerging applications such as wearable sensing, personalized medicine, and point-of-care diagnostics. Beyond plasmonics, other types of sensors that operate based on spectral changes can broadly benefit from this approach, including e.g., aptamer-enabled nanoparticle assays and graphene-based sensors, among others.

Pub.: 28 Jan '17, Pinned: 12 Apr '17

A predictive machine learning approach for microstructure optimization and materials design.

Abstract: This paper addresses an important materials engineering question: How can one identify the complete space (or as much of it as possible) of microstructures that are theoretically predicted to yield the desired combination of properties demanded by a selected application? We present a problem involving design of magnetoelastic Fe-Ga alloy microstructure for enhanced elastic, plastic and magnetostrictive properties. While theoretical models for computing properties given the microstructure are known for this alloy, inversion of these relationships to obtain microstructures that lead to desired properties is challenging, primarily due to the high dimensionality of microstructure space, multi-objective design requirement and non-uniqueness of solutions. These challenges render traditional search-based optimization methods incompetent in terms of both searching efficiency and result optimality. In this paper, a route to address these challenges using a machine learning methodology is proposed. A systematic framework consisting of random data generation, feature selection and classification algorithms is developed. Experiments with five design problems that involve identification of microstructures that satisfy both linear and nonlinear property constraints show that our framework outperforms traditional optimization methods with the average running time reduced by as much as 80% and with optimality that would not be achieved otherwise.

Pub.: 24 Jun '15, Pinned: 12 Apr '17

Machine-learning techniques for geochemical discrimination of 2011 Tohoku tsunami deposits.

Abstract: Geochemical discrimination has recently been recognised as a potentially useful proxy for identifying tsunami deposits in addition to classical proxies such as sedimentological and micropalaeontological evidence. However, difficulties remain because it is unclear which elements best discriminate between tsunami and non-tsunami deposits. Herein, we propose a mathematical methodology for the geochemical discrimination of tsunami deposits using machine-learning techniques. The proposed method can determine the appropriate combinations of elements and the precise discrimination plane that best discerns tsunami deposits from non-tsunami deposits in high-dimensional compositional space through the use of data sets of bulk composition that have been categorised as tsunami or non-tsunami sediments. We applied this method to the 2011 Tohoku tsunami and to background marine sedimentary rocks. After an exhaustive search of all 262,144 (= 2(18)) combinations of the 18 analysed elements, we observed several tens of combinations with discrimination rates higher than 99.0%. The analytical results show that elements such as Ca and several heavy-metal elements are important for discriminating tsunami deposits from marine sedimentary rocks. These elements are considered to reflect the formation mechanism and origin of the tsunami deposits. The proposed methodology has the potential to aid in the identification of past tsunamis by using other tsunami proxies.

Pub.: 18 Nov '14, Pinned: 12 Apr '17

Machine-learning-assisted materials discovery using failed experiments

Abstract: Inorganic–organic hybrid materials1, 2, 3 such as organically templated metal oxides1, metal–organic frameworks (MOFs)2 and organohalide perovskites4 have been studied for decades, and hydrothermal and (non-aqueous) solvothermal syntheses have produced thousands of new materials that collectively contain nearly all the metals in the periodic table5, 6, 7, 8, 9. Nevertheless, the formation of these compounds is not fully understood, and development of new compounds relies primarily on exploratory syntheses. Simulation- and data-driven approaches (promoted by efforts such as the Materials Genome Initiative10) provide an alternative to experimental trial-and-error. Three major strategies are: simulation-based predictions of physical properties (for example, charge mobility11, photovoltaic properties12, gas adsorption capacity13 or lithium-ion intercalation14) to identify promising target candidates for synthetic efforts11, 15; determination of the structure–property relationship from large bodies of experimental data16, 17, enabled by integration with high-throughput synthesis and measurement tools18; and clustering on the basis of similar crystallographic structure (for example, zeolite structure classification19, 20 or gas adsorption properties21). Here we demonstrate an alternative approach that uses machine-learning algorithms trained on reaction data to predict reaction outcomes for the crystallization of templated vanadium selenites. We used information on ‘dark’ reactions—failed or unsuccessful hydrothermal syntheses—collected from archived laboratory notebooks from our laboratory, and added physicochemical property descriptions to the raw notebook information using cheminformatics techniques. We used the resulting data to train a machine-learning model to predict reaction success. When carrying out hydrothermal synthesis experiments using previously untested, commercially available organic building blocks, our machine-learning model outperformed traditional human strategies, and successfully predicted conditions for new organically templated inorganic product formation with a success rate of 89 per cent. Inverting the machine-learning model reveals new hypotheses regarding the conditions for successful product formation.

Pub.: 04 May '16, Pinned: 12 Apr '17

Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies.

Abstract: The standard approach to the analysis of genome-wide association studies (GWAS) is based on testing each position in the genome individually for statistical significance of its association with the phenotype under investigation. To improve the analysis of GWAS, we propose a combination of machine learning and statistical testing that takes correlation structures within the set of SNPs under investigation in a mathematically well-controlled manner into account. The novel two-step algorithm, COMBI, first trains a support vector machine to determine a subset of candidate SNPs and then performs hypothesis tests for these SNPs together with an adequate threshold correction. Applying COMBI to data from a WTCCC study (2007) and measuring performance as replication by independent GWAS published within the 2008-2015 period, we show that our method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods. COMBI presents higher power and precision than the examined alternatives while yielding fewer false (i.e. non-replicated) and more true (i.e. replicated) discoveries when its results are validated on later GWAS studies. More than 80% of the discoveries made by COMBI upon WTCCC data have been validated by independent studies. Implementations of the COMBI method are available as a part of the GWASpi toolbox 2.0.

Pub.: 29 Nov '16, Pinned: 12 Apr '17

Identification of DEP domain-containing proteins by a machine learning method and experimental analysis of their expression in human HCC tissues.

Abstract: The Dishevelled/EGL-10/Pleckstrin (DEP) domain-containing (DEPDC) proteins have seven members. However, whether this superfamily can be distinguished from other proteins based only on the amino acid sequences, remains unknown. Here, we describe a computational method to segregate DEPDCs and non-DEPDCs. First, we examined the Pfam numbers of the known DEPDCs and used the longest sequences for each Pfam to construct a phylogenetic tree. Subsequently, we extracted 188-dimensional (188D) and 20D features of DEPDCs and non-DEPDCs and classified them with random forest classifier. We also mined the motifs of human DEPDCs to find the related domains. Finally, we designed experimental verification methods of human DEPDC expression at the mRNA level in hepatocellular carcinoma (HCC) and adjacent normal tissues. The phylogenetic analysis showed that the DEPDCs superfamily can be divided into three clusters. Moreover, the 188D and 20D features can both be used to effectively distinguish the two protein types. Motif analysis revealed that the DEP and RhoGAP domain was common in human DEPDCs, human HCC and the adjacent tissues that widely expressed DEPDCs. However, their regulation was not identical. In conclusion, we successfully constructed a binary classifier for DEPDCs and experimentally verified their expression in human HCC tissues.

Pub.: 22 Dec '16, Pinned: 12 Apr '17

Computer vision and machine learning for robust phenotyping in genome-wide studies.

Abstract: Traditional evaluation of crop biotic and abiotic stresses are time-consuming and labor-intensive limiting the ability to dissect the genetic basis of quantitative traits. A machine learning (ML)-enabled image-phenotyping pipeline for the genetic studies of abiotic stress iron deficiency chlorosis (IDC) of soybean is reported. IDC classification and severity for an association panel of 461 diverse plant-introduction accessions was evaluated using an end-to-end phenotyping workflow. The workflow consisted of a multi-stage procedure including: (1) optimized protocols for consistent image capture across plant canopies, (2) canopy identification and registration from cluttered backgrounds, (3) extraction of domain expert informed features from the processed images to accurately represent IDC expression, and (4) supervised ML-based classifiers that linked the automatically extracted features with expert-rating equivalent IDC scores. ML-generated phenotypic data were subsequently utilized for the genome-wide association study and genomic prediction. The results illustrate the reliability and advantage of ML-enabled image-phenotyping pipeline by identifying previously reported locus and a novel locus harboring a gene homolog involved in iron acquisition. This study demonstrates a promising path for integrating the phenotyping pipeline into genomic prediction, and provides a systematic framework enabling robust and quicker phenotyping through ground-based systems.

Pub.: 09 Mar '17, Pinned: 12 Apr '17