Machine learning-based design (of proteins, small molecules and beyond)

2 minute read


Title: Machine learning-based design (of proteins, small molecules and beyond)

Date: 2020-09-16

Series: Machine Learning in Medicine Seminar (

Speaker: Jennifer Listgarten

Summary and thoughts:

Today’s speaker at the machine learning in medicine seminar series was Dr. Jennifer Listgarten from University of California, Berkeley presented the talk on protein design with machine learning. While I have some experience in clinical informatics, this area of computational biology is still new to me and it was interesting to hear the conceptual talk in this very novel field. Dr. Listgarten touched on several areas of interest in the intersection of chemistry, molecular biology and computational science including epigenetics, confounder analysis with volume images, ML-based CRISPR (new term for me - tool for editing genomes) and latent space harmonization. The primary focus of the talk, however, was on protein design using machine learning - why do we need it, and how do we achieve it? I was previously unfamiliar with the term “protein re-engineering” - developing useful and valuable substances from existing proteins. It is a fascinating area of research with applications in fluorescent tagging, antibody development, gene therapy, and virus delivery. The high throughput of sequencing data available currently (and in the past few years) has made it possible to perform computations over the sequence space and use state-of-the-art machine learning methods for protein design.

The foremost challenge for protein design is the nearly infinite search space - the search space scales by about 20^L where L is the length of the sequence. Current approach for protein design (in vivo) called ‘Directed Evolution’ ( won the Nobel Prize in Chemistry in 2018 and is the inspiration behind Dr. Listgarten’s current research. To use machine learning methods to design proteins in silico (i.e. on computer or through computer simulation instead of in lab), there are two possible alternatives to the traditional approach - use machine learning for the screening process (selection from variant pool) and/or replace search (or diversification) of proteins with intelligent search. Through intelligent searching, it is possible to develop algorithms specially designed to search the protein space and get the same or better results with fewer measurements. Thus, given a predictive model (with some uncertainty) that takes in a model of protein (eg. DNA) and outputs a property of the protein (eg. fluorescence), the goal is the generate a set of candidates (of protein sequences) to maximize/specify the property (fluorescence in this case). With model-based optimization and search algorithms designed for protein space, this becomes a feasible (but still challenging) task. I found the speaker to be very engaging and it was disappointing that she ran out of time before reaching the results and applications part of the talk. Nevertheless, there are multiple sources and papers available for people interested in the research area. I found her faculty profile discussing her career trajectory very interesting too: