Joining the Bhatt lab

My third lab rotation in my first year at Stanford took a different path than most of my previous experience. I came to Stanford expecting to research chromatin structure – 3D conformation, gene expression, functional consequences. My past post history shows this interest undoubtedly, and people in my class even referred to me as the “Chromatin Structure Guy.” However, approaching my third quarter lab rotation I was looking for something a little different. Rotations are a great time to try something new and a research area you’re not experienced in.

I decided to rotate in Dr. Ami Bhatt’s lab. She’s an MD/PhD broadly interested in the human microbiome and its influence on human health. With dual appointments in the departments of Genetics and Hematology, she has great clinical research projects as well. Plus, the lab does interesting method development on new sequencing technologies, DNA extraction protocols and bioinformatics techniques. The microbiome research area is rapidly expanding, as gut microbial composition has been shown to play a role in a huge range of human health conditions, from psychiatry to cancer immunotherapy response. “What a great chance to work on something new for a few months?” I told myself. “I can always go back to a chromatin lab after the rotation is over”

I never thought I would find the research so interesting, and like the lab so much.

So, I joined a poop lab. I’ll let that one sink in. We work with stool samples so much that we have to make light of it. Stool jokes are commonplace, even encouraged, in lab meeting presentations. Everyone in the lab is required to make their own “poo-moji” after they join.

My poo-moji. What a likeness!

I did my inaugural microbial DNA extraction from stool samples last week. I was expecting worse; it didn’t smell nearly as bad as I expected. Still, running this protocol always has me thinking about the potential for things to end very badly:

  1. Place frozen stool in buffer
  2. Heat to 85 degrees C
  3. Vortex violently for 1 minute
  4. ….

Yes, we have tubes full of liquid poo, heated to nearly boiling temperature, shaking about violently on the bench! You can bet I made sure those caps were on tight.

Jokes aside, my interest in this field continues to grow the more I read about the microbiome. As a start, here are some of the genomics and methods topics I find interesting at the moment:

  • Metagenomic binning. Metagenomics often centers around working on organisms without a reference genome – maybe the organism has never been sequenced before, or it has diverged so much from a reference that it’s essentially useless. Without aligning to a reference sequence, how can we cluster contigs assembled from a metagenomic sequencing experiment such that a cluster likely represents a single organism?
  • Linked reads, which provide long-range information to a typical short read genome sequencing dataset. They can massively aid in assembly and recovery of complete genomes from a metagenome.
  • k-mer analysis. How can short sequences of DNA be used to quickly classify a sequencing read, or determine if a particular organism is in a metagenomic sample? This hearkens to some research I did in undergrad on tetranucleotide usage in bacteriophage genomes. Maybe this field isn’t too foreign after all!

On the biological side, there’s almost too much to list. It seems like the microbiome plays a role in every bodily process involving metabolism or the immune system. Yes, that’s basically everything. For a start:

  • Establishment of the microbiome. A newborn’s immune system has to tolerate microbes in the gut without mounting an immune overreaction, but also has to prevent pathogenic organisms from taking hold. The delicate interplay between these processes, and how the balance is maintained, is very interesting to me.
  • The microbiome’s role in cancer immunotherapy. Mice without a microbiome respond poorly to cancer immunotherapy, and the efficacy of treatment can reliably be altered with antibiotics. Although researchers have shown certain bacterial groups are associated with better or worse outcomes in patients, I’d really like to move this research beyond correlative analysis.
  • Fecal microbial transplants (FMT) for Clostridium difficile infection. FMT is one of the most effective ways to treat C. difficile, a infection typically acquired in hospitals and nursing homes that costs tens of thousands of lives per year. Transferring microbes from a healthy donor to a infected patient is one of the best treatments, but we’re not sure of the specifics of how it works. Which microbes are necessary and sufficient to displace C. diff? Attempts to engineer a curative community of bacteria by selecting individual strains have failed, can we do better by comparing simplified microbial communities from a stool donor?

Honestly, it feels great to be done with rotations and to have a decided lab “home.” With the first year of graduate school almost over, I can now spend my time in more focused research and avoid classes for the time being. More microbiome posts to come soon!

Deep learning to understand and predict single-cell chromatin structure

In my last post, I described how to simulate ensembles of structures representing the 3D conformation of chromatin inside the nucleus. Now, I’m going to describe some of my research to use deep learning methods, particularly an autoencoder/decoder, to do some interesting things with this data:

  • Cluster structures from individual cells. The autoencoder should be able to learn a reduced-dimensionality representation of the data that will allow better clustering.
  • Reduce noise in experimental data.
  • Predict missing points in experimental data.

Something I learned early on rotating in the Kundaje lab at Stanford is that deep learning methods might seem domain specific at first. However, if you can translate your data and question into a problem that has already been studied by other researchers, you can benefit from their work and expertise. For example, if I want to use deep learning methods on 3D chromatin structure data, that will be difficult because few methods have been developed to work on point coordinates in 3D. However, the field of image processing has a wealth of deep learning research. A 3D structure can easily be represented by a 2D distance or contact map – essentially a grayscale image. By translating a 3D structure problem into a 2D image problem, we can use many of the methods and techniques already developed for image processing.

Autoencoders and decoders

The primary model I’m going to use is a convolutional autoencoder. I’m not going into depth about the model here, see this post for an excellent review. Conceptually, an autoencoder learns a reduced representation of the input by passing it through (several) layers of convolutional filters. The reverse operation, decoding, attempts to reconstruct the original information from the reduced representation. The loss function is some difference between the input and reconstructed data, and training iteratively optimizes the weights of the model to minimize the loss.

In this simple example, and autoencoder and decoder can be thought of as squishing the input image down to a compressed encoding, then reconstructing it to the original size (decoding). The reconstruction will not be perfect, but the difference between the input and output will be minimized in training. (Source)

Data processing

In this post I’m going to be using exclusively simulated 3D structures. Each structure starts as 64 ordered 3D points, an 64×3 matrix with x,y,z coordinates. Calculating the pairwise distance between all points gives a 64×64 distance matrix. The matrix is normalized to be in [0-1]. The matrix is also symmetric and has a diagonal of zero by definition. I used ten thousand structures simulated with the molecular dynamics pipeline, with an attempt to pick independent draws from the MD simulation. The data was split 80/20 between training and

Model architecture

For the clustering autoencoder, the goal is to reduce dimensionality as much as possible while still retaining good input information. We will accept modest loss for a significant reduction in dimensionality. I used 4 convolutional layers with 2×2 max pooling between layers. The final encoding layer was a dense layer. The decoder is essentially the inverse, with upscaling layers instead of max pooling. I implemented this model in python using Keras with the Theano backend.

Dealing with distance map properties

The 2D distance maps I’m working with are symmetric and have a diagonal of zero. First, I tried to learn these properties through a custom regression loss function, minimizing the distance between a point i,j and its pair j,i for example. This proved to be too cumbersome, so I simply freed the model from learning these properties by using custom layers. Details of the implementation are below, because they took me a while to figure out! One custom layer sets the diagonal to zero at the end of the decoding step, the other averages the upper and lower triangle of the matrix to enforce symmetry.

Clustering single-cell chromatin structure data

No real clustering here…

In the past I’ve attempted to visualize and cluster single-cell chromatin structure data. Pretty much any way I tried, on simulated and true experimental data, resulted the “cloud” – no real variation captured by the axes. In this t-SNE plot from simulated 3D structures collapsed to 2D maps, you can see some regions of higher density, but no true clusters emerging. The output layer of the autoencoder ideally contains much of the information in the original image, at a much reduced size. By clustering this output, we will hopefully capture more meaningful variation and better discrete grouping.

 

 

 

Groupings of similar folding in the 3D structure!

Here are the results of clustering the reduced dimensionality representations learned by the autoencoder. I’m using the PHATE method here, which seems especially applicable if chromatin is thought to have the ability to diffuse through a set of states. Each point is represented by the decoded output in this map. You can see images with similar structure, blocks that look like topologically associated domains, start to group together, indicating similarities in the input. There’s still much work to be done here, and I don’t think clear clusters would emerge even with perfect data – the space of 3D structures is just too continuous.

Denoising and inpainting

I am particularly surprised and impressed with the usage of deep learning for image superresolution and image inpainting. The results of some of the state of the art research are shocking – the network is able to increase the resolution of a blurred image almost to original quality, or find pixels that match a scene when the information is totally absent.

With these ideas in mind, I thought I could use a similar approach to reduce noise and “inpaint” missing data in simulated chromatin structures. These tasks also use an autoencoder/decoder architecture, but I don’t care about the size of the latent space representation so the model can be much larger. In experimental data obtained from high-powered fluorescence microscope, some points are outliers: they appear far away from the rest of the structure and indicate something went wrong with hybridization of fluorescence probes to chromatin or the spot fitting algorithm. Some points are entirely missed, when condensed to a 2D map these show up as entire rows and columns of missing data.
To train a model to solve these tasks, I artificially created noise or missing data in the simulated structures. Then the autoencoder/decoder was trained to predict the original, unperturbed distance matrix.

Here’s an example result. As you can see, the large scale features of the distance map are recovered, but the map remains noisy and undefined. Clearly the model is learning something, but it can’t perfectly predict the input distance map

Conclusions

By transferring a problem of 3D points to a problem of 2D distance matrices, I was able to use established deep learning techniques to work on single-cell chromatin structure data. Here I only showed simulated structures, because the experimental data is still unpublished! Using an autoencoder/decoder mode, we were able to better cluster distance maps into groups that represent shared structures in 3D. We were also able to achieve moderate denoising and inpainting with an autoencoder.

If I was going to continue this work further, there’s a few areas I would focus on:

  • Deep learning on 3D structures themselves. This has been used in protein structure prediction [ref]. You can also use a voxel representation, where each voxel can be occupied or unoccupied by a point. My friend Eli Draizen is working on a similar problem.
  • Can you train a model on simulated data, where you have effectively infinite sample size and can control properties like noise, and then apply it to real-world experimental data?
  • By working exclusively with 2D images we lose a lot of information about the input structure. For example the output distance maps don’t have to obey the triangle inequality. We could use a method like Multi-Dimensional Scaling to get a 3D structure from an outputted 2D distance map, then compute distances again, and use this in the loss function.

Overall, though this was an interesting project and a great way to learn about implementing a deep learning model in Keras!

 

Academia or Industry?

That question seems to be on the mind of a lot of the people around me lately. Junior year of my undergrad studies at Brown is almost over, and my classmates and I are starting to think of what our lives will be like after next May. For people interested in the sciences, especially biology and biotech related fields. there are two main options everyone considers: should I get a job right our of undergrad, with only a bachelor’s degree? Or, should I stay in school for another 5+ years for a Ph.D?  There are obvious benefits and costs to each (which I’ll cover in a later post), and everyone wants to know exactly what choice will be best for them.

Up until recently, I have wanted to go to graduate school the fall after finishing at Brown. I feel like I’m ready for what the process entails, thanks in part to the books I’ve been reading by recent (either delighted or regretful) Ph.D recipients. But now, I’m not so sure. The Brown Club of Boston Biotech Conference was a big factor in this – listing to people in industry talk about the opportunities for bachelor’s degree holders was eye opening. The starting salaries they mentioned were impressive (and not much less than what you’d get as a Ph.D). The projects looked interesting.  And most importantly, the jobs are there.

I’m considering taking a year or two to work in industry before committing to grad school. After all, whats a year delay when you’re going to be in school for another 5-7 ?

Books on Graduate School

Students are pretty much on their own for figuring things out after undergrad (I definitely don’t miss high school guidance counselors, though). Grad school is a confusing topic – I find that everyone talks about it, but some people don’t really know what to expect if they commit the next 5 years of their life to getting another degree. Asking students in your lab, professors and graduating seniors is a great way to get information, but it’s still hard to get a good picture of what the whole process is like. Over winter break this year I found a few good books that summarize what you can expect at a graduate program in the sciences.

  • The Ph.D Grind by Philip Guo
    This short memoir details Philip’s time at Stanford from 2006 to 2012 while he worked toward a Ph.D in computer science. His story is filled with countless ups and downs, tips to get work published and lessons about working efficiently and maintaining sanity during stressful times. It’s a free ebook, quick read, and definitely recommended (even if you’re not considering computer science).
  • Getting What You Came For by Robert Peters
    The second edition of this book was published in 1992 so it’s a little outdated (it references word processors like they’re the next big thing, for example). It is still a good general account of the entire graduate school process, though. Getting What You Came For covers everything you should be doing in undergrad, how to pick a graduate school that’s a good fit for you, how to navigate your first years in the program while finding an advisor and taking quals, how to get the thesis done, and much more. It’s definitely not a quick read, but I’m keeping a copy for reference.
  • The Ph.D Process by Dale Bloom, Jonathan Karp and Nicholas Cohen
    Another book that lays out each step in the process of getting a Ph.D. Kind of outdated, but it has anecdotes and advice from students that I thing are still relevant and useful. The ebook was free through the library so I gave it a read.
  • A Guide to Academia by Prosanta Chakrabarty
    This one covers the entirety of grad school and also has information for postdocs and people looking for their first job. It’s a pretty solid overview. I think it’s was important to read ahead to learn about life after grad school and what to expect for the job market in academia. Also free through the library at Brown.

Of course, this is not an exhaustive list! If you have any suggestions on other books, or books that compare academia to industry, leave me a comment below.

Brown Club of Boston Biotech Conference

I just returned from an excellent conference sponsored by the Brown Club of Boston. The conference was designed to facilitate networking between Brown alumni, current students and leaders in all aspects of the biotechnology field. There was a panel of three interesting speakers, each of whom talked about their unique experience in the biotech industry.

  •  Angus McQuilken, VP of Communications and Marketing at the Massachusetts Life Sciences Center.

Angus spoke about the biotech boom that’s been happening in Massachusetts recently. A major reason for this is support from the state government – the Massachusetts Life Sciences Center was tasked with giving out $1 billion in funding to companies and individuals over a ten year period. Some of that money goes toward funding the internship challenge, a program that subsidizes internships for Mass residents and students working at biotech companies.

  • Mitch Sanders, Ph.D, founder and CEO of ECI Biotech

Mitch talked about his experience in biotech after his postdoc at MIT. His company is developing sensors that change color in the presence of certain bacteria and viruses – the tech is being applied to band-aids, prosthetics and food and will hopfully be on the market within the next couple of years. He also spoke about the ups and downs his company had been through in the past years and the importance of being able to adapt in the industry.

  • Elaine Crowley, founder and president of the Crowley Group.

The Crowley Group is a consulting firm that provides “insightful executive coaching” to companies and individuals. Although not limited to biotech, Eliane had some good lessons for the audience of students: finding employment that matches your culture is just as (if not more) important than the pay, prestige or benefits.

There was a Q&A session after the panelists spoke. I asked the three of them about the pros and cons of getting a Ph.D before working in the biotech industry, and the answers I recieved surprised me. More on that in a later post!

Many thanks to the members of the Brown Club of Boston who organized this event, especially Paula Freeman. who blogs about etiquette and lessons young people should learn over at jobetiquettebypaula.wordpress.com.