When we last left off, I was peering into the -80 freezer at the hundreds of stool samples I would need to analyze. In reality, a lot of experimental design work came on this project before I ever opened up the freezer!
Designing a good experiment was one of the most important things I learned in grad school. Science is already hard enough – you need to set yourself up for success from the beginning by designing a good experiment, whether it’s wet lab or computational. I like to think about what success in this project would look like, and work backwards from success to understand the data I need to collect.
To convincingly prove that a bacterium had transmitted from the microbiome of one patient to the microbiome of another, I needed the following pieces of evidence:
- At a given point in time, the bacterial genome was present in the microbiome of the source patient and undetectable in the microbiome of the recipient.
- At a future point in time, the bacterial genome was present in the microbiome of the recipient patient, and ideally persisted for multiple future time points.
Through Stanford Hospital, I also had access to a dataset of each patient’s room history. From this, I could find when two patients were roommates. Mapping the overlapping intervals, combined with the list of samples biobanked from each patient, was a challenging data science problem. It took me about a month of work to design an experiment that would give me the best chance of observing patient-patient microbiome transmission, if it was happening.
The wet lab work for this project was long and monotonous. You can read about it in the methods section of the paper, but we did DNA extraction and 10X Genomics linked read sequencing on all of the new samples.
When the new data came back, it was time to get cracking! The processing pipeline and data analysis I had planned would take too long to run on Stanford’s HPC cluster, so I turned to Google Cloud to get everything done with quick parallelization. The process of getting our workflows to run at scale in the cloud was certainly a learning experience, and I wrote a blog post about the effort (two years ago).
After assembling bacterial genomes from hundreds of microbiome samples, comparing strain-level populations with inStrain, and generating massive matrices comparing all sets of genomes in my samples, the true data analysis began. A few key lessons from the data analysis and writing experience have stuck with me, and the challenges made me a better scientist.
- Scrutinize your results! When I initially looked for identical bacterial genomes in samples from different patients, I found many “transmission events” that were simply the results of barcode swapping (when samples sequenced on an Illumina machine at the same time experience a small degree of contamination). I was prepared for this outcome, and developed a method to quantify when identical genomes were likely the result of barcode swapping in the linked read data.
- Carefully evaluate negative findings. After eliminating all the likely false positive results, I found very few identical genomes between patients, especially antibiotic resistant pathogens. At first, this was an upsetting result. I was really hoping to find lots of transmission between patients who were roommates! However, the lack of pathogen transmission findings allowed me to focus on the potentially more interesting cases of commensal bacteria transmitted between patients. The “negative” finding here turned out to make a more interesting story.