Introducing the ComputeBench: the cloud development environment for bioinformatics

Posted on 2023-11-212023-11-21 by Ben

The following is a cross-post from the Deep Origin blog.

Today, I’m proud to announce the beta release of Deep Origin’s first product: the ComputeBench. The ComputeBench is a cloud-based environment for interactive analysis in computational biology and bioinformatics. It’s designed from the ground up for computational scientists – providing you the compute and storage resources you need, the software packages you expect, and tools for seamless collaboration with your team, all in a secure, backed-up and scalable platform.

We’re building Deep Origin for teams of computational scientists to work in a collaborative, cloud-based environment without worrying about DevOps and IT. In short, our mission is to let scientists focus on science. What we’re releasing today is the first stop in that longer-term vision. Read on to learn more about ComputeBenches, or check out the product page, pricing, and request access.

After starting a ComputeBench, it’s one click to drop into the development environment you’re most comfortable with. Computational power, storage, and hundreds of packages are available at your fingertips!

Two trends in biological discovery highlight the importance of developing better tooling for computational biology:

Biological discoveries are increasingly driven by massive amounts of heterogeneous data. Look no further than the move to single-cell, spatial, long read, or multi-omic profiling, or the promises of the newest instrument vendors.
Biological discoveries are increasingly collaborative. In both academia and industry, science doesn’t happen in a vacuum, especially when an analysis spans different modalities and areas of expertise.

Meanwhile, tools for collaborative computational biology at scale are lacking, and scientists frequently waste hours or days solving problems that we are all too familiar with. How long have you spent troubleshooting a package installation, setting up the right infrastructure and permissions to share data with a collaborator, or reproducing an analysis from a publication? How many teams have built undifferentiated cloud infrastructure just to get to the point that they can start to do their job?

We built the ComputeBench for these scientists. We want to give every scientist superpowers to scale their analysis, without the boring stuff getting in the way. A ComputeBench has the following key features:

Scalable hardware: from 2-192 vCPU, 4-1536 GB of RAM, up to 16 TB of local persistent storage, and NVIDIA GPUs if you need them.
Software blueprints: collections of hundreds of pre-installed, validated, and versioned tools for different scientific domains. Use the tools in the blueprints, or use them as a jumping off point to customize and create your own environment. Examples of our blueprints include metagenomics, RNA sequencing, and single-cell sequencing.
Interact the way you want: Each blueprint provides a number of user-friendly web-based applications, like JupyterLab, RStudio, VS Code server, R Shiny, or CELLxGENE. You can also connect over SSH, and you have root access to modify your bench as you see fit.
Storage volumes: performant, scalable storage that can be accessed by all of your ComputeBenches to share data among your team.
Credential and secret management: Store your secrets and preferences, and automatically load them in every ComputeBench that you create. Share team-level variables and secrets, as well.

Usage and cost controls. Automatically stop idle ComputeBenches, and view transparent, up-front pricing before you create any resources.
Invoices for humans. Deep Origin customers receive invoices in terms that are easy to understand and traceable to a particular user or ComputeBench.

ComputeBenches provide the best experience for interactive analysis in computational biology – and we bet you’ll feel the same way. To get started, we’re offering new users $500 in credits to try working on the Deep Origin platform. Since much of the benefit of working on our platform comes when you are working with your team members, we’ll add $500 for each regular user you bring on to your organization, as well.

Claim your $500 in credits here.

Finally, we’re always looking for ways to improve the platform. Next on our roadmap are features allowing users to submit their own software blueprints for use within their organization, and tools to manage data in buckets, both hosted by Deep Origin and elsewhere. If you have feature requests, suggestions of packages to add to a blueprint, or other feedback, please let us know.

Accelerated computing is the future of genomics

Posted on 2023-01-102023-01-10 by Ben

“We’re out of storage, and we’re out of compute.” I’ll never forget the 2016 Broad Institute Cancer Program meeting where Eric Banks, senior director of the Data Sciences Platform, showed the audience how much genomic data the Broad was generating. The exponential curve was plotted against the current capacity of the on-premises compute cluster – the time until intersection of the curves could be measured in months. In fact, we were generating data even faster than we could add new storage drives. This meeting sparked the Broad’s move to the cloud for the majority of data storage and compute.

While moving to the cloud may have been the simple answer seven years ago, it’s not a catch-all solution today. Genome sequencing costs have dropped precipitously, and newer high-content assays like single-cell sequencing and spatial transcriptomics are being developed every day. Storing and processing all these data requires a massive amount of resources, and a cloud bill to match (even if you’re trying to do the analysis on the cheap). We need new, more efficient ways to process, store and transfer genomic data. Enter accelerated computing.

Accelerated computing, the use of special-purpose hardware for specific computational tasks, can help solve many of the problems facing genomics today. Using Graphical Processing Units (GPUs) and similar hardware, these custom-developed algorithms can reduce the time needed to run an analysis by a factor of 10 to 100. This great reduction in compute time has paved the way for more efficient data processing and real-time analysis in clinical genomics.

In this series, I’ll cover three topics related to accelerated computing in genomics:

An overview of the basics of accelerated computing, the popular tools, and the companies developing them (this post).
Practical considerations. How can you use these tools today? (coming soon)
Algorithmic details. How do accelerated tools work to drastically decrease the runtime of common tasks? What problems in genomics are amenable to acceleration? (a future post)

Accelerated computing is a general term describing the use of specialized hardware to speed up a certain computation. Commonly used hardware in the genomics space includes:

Graphical Processing Units (GPUs)
Field Programmable Gate Arrays (FPGAs)
Application-specific integrated circuits (ASICs)
Tensor Processing Units (TPUs)

While Central Processing Units (CPUs) excel at general-purpose tasks, they lack the ability to run many computations in parallel. GPUs are the opposite. While they were originally developed for video game rendering, GPUs excel at parallel computations. They can have thousands of processor cores, each capable of running a calculation in parallel. Accelerated computing takes advantage of each hardware type for the tasks it is best at. Control functionality and single-threaded work is left to the CPU, and parallelizable computations are done on the GPU. NVIDIA Clara Parabricks leads the way in GPU-accelerated genomics.

FPGAs allow for the hardware to be reconfigured on the fly to run a specific algorithm. They are used in the Illumina DRAGEN tools I’ll cover later. ASICs are less common. They have a fixed configuration and perform a limited set of functions, so they’re best used in very specific settings, like controlling the pores on an Oxford Nanopore MinION. TPUs are used in training ML models to interpret genomic data, but not in the processing of the data directly.

While GPU-based training of deep learning models is standard and supported by key libraries in the ecosystem, in traditional genomics fashion, the field is 5-10 years behind other industries. We are starting to see GPU-based genomics tools being released, but they’re closed source and still gaining traction. By contrast, PyTorch, a popular open source machine learning framework, was released in 2016! My prediction is that accelerated tools will become standard in genomics as well, but we need a lot of work to get there.

How can I use hardware acceleration in genomics today?

Your best bet is to use one of these developed toolkits. If you don’t have access to a machine with GPUs or FPGAs, they can be rented from AWS or GCP for a low hourly fee.

NVIDIA Clara Parabricks: GPU-accelerated alignment, variant calling, and RNA-seq

NVIDIA’s entry into GPU-accelerated genomics is the result of the 2019 acquisition of the software startup Parabricks. NVIDIA first released the software as closed access, but with the version 4.0 release in 2022, anyone can download the docker container and run the software. Parabricks runs on most modern NVIDIA GPUs and accelerates alignment of DNA and RNA-seq data, variant calling, and other time-intensive processes by up to 80x (using a multi-GPU machine). Parabricks was designed as a drop-in replacement for common tools like GATK, and is guaranteed to produce an identical output as certain GATK versions. Running the software is simple, all you have to do is pull the docker container at nvcr.io/nvidia/clara/clara-parabricks:4.0.0-1 and run one of the command line tools.

The strengths of Parabricks lie in its ease of use, wide applicability, and cost effectiveness. GPUs are everywhere: on the cloud, in gaming PCs, and in servers used for ML/AI training. The docker container is available for anyone to try for free, without a license or multiple sales calls. Parabricks also attempts to automate some processes that may have been spread across multiple tools in the past: alignments are always coordinate-sorted, for example.

The weaknesses of Parabricks come down to the limited functionality and lack of integration, at least compared to DRAGEN. Parabricks doesn’t have the advanced functionality of DRAGEN for tasks like single-cell sequencing and star-allele calling for pharmacogenomics. And you obviously can’t buy an Illumina sequencer with a competitor’s hardware and software in it!

Illumina DRAGEN: FPGA-accelerated Bio-IT platform

Illumina recognized the challenges in storage and processing of genomic data early, and acquired Edico Genome and the DRAGEN Bio-IT platform in 2018 to architect a solution. DRAGEN uses FPGAs to speed up generation of FASTQ files, alignment, variant calling, and many other processes. In addition to a standard GATK implementation, DRAGEN designed their own variant calling algorithms which have won two out of three of the Precision FDA Truth Challenge V2 Illumina categories. DRAGEN also provides new algorithms for accelerated single-cell genomics, star-allele calling, and other processes.

The strengths of DRAGEN lie in the tight integration with Illumina products. You can buy an Illumina sequencer with a DRAGEN server built in, so that everything up to variant calling can be completed on the sequencer. This means the large raw data files never have to be transferred to the cloud or elsewhere, saving on storage costs (as long as you don’t need the raw data for backup or compliance purposes). The accuracy, speed, and continued improvement of the algorithms are another key advantage.

The weaknesses of DRAGEN come with the costs of using the software. Since Illumina doesn’t benefit from the purchase of FPGAs, they charge a LOT for the DRAGEN license. In fact, the license cost is 80% of the total cost when running DRAGEN on the cloud! This deters researchers in academia and lower-resourced companies from using DRAGEN, and may push them to a free alternative instead.

Nanopore and PacBio sequencers: accelerated computing right under the hood

Oxford Nanopore uses hardware acceleration at multiple places in their long-read sequencers. Pores on the flowcells are controlled by ASICs, and the more advanced multi-flowcell workstations use GPUs to accelerate data analysis. The Nanopore Promethion comes with 4 NVIDIA A100 GPUs, for example. PacBio’s new Revio sequencer has a similar arrangement, with on-board GPUs to speed up the processing of raw data. While Nanopore and PacBio sequencers both take advantage of hardware acceleration, there’s much less direct interaction with the algorithms, compared to the user-facing toolkits above.

Where’s the PyTorch of genomics?

All of the accelerated genomics toolkits I’ve talked about today are being developed closed-source by publicly traded companies. That’s great for efficient development of high-performance code, but it shuts out the community of developers in academia or low-resource industries that might use and contribute to your code. NVIDIA had GenomeWorks, but that hasn’t seen a commit in a year and a half. Some other groups are repurposing GPU-accelerated Python libraries for single-cell analysis.

If you’re working on an open-source GPU genomics toolkit, I’d love to hear about it.

One final thought: the story of how GPUs transitioned from gaming devices to general-purpose compute accelerators is both fascinating and entertaining. It all started with a quantum chemistry professor at Stanford buying NVIDIA gaming cards in 2006 and hacking them to do the computation he needed. Acquired has a great podcast on the topic.

Bioinformatics in the cloud, on a budget

Posted on 2022-10-222022-10-22 by Ben

Let’s say you’re a biotech or academic lab that needs to do bioinformatics or computational biology at a reasonably large scale. You have a tight budget and you want to be as cost effective as possible. You also don’t want to build and maintain your own hardware, because you recognize the hidden costs baked into the time, effort, and security of doing so. Luckily, the last few years have seen a proliferation of “alternative” cloud providers. These providers can compute with AWS, GCP and Azure by doing few things really well at greatly reduced prices. My main argument in this post is that by mixing services from different cloud providers, budget and cloud can mix, despite the prevailing pessimistic opinions.

To be upfront, I believe working with one of the larger public cloud providers will make your life easier and allow you to deliver results faster, with less engineering expertise. AWS has services that cover everything a biotech needs to process data in the cloud, and the integration between these services is seamless and efficient. But we’re not going for easy here, right? We’re going for cheap. And cheap means cutting some corners and making things more difficult in the name of saving your valuable dollars.

What’s the problem with the big public cloud providers? AWS allows a team to build any product imaginable, and scale in infinitely. Need to build a Netflix competitor that can deliver video with low latency and maximum uptime to every corner of the world? AWS will let you do that (and bill you appropriately). With this plethora of features comes many hidden costs. It can seem like AWS intentionally makes their billing practices opaque, allowing you to rack up massive bills by leaving a service running or enabling features you don’t need. In the future, I’ll do a separate post on keeping AWS costs manageable. For now, just know that you have to be careful or you can be burned – I personally know several individuals that have made costly mistakes here. Even when just looking at raw compute, AWS is priced at a large premium compared to competitors on the market. You pay for the performance, uptime, reliability, interoperability, and support.

The minimum viable bioinformatics cloud

With that out of the way, it’s time to design our bioinformatics cloud! The minimum capabilities of a system supporting a bioinformatics team include:

1. Interactive compute for experimentation, prototyping workflows, programming in Jupyter and RStudio and generating figures. GPUs may be needed for training machine learning models.
2. Cloud storage that’s accessible to all team members and other services. Ideally this system supports cheap cold storage for infrequently accessed and backup data.
3. Container registries. Batch workflows need to access a high-bandwidth container registry for custom private and public containers.
4. Scalable batch compute that can be managed by a workflow manager. A team should be able to easily 10-1000X their compute with a single command line argument or config change.
5. GPUs, databases, and other add-ons, depending on the work the team is doing.

Where can we cut corners?

Some of the features offered by AWS matter less to a bioinformatics team.

The final 10% optimization of latency, uptime and performance. In research, my day isn’t ruined if a workflow completes in 24 versus 22 hours – it’s still an overnight task. Similarly, an hour of downtime on a cluster for maintenance isn’t the end of the world – I always have papers I could be reading. Beyond some limit, increasing these metrics isn’t worth the additional cost.

Multi-region and multi-availability zone. We’re not building Netflix, or even publicly available services. All the compute can be in one region.

Infinite hot storage. I’ve found that beyond a certain point, adding more hot storage doesn’t make a team more efficient, just lazy about cleaning data up. Not all data needs to be accessed with zero latency. There has to be something similar to Parkinson’s law for this case: left unchecked, data storage will expand to fill all available space.

Infinitely scalable compute. Increasing parallelization of a workflow beyond a certain point often results in increased overhead and diminishing returns. While scalability is necessary, it doesn’t need to be truly infinite.

With these requirements and cost saving measures in mind, here’s my bioinformatics in the cloud on a budget “cookbook”.

1: Interactive compute

There are two ways teams typically handle this requirement. Either by providing a large, central compute server for all members to share, or allow team members to provision their own compute servers. The first option requires more central management, while the second relies on each team member being able to administer their own resources.

How it’s done on AWS: EC2 instances that are always running or provisioned on-demand. You can save by paying up-front for a dedicated EC2 instance, but there’s a sneaky $2/hour fee for this service that makes it inefficient until large scales.

How it can be done cheaply: Hetzner is a German company that offers dedicated servers for 10-25% the cost of AWS. You can either configure a new server with your desired capabilities for a small setup fee, or immediately lease an existing server available on their website. These servers can have up to 64 vCPU, 1TB RAM, and 77TB of flash storage. 20TB of data egress traffic is included (which would cost you over $1800 at AWS)!

If you want to use the Hetzner Storage Box and Cloud services I mention later, you’ll want to pick a server in Europe to keep all your services in the same data center. This can create lag when connecting from the US, so I recommend using mosh instead of SSH to minimize the impact of transatlantic latency.

Where you cut corners: Hetzner servers are not as high powered as AWS EC2 instances, which can easily top out at over 128 vCPU. You can’t add GPUs or get very specific hardware configurations. Hetzner dedicated servers are billed per month, while AWS EC2 instances are billed per second, offering you more flexibility. Compared to AWS, there aren’t as many integrated services at Hetzner, and some users complain that there’s more scheduled maintenance downtime.

2: Cloud storage

How it’s done on AWS: S3 buckets or Elastic File System (EFS, their implementation of NFS). Storage tiers, and the AWS intelligent tiering service, allow archival storage to be very cheap.

How it can be done cheaply: Many companies now offer infinitely scalable cloud storage for significantly cheaper than S3. They also offer free or greatly reduced data transfer rates, which can help you avoid the obscene AWS egress fees. Two of my favorite providers are Backblaze B2 and Cloudflare R2. Both of these services can be accessed with the familiar S3 API. If this service is being used to store actively analyzed data, Cloudflare wins out. Zero egress fees make up for the increased storage cost. As soon as you egress more than you store per month, Cloudflare is cheaper than Backblaze.

Hetzner recently released Storage Boxes, which you can purchase in predefined sizes and get storage costs down to about $2/TB/month when fully utilized. The performance of the storage boxes is very high when transferring data within a Hetzner location, making this an ideal combination for low-latency data analysis.

Where you cut corners: Using storage and compute from different providers will always be slower than staying within the AWS ecosystem. Hetzner storage boxes come in defined sizes up to 40TB, and you pay for space that you’re not using. Storage boxes also don’t support S3 or other APIs that developers desire. For true backups and archival storage, it’s hard to beat AWS Glacier at $1/TB/month.

3: Container Registries

How it’s done on AWS: ECR (Elastic container registry) allows for public and private repositories for your team to push and pull containers. You pay for the storage costs and egress when the containers are pulled outside of the same AWS region.

How it can be done cheaply: DockerHub offers paid plans that include image builds and 5000 container pulls per day. The math on this one will depend on your workflow size and the need for public vs private containers.You could also host your own registry with something like Harbor, but that’s beyond the scope of this post.

Where you cut corners: Again, moving outside of AWS means you lose the integration and lightning-fast container pulls. Using DockerHub or another service is one more monthly bill and account to manage.

4: Batch workflows

How it’s done on AWS: Deploy workflows to Batch or EKS (Elastic Kubernetes Service). Compute happens on autoscaling EC2 or Fargate instances, data is stored in S3 or EFS, and containers are pulled from ECR. Batch workflows is where the interoperability of AWS services really stands out, and it’s hard to replicate everything at scale without significant engineering.

How it can be done cheaply: If on AWS, use spot instances as much as possible, and design your workflows to be redundant to spot instance reclaims (create small composable steps, parallelize as much as possible and use larger instances for less time). If you’re not on AWS, you have three options, which I will present in order of increasing difficulty and thriftiness:

Manually deploy your workflows to a few large servers on your cloud provider of choice. If you’ve containerized your workflows (you’re using containers, right?) running the same pipeline on different samples should be as easy as changing the sample sheet. This method obviously takes more oversight and doesn’t scale beyond what you can do on a few large servers.
Deploy your workflow to a Kubernetes cluster at a managed k8s provider, like Digital Ocean. You can use the autoscaling features to automatically increase and decrease the number of available nodes depending on your workflow.
Deploy a Kubernetes cluster to Hetzner Cloud. Here, you’ll be managing the infrastructure from start to finish, but you can take advantage of the cheapest autoscaling instances available on the planet. I can expand this to a tutorial if there’s interest, but the basic deployment looks like this:
1. Set up a Kubernetes cluster using something like the lightweight distribution k3s.
2. Set up autoscaling with Hetzner so you don’t have to manage node pools yourself.
3. Nextflow and other workflow managers need storage (a persistent volume claim, or PVC) with “read write many” capabilities. You can set this up with Rook Ceph.
4. Modify your workflow requirements so that you don’t exceed the maximum resources available with a given cloud instance. The Hetzner Cloud instances are not as CPU and memory heavy as AWS.
5. Deploy your workflow using the storage provider and container registry of your choice!

These setups obviously take more time and expertise to create and manage. Ensure that your team is familiar with the technology and the tradeoffs. If you want to deploy big batch workflows with minimal configuration, it’s hard to beat the managed services at AWS.

5: GPUs and accelerated computing

How it’s done on AWS: Get an EC2 instance with a GPU. Use GPU instances within a workflow.

How it can be done cheaply: Hetzner doesn’t offer cheap GPUs yet, but other cloud providers do, like Genesis Cloud, Vast, and RunPod. The obvious downside of this is splitting your workloads up between another cloud provider.

General advice

These tips can apply regardless of the cloud provider and services you use. Many of these came up in a Twitter thread I posted the other day.

Use spot instances whenever you can to save ~50% on compute. On AWS, set your maximum bid to the on-demand price to minimize interruptions.
The big cloud providers offer credits to new teams to get them on the service – I think the standard AWS deal for startups is $100k in credits for a year. They also offer grants for research teams looking to take advantage of the cloud. My best “hourly rate” in grad school was filling out a GCP credit application – about $20k for one hour of work!
Turn your stuff off! This goes without saying, but so much compute is wasted by just leaving servers running when they don’t need to be.
Get good at the cost exploration tools, and designate one team member to understand the monthly bill and track changes.
Test your workflows at small scale before deploying to a big cluster.
Use free and cheap accelerated compute available at Google Colab and Paperspace.

Conclusion

Cloud computing has made large strides in the last ten years, but for use in research, we still have a long way to go. I agree with the sentiment that we’re still early in cloud. For biotechs and academic labs that don’t have access to a university cluster (or are scaling beyond what their cluster can offer), there aren’t many alternatives to cloud computing. Unfortunately, high costs and stories of researchers breaking the bank with AWS turn many people off from these solutions completely.

My goal with this post is to outline some alternative services that biotechs and academic labs can use for their storage and compute. By being thrifty and learning some new skills, I bet cloud bills could be reduced by 50% or more. However, the integration between services in AWS is still top notch, and I hope we see more innovation and competition in this space in the near future.

Do you have experience with the services I mentioned? Agree or disagree with the recommendations, or have something else to add? Please let me know in the comments below!

Why are bioinformatics workflows different?

Posted on 2022-08-142022-08-14 by Ben

Data workflows and pipelines are an integral part of bioinformatics. However, the tools used to write and deploy workflows in bioinformatics are different from tools used for similar tasks in data engineering. In this post, I’ll lay out (my opinion on) the reasons for separations in these fields, and speculate on where bioinformatics is headed in the future.

What is a bioinformatics workflow?

A bioinformatics workflow is a series of programmatic steps to transform raw data into processed results, figures, and insights. A workflow can consist of many steps, each involving different tools, parameters, reference databases, and requirements. For example, a bioinformatics workflow I developed at Loyal transforms raw sequencing data from each sample into a DNA methylation profile. This workflow has about 10 steps, uses several different open source tools, and requires the canine reference genome in addition to the raw data input.

The complexity of these workflows, along with the requirement for different programs and resources at each step, necessitate the use of “workflow managers.” These tools orchestrate the processes, dependencies, deployment and tracking of large bioinformatics pipelines.

Individuals with data engineering experience at tech companies are always surprised when they hear about the ecosystem of bioinformatics workflow managers – the set of tools is almost completely disjoint from the big data workflow tools they’re used to. Why then, should scientists use a bioinformatics-specific workflow manager? I have found three reasons for this separation:

Differences in data type, shape and scale
Differences in programs and tooling
Community support behind bioinformatics workflow managers

First, which tools are used in bioinformatics and data engineering?

There are several popular bioinformatics workflow managers. A non-exhaustive list includes Nextflow, Snakemake, common workflow language (CWL), and workflow description language (WDL). These workflow managers all provide the necessary capabilities of data provenance, portability, scalability, and re-entrancy. For a more thorough review, see (Wratten et al. 2021).

In data engineering, several graph-based workflow managers are used to run tasks based on a schedule or dependencies. These include Airflow, Flyte, Dagster and Prefect. These tools are simple to wire up to databases and other cloud compute services, and easily scale to manage millions of events.

Differences in data type, shape, and scale

In bioinformatics and data engineering, the type, shape, and size of data are different. Most genomic data is stored in large compressed text files, often reaching several gigabytes per sample. The total number of samples is often limited due to constraints. Individual steps in a bioinformatics pipeline commonly take files as inputs and produce files as outputs. Each step can have different compute, memory and disk space requirements. Databases are rarely used to store results.

In contrast, data engineering workflows may consist of processing millions of events from an application, transforming images from user input, or ingesting logs from across an application stack. Data is more likely to be stored in databases, individual processing steps may be simpler and better suited to serverless architecture, and total numbers of inputs and outputs may be higher.

In short, a bioinformatics workflow may process data from 1000 samples, where the input is compressed text files, each 4Gb in size. A data engineering workflow may process 20 million images, each 200kb in size. The total amount of data flowing through the pipeline is the same, but the needs for each use case can be drastically different.

	Bioinformatics	Data Engineering
Size of data files	Large	Small
Data type	Compressed text, proprietary formats	Common formats (text, images, etc)
Number of data files	Small	Large
Compute Intensity per step	Medium to large	Small to medium
Store results in databases?	No	Yes

Differences in programs and tooling

Bioinformatics pipelines are often built by stringing together many command line tools. These tools may have different installation methods and incompatible dependencies. Bioinformatics workflow managers solve these problems by allowing for a separate environment definition or container in each step. Finally, analysis steps may be written in different scripting languages, such as Python, R, or MATLAB, all of which need to be accessible to the workflow manager.

In contrast, data engineering workflows are primarily written in a single language, which is used to define both the workflow structure and the data processing steps. For example, Dagster is written in Python and only has weak extension support for other languages.

Community support of bioinformatics-specific workflow managers

Another advantage of using a bioinformatics-specific workflow manager are the strong communities that have been built around these tools. Nextflow-core is the most active, but similar groups exist for snakemake and CWL. In nf-core, you can join thousands of scientists working on similar problems, use pipelines developed and maintained by the community, and ask for help on GitHub or Slack. Even if the community-developed pipelines don’t solve your problem exactly, they can be a great starting point for further customization. Science is all about standing on the shoulders of giants, so why should you re-implement a pipeline in airflow when it already exists in nf-core?

An example bioinformatics workflow

The nextflow-core RNA-Seq workflow is a community-developed pipeline for conducting all the steps in an RNA-Seq analysis. Starting with raw DNA sequence data in the FASTQ file format, the data will go through QC, alignment to the reference genome, quantification, and differential expression calculation. This pipeline has been developed over many years and has 3700+ commits on GitHub. The default workflow uses several different programs and has 20 steps – adopting this workflow is a guaranteed way to get results faster than writing everything from scratch.

nf-core/rnaseq metro map

What about scale?

Nextflow workflows should scale to millions of samples, as long as sufficient compute resources are available. For example, 23andMe uses nextflow for processing genetic data from customers. However, bioinformatics workflow managers may not be the best choice when biological data shifts into the shape and scale typically managed by data engineering workflows. I’m thinking most concretely about Ginkgo Bioworks, which processes terabytes of sequencing data through their pipeline each day. The individual files processed are much smaller – jobs may take seconds to run instead of hours. Ginkgo eventually settled on a workflow composed of Airflow, Celery, and AWS batch. Efficiency is paramount at this scale, and a whole data engineering team contributed to Ginkgo’s solution. Most biotech companies and academic labs are better off using Nextflow or another bioinformatics-specific workflow manager, which can be deployed by a single scientist.

Where is the field headed?

After working in bioinformatics for 10 years now, I have a few ideas about where the field is headed. I’m open to being wrong on any of these predictions, let me know in the comments!

Bioinformatics-specific workflow managers will stick around for the foreseeable future. The most powerful argument for this is the activity and excitement in communities like nextflow-core.
Nextflow is the best choice for doing bioinformatics at scale in 2022.
Cloud is the future, but it’s still challenging to manage a team doing bioinformatics in the cloud.
- A large part of this is that scientists are trained working on local computers or university-built HPC clusters. The tools to abstract away the complexity of cloud computing for scientists do not exist yet.
A more advanced and easier to use workflow manager will be developed that overtakes nextflow in popularity and community support.
- It will be written in python, not a clunky DSL or obscure language like groovy.
- It will natively support execution in multiple cloud environments, intelligent resource usage, and smooth logging and debugging.
- It will have an optional graphical interface for pipeline design and monitoring.
- It may have already been started, as Redun satisfies many of these criteria.

Conclusion

Computational biologists and bioinformaticians often use domain-specific workflow managers like Snakemake, Nextflow, and CWL. To someone with a data engineering background, this may be confusing, as well-developed and efficient workflow orchestration tools already exist. Digging deeper, the differences in data type/scale, tooling, and bioinformatics-specific communities reveal strong reasons for choosing a bioinformatics-specific workflow manager, even at the highest scale.

References

Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 1–8 (2021) doi:10.1038/s41592-021-01254-9.

Deploying redun to AWS Batch – troubleshooting

Posted on 2022-07-052022-07-05 by Ben

I recently went down the rabbit hole trying out the newest bioinformatics workflow manager, redun. While installation and running workflows locally went off without a hitch, I experienced some trouble getting jobs deployed to AWS Batch. Here’s a list of my troubleshooting steps, in case you experience the same issues. To start, I followed the instructions for the “05_aws_batch” example workflow.

I was deploying the workflow on my AWS account at Loyal. This may change if you’re using a new AWS account, or have different security policies in place.

Building docker images

Docker needs root access to build and push images to a registry. In practice, this often means using “sudo” before every command. You can fix this with the command sudo chmod 666 /var/run/docker.sock

Or see the longer fix in this stack overflow post.

Submitting jobs to AWS Batch

I experienced the following error when submitting jobs to AWS Batch:

upload failed: - to s3://MY-BUCKET/redun/jobs/ca27a7f20526225015b01b231bd0f1eeb0e6c7d8/status
An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

I thought this was due to an error in the “role” setting, and that was correct. I first tried using the generic role

arn:aws:iam::ACCOUNT-ID:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch

but that didn’t work.

I then added a custom IAM role to AWS with S3, EC2, ECS and Batch permissions. I added the following permissions as well:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

And then everything worked as expected.

ECS unable to assume role

I heard from someone else trying redun for the first time that they were able to get the batch submission working with the (similar) instructions at this stack overflow post

I hope this helps anyone trying to deploy redun to AWS Batch for the first time!

Trying out redun – the newest workflow manager on the block

Posted on 2022-07-052022-07-05 by Ben

Workflow managers form the cornerstone of a modern bioinformatics stack. By enabling data provenance, portability, scalability, and re-entrancy, workflow managers accelerate the discovery process in any computational biology task. There are many workflow managers available to chose from (a community-sourced list holds over 300): Snakemake, Nextflow, and WDL… each have their relative strengths and drawbacks.

The engineering team at Insitro saw all the existing workflow managers, and then decided to invest in building their own: redun. Why? The motivation and influences docs pages lay out many of the reasons. In short, the team wanted a workflow manager written in Python that didn’t require expressing pipelines as dataflows.

I spent a few days trying out redun – working through the examples and writing some small workflows of my own. I really like the project and the energy of open source development behind it. I’m not at the point where I’m going to re-write all of my Nextflow pipelines in redun, but I’m starting to consider the benefits of doing so.

The positives I immediately noticed about redun include:

redun is Python. Not having to learn a domain-specific language is a huge advantage.
The ability to execute sub-workflows with a single command. This is helpful if you want to enter a workflow with an intermediate file type.
I can see redun working as a centralized way to track workflow execution and file provenance within a lab or company.
There are several options for the execution backend, and redun is easy to deploy to AWS Batch (with some tweaks).
The tutorial and example workflows were helpful for demonstrating the key concepts.

A few drawbacks, as well:

There hasn’t been much investment in observability or execution tracking. Compared to Nextflow Tower and other tools, redun is in the last century.
Similarly, there isn’t yet much community investment in redun, like there is in nf-core.
While redun is extremely flexible, I bet it will be more challenging for scientists to learn than Snakemake.

There will certainly be other items to add to these lists as I get more familiar with redun. For now, it’s fair to say I’m impressed, and I want to write more pipelines in redun!

Rare transmission of commensal and pathogenic bacteria in the gut microbiome of hospitalized adults (1)

Posted on 2022-02-012022-06-27 by Ben

My final project with the Bhatt Lab is now published! You can find the open access text at Nature Communications. I’m excited to bring this chapter of my research career to a close. The paper contains the full scientific results; here I’ll detail some of the journey and challenges along the way.

Hot off the success of my previous work studying mother-infant transmission of phages in the microbiome, I was eager to characterize other examples transmission between the microbiome of humans. While mother-infant transmission of both bacteria and phages was now understood, microbiome transmission between adults was less clear. There were some hints of it happening in the literature, but nobody had fully characterized the phenomenon at a genomic level of detail that I believed. I’m also not counting FMT as transmission here – while it certainly results in the transfer of microbiome components from donor to recipient, I was more interested in characterizing how this phenomenon happened naturally.

In our lab, we have a stool sample biobank from patients undergoing hematopoietic cell transplantation (HCT). We’ve been collecting weekly stool samples from patients undergoing transplant at Stanford Hospital, and to date we have thousands of samples from about one thousand patients. HCT patients are prime candidates to study gut-gut bacterial transmission, due to a few key factors:

Long hospital stays. The conditioning, transplant and recovery process can leave a patient hospitalized for up to months at a time. The long stays provide many opportunities for transmission to occur and many longitudinal samples for us to analyze.
Roommates when recovering from transplant. At Stanford Hospital, patients were placed in double occupancy rooms when there were not active contact precautions. These periods of roommate overlap could provide an increased chance for patient-patient transmission.
Frequent antibiotic use. HCT patients are prescribed antibiotics both prophylactically and in response to infection. These antibiotics kill the natural colonizers of the gut microbiome, allowing antibiotic resistant pathogens to dominate, which may be more likely to be transmitted between patients. Antibiotic use may also empty the niche occupied by certain bacteria and make it more likely for new colonizers to engraft long-term.
High burden of infection. HCT patients frequently have potentially life-threatening infections, and the causal bacteria can originate in the gut microbiome. However, it’s currently unknown where these antibiotic resistant bacteria originate from in the first place. Could transmission from another patient be responsible?

As we thought more about the cases of infection that were caused by gut-bloodstream transmission, we identified three possibilities:

The microbes existed in the patient’s microbiome prior to entering the hospital for HCT. Then, due to antibiotic use and chemotherapy, these microbes could come to dominate the gut community.
Patients acquired the microbe from the hospital environment. Many of the pathogens we’re interested in are Hospital Acquired Infections (HAIs) and known to persist for long periods of time on on hospital surfaces, in sinks, etc.
Patients acquired the microbe via transmission from another patient. This was the most interesting possibility to us, as it would indicate direct gut-gut transmission.

While it’s likely that all three are responsible to some degree, finding evidence for (3) would have been the most interesting to us. Identifying patient-patient microbiome transmission would be both a slam dunk for my research, and would potentially help prevent infections in this patient population. With the clear goal in mind, I opened the door of the -80 freezer to pull out the hundreds of stool samples I would need to analyze…

More to come in part 2!

Moving into aging research – in dogs!

Posted on 2021-11-012022-06-20 by Ben

P – H – Done

As I finish up my PhD at Stanford and consider my next career moves, I’m positive I want to work at a small and rapidly growing biotech startup. After many interviews and some serious introspection, I settled on working at Loyal, a biotech company dedicated to extending the lifespan of dogs by developing therapeutics. It seems like a crazy idea at first, but the core thesis of doing aging research in companion canines makes a lot of sense.

I believe the aging field is at an inflection point – it’s where the microbiome research was 10 years ago. Back then, 16S rRNA sequencing was the state of the art, and the only question researchers were commonly asking of microbial communities was “who’s there.” We’ve since come to appreciate the ecological complexity of the microbiome, developed new genomic ways to study the identities and function of it’s members, and engineered microbiome therapeutics that are starting to show signs of efficacy.

At the core of the aging thesis is the idea that aging is a disease. After all, age is the largest risk factor for death, cancer, dementia, etc. Re-framing aging as a disease allows for completely new investigations, but will not be easy from a regulatory perspective.

Lifespan vs healthspan

“Why would you want to extend the number of years someone is sick at the end of their life?”

This question is frequently asked by those unfamiliar with aging research. However, I don’t believe many in the field have a desire to prolong an unhealthy end of life. Extension of lifespan is not valuable if the extra years are not lived well. Many researchers are interested in healthspan, the number of years lived in a good state of health. One way to picture this is to imagine a “rectangularization” of the survival curve. A drug that prolongs the number of years lived in good health would be very valuable, even if it had no impact on life expectancy.

Rectangularization of the survival curve – The lines should both be the same height to start, but you get the idea.

What about the ethical implications?

News about advancements in aging research are often accompanied by fear: “won’t this just make rich people live longer?” After all, immortality has been a quest for millennia. I don’t buy into many of these criticisms, for a few reasons. First, lifespan is already very stratified by income, and the wealthiest individuals already have access to advanced therapies and care that others lack. Second, advances in lifespan and healthspan are likely to be slow. No immortality drug will be developed overnight. Third, many researchers are working to develop drugs for aging that are cheap and commoditized. The CEO of Loyal, Celine Halioua, has written about this at length.

I’m not new to the aging field!

Back in my undergrad research at Brown, I worked in Nicola Neretti’s lab, which was focused on the genetic and epigenetic pathways of aging. The main paper I contributed to in undergrad studied the chromatin organization of cells as they progressed into senescence – a cellular version of aging slowdown. It’s great to be back!

What’s going on at Loyal?

I’ll be working on everything related to genomics and bioinformatics related to dogs. This means sequencing blood and saliva samples from our laboratory and companion animals, quantifying aging at the genetic and epigenetic level, building better epigenetic clocks, and researching the breed-specific epigenetic changes that accompany aging in certain dogs. It’s exciting and fast paced. And we’re hiring more! Whether your background is in aging science, vet med, computer science, or business operations, we need talented people. Drop me a line if you want to talk more.

Large-scale bioinformatics in the cloud with GCP, Kubernetes and Snakemake

Posted on 2020-01-082020-01-08 by Ben

I recently finished a large metagenomics sequencing experiment – 96 10X Genomics linked read libraries sequenced across 25 lanes on a HiSeq4000. This was around 2TB of raw data (compressed fastqs). I’ll go into more detail about the project and data in another post, but here I’m just going to talk about processing the raw data.

We’re lucky to have a large compute cluster at Stanford for our every day work. This is shared with other labs and has a priority system for managing compute resouces. It’s fine for most tasks, but not up to the scope of this current project. 2TB of raw data may not be “big” in the scale of what places like the Broad deal with on a daily basis, but it’s definitely the largest single sequencing experiment I and our lab has done. To solve this, we had to move… TO THE CLOUD!

By utilizing cloud compute, I can easily scale the compute resources to the problem at hand. Total cost is the same if you use 1 cpu for 100 hours or 100 cpus for 1 hour… so I will parallelize this as much as possible to minimize the time taken to process the data. We use Google Cloud Comptue (GCP) for bioinformatics, but you can do something similar with Amazon’s or Microsoft’s cloud compute, too. I used ideas from this blog post to port the Bhatt lab metagenomics workflows to GCP.

Step 0: Install the GCP SDK, Configure a storage bucket.

Install the GCP SDK to manage your instances and connect to them from the command line. Create a storage bucket for data from this project – this can be done from the GCP console on the web. Then, set up authentication as described here.

Step 1: Download the raw data

Our sequencing provider provides raw data via an FTP server. I downloaded all the data from the FTP server and uploaded it to the storage bucket using the gsutil rsync command. Note that any reference databases (human genome for removing human reads, for example) need to be in the cloud too.

Step 2: Configure your workflow.

I’m going to assume you already have a snakemake workflow that works with local compute. Here, I’ll show how to transform it to work with cloud compute. I’ll use the workflow to run the 10X Genomics longranger basic program and deinterleave reads as an example. This takes in a number of samples with forward and reverse paired end reads, and outputs the processed reads as gzipped files.

The first lines import the cloud compute packages, define your storage bucket, and search for all samples matching a specific name on the cloud.

from os.path import join
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
GS = GSRemoteProvider()
GS_PREFIX = "YOUR_BUCKET_HERE"
samples, *_ = GS.glob_wildcards(GS_PREFIX + '/raw_data_renamed/{sample}_S1_L001_R1_001.fastq.gz')
print(samples)

The rest of the workflow just has a few modifications. Note that Snakemake automatically takes care of remote input and output file locations. However, you need to add the ‘GS_PREFIX’ when specifying folders as parameters. Also, if output files aren’t explicitly specified, they don’t get uploaded to remote storage. Note the use of a singularity image for the longranger rule, which automatically gets pulled on the compute node and has the longranger program in it. pigz isn’t available on the cloud compute nodes by default, so the deinterleave rule has a simple conda environment that specifies installing pigz. The full pipeline (and others) can be found at the Bhatt lab github.

rule all:
    input:
        expand('barcoded_fastq_deinterleaved/{sample}_1.fq.gz', sample=samples)

rule longranger:
    input: 
        r1 = 'raw_data_renamed/{sample}_S1_L001_R1_001.fastq.gz',
        r2 = 'raw_data_renamed/{sample}_S1_L001_R2_001.fastq.gz'
    output: 'barcoded_fastq/{sample}_barcoded.fastq.gz'
    singularity: "docker://biocontainers/longranger:v2.2.2_cv2"
    threads: 15
    resources:
        mem=30,
        time=12
    params:
        fq_dir = join(GS_PREFIX, 'raw_data_renamed'),
        outdir = join(GS_PREFIX, '{sample}'),
    shell: """
        longranger basic --fastqs {params.fq_dir} --id {wildcards.sample} \
            --sample {wildcards.sample} --disable-ui --localcores={threads}
        mv {wildcards.sample}/outs/barcoded.fastq.gz {output}
    """

rule deinterleave:
    input:
        rules.longranger.output
    output:
        r1 = 'barcoded_fastq_deinterleaved/{sample}_1.fq.gz',
        r2 = 'barcoded_fastq_deinterleaved/{sample}_2.fq.gz'
    conda: "envs/pigz.yaml"
    threads: 7
    resources: 
        mem=8,
        time=12
    shell: """
        # code inspired by https://gist.github.com/3521724
        zcat {input} | paste - - - - - - - -  | tee >(cut -f 1-4 | tr "\t" "\n" |
            pigz --best --processes {threads} > {output.r1}) | \
            cut -f 5-8 | tr "\t" "\n" | pigz --best --processes {threads} > {output.r2}
    """

Now that the input files and workflow are ready to go, we need to set up our compute cluster. Here I use a Kubernetes cluster which has several attractive features, such as autoscaling of compute resources to demand.

A few points of terminology that will be useful:

A cluster contains (potentially multiple) node pools.
A node pool contains multiple nodes of the same type
A node is the basic compute unit, that can contain multiple cpus
A pod (as in a pod of whales) is the unit or job of deployed compute on a node

To start a cluster, run a command like this. Change the parameters to the type of machine that you need. The last line gets credentials for job submission. This starts with a single node, and enables autoscaling up to 96 nodes.

export CLUSTER_NAME="snakemake-cluster-big"
export ZONE="us-west1-b"
gcloud container clusters create $CLUSTER_NAME \
    --zone=$ZONE --num-nodes=1 \
    --machine-type="n1-standard-8" \
    --scopes storage-rw \
    --image-type=UBUNTU \
    --disk-size=500GB \
    --enable-autoscaling \
    --max-nodes=96 \
    --min-nodes=0
gcloud container clusters get-credentials --zone=$ZONE $CLUSTER_NAME

For jobs with different compute needs, you can add a new node pool like so. I used two different node pools, with 8 core nodes for preprocessing the sequencing data and aligning against the human genome, and 16 core nodes for assembly. You could also create additional high memory pools, GPU pools, etc depending on your needs. Ensure new node pools are set with --scopes storage-rw to allow writing to buckets!

gcloud container node-pools create pool2 \
    --cluster $CLUSTER_NAME \
    --zone=$ZONE --num-nodes=1 \
    --machine-type="n1-standard-16" \
    --scopes storage-rw \
    --image-type=UBUNTU \
    --disk-size=500GB \
    --enable-autoscaling \
    --max-nodes=96 \
    --min-nodes=0

When you are finished with the workflow, shut down the cluster with this command. Or let autoscaling slowly move the number of machines down to zero.

gcloud container clusters delete --zone $ZONE $CLUSTER_NAME

To run the snakemake pipeline and submit jobs to the Kubernetes cluster, use a command like this:

snakemake -s 10x_longranger.snakefile --default-remote-provider GS \
    --default-remote-prefix YOUR_BUCKET_HERE --use-singularity \
    -j 99999 --use-conda --nolock --kubernetes

Add the name of your bucket prefix. The ‘-j’ here allows (mostly) unlimited jobs to be scheduled simultaneously.

Each job will be assigned to a node with available resources. You can monitor the progress and logs with the commands shown as output. Kubernetes autoscaling takes care of provisioning new nodes when more capacity is needed, and removes nodes from the pool when they’re not needed any more. There is some lag for removing nodes, so beware of the extra costs.

While the cluster is running, you can view the number of nodes allocated and the available resources all within the browser. Clicking on an individual node or pod will give an overview of the resource usage over time.

Useful things I learned while working on this project

Use docker and singularity images where possible. In cases where multiple tools were needed, a simple conda environment does the trick.
The container image type must be set to Ubuntu (see above) for singularity images to correctly work on the cluster.
It’s important to define memory requirements much more rigorously when working on the cloud. Compared to our local cluster, standard GCP nodes have much less memory. I had to go through each pipeline and define an appropriate amount of memory for each job, otherwise they wouldn’t schedule from Kubernetes, or would be killed when they exceeded the limit.
You can only reliably use n-1 cores on each node in a Kubernetes cluster. There’s always some processes running on a node in the background, and Kubernetes will not scale an excess of 100% cpu. The threads parameter in snakemake is an integer. Combine these two things and you can only really use 7 cores on an 8-core machine. If anyone has a way around this, please let me know!
When defining input and output files, you need to be much more specific. When working on the cluster, I would just specify a single output file out of many for a program, and could trust that the others would be there when I needed them. But when working with remote files, the outputs need to be specified explicitly to get uploaded to the bucket. Maybe this could be fixed with a call to directory() in the output files, but I haven’t tried that yet.
Snakemake automatically takes care of remote files in inputs and outputs, but paths specified in the params: section do not automatically get converted. I use paths here for specifying an output directory when a program asks for it. You need to add the GS_PREFIX to paths to ensure they’re remote. Again, might be fixed with a directory() call in the output files.
I haven’t been able to get configuration yaml files to work well in the cloud. I’ve just been specifying configuration parameters in the snakefile or on the command line.

Transmission of crAsspahge in the microbiome

Posted on 2019-07-242022-09-30 by Ben

Update! This work has been published in Nature Communications.
Siranosian, B.A., Tamburini, F.B., Sherlock, G. et al. Acquisition, transmission and strain diversity of human gut-colonizing crAss-like phages. Nat Commun 11, 280 (2020). https://doi.org/10.1038/s41467-019-14103-3

Big questions in the microbiome field surround the topic of microbiome acquisition. Where do we get our first microbes from? What determines the microbes that colonize our guts form birth, and how do they change over time? What short and long term impacts do these microbes have on the immune system, allergies or diseases? What impact do delivery mode and breastfeeding have on the infant microbiome?

A key finding from the work was that mothers and infants often share identical or nearly identical crAssphage sequences, suggesting direct vertical transmission. Also, I love heatmaps.

As you might expect, a major source for microbes colonizing the infant gut is immediate family members, and the mother is thought to be the major source. Thanks to foundational studies by Bäckhed, Feretti, Yassour and others (references below), we now know that infants often acquire the primary bacterial strain from the mother’s microbiome. These microbes can have beneficial capabilities for the infant, such as the ability to digest human milk oligosaccharides, a key source of nutrients in breast milk.

The microbiome isn’t just bacteria – phages (along with fungi and archaea to a smaller extent) play key roles. Phages are viruses that predate on bacteria, depleting certain populations and exchanging genes among the bacteria they infect. Interestingly, phages were previously shown to display different inheritance patterns than bacteria, remaining individual-specific between family members and even twins (Reyes et al. 2010). CrAss-like phages are the most prevalent and abundant group of phages colonizing the human gut, and our lab was interested in the inheritance patterns of these phages.

We examined publicly available shotgun gut metagenomic datasets from two studies (Yassour et al. 2018, Bäckhed et al. 2015), containing 134 mother-infant pairs sampled extensively through the first year of life. In contrast to what has been observed for other members of the gut virome, we observed many putative transmission events of a crAss-like phage from mother to infant. The key takeaways from our research are summarized below. You can also refer my poster from the Cold Spring Harbor Microbiome meeting for the figures supporting these points. We hope to have a new preprint (and hopefully a publication) on this research out soon!

CrAssphage is not detected in infant microbiomes at birth, increases in prevalence with age, but doesn’t reach the level of adults by 12 months of age.
Mothers and infants share nearly identical crAssphage genomes in 40% of cases, suggesting vertical transmission.
Infants have reduced crAssphage strain diversity and typically acquire the mother’s dominant strain upon transmission.
Strain diversity is mostly the result of neutral genetic variation, but infants have more nonsynonymous multiallelic sites than mothers.
Strain diversity varies across the genome, and tail fiber genes are enriched for strain diversity with nonsynonymous variants.
These findings extend to crAss-like phages. Vaginally born infants are more likely to have crAss-lke phages than those born via C-section.

References
1. Reyes, A. et al. Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature 466, 334–338 (2010).
2. Yassour, M. et al. Strain-Level Analysis of Mother-to-Child Bacterial Transmission during the First Few Months of Life. Cell Host & Microbe 24, 146-154.e4 (2018).
3. Bäckhed, F. et al. Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life. Cell Host & Microbe 17, 690–703 (2015).
4. Ferretti, P. et al. Mother-to-Infant Microbial Transmission from Different Body Sites Shapes the Developing Infant Gut Microbiome. Cell Host & Microbe 24, 133-145.e5 (2018).