Bioinformatics in the cloud, on a budget

Let’s say you’re a biotech or academic lab that needs to do bioinformatics or computational biology at a reasonably large scale. You have a tight budget and you want to be as cost effective as possible. You also don’t want to build and maintain your own hardware, because you recognize the hidden costs baked into the time, effort, and security of doing so. Luckily, the last few years have seen a proliferation of “alternative” cloud providers. These providers can compute with AWS, GCP and Azure by doing few things really well at greatly reduced prices. My main argument in this post is that by mixing services from different cloud providers, budget and cloud can mix, despite the prevailing pessimistic opinions. 

To be upfront, I believe working with one of the larger public cloud providers will make your life easier and allow you to deliver results faster, with less engineering expertise. AWS has services that cover everything a biotech needs to process data in the cloud, and the integration between these services is seamless and efficient. But we’re not going for easy here, right? We’re going for cheap. And cheap means cutting some corners and making things more difficult in the name of saving your valuable dollars.

What’s the problem with the big public cloud providers? AWS allows a team to build any product imaginable, and scale in infinitely. Need to build a Netflix competitor that can deliver video with low latency and maximum uptime to every corner of the world? AWS will let you do that (and bill you appropriately). With this plethora of features comes many hidden costs. It can seem like AWS intentionally makes their billing practices opaque, allowing you to rack up massive bills by leaving a service running or enabling features you don’t need. In the future, I’ll do a separate post on keeping AWS costs manageable. For now, just know that you have to be careful or you can be burned – I personally know several individuals that have made costly mistakes here. Even when just looking at raw compute, AWS is priced at a large premium compared to competitors on the market. You pay for the performance, uptime, reliability, interoperability, and support.

The minimum viable bioinformatics cloud

With that out of the way, it’s time to design our bioinformatics cloud! The minimum capabilities of a system supporting a bioinformatics team include: 

    1. Interactive compute for experimentation, prototyping workflows, programming in Jupyter and RStudio and generating figures. GPUs may be needed for training machine learning models. 
    2. Cloud storage that’s accessible to all team members and other services. Ideally this system supports cheap cold storage for infrequently accessed and backup data.
    3. Container registries. Batch workflows need to access a high-bandwidth container registry for custom private and public containers. 
    4. Scalable batch compute that can be managed by a workflow manager. A team should be able to easily 10-1000X their compute with a single command line argument or config change.
    5. GPUs, databases, and other add-ons, depending on the work the team is doing. 

Where can we cut corners?

Some of the features offered by AWS matter less to a bioinformatics team.

  • The final 10% optimization of latency, uptime and performance. In research, my day isn’t ruined if a workflow completes in 24 versus 22 hours – it’s still an overnight task. Similarly, an hour of downtime on a cluster for maintenance isn’t the end of the world – I always have papers I could be reading. Beyond some limit, increasing these metrics isn’t worth the additional cost.
  • Multi-region and multi-availability zone. We’re not building Netflix, or even publicly available services. All the compute can be in one region. 
  • Infinite hot storage. I’ve found that beyond a certain point, adding more hot storage doesn’t make a team more efficient, just lazy about cleaning data up. Not all data needs to be accessed with zero latency. There has to be something similar to Parkinson’s law for this case: left unchecked, data storage will expand to fill all available space. 
  • Infinitely scalable compute. Increasing parallelization of a workflow beyond a certain point often results in increased overhead and diminishing returns. While scalability is necessary, it doesn’t need to be truly infinite.

With these requirements and cost saving measures in mind, here’s my bioinformatics in the cloud on a budget “cookbook”.

1: Interactive compute

There are two ways teams typically handle this requirement. Either by providing a large, central compute server for all members to share, or allow team members to provision their own compute servers. The first option requires more central management, while the second relies on each team member being able to administer their own resources.

How it’s done on AWS: EC2 instances that are always running or provisioned on-demand. You can save by paying up-front for a dedicated EC2 instance, but there’s a sneaky $2/hour fee for this service that makes it inefficient until large scales.

How it can be done cheaply: Hetzner is a German company that offers dedicated servers for 10-25% the cost of AWS. You can either configure a new server with your desired capabilities for a small setup fee, or immediately lease an existing server available on their website. These servers can have up to 64 vCPU, 1TB RAM, and 77TB of flash storage. 20TB of data egress traffic is included (which would cost you over $1800 at AWS)! 

If you want to use the Hetzner Storage Box and Cloud services I mention later, you’ll want to pick a server in Europe to keep all your services in the same data center. This can create lag when connecting from the US, so I recommend using mosh instead of SSH to minimize the impact of transatlantic latency. 

Where you cut corners: Hetzner servers are not as high powered as AWS EC2 instances, which can easily top out at over 128 vCPU. You can’t add GPUs or get very specific hardware configurations. Hetzner dedicated servers are billed per month, while AWS EC2 instances are billed per second, offering you more flexibility. Compared to AWS, there aren’t as many integrated services at Hetzner, and some users complain that there’s more scheduled maintenance downtime.

2: Cloud storage

How it’s done on AWS: S3 buckets or Elastic File System (EFS, their implementation of NFS). Storage tiers, and the AWS intelligent tiering service, allow archival storage to be very cheap.

How it can be done cheaply: Many companies now offer infinitely scalable cloud storage for significantly cheaper than S3. They also offer free or greatly reduced data transfer rates, which can help you avoid the obscene AWS egress fees. Two of my favorite providers are Backblaze B2 and Cloudflare R2. Both of these services can be accessed with the familiar S3 API. If this service is being used to store actively analyzed data, Cloudflare wins out. Zero egress fees make up for the increased storage cost. As soon as you egress more than you store per month, Cloudflare is cheaper than Backblaze.

Hetzner recently released Storage Boxes, which you can purchase in predefined sizes and get storage costs down to about $2/TB/month when fully utilized. The performance of the storage boxes is very high when transferring data within a Hetzner location, making this an ideal combination for low-latency data analysis. 

Where you cut corners: Using storage and compute from different providers will always be slower than staying within the AWS ecosystem. Hetzner storage boxes come in defined sizes up to 40TB, and you pay for space that you’re not using. Storage boxes also don’t support S3 or other APIs that developers desire. For true backups and archival storage, it’s hard to beat AWS Glacier at $1/TB/month. 

3: Container Registries

How it’s done on AWS: ECR (Elastic container registry) allows for public and private repositories for your team to push and pull containers. You pay for the storage costs and egress when the containers are pulled outside of the same AWS region. 

How it can be done cheaply: DockerHub offers paid plans that include image builds and 5000 container pulls per day. The math on this one will depend on your workflow size and the need for public vs private containers.You could also host your own registry with something like Harbor, but that’s beyond the scope of this post. 

Where you cut corners: Again, moving outside of AWS means you lose the integration and lightning-fast container pulls. Using DockerHub or another service is one more monthly bill and account to manage.

4: Batch workflows

How it’s done on AWS: Deploy workflows to Batch or EKS (Elastic Kubernetes Service). Compute happens on autoscaling EC2 or Fargate instances, data is stored in S3 or EFS, and containers are pulled from ECR. Batch workflows is where the interoperability of AWS services really stands out, and it’s hard to replicate everything at scale without significant engineering. 

How it can be done cheaply: If on AWS, use spot instances as much as possible, and design your workflows to be redundant to spot instance reclaims (create small composable steps, parallelize as much as possible and use larger instances for less time). If you’re not on AWS, you have three options, which I will present in order of increasing difficulty and thriftiness: 

  1. Manually deploy your workflows to a few large servers on your cloud provider of choice. If you’ve containerized your workflows (you’re using containers, right?) running the same pipeline on different samples should be as easy as changing the sample sheet. This method obviously takes more oversight and doesn’t scale beyond what you can do on a few large servers. 
  2. Deploy your workflow to a Kubernetes cluster at a managed k8s provider, like Digital Ocean. You can use the autoscaling features to automatically increase and decrease the number of available nodes depending on your workflow. 
  3. Deploy a Kubernetes cluster to Hetzner Cloud. Here, you’ll be managing the infrastructure from start to finish, but you can take advantage of the cheapest autoscaling instances available on the planet. I can expand this to a tutorial if there’s interest, but the basic deployment looks like this:
    1. Set up a Kubernetes cluster using something like the lightweight distribution k3s
    2. Set up autoscaling with Hetzner so you don’t have to manage node pools yourself. 
    3. Nextflow and other workflow managers need storage (a persistent volume claim, or PVC) with “read write many” capabilities. You can set this up with Rook Ceph.
    4. Modify your workflow requirements so that you don’t exceed the maximum resources available with a given cloud instance. The Hetzner Cloud instances are not as CPU and memory heavy as AWS.
    5. Deploy your workflow using the storage provider and container registry of your choice!

These setups obviously take more time and expertise to create and manage. Ensure that your team is familiar with the technology and the tradeoffs. If you want to deploy big batch workflows with minimal configuration, it’s hard to beat the managed services at AWS.

5: GPUs and accelerated computing

How it’s done on AWS: Get an EC2 instance with a GPU. Use GPU instances within a workflow.

How it can be done cheaply: Hetzner doesn’t offer cheap GPUs yet, but other cloud providers do, like Genesis Cloud, Vast, and RunPod. The obvious downside of this is splitting your workloads up between another cloud provider.

General advice

These tips can apply regardless of the cloud provider and services you use. Many of these came up in a Twitter thread I posted the other day. 

  • Use spot instances whenever you can to save ~50% on compute. On AWS, set your maximum bid to the on-demand price to minimize interruptions.
  • The big cloud providers offer credits to new teams to get them on the service – I think the standard AWS deal for startups is $100k in credits for a year. They also offer grants for research teams looking to take advantage of the cloud. My best “hourly rate” in grad school was filling out a GCP credit application – about $20k for one hour of work!
  • Turn your stuff off! This goes without saying, but so much compute is wasted by just leaving servers running when they don’t need to be.
  • Get good at the cost exploration tools, and designate one team member to understand the monthly bill and track changes. 
  • Test your workflows at small scale before deploying to a big cluster. 
  • Use free and cheap accelerated compute available at Google Colab and Paperspace. 

Conclusion

Cloud computing has made large strides in the last ten years, but for use in research, we still have a long way to go. I agree with the sentiment that we’re still early in cloud. For biotechs and academic labs that don’t have access to a university cluster (or are scaling beyond what their cluster can offer), there aren’t many alternatives to cloud computing. Unfortunately, high costs and stories of researchers breaking the bank with AWS turn many people off from these solutions completely.

My goal with this post is to outline some alternative services that biotechs and academic labs can use for their storage and compute. By being thrifty and learning some new skills, I bet cloud bills could be reduced by 50% or more. However, the integration between services in AWS is still top notch, and I hope we see more innovation and competition in this space in the near future.

Do you have experience with the services I mentioned? Agree or disagree with the recommendations, or have something else to add? Please let me know in the comments below!

Getting a industry job after grad school

You’ve decided to move on from the academic career path after finishing your masters or PhD. Congratulations! However, making the transition out of academia can be hard, intimidating, and lonely. There are so many possible paths, rather than the linear grad school to postdoc to faculty pipeline, and it can feel like you’re leaving your community behind after years in the university system. Here’s some advice that helped me with the transition to my first biotechnology job, and a few things I learned hiring scientists and managing a team at Loyal and Formic Labs. This advice is based on my own experience and the experiences of the people close to me – it won’t be perfectly applicable to fields outside of biotechnology. I’ll cover three key areas: how to find the right position, how to apply and get the job, and how to find your people.

How to find the right position

Narrow down your search space as much as possible

There are over three thousand biotech companies in the Bay Area alone. That’s a huge number compared to the 5-10 schools offering graduate biology degrees. Your first task is to narrow the search space using a few key factors.

  1. What field do you want to work in? Maybe your PhD research was in gene therapy delivery, and you’d like to stay in that space. Congrats, you just narrowed your search space down to only 88 companies in CA (data from BioPharmGuy, considering gene therapy, RNA and peptide therapy companies).
  2. What company size would you enjoy most? This can be a hard question to answer if you haven’t had a non-academic job before, but you can use clues from grad school. Knowing what you know now, what type of lab would you ideally want to work in? One with a small team and hands-on advisor, or a large lab with many graduate students and postdocs, but limited attention from your advisor? Are you excited or frightened by the idea of working in a new lab with a young advisor, before they’ve gotten tenure? The answers to these questions can steer you towards small and big companies, and towards or away from startups.
  3. Where do you want to live? Geography is an important consideration that shouldn’t be ignored. You now have the flexibility of being independent of the university system – use it to make a choice based on cost of living, proximity to family and friends, hobbies, or the best place to raise a family. Depending on the industry, your best options may be in one of a few hubs.
  4. Do you want to work remotely? If you enjoy the tradeoffs of remote work, limit your search to positions that offer this up front. Companies will often bring the entire team together a few times a year, so be prepared to travel at least at least a few times if you go down this route.

Talk to as many people as you can

You can start this process while you’re still in grad school. It’s not uncommon or uncool to do “informational interviews” with people in your field. These people might be a lab or university alumni, someone who has published in the same research area, or even just someone you follow online. I’ve had great luck in reaching out to strangers on Twitter or Linkedin to talk about ideas and careers.

Search smarter, not harder

Two websites I’ve already linked hold databases of biotech companies and a biotech-specific job board: BioPharmGuy and BioSpace. Searching on these sites can be great for both company discovery and job postings. AngelList Talent can help with the search for jobs at newer startups.

Get on Twitter

Twitter is a hub for science information, new publications, job postings, and gossip in the field. Especially for the startup scene, Twitter has far more value than Linkedin. You don’t even have to post anything, just find some interesting people to follow and go from there. The #AltAcChats hashtag is a good place to start.

Your skills are general – it’s okay to change fields

The skills you learn during a PhD are more generally applicable than you may believe. Did you manage projects involving several lab members or outside collaborators? Did you mentor undergrads or new members of the lab? TA and develop material for a course? Take on a project in a new research area after jumping into the deep end of the literature pool? Recognize, promote, and sell these skills – they are valuable in any field you end up committed to. Conquering a PhD means you can learn pretty much anything.

Get connected with the venture capitalists

The best VCs have an expert birds-eye-view of their industry, and they have an incentive to place talented people at their portfolio companies. I’ve talked with VCs from Lux Capital, 8VC, Northpond and others at biotech meetups. They’re always looking to network with talented people – they need dealflow just as much as you need a job or a term sheet!

Consider roles outside of pure research

Consider strategic operations, chief of staff, project management, VC, and other “alternative” roles. If you love being involved with science but don’t see yourself doing pure research forever, there are many ways to stay involved without opening a lab notebook.

How to apply and get the job

Your resume, cover letter, or intro needs to stand out

If you’ve identified a company and role that is a good fit for you, and you want to apply, realize that hiring managers get A LOT of resumes. This is especially true when a job is posted on Linkedin or other general job sites. If a manager only has a minute or two to devote to each resume, you have to stand out in a positive way. Maybe it’s a relevant and interesting thesis title, an open source software project you’ve contributed to, or a good word from someone working at the company. Any positive connection or good word can go a long way to getting you a first interview.

Do many, many interviews

Especially if you’re unfamiliar with the interview process, or they make you nervous. It might seriously suck at first, but the only way to get more comfortable with interviewing is to put yourself out there and get uncomfortable. In the age of zoom, you can interview with a company halfway across the country without ever leaving your room (or putting on pants). I’ll even suggest doing earlier interviews with companies that you may be a good fit for, but you know you wouldn’t take. You’ll learn some of the common interview questions, get practice summarizing your research experience, and learn about the salary bands for the role (you are going to ask about salary, right?)

Have something to show in public, especially if you’re interviewing for a computational or software role

This could be a personal website, Github repository, a website for a side project, or a reproducible demo analysis from a paper. You want something that can show off your programming and quantitative skills from any device connected to the internet. Be prepared to walk through design choices for the code and any areas that were particularly interesting or challenging. Good documentation is important for any software intended to be re-used – docs are valued more in industry than in academia.

I have a few personal examples that I’ve repeatedly sent in messages or brought up live on a zoom interview. The bhattlab_workflows and kraken2_classification pipelines are not miracles of software engineering by any means, but they’re still used by members of the Bhatt lab and others, they make nice figures, and they have good docs. My bioinformatics in the cloud post is now a few years out of date, but it shows that I have been thinking about the challenges and solutions in this field for a while.

Brush up on the latest trends, languages, and frameworks in your field

In bioinformatics, Nextflow is the most popular workflow manager, and cloud compute skills are a necessity. Being familiar with both of these tools will help any bioinformatics interview. So, re-write a simple pipeline from grad school in Nextflow, sign up for the AWS Free Tier, and learn how to deploy it to AWS Batch. You could even write a blog post or a Twitter thread about the process, what you learned, and what you found challenging, then refer to it during an interview. A weekend of work will set you apart from those who haven’t tried to make the transition.

Utilize resources at your university

Many universities have free career counseling or job boards for people in situations just like you! Make sure you take advantage of these resources. You could probably benefit from a resume review, Linkedin profile checkup, or just someone knowledgeable to talk through your different options with.

Know what you’re worth. Negotiate.

Salary and equity compensation is field and role dependent. Talking with others in positions you’re applying to is the best way to get the current numbers. Ask for a range rather than direct numbers to avoid getting too personal. Also, recognize the tradeoffs that come with company size. Startups can’t pay as well, but can compensate with equity that could be life-changing in the event of a successful exit. Later-stage or public companies will be more stable and offer more in salary without the asymmetric upside. Finally, realize an offer is just a starting point for negotiations. There’s a hard limit for every position, but most offers can be flexed for the right candidate. You can also trade salary for equity (and vice-versa) depending on your risk tolerance.

How to find your people

Find your in-person community

There are growing meetup groups for young scientists in biotech and other fields. Right now, I’m seeing these mostly advertised in the Bay Area, NYC, and Boston, but they’re rapidly expanding to other areas as well. My top two for the Bay are Bits in Bio (which also has an active Slack community with over 2000 members) and Ergo Bio’s Biotech Venture Meetups. Groups like Nucleate bring together biotech founders from around the world.

Find your online community

I feel like the network of people talking about industry jobs, trends, and advice is stronger than ever. Twitter and Slack spaces like Bits in Bio are full of friendly and talented people.

Don’t stress about finding the “perfect” industry position in your first role out of grad school

Industry is not like academia, where you must commit 4+ years to a single field, and where your life is defined by your research area. You will learn more than you expect in the first year of your new role, and if you’re not happy, you’ll be in a better place to change it a year in. It’s much easier to change jobs in industry, and each change can come with better fit and increased compensation.

Why are bioinformatics workflows different?

Data workflows and pipelines are an integral part of bioinformatics. However, the tools used to write and deploy workflows in bioinformatics are different from tools used for similar tasks in data engineering. In this post, I’ll lay out (my opinion on) the reasons for separations in these fields, and speculate on where bioinformatics is headed in the future.

What is a bioinformatics workflow?

A bioinformatics workflow is a series of programmatic steps to transform raw data into processed results, figures, and insights. A workflow can consist of many steps, each involving different tools, parameters, reference databases, and requirements. For example, a bioinformatics workflow I developed at Loyal transforms raw sequencing data from each sample into a DNA methylation profile. This workflow has about 10 steps, uses several different open source tools, and requires the canine reference genome in addition to the raw data input.

The complexity of these workflows, along with the requirement for different programs and resources at each step, necessitate the use of “workflow managers.” These tools orchestrate the processes, dependencies, deployment and tracking of large bioinformatics pipelines.

Individuals with data engineering experience at tech companies are always surprised when they hear about the ecosystem of bioinformatics workflow managers – the set of tools is almost completely disjoint from the big data workflow tools they’re used to. Why then, should scientists use a bioinformatics-specific workflow manager? I have found three reasons for this separation:

  1. Differences in data type, shape and scale
  2. Differences in programs and tooling
  3. Community support behind bioinformatics workflow managers

First, which tools are used in bioinformatics and data engineering?

There are several popular bioinformatics workflow managers. A non-exhaustive list includes Nextflow, Snakemake, common workflow language (CWL), and workflow description language (WDL). These workflow managers all provide the necessary capabilities of data provenance, portability, scalability, and re-entrancy. For a more thorough review, see (Wratten et al. 2021).

In data engineering, several graph-based workflow managers are used to run tasks based on a schedule or dependencies. These include Airflow, Flyte, Dagster and Prefect. These tools are simple to wire up to databases and other cloud compute services, and easily scale to manage millions of events.

Differences in data type, shape, and scale

In bioinformatics and data engineering, the type, shape, and size of data are different. Most genomic data is stored in large compressed text files, often reaching several gigabytes per sample. The total number of samples is often limited due to constraints. Individual steps in a bioinformatics pipeline commonly take files as inputs and produce files as outputs. Each step can have different compute, memory and disk space requirements. Databases are rarely used to store results.

In contrast, data engineering workflows may consist of processing millions of events from an application, transforming images from user input, or ingesting logs from across an application stack. Data is more likely to be stored in databases, individual processing steps may be simpler and better suited to serverless architecture, and total numbers of inputs and outputs may be higher.

In short, a bioinformatics workflow may process data from 1000 samples, where the input is  compressed text files, each 4Gb in size. A data engineering workflow may process 20 million images, each 200kb in size. The total amount of data flowing through the pipeline is the same, but the needs for each use case can be drastically different.

 BioinformaticsData Engineering
Size of data filesLargeSmall
Data typeCompressed text, proprietary formatsCommon formats (text, images, etc)
Number of data filesSmallLarge
Compute Intensity per stepMedium to largeSmall to medium
Store results in databases?NoYes

Differences in programs and tooling

Bioinformatics pipelines are often built by stringing together many command line tools. These tools may have different installation methods and incompatible dependencies. Bioinformatics workflow managers solve these problems by allowing for a separate environment definition or container in each step. Finally, analysis steps may be written in different scripting languages, such as Python, R, or MATLAB, all of which need to be accessible to the workflow manager.

In contrast, data engineering workflows are primarily written in a single language, which is used to define both the workflow structure and the data processing steps. For example, Dagster is written in Python and only has weak extension support for other languages.

Community support of bioinformatics-specific workflow managers

Another advantage of using a bioinformatics-specific workflow manager are the strong communities that have been built around these tools. Nextflow-core is the most active, but similar groups exist for snakemake and CWL. In nf-core, you can join thousands of scientists working on similar problems, use pipelines developed and maintained by the community, and ask for help on GitHub or Slack. Even if the community-developed pipelines don’t solve your problem exactly, they can be a great starting point for further customization. Science is all about standing on the shoulders of giants, so why should you re-implement a pipeline in airflow when it already exists in nf-core?

An example bioinformatics workflow

The nextflow-core RNA-Seq workflow is a community-developed pipeline for conducting all the steps in an RNA-Seq analysis. Starting with raw DNA sequence data in the FASTQ file format, the data will go through QC, alignment to the reference genome, quantification, and differential expression calculation. This pipeline has been developed over many years and has 3700+ commits on GitHub. The default workflow uses several different programs and has 20 steps – adopting this workflow is a guaranteed way to get results faster than writing everything from scratch.

nf-core/rnaseq metro map

What about scale?

Nextflow workflows should scale to millions of samples, as long as sufficient compute resources are available. For example, 23andMe uses nextflow for processing genetic data from customers. However, bioinformatics workflow managers may not be the best choice when biological data shifts into the shape and scale typically managed by data engineering workflows. I’m thinking most concretely about Ginkgo Bioworks, which processes terabytes of sequencing  data through their pipeline each day. The individual files processed are much smaller – jobs may take seconds to run instead of hours. Ginkgo eventually settled on a workflow composed of Airflow, Celery, and AWS batch. Efficiency is paramount at this scale, and a whole data engineering team contributed to Ginkgo’s solution. Most biotech companies and academic labs are better off using Nextflow or another bioinformatics-specific workflow manager, which can be deployed by a single scientist.

Where is the field headed?

After working in bioinformatics for 10 years now, I have a few ideas about where the field is headed. I’m open to being wrong on any of these predictions, let me know in the comments!

  • Bioinformatics-specific workflow managers will stick around for the foreseeable future. The most powerful argument for this is the activity and excitement in communities like nextflow-core.
  • Nextflow is the best choice for doing bioinformatics at scale in 2022.
  • Cloud is the future, but it’s still challenging to manage a team doing bioinformatics in the cloud.
    • A large part of this is that scientists are trained working on local computers or university-built HPC clusters. The tools to abstract away the complexity of cloud computing for scientists do not exist yet.
  • A more advanced and easier to use workflow manager will be developed that overtakes nextflow in popularity and community support.
    • It will be written in python, not a clunky DSL or obscure language like groovy.
    • It will natively support execution in multiple cloud environments, intelligent resource usage, and smooth logging and debugging.
    • It will have an optional graphical interface for pipeline design and monitoring.
    • It may have already been started, as Redun satisfies many of these criteria.

Conclusion

Computational biologists and bioinformaticians often use domain-specific workflow managers like Snakemake, Nextflow, and CWL. To someone with a data engineering background, this may be confusing, as well-developed and efficient workflow orchestration tools already exist. Digging deeper, the differences in data type/scale, tooling, and bioinformatics-specific communities reveal strong reasons for choosing a bioinformatics-specific workflow manager, even at the highest scale.

References

  1. Wratten, L., Wilm, A. & Göke, J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 1–8 (2021) doi:10.1038/s41592-021-01254-9.

 

Deploying redun to AWS Batch – troubleshooting

I recently went down the rabbit hole trying out the newest bioinformatics workflow manager, redun. While installation and running workflows locally went off without a hitch, I experienced some trouble getting jobs deployed to AWS Batch. Here’s a list of my troubleshooting steps, in case you experience the same issues. To start, I followed the instructions for the “05_aws_batch” example workflow.

I was deploying the workflow on my AWS account at Loyal. This may change if you’re using a new AWS account, or have different security policies in place.

Building docker images

Docker needs root access to build and push images to a registry. In practice, this often means using “sudo” before every command. You can fix this with the command sudo chmod 666 /var/run/docker.sock

Or see the longer fix in this stack overflow post.

Submitting jobs to AWS Batch

I experienced the following error when submitting jobs to AWS Batch:

upload failed: - to s3://MY-BUCKET/redun/jobs/ca27a7f20526225015b01b231bd0f1eeb0e6c7d8/status
An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

I thought this was due to an error in the “role” setting, and that was correct. I first tried using the generic role

arn:aws:iam::ACCOUNT-ID:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch

but that didn’t work.

I then added a custom IAM role to AWS with S3, EC2, ECS and Batch permissions. I added the following permissions as well:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

And then everything worked as expected.

ECS unable to assume role

I heard from someone else trying redun for the first time that they were able to get the batch submission working with the (similar) instructions at this stack overflow post

I hope this helps anyone trying to deploy redun to AWS Batch for the first time!

Trying out redun – the newest workflow manager on the block

Workflow managers form the cornerstone of a modern bioinformatics stack. By enabling data provenance, portability, scalability, and re-entrancy, workflow managers accelerate the discovery process in any computational biology task. There are many workflow managers available to chose from (a community-sourced list holds over 300): Snakemake, Nextflow, and WDL… each have their relative strengths and drawbacks.

The engineering team at Insitro saw all the existing workflow managers, and then decided to invest in building their own: redun. Why? The motivation and influences docs pages lay out many of the reasons. In short, the team wanted a workflow manager written in Python that didn’t require expressing pipelines as dataflows.

I spent a few days trying out redun – working through the examples and writing some small workflows of my own. I really like the project and the energy of open source development behind it. I’m not at the point where I’m going to re-write all of my Nextflow pipelines in redun, but I’m starting to consider the benefits of doing so.

The positives I immediately noticed about redun include:

  • redun is Python. Not having to learn a domain-specific language is a huge advantage.
  • The ability to execute sub-workflows with a single command. This is helpful if you want to enter a workflow with an intermediate file type.
  • I can see redun working as a centralized way to track workflow execution and file provenance within a lab or company.
  • There are several options for the execution backend, and redun is easy to deploy to AWS Batch (with some tweaks).
  • The tutorial and example workflows were helpful for demonstrating the key concepts.

A few drawbacks, as well:

  • There hasn’t been much investment in observability or execution tracking. Compared to Nextflow Tower and other tools, redun is in the last century.
  • Similarly, there isn’t yet much community investment in redun, like there is in nf-core.
  • While redun is extremely flexible, I bet it will be more challenging for scientists to learn than Snakemake.

There will certainly be other items to add to these lists as I get more familiar with redun. For now, it’s fair to say I’m impressed, and I want to write more pipelines in redun!

Rare transmission of commensal and pathogenic bacteria in the gut microbiome of hospitalized adults (2)

When we last left off, I was peering into the -80 freezer at the hundreds of stool samples I would need to analyze. In reality, a lot of experimental design work came on this project before I ever opened up the freezer!

Designing a good experiment was one of the most important things I learned in grad school. Science is already hard enough – you need to set yourself up for success from the beginning by designing a good experiment, whether it’s wet lab or computational. I like to think about what success in this project would look like, and work backwards from success to understand the data I need to collect.

To convincingly prove that a bacterium had transmitted from the microbiome of one patient to the microbiome of another, I needed the following pieces of evidence:

  1. At a given point in time, the bacterial genome was present in the microbiome of the source patient and undetectable in the microbiome of the recipient.
  2. At a future point in time, the bacterial genome was present in the microbiome of the recipient patient, and ideally persisted for multiple future time points.

Through Stanford Hospital, I also had access to a dataset of each patient’s room history. From this, I could find when two patients were roommates. Mapping the overlapping intervals, combined with the list of samples biobanked from each patient, was a challenging data science problem. It took me about a month of work to design an experiment that would give me the best chance of observing patient-patient microbiome transmission, if it was happening.

The wet lab work for this project was long and monotonous. You can read about it in the methods section of the paper, but we did DNA extraction and 10X Genomics linked read sequencing on all of the new samples.

When the new data came back, it was time to get cracking! The processing pipeline and data analysis I had planned would take too long to run on Stanford’s HPC cluster, so I turned to Google Cloud to get everything done with quick parallelization. The process of getting our workflows to run at scale in the cloud was certainly a learning experience, and I wrote a blog post about the effort (two years ago).

After assembling bacterial genomes from hundreds of microbiome samples, comparing strain-level populations with inStrain, and generating massive matrices comparing all sets of genomes in my samples, the true data analysis began. A few key lessons from the data analysis and writing experience have stuck with me, and the challenges made me a better scientist.

  1. Scrutinize your results! When I initially looked for identical bacterial genomes in samples from different patients, I found many “transmission events” that were simply the results of barcode swapping (when samples sequenced on an Illumina machine at the same time experience a small degree of contamination). I was prepared for this outcome, and developed a method to quantify when identical genomes were likely the result of barcode swapping in the linked read data.
  2. Carefully evaluate negative findings. After eliminating all the likely false positive results, I found very few identical genomes between patients, especially antibiotic resistant pathogens. At first, this was an upsetting result. I was really hoping to find lots of transmission between patients who were roommates! However, the lack of pathogen transmission findings allowed me to focus on the potentially more interesting cases of commensal bacteria transmitted between patients. The “negative” finding here turned out to make a more interesting story.

 

Rare transmission of commensal and pathogenic bacteria in the gut microbiome of hospitalized adults (1)

My final project with the Bhatt Lab is now published! You can find the open access text at Nature Communications. I’m excited to bring this chapter of my research career to a close. The paper contains the full scientific results; here I’ll detail some of the journey and challenges along the way.

Hot off the success of my previous work studying mother-infant transmission of phages in the microbiome, I was eager to characterize other examples transmission between the microbiome of humans. While mother-infant transmission of both bacteria and phages was now understood, microbiome transmission between adults was less clear. There were some hints of it happening in the literature, but nobody had fully characterized the phenomenon at a genomic level of detail that I believed. I’m also not counting FMT as transmission here – while it certainly results in the transfer of microbiome components from donor to recipient, I was more interested in characterizing how this phenomenon happened naturally.

In our lab, we have a stool sample biobank from patients undergoing hematopoietic cell transplantation (HCT). We’ve been collecting weekly stool samples from patients undergoing transplant at Stanford Hospital, and to date we have thousands of samples from about one thousand patients. HCT patients are prime candidates to study gut-gut bacterial transmission, due to a few key factors:

  1. Long hospital stays. The conditioning, transplant and recovery process can leave a patient hospitalized for up to months at a time. The long stays provide many opportunities for transmission to occur and many longitudinal samples for us to analyze.
  2. Roommates when recovering from transplant. At Stanford Hospital, patients were placed in double occupancy rooms when there were not active contact precautions. These periods of roommate overlap could provide an increased chance for patient-patient transmission.
  3. Frequent antibiotic use. HCT patients are prescribed antibiotics both prophylactically and in response to infection. These antibiotics kill the natural colonizers of the gut microbiome, allowing antibiotic resistant pathogens to dominate, which may be more likely to be transmitted between patients. Antibiotic use may also empty the niche occupied by certain bacteria and make it more likely for new colonizers to engraft long-term.
  4. High burden of infection. HCT patients frequently have potentially life-threatening infections, and the causal bacteria can originate in the gut microbiome. However, it’s currently unknown where these antibiotic resistant bacteria originate from in the first place. Could transmission from another patient be responsible?

As we thought more about the cases of infection that were caused by gut-bloodstream transmission, we identified three possibilities:

  1. The microbes existed in the patient’s microbiome prior to entering the hospital for HCT. Then, due to antibiotic use and chemotherapy, these microbes could come to dominate the gut community.
  2. Patients acquired the microbe from the hospital environment. Many of the pathogens we’re interested in are Hospital Acquired Infections (HAIs) and known to persist for long periods of time on on hospital surfaces, in sinks, etc.
  3. Patients acquired the microbe via transmission from another patient. This was the most interesting possibility to us, as it would indicate direct gut-gut transmission.

While it’s likely that all three are responsible to some degree, finding evidence for (3) would have been the most interesting to us. Identifying patient-patient microbiome transmission would be both a slam dunk for my research, and would potentially help prevent infections in this patient population. With the clear goal in mind, I opened the door of the -80 freezer to pull out the hundreds of stool samples I would need to analyze…

More to come in part 2!

 

 

Moving into aging research – in dogs!

P – H – Done

As I finish up my PhD at Stanford and consider my next career moves, I’m positive I want to work at a small and rapidly growing biotech startup. After many interviews and some serious introspection, I settled on working at Loyal, a biotech company dedicated to extending the lifespan of dogs by developing therapeutics. It seems like a crazy idea at first, but the core thesis of doing aging research in companion canines makes a lot of sense.

I believe the aging field is at an inflection point – it’s where the microbiome research was 10 years ago. Back then, 16S rRNA sequencing was the state of the art, and the only question researchers were commonly asking of microbial communities was “who’s there.” We’ve since come to appreciate the ecological complexity of the microbiome, developed new genomic ways to study the identities and function of it’s members, and engineered microbiome therapeutics that are starting to show signs of efficacy.

At the core of the aging thesis is the idea that aging is a disease. After all, age is the largest risk factor for death, cancer, dementia, etc. Re-framing aging as a disease allows for completely new investigations, but will not be easy from a regulatory perspective.

Lifespan vs healthspan

“Why would you want to extend the number of years someone is sick at the end of their life?”

This question is frequently asked by those unfamiliar with aging research. However, I don’t believe many in the field have a desire to prolong an unhealthy end of life. Extension of lifespan is not valuable if the extra years are not lived well. Many researchers are interested in healthspan, the number of years lived in a good state of health. One way to picture this is to imagine a “rectangularization” of the survival curve. A drug that prolongs the number of years lived in good health would be very valuable, even if it had no impact on life expectancy.

Rectangularization of the survival curve – The lines should both be the same height to start, but you get the idea.

What about the ethical implications?

News about advancements in aging research are often accompanied by fear: “won’t this just make rich people live longer?” After all, immortality has been a quest for millennia. I don’t buy into many of these criticisms, for a few reasons. First, lifespan is already very stratified by income, and the wealthiest individuals already have access to advanced therapies and care that others lack. Second, advances in lifespan and healthspan are likely to be slow. No immortality drug will be developed overnight. Third, many researchers are working to develop drugs for aging that are cheap and commoditized. The CEO of Loyal, Celine Halioua, has written about this at length.

I’m not new to the aging field!

Back in my undergrad research at Brown, I worked in Nicola Neretti’s lab, which was focused on the genetic and epigenetic pathways of aging. The main paper I contributed to in undergrad studied the chromatin organization of cells as they progressed into senescence – a cellular version of aging slowdown. It’s great to be back!

What’s going on at Loyal?

I’ll be working on everything related to genomics and bioinformatics related to dogs. This means sequencing blood and saliva samples from our laboratory and companion animals, quantifying aging at the genetic and epigenetic level, building better epigenetic clocks, and researching the breed-specific epigenetic changes that accompany aging in certain dogs. It’s exciting and fast paced. And we’re hiring more! Whether your background is in aging science, vet med, computer science, or business operations, we need talented people. Drop me a line if you want to talk more.

Tail risk hedging – replication of the VXTH index

In my last post about hedging a portfolio with options, I looked at how a complicated 4-option spread could replicate the VIX index and hedge against market volatility. Now, we’re going to look at a simpler, explicit “tail risk” hedge using VIX calls. This strategy is based on the VXTH index (VIX Tail Hedge), which buys 30 delta VIX calls with 1% of the portfolio when volatility is low, and allocates the rest into the SPX index. Looking at the performance of the index below, three things are immediately clear:

  1. VXTH did well, but not stellar, in 2008-2009
  2. VXTH slightly underperformed the benchmark during the bull market of 2010-2020
  3. VXTH absolutely skyrocketed during the COVID crash of 2020. I think this played right into the strengths of the hedging program: a rapid VIX spike, followed by quick recovery of SPX.

We’re going to look at replicating the VXTH index and extending the methodology to other portfolios, including a leveraged ETF portfolio holding UPRO and TMF.

 

Equity curves of VXTH (green) compared to SPX (black) from 2006-2020.

How does VXTH work?

The methodology is simple. Each month, the look at the front month VIX futures contract and decide how to allocate to the hedge. With the specified fraction of the portfolio, buy 30 delta VIX calls with one month to expiration.

VIX future valuePortfolio allocation
X <= 150%
15 > X <= 30 1%
30 > X <= 500.5%
X > 500%

N.B. The phrase “forward value of VIX” on the CBOE website is strange and doesn’t have an explicit meaning (at least to me). I confirmed the index is looking at the front month VIX future rather than spot VIX by examining the trade log on the CBOE website.

Why hedge with VIX calls?

I think the main reason for using VIX calls as a tail risk hedge is the convexity embedded in the option. In times of low vol, the calls are cheap, and a 1% allocation can buy your portfolio many many OTM calls. But when tail risks come to fruition and VIX spikes like it did in March 2020, the value of the options goes parabolic. If you have the hedge on before everyone else in the market is trying to hedge, you’re in a great position. VIX options are also very liquid in a crisis, in times when other instruments can be illiquid and difficult to unwind for big positions.

Replicating the VXTH index

Similar to the last post, I obtained VIX option data from IvyDB and historicaloptiondata.com. /VX prices were obtained from the Quandl continuous futures dataset. Backtesting was done with a custom R program. Option transactions occur at the midpoint of the bid/ask spread and have no transaction costs (big caveat here!). I first replicated VXTH, and equity curves are below. However, I’m still experiencing some tracking error compared to the benchmark, especially in 2020. I think this could be due to differences in my price data or timing luck (see the future directions section). Still, the VXTH replication captures most of the movement of the benchmark and has no drawdown in March 2020.

Equity curves for my replicated VXTH (red) compared to the benchmarks.

Extension to a UPRO/TMF portfolio

How does adding a VIX call hedge deal with the added volatility of a leveraged portfolio? Quite well! Using the same parameters and a portfolio of 55% UPRO, 45% TMF, the equity curves are below. The outperformance in 2020 isn’t very visible on the log scale, but the VIX call hedged portfolio ends the backtest with a 30% higher balance. The stats on the hedged portfolio are also excellent – improved total and risk-adjusted return, and a comparable drawdown to holding SPX alone. So far, this looks really good!

Equity curves of hedged UPRO/TMF portfolio compared to benchmarks

 SPXVXTH (benchmark)VXTH (replicated)UPRO/TMFUPRO/TMF + VXTH
CAGR7.4912.28.817.119.2
Sharpe ratio (Annualized)0.490.670.660.700.87
StdDev (Annualized)15.218.313.525.022.4
Worst drawdown52.5%37.4%35.1%70.9%57.2%

Conclusions

Adding a small, constant allocation to VIX calls can improve the absolute and risk-adjusted returns of a portfolio of stocks or leveraged stocks/bonds, at least in the period I backtested. This method is relatively simple compared to the 4 option method I tested in the last post, and only requires management once per month, which can coincide with a monthly portfolio rebalance. There are a few optimizations I want to test before running this method live. I also need to include transaction costs and slippage into my model.

Future directions

I noticed some timing luck in replicating VXTH, specifically around the COVID crash. Slightly changing the days to expiration of the calls would result in very different outcomes, because the VIX calls could be held through the entire crash instead of sold at the “right” time. I think that’s part of why VXTH did so well in March – the VIX peak was right at an option expiration, so the position was exited at just the right time. Ideally we’d strive eliminate this timing luck from a portfolio. I can see a few ways to do this, that I’ll think about implementing in my backtests:

  1. Instead of holding to expiration, positions should be dynamically opened or closed when VIX crosses one of the allocation thresholds.
  2. Holding a “ladder” of calls with different expirations to reduce the effect of timing.
  3. Daily rebalancing (probably not a good idea in practice because of transaction costs).

I want to optimize some other parameters, while being wary of the possibility of overfitting to the relatively few “tail risk” events that have happened in my dataset.

  1. Allocation amounts (probably more hedge is better with the leveraged portfolio)
  2. Hedge thresholds. Analyzing the transition matrix from one VIX state to the next may help with this.
  3. Option delta. Lower delta options will give you more convexity when the rare crashes happen, but you may not benefit from small VIX spikes.

 

Volatility as an asset class – replication of Doran (2020) and extension to a leveraged risk-parity portfolio

Introduction

This post is going to be a departure from the usual genomics tilt of this blog. I’ve recently been interested in the science (art?) of hedging a stock portfolio against market downturns. Hedging is difficult and involves the selection of the right asset class, right allocation (holding too much of the hedge and you under perform in all markets) and right time to remove the hedge (ideally at the bottom of a correction). If the VIX (CBOE Volatility Index) were directly investable, holding it as an asset in a portfolio would provide a significant edge. However, you cannot directly “buy” the VIX, and tradable VIX products (like VXX, UVXY, etc) have notable under performance when used as a hedge (Bašta and Molnár, 2019).

A paper by James Doran (2020) proposed that a portfolio of SPX options that is highly correlated to the VIX could be held as a long-term hedge. The portfolio buys an ITM-OTM put spread and sells an ATM-OTM call spread when the VIX is at normal values, and does not hedge when the VIX is above the mean plus one standard deviation. In this way the portfolio systematically removes the hedge when vol is the most expensive and therefore more likely to revert to the mean. For example, if SPX was at 3800 and VIX was at normal levels, the portfolio would allocate 1% to the following option spreads with one month expiration. The payoff with SPX at various levels at expiration is shown below.  Importantly, this spread has positive theta, and only begins to lose if SPX closes above 3850.

 ITM/OTM %Put/CallStrike
Buy5% ITMPut3990
Sell5% OTMPut3610
SellATMCall3800
Buy5% OTMCall3990

P/L of the option spread at expiration. Cost = 8710, max gain = 29290, max loss = 27710.

I was interested in replicating the results of this paper, extending the findings to the end of 2020 (the paper stops in 2017), and finding if the option portfolio would hedge a leveraged stock portfolio holding UPRO (3X leveraged S&P500).

Step 0: Obtain data, write backtest code

Option data: I obtained end of day option prices for the SPX index from Stanford’s subscription to OptionMetrics for 1996-2019. 2020 data were purchased from historicaloptiondata.com.

Extended UPRO and TMF data: These products began trading in 2009, but we definitely want to include the early 2000s dotcom crash and 2008 financial crisis in our backtests. Someone on the bogleheads forum simulated the funds going back to 1986, and they’re available here

Backtesting: I wrote a simple program to backtest an option portfolio in R. This program buys a 30 DTE spread as described above and typically holds to expiration. When VIX is low, a fixed percentage of the portfolio value is placed into the option portion during each rebalance, which occurs when the options expire. When VIX is high (above mean plus one standard deviation), the portfolio only holds the base asset class. If VIX transitions from low to high, the hedge is immediately abandoned, and if VIX transitions from high to low, the hedge is repurchased.

Step 1: replicate the results of Doran (2020) with the SPX index

To ensure our option backtest works as expected, I first replicated the results from the Doran paper using the SPX index. I allocated a fixed 5% to the hedge. I found performance was improved by using options 10% ITM or OTM, so these were used in all backtests. Below are the returns of these portfolios from 1996-2020, starting with $100,000. Although the hedge does well in negative markets, the under performance in the bull market of the last 10 years is quite apparent. The hedge also didn’t protect much against the rapid COVID crash in March 2020 – I think because VIX spiked very quickly and the portfolio wasn’t hedged for much of the crash. My results don’t exactly match those in the paper (even using a 5% spread width). I think differences in the option prices, especially early in the dataset, are playing a role in this.

Equity curves for option hedged SPX portfolios. SPX = un-hedged. OPT: always hedged 5%. OPTsd: hedged 5% when VIX is below the mean plus one standard deviation.

 SPXOPTOPTsd
CAGR7.492.917.08
Sharpe ratio (Annualized)0.480.390.64
StdDev (Annualized)15.37.7111.23
Worst drawdown52.535.241.2

Step 2: extend the option hedge to a portfolio holding UPRO

How does the hedge work using 3X leveraged fund UPRO? I conducted the same backtest, and found that 10% allocated to the hedge is better. This makes sense – you need something with higher volatility to balance out the extreme swings in UPRO. Hedged performance is definitely better than holding UPRO alone, which has pathetic stats over this time period. Better returns than holding SPX alone, but more variance and a equivalent Sharpe ratio. Holding the VIX as an asset is still the winner here.

Equity curves for option hedged UPRO portfolios. SPX: un-hedged, UPRO: un-hedged, UPROvixsd: holding VIX as hedge when VIX is low, UPROoptsd: holding option hedge when VIX is low.

 SPXUPROUPROoptsdUPROvixsd
CAGR7.499.7115.121.6
Sharpe ratio (Annualized)0.480.200.490.53
StdDev (Annualized)15.346.831.640.5
Worst drawdown52.597.487.791.7

Comparison to a UPRO/TMF portfolio

The option-hedged portfolio needs to outperform a 55/45% UPRO/TMF portfolio for me to consider running it for real. I used portfoliovisualizer.com to easily compare these portfolios with monthly rebalancing.

Portfolio 1 (blue) : UPROoptsd   Portfolio 2 (red) : UPRO/TMF 55/45   Portfolio 3 (yellow): UPRO/VIX 70/30

The returns with TMF have less variance than the option hedged portfolio and end up almost exactly equal at the end of this time period. However, in 1996-2008, the option portfolio definitely outperformed. Holding VIX is again the clear winner in both absolute and risk-adjusted returns, but still suffers severe drawdowns.

Conclusions

I don’t think holding this portfolio will provide a significant advantage compared to a UPRO/TMF portfolio. Given the limitations below and no significant advantage in the backtest, I won’t be voting with my wallet. The option hedge portfolio did provide significant advantages in the 1996-2008 period, where it outperformed all other portfolios (even the optimal 70/30 UPRO/VIX!) with a Sharpe ratio of 1.01 and max drawdown of 47% in the dotcom crash. I may paper-trade this strategy to get a feel for position sizing, slippage and fills on these spreads, though.

Limitations: Why I won’t be hedging with this method

  1. This model assumes all transactions occur at the midpoint of the bid-ask spread and does not take into account transaction costs. While transaction costs are relatively small, SPX and XSP can have relatively wide bid-ask spreads, much wider than SPY.
  2. Options can by illiquid, only purchased in fixed quantities, and difficult to adjust. Today with SPX at 3750, Buying one SPX 30d 5% ITM-OTM put spread costs $16100. Adding the call spread brings the cost down to $9340 but brings the max loss of the position to $27340! Trading on XSP brings the cost down by a factor of 10. With a 1% hedge, this method is only good for portfolios >100k. As a 5% hedge this can be used on a portfolio as small as 20k. Still, what do you do when the optimal amount of hedge is 1.5 XSP contracts?
  3. It’s more complicated than simply rebalancing between UPRO and TMF, requiring more active management time.
  4. The option hedge didn’t even outperform UPRO/TMF in some regards!
  5. Backtests are only backward-looking and easy to overfit to your problem.

Future directions to explore

  1. Optimal hedge amount – was not optimized scientifically, I just tried a few values and decided based on returns and Sharpe ratio.
  2. Differing DTE on position opening an closing. 30 days and holding to expiration may not be optimal.
  3. Selecting strikes based on Delta instead of fixed percentage ITM/OTM. This would result in different strikes selected in times of low and high vol, but probably has a minimal impact.
  4. The max loss of these spreads can be quite high compared to the cost to enter the trade – maybe the hedge amount should be scaled based on the max loss of the position (with the remaining invested in the base asset or held in cash).

Questions? Other ideas to test? Let me know! I’ll also happily release returns or code (it’s not pretty) if you are interested.

References:
1.Doran, J. S. Volatility as an asset class: Holding VIX in a portfolio. Journal of Futures Markets 40, 841–859 (2020).
2.Ayres, I. & Nalebuff, B. J. Life-Cycle Investing and Leverage: Buying Stock on Margin Can Reduce Retirement Risk. https://papers.ssrn.com/abstract=1149340 (2008).
3.Ayres, I. & Nalebuff, B. J. Diversification Across Time. https://papers.ssrn.com/abstract=1687272 (2010).
4. Bašta, M. & Molnár, P. Long-term dynamics of the VIX index and its tradable counterpart VXX. Journal of Futures Markets 39, 322–341 (2019).

Leveraged portfolio background

The leveraged portfolio idea comes from the famous “HEDGEFUNDIE’s excellent adventure” thread on the Bogleheads forum (thread 1, thread 2) with ideas going back to “lifecycle investing” and “diversification across time” from Ayres and Nalebuff (2008, 2010). Basically, it makes sense to use leverage to obtain higher investment returns when you’re young and expect to have higher earnings in the future. You can do this with margin, futures, LEAPS options, or leveraged index funds. The leveraged funds appear to be the easiest way to obtain consistent and cheap leverage without risk of a margin call. The portfolio holds 55% UPRO and 45% TMF (3X bonds) and typically rebalances monthly. I’ve also thrown some TQQQ (3X leveraged NASDAQ) into the mix. These portfolios outperform a 100% stocks or an unleveraged 60/40 portfolio on BOTH a absolute and risk-adjusted return basis. However, if you could hold VIX as an asset to rebalance out of, performance would be even better. Hence my interest in replicating the a VIX hedge with options.