Reducing RNA-seq batch effects by re-aligning TCGA and GTEx

At Pattern, we want to be able to compare RNA-seq data between several public sources and our growing internal datasets. We care a lot about the differential expression of certain genes, but batch effects can completely overwhelm any signal of genes overexpressed in cancer, genes changing between cancer subtypes, and similar comparisons.

A few publications [1,2] suggest that processing raw, read-level data from different public sources with a consistent bioinformatics pipeline can reduce the batch effects in RNA-seq. That makes sense — I can imagine the batch effect is the sum of effects from cohort selection, sample collection, sample processing, library preparation, sequencing, and bioinformatics methods. Eliminating the contribution of bioinformatics to this equation can surely reduce the overall batch effect, but is it enough to be noticeable? Is it worth the extra computational effort?

Several months ago, I began the journey to consistently process all the samples from The Cancer Genome Atlas (TCGA, the largest public collection of RNA-seq data from different cancers), the Genotype Expression project (GTEx, the largest collection of RNA-seq data from many tissues of healthy individuals), and our internal collection of several hundred tumor and normal RNA-seq samples. The total was over 30,000 samples and 200TB of raw data. I used the nf-core/rnaseq pipeline, since that’s what I already ran for internal data. The challenges I encountered along the way are the subject of this post.

If you only make it this far, I think it is worth it to re-process RNA-seq data whenever raw reads are available. The reduction in batch effects, as measured by decrease in distance between centroids in PCA-space of different datasets, was larger than I expected. I’ve started always going back to raw reads for public RNA-seq data when they are available.

Challenge 1: Getting access to the raw data.
Raw, read-level data for TCGA and GTEx is only available behind an application to NIH because it has information on germline genetic variants. Yes, you can get access to controlled data in industry. This is not clear by looking at the NIH dbGaP documentation, but all it takes is one person at the company to register in eRA Commons as the Signing Officer, and you to register as the PI. Then, just complete the application for each dataset, correct the mistakes you will inevitably make, and wait a few weeks.

Challenge 2: Where are you going to run all these alignments?
I thought about running these 30k samples on AWS, but even with spot instances, the total cost would have been $20k-$30k. Instead, I chose to buy a few servers for about the same price. Now that the project is over, I still have the hardware, and my cloud bill stayed sane. The workhorse of the project was a dual AMD EPYC 9654 machine (192 cores, 384 threads) with 1.5TB of RAM and 30T of local NVMe storage. It’s networked at 100Gb/s to a storage server with 100TB of NVMe. This is a subset of the hardware build-out I did when we moved to our new office, which should be the topic of a separate post.

Challenge 3: Downloading 200TB of raw fastq files.
Downloading TCGA controlled access data with the gdc-client tool works well. GTEx on the other hand… is another story. Despite the fact that our taxpayer dollars paid for every aspect of the GTEx project, the GTEx raw data is only available in Google Cloud Storage. I have to pay Google for permission to use the sequencing data that I already paid to collect! It would cost something like $12k to download the entirety of GTEx so I could process it on my new servers, which were sitting idle after enthusiastically consuming all of the TCGA data with no sign of slowing down. If only there was a better way…
If you read the Google Cloud Storage docs closely, you’ll notice something in the fine print. Egress from Google Cloud to Google Drive is FREE. And downloading from Google Drive is never charged. We already pay for 10s of TB of space in Drive through our Google Workspace subscription. The path emerged:
Google Cloud Storage -> Google Cloud VM -> Google Drive -> Local servers. Zero egress charges.
The only issues are a 750GB upload limit to Google Drive per user per day, but service accounts count as a user for the purposes of this limit. I had a path forward! Finally, both TCGA and GTEx provide aligned BAM files, but the raw reads can be extracted with samtools fastq. GTEx reads required sorting before alignment.

Challenge 4: Actually running 30k samples through nf-core/rnaseq
I had to work in batches due to the large amount of temporary files that are generated during the nf-core/rnaseq pipeline. Processing everything at once would have generated 2PB in temporary files and results! I set up a script to launch batches of 500 samples at a time, upload the results I cared about to AWS, and delete everything else when the batch was complete.
I did some minor tuning to the pipeline, including disabling tools I didn’t need and changing the job resource requirements to better match my hardware. It took about 45 days of continuous runtime to process all the samples.

Results

I looked at groups of samples from matched organs from the different datasets to quantify the batch effects before and after re-processing. For GTEx, nTPM (normalized transcripts per million) normalization was done across the entire collection. For TCGA, normalization was done per-project. PCA was calculated for all samples from a matched organ. The centroid of the points from each dataset was estimated in 2D or Nd space, and the euclidean distance between centroids was calculated. A distance was also calculated using only non-tumor samples in TCGA, which were expected to be closer in PC-space to the GTEx samples. All of these results are before running any batch correction algorithm, like COMBAT.

In every matched organ, re-processing TCGA and GTEx RNA-seq samples with a consistent bioinformatics pipeline reduced the batch effects. The reduction was almost always larger when considering only non-tumor samples.

In the liver, where we have the most internal RNA-seq data, all three data sources were much closer in PC-space after re-processing. I’m particularly happy about this result, as it means our internally-generated data can be compared with these external sources more reliably.

References

  1. Arora, S., Pattwell, S. S., Holland, E. C. & Bolouri, H. Variability in estimated gene expression among commonly used RNA-seq pipelines. Sci Rep10, 2734 (2020).
  2. Wang, Q. et al. Unifying cancer and normal RNA sequencing data from different sources. Sci Data5, 180061 (2018).

s3stasher simplifies using AWS S3 in python

Working with cloud files in python is a necessary pain at any biotech organization. While packages like pandas transparently handle S3 URIs, I still found myself writing the same boto3 code to manage files far too often. Nextflow solves this for workflows, but there aren’t any packages that I’m aware of that make managing S3 files easier for script and notebook use cases.

I developed s3stasher to make working with files in AWS S3 as easy if they were local files. The key principles are:

  • S3 objects should be referred to as full URIs at all times. It shouldn’t be necessary to split a URI into bucket and key strings.
  • Any method that reads or writes a file should transparently work on S3 URIs or local files.
  • S3 objects should be cached locally and only re-downloaded when the source object has changed.
  • Reading S3 objects should work identically while offline, assuming the user has the file cached.

Using s3stasher, you simply have to wrap any file reading or writing in a with statement, and all the file operations will happen behind the scenes.

from s3stasher import S3

# Download, cache, and read an S3 object
with S3.s3open("s3://my-bucket/my_data.csv") as f:
    my_df = pd.read_csv(f)

# Two layers of context manager are needed for traditional open operations
with S3.s3open("s3://my-bucket/unstructured.txt") as s3f:
    with open(s3f) as f:
        lines = f.readlines()

# Write a file back to s3. By default, it will be saved in the cache dir 
# to avoid an unnecessary download in the future
with S3.s3write("s3://my-bucket/my_data_new.csv") as f:
    my_df.to_csv(f)

# Other convenience functions are provided
## List objects under a prefix
uri_list = S3.s3list("s3://my-bucket/prefix/")
## Check for existance of an object
uri_exists = S3.s3exists("s3://my-bucket/unknown_file.txt")
## copy, move, remove an S3 object
S3.s3cp("s3://my-bucket/my_file_1.txt", "s3://my-bucket/my_file_2.txt")
S3.s3mv("s3://my-bucket/my_file_2.txt", "s3://my-bucket/my_file_3.txt")
S3.s3rm("s3://my-bucket/my_file_3.txt")

By default, s3stasher uses your already set up AWS credentials, and caches files to ~/.s3_cache. All of these options can be changed with a config file or environment variables.

You can install s3stasher with a quick pip install s3stasher.
Pypi: https://pypi.org/project/s3stasher/
GitHub: https://github.com/bsiranosian/s3stasher

Feedback and PRs welcome!