One of the first steps in the analysis of most next generation sequencing datasets (unless you’re doing a novel genome or transcript assembly) is mapping to a reference genome. Mapping is a procedure that determines the location in the genome that each sequencing read came from. If you have good sequencing data, most of the reads will be mapped by the program you chose to use.
What about the small (usually <5%) portion of reads that fail to map, then? What can we learn from these reads? Can they be used for quality control or actual analyses?
As it turns out, a lot can be learned by analyzing unmapping reads. Let’s start by understanding how a read can fail to map to a reference genome.
- Low quality or complexity: Some sequencing reads are filled with low quality base calls – either several ‘N’ bases in the reads or poor quality scores. These reads are usually eliminated by a filtering step before any downstream analysis. Low complexity reads – homopolymer and heteropolymer repeats, for example – are also impossible to align. Both examples don’t encode any useful information, but can be important in determining the quality of the sequencing library before further analysis. Trimming the low quality bases (if in a consistent position across the dataset) is one way to improve alignment.
- Ambiguous alignment: Reads from repetitive parts of the genome may align to more than one position. In humans, this can be a large portion of the sequencing data, since over 50% of the human genome is repetitive DNA. Depending on the aligner and parameters you choose, reads with ambiguous alignments may be reported in one position or fail to map. Bowtie2, for example, reports a single alignment for ambiguous reads by default; it chooses between the best possible alignments with a random number generator.
How can they be useful?
Ambiguous reads can be used to find information on the repetitive part of the genome – what many scientists once called ‘junk DNA’. Repetitive sequences are actually important for
- Discordant alignment (paired end sequencing): Paired end reads should be separated by a certain number of bases (plus or minus some standard deviation) when they map to a genome. This is because paired end protocols generate molecules of roughly the same length of which both ends are sequenced. Once again, the reporting of discordant alignments differs with the program and parameters.
What can you do with them?
Discordant alignments can give information about genome rearrangements, such as deletions, insertions and duplications. For example, If there’s strong evidence for two reads aligning at a distance greater than the insert size, it’s possible some DNA between the two loci was deleted. The inverse is also true: reads aligning at a distance less than the insert size can indicate novel insertions, such as retrotransposons. Peter Park’s lab at Harvard has been developing algorithms to detect these events in NGS data and has applied them to look at genome rearrangements in cancer.
- The read came from another organism: A tissue sample isn’t always a pure culture of the cells you want to look at. Humans are host to a huge number of microbes, viruses and parasites that inevitably end up in a tissue sample. This is called the microbiome, which has been increasingly studied and found to be very important in health and disease. If other organisms are present in a tissue sample that’s being sequenced, some of their DNA will be sequenced as well. These reads won’t map to the reference genome.
What can they tell us?
Sequencing reads from the microbiome can tell you a lot about the communities of bacteria, fungi and viruses living in a sample. Several studies have compared the microbiome of individuals using next generation sequencing data.
That’s all the cases I can think of for why a read wouldn’t map to the reference, although it’s possible I missed some. In my next post I’ll talk about the analysis I’ve been doing on the unmaping portion of sequencing data and some interesting results!