Highlights from ISMB – Day 1 – Benjamin Siranosian

Today was the first day of the main ISMB conference in Boston. Between volunteering duties I had a chance to catch a few interesting talks. Here’s what I thought was interesting on Sunday, July 13.

Keynote by Eugene Myers
Eugene Myers was given the Senior Scientist Award by ISCB this year and gave an excellent keynote in the afternoon. The list of projects he’s contributed to is impressive – basically solving a huge problem in computer science or biology every 10 years. In 1990 he invented in the suffix array data structure, in 2000 he was essential to the human genome assembly effort at Celera Genomics, and recently he’s been working on visualization and microscopy problems in neuroscience. His keynote focused on genome assembly – first about the process used at Celera and the developments leading to the algorithms, then about the quality of the recent published assemblies.

His main point was that the quality of recent assemblies have been decreasing. The short reads produced by NGS tech are resulting in more contigs and increased gaps in the assemblies. In order to to real comparative genomics and look for structural variation and gene duplication, high-quality continuous genomes are necessary, according to Myers.

Luckily, new technology (PacBio et al.) is producing longer reads that should allow for better assemblies in the future. Myers also talked some about error rates – as long as errors are randomly distributed, they shouldn’t effect the quality of the resulting assembly at all. This is good news for PacBio, with its 10% or so relative error. Myers is also working on a new assembler for long reads called DAZZLER, which he didn’t get to describe in detail (and I haven’t had the time to look into the actual methods yet) but it seems interesting. Check out his blog here.

Compressive Genomics by Bonnie Berger’s lab
This presentation was in a session chaired by Michael Waterman and my teacher Sorin Istrail celebrating 20 years of the Journal of Computational Biology. The Berger lab is working on new algorithms for data compression and processing to facilitate the massive amount of biological data being generated these days (which is growing faster than our capacity for data storage, I should add). First, they presented a pipeline that eliminates 95% of quality scores in .fastq files before downstream processing. Only quality values of bases that are called as mutations or otherwise interesting are retained, the rest are transformed to the mean quality. This decreases the file size and can apparently increase downstream accuracy.

Second, a compressive BLAST algorithm that can speed up alignments in large databases. Their method first computes clusters of entries in the BLAST database, and creates a representative sequence for each cluster. A query is then compared against these representative sequences, and only compared against the constituent subjects in the clusters nearest to it. This drastically shrinks the number of alignments done and speeds up the BLAST search. It turns out there are some problems with the math behind computing clusters (the measures aren’t truly distance and don’t satisfy the triangle inequality) but since BLAST is an approximate algorithm anyway, it turns out this doesn’t matter!

These descriptions were done from memory, but there’s more information at the Berger Lab page.

Watching the World Cup with world-class scientists
There’s nothing better than watching scientists you look up to cheer for their favorite soccer team. ISMB was nice enough to set up a big projection screen and stream the world cup for us:

The room exploded when Germany scored in overtime — I think there’s a lot more people from Europe here than from South America!

ISMB 2014 continues tomorrow. Stay tuned for more updates!

Leave a Reply Cancel reply