Highlights from ISMB – Day 1

Today was the first day of the main ISMB conference in Boston. Between volunteering duties I had a chance to catch a few interesting talks. Here’s what I thought was interesting on Sunday, July 13.

Keynote by Eugene Myers
Eugene Myers was given the Senior Scientist Award by ISCB this year and gave an excellent keynote in the afternoon. The list of projects he’s contributed to is impressive – basically solving a huge problem in computer science or biology every 10 years. In 1990 he invented in the suffix array data structure, in 2000 he was essential to the human genome assembly effort at Celera Genomics, and recently he’s been working on visualization and microscopy problems in neuroscience. His keynote focused on genome assembly – first about the process used at Celera and the developments leading to the algorithms, then about the quality of the recent published assemblies.

His main point was that the quality of recent assemblies have been decreasing. The short reads produced by NGS tech are resulting in more contigs and increased gaps in the assemblies. In order to to real comparative genomics and look for structural variation and gene duplication, high-quality continuous genomes are necessary, according to Myers.

Luckily, new technology (PacBio et al.) is producing longer reads that should allow for better assemblies in the future. Myers also talked some about error rates – as long as errors are randomly distributed, they shouldn’t effect the quality of the resulting assembly at all. This is good news for PacBio, with its 10% or so relative error. Myers is also working on a new assembler for long reads called DAZZLER, which he didn’t get to describe in detail (and I haven’t had the time to look into the actual methods yet) but it seems interesting. Check out his blog here.

Compressive Genomics by Bonnie Berger’s lab
This presentation was in a session chaired by Michael Waterman and my teacher Sorin Istrail celebrating 20 years of the Journal of Computational Biology. The Berger lab is working on new algorithms for data compression and processing to facilitate the massive amount of biological data being generated these days (which is growing faster than our capacity for data storage, I should add). First, they presented a pipeline that eliminates 95% of quality scores in .fastq files before downstream processing. Only quality values of bases that are called as mutations or otherwise interesting are retained, the rest are transformed to the mean quality. This decreases the file size and can apparently increase downstream accuracy.

Second, a compressive BLAST algorithm that can speed up alignments in large databases. Their method first computes clusters of entries in the BLAST database, and creates a representative sequence for each cluster. A query is then compared against these representative sequences, and only compared against the constituent subjects in the clusters nearest to it. This drastically shrinks the number of alignments done and speeds up the BLAST search. It turns out there are some problems with the math behind computing clusters (the measures aren’t truly distance and don’t satisfy the triangle inequality) but since BLAST is an approximate algorithm anyway, it turns out this doesn’t matter!

These descriptions were done from memory, but there’s more information at the Berger Lab page.

Watching the World Cup with world-class scientists
There’s nothing better than watching scientists you look up to cheer for their favorite soccer team. ISMB was nice enough to set up a big projection screen and stream the world cup for us:

2014-07-13 16.49.17

The room exploded when Germany scored in overtime — I think there’s a lot more people from Europe here than from South America!

ISMB 2014 continues tomorrow. Stay tuned for more updates!

 

 

ISCB Student Council Symposium 2014

I had an excellent opportunity to present my mycobacteriophage kmer usage research at the ISCB Student Council Symposium earlier today. I was one of 12 students from around the world who gave oral oral presentations, which spanned all walks of computational biology and bioinformatics. I thought the symposium was a huge success! Some highlights:

  • Great keynote speakers
    Dr. David Bartel (Whitehead/MIT/HHMI) gave a talk on developments in microRNA research and some really creative tech for sequencing poly(A)-tails. The technique uses a two-step imaging process on an Illumina sequencer to determine the length of the tail and the sequence of the microRNA.
    Dr. Ashlee Earl (Broad) discussed how her lab is using genomics to track pathogencity and drug resistance in TB and other bugs. She also talked about Pilon, software developed by the Broad for improving assemblies of microbial genomes.
  • Scientific speed dating
    This was a novel concept – chat with a fellow scientist and try to describe your research in two minutes or less. The goal isn’t to find a relationship, but a new collaboration!
  • Networking opportunities
    Abhishek Pratap from Sage Bionetworks talked about software called Synapse they’ve been developing to help computational analysis of NGS data be more open and well documented. The student council is also a fan of networking in social settings, and took us all out to a pub after the symposium was finished.

Starting tomorrow, I’m volunteering at the main ISMB conference (what a great way to go to a conference when you don’t have grant support). Stay tuned for updates on interesting research that I see over the next few days!

Thoughts from the SEA-PHAGES symposium

What a weekend! The past two days have been filled with excellent student presentations, ample opportunities for networking and fruitful conversations about future research and teaching ideas. Chen and I presented our poster about alignment-free sequence analysis techniques applied to mycobacteriophage genomes on Saturday night. We must have done something right, because we came back to Janelia Farm this morning with a first place ribbon on our poster! Chen also gave his oral presentation this morning and absolutely knocked it out of the park – people have been coming up to us all day and asking how the animations were done.

We’re going to be putting up a web page summarizing our presentation, poster, results and methods in the next few days. For now, you can view the poster and check out our (unfinished) code at my GitHub. I’ll make another post here when everything is ready!

I was also very impressed with some of the research happening at other schools in the SEA-PHAGES program, and will be writing about some of them in the next few days. For now, check out some photos from Janelia: Continue reading

SEA-PHAGES symposium 2014

This weekend I’m down at HHMI’s Janelia Farm Research Campus at the SEA-PHAGES undergraduate research symposium. The phage hunters class I TA is administered through HHMI and is taught at over 70 schools around the US and internationally. This symposium is a chance for undergraduates from all the schools to get together, present their research and be exposed to new ideas. Chen (one of the first year students and I are presenting our research into tetranculeotide usage in mycobacteriophage genomes. We’ll have a poster at the session on Saturday night and Chen will be giving an oral presentation on Sunday morning.

Janelia Farm is an inspiring place to visit – something about the beautiful architecture coupled with cutting edge research really sticks with you. I hope to come back to Providence with new connections, ideas and inspirations.

Check out the poster we’ll be presenting and feel free to leave a comment with any questions about the research, phage hunters, or the symposium in general.

Counting tetranucleotides in mycobacteriophages

As a teaching assistant in Brown’s first year seminar “Phage Hunters” I lead several freshman biology and computer science students in an independent bioinformatics research project. We began the semester looking for evidence of CRISPR protospacers in mycobateriophage genomes. The idea was to use blast and other tools to get students introduced to the bioinformatics investigation process. We covered the basics of the CRISPR/Cas system, wrote a python script to download genome sequences from phagesdb.org, and made a local blast database on Brown’s computer cluster.

Things were going well with the project, but a few weeks in I was having doubts as to how statistically valid our protospacer predictions were. Then, I re-read a paper by one of the leaders in the field and discovered a) they had already looked for protospacers, and b) found no conclusive evidence in mycobacteriophages. The author of the paper was also going to be at the SEA-PHAGES symposium we were planning to present our class results at, so that really spelled the end of the CRISPR project. We needed  a new idea though – the course instructors were counting on the bioinformatics team to generate some research we could bring to the symposium. My solution: frantic searching on Google Scholar for anything relevant to bioinformatics and bacteriophages.

Within a few minutes I came upon a paper (1) that looked at the the usage of tetranucleotides in viral and bacterial genomes. The idea is that closely related genomes have similar signals in terms of tetranucelotide usage, and this signal can be used to look at relationships independent of alignment-based techniques. I had found a new idea for the project! This kind of analysis was also perfect for teaching bioinformatics. It introduces a lot of the concepts and language used in the field, like kmer counting and normalization. It is fairly straightforward to program, easy to apply to bacteriophage genomes and doesn’t require complicated statistics in a first level investigation.

I ran with this idea for the bioinformatics project and the results were quite exciting. We found tetranucleotide usage was well conserved within mycobacteriophage cluster (a way to group phage based on pariwise nucleotide alignment and gene content comparisons) and divergent between clusters. We built phylogenetic trees that closely corresponded to published trees, looked for horizontal gene transfer and were able to accurately cluster unknown phage – all based on the usage of 4-letter words within the genomes. For a more detailed overview of the work, check out the abstract I submitted for the International Society for Computational Biology Student Council conference.

One of the first year students, Chen Ye, and I are also going to be presenting this research at the SEA-PHAGES symposium at HHMI’s Janelia Farm this weekend. Check back for an update with our poster and other thoughts from the conference!

1. Pride, D.T., Wassenaar, T.M., Ghose, C., and Blaser, M.J. (2006). Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics 7, 8.

ISCB student council 2014

I submitted some research I’ve been working on (as a byproduct of TAing a first year seminar and leading some students in an independent bioinformatics project) to the International Society for Computational Biology student council symposium. Yesterday I found out it was selected for an oral presentation! This is the first chance I’ve had to present independent research, so needless to say I’m pretty excited.

The talk is titled “Tetranucleotide usage in mycobacteriophage genomes: alignment-free methods to cluster phage and infer evolutionary relationships” Read on for the full abstract.

Continue reading

Finished with junior year!

I handed in the final assignment of the semester yesterday, which marked the end of my Junior year at Brown. I can safely say it was the best year I’ve had so far, with the first semester spent abroad in Budapest and the second back in the states. Looking back, I accomplished and learned a huge amount about traveling, bioinformatics, the research process, cooking, the professional environment in biotech and myself as a person.

Today I’m headed up to my friend’s house off the coast of Portland, Maine for some much needed decompression time. I’ll be back in Providence for the summer, though. As a recipient of Brown’s UTRA award I’ll be spending my summer working for Dr. Nicola Neretti on a computational epigenetics project. The project I’m working on will likely continue for my senior thesis, so stay tuned!

Blogs I follow

I don’t know how they find the time to do it, but many of today’s top bioinformatics and computational biology researchers have a regularly updated blog. Reading some of them actually gave me the inspiration to start Blogging Bioinformatics: I liked the idea of having a space where I can talk about research and other topics that interest me. Here’s a list of a few blogs I read regularly. It’s nowhere near complete, and I’m always looking for new suggestions of new people to follow.

  • Homolg.us runs an exceptional bioinformatics blog. They have regularly updated content covering new research, funding situations, and commentaries on issues in the bioinformatics field.
  • assertTrue() by Kas Thomas covers interesting topics and new research in biology and bioinformatics. Many of his recent posts discuss how published genomes are too often “auto annotated,” leading to an abundance meaningless gene calls and hypothetical proteins.
  • Living in an Ivory Basement where Titus Brown talks about bioinformatics, programming and teaching. I’ve previously mentioned content from his site in defense of publishing code for my computational biology classes on my GitHub.
  • Judge Starling by Dan Graur has posts about bioinformatics mixed in with poetry, musings about modern research practices and commentary on ENCODE (which seems to be a popular topic to blog about these days… everyone has an opinion on the consortium).
  • Bits of DNA by Lior Pachter isn’t updated as frequently but the content is always extremely well thought out and backed up by solid reasoning. Lior uses the blog to comment on issues in the computational biology and bioinformatics fields, like ENCODE, missuses of statistics in research and the state of funding in the US.
  • The Mermaid’s Tale – I just added this blog to my list, the first post currently is about a new study of resveratrol (remember how drinking red wine was supposed to extend your life?) and how it didn’t uncover any measurable benefit of the chemical. The real question: which body of research should be treated as fact?
  • Job Etiquette by Paula – A blog I started reading after the Brown Club of Boston Biotech Conference. Paula Freeman discusses advice for young people searching for a job and has excellent content about resumés, interviews and job etiquette in general.

Know of another blog I should add to the list? Let me know in the comments!

Husband and wife start PhD after learning of disease

Eric Minikel and Sonia Vallabh were working as an urban planner and a lawyer until 2011, when they learned Sonia has a rare heritable disease – Fatal Familial Insomnia (FFI). FFI is caused by a mutation in the PRNP gene, which encodes prion protein PrP. Although the function of PrP isn’t precisely known, the mutated form can misfold the normal form of the protein (FFI a prion disease). 

After receiving the unfortunate news from a genetic test, Eric and Sonia decided to devote their life to researching FFI. Both left their jobs for research positions at Massachusetts General Hospital. They soon started a scientific blog, CureFFI.org, where they discuss their research progress and next steps. In addition, Eric and Sonia founded Prion Alliance, a nonprofit devoted to funding prion disease research.

Now, Eric and Sonia have decided to take their research one step further: both will be starting PhDs at Harvard Medical School in the fall. Eric has turned his previous computational and analysis skills toward bioinformatics:

 “My thesis was on analyzing bicycle-accident data, and I worked with my advisor, Professor Joe Ferreira, on an analysis of Massachusetts vehicle accident and insurance data. In the course of this, I learned to code in R and to manage SQL databases, both hard skills that I use every day now in the bioinformatics world. More broadly, writing my thesis taught me how to frame and answer a research question, which has been invaluable,” he said.

Eric and Sonia’s story is truly inspiring. In the face of a diagnosis that is basically a death sentence, they chose to fight back and devote their life to prion disease research. I guess it’s never too late to get a PhD!

Publishing code on GitHub

This semester, I’ve made an effort to get all the code I write under version control. In the past I simply maintained my codebase in Dropbox. This worked well as a backup solution and allowed me to develop the same project on my laptop and desktop without any problems (despite dealing with differences in Windows/Linux file paths). However, I’ve been involved in more collaborative coding projects this semester and Dropbox simply doesn’t cut it anymore.

Bioinformaticians as a group seem to be particularly passionate about version control and open access software – Titus Brown even says, ” If you can’t be bothered to learn how to use version control, you shouldn’t be trusted to write important software.” This goes along with the open source and open access movement academics generally tend to support. Plus, we’ve all had the experience of working with poorly maintained, documented or commented code… It can really slow down the research process and be a huge hassle.

So, I’ve made a new commitment. Every piece of code I write for an academic project will be under version control on GitHub. Code for lab work that we’ve decided to publish will also make its way there (for the time being it’s held in a private bitbucket repository, still under version control though!) This is a bit of a challenge for me – publishing code is a lot like publishing something you’ve written. You’re putting your work out there for the world to see and critique, and in a lot of cases, it’s not a finished product or something you’re quite happy with yet.

I see a lot of advantages to making code public. It should help me develop better structured, more thoughtful and well-commented code. It will allow me to share projects and ideas with anyone just by giving them my GitHub username (hint: it’s bsiranosian). I can now include my GitHub url on things like my website and business cards, and anyone can see the kind of projects I work on. I feel like this could give me a leg up when searching for jobs and the like.

I can see a few downsides too. Academic integrity is one – I don’t want someone at Brown or another university copying my code for their homework or project. After thinking about this point though, I realized the answers to most bioinformatics problems are already available at places like stackoverflow. It’s not my responsibility to make sure someone doesn’t plagiarize code. Titus Brown teaches an undergraduate class where students are required to hand in assignments on GitHub and hasn’t had any problems.

You can find my GitHub at https://github.com/bsiranosian