Publishing code on GitHub

This semester, I’ve made an effort to get all the code I write under version control. In the past I simply maintained my codebase in Dropbox. This worked well as a backup solution and allowed me to develop the same project on my laptop and desktop without any problems (despite dealing with differences in Windows/Linux file paths). However, I’ve been involved in more collaborative coding projects this semester and Dropbox simply doesn’t cut it anymore.

Bioinformaticians as a group seem to be particularly passionate about version control and open access software – Titus Brown even says, ” If you can’t be bothered to learn how to use version control, you shouldn’t be trusted to write important software.” This goes along with the open source and open access movement academics generally tend to support. Plus, we’ve all had the experience of working with poorly maintained, documented or commented code… It can really slow down the research process and be a huge hassle.

So, I’ve made a new commitment. Every piece of code I write for an academic project will be under version control on GitHub. Code for lab work that we’ve decided to publish will also make its way there (for the time being it’s held in a private bitbucket repository, still under version control though!) This is a bit of a challenge for me – publishing code is a lot like publishing something you’ve written. You’re putting your work out there for the world to see and critique, and in a lot of cases, it’s not a finished product or something you’re quite happy with yet.

I see a lot of advantages to making code public. It should help me develop better structured, more thoughtful and well-commented code. It will allow me to share projects and ideas with anyone just by giving them my GitHub username (hint: it’s bsiranosian). I can now include my GitHub url on things like my website and business cards, and anyone can see the kind of projects I work on. I feel like this could give me a leg up when searching for jobs and the like.

I can see a few downsides too. Academic integrity is one – I don’t want someone at Brown or another university copying my code for their homework or project. After thinking about this point though, I realized the answers to most bioinformatics problems are already available at places like stackoverflow. It’s not my responsibility to make sure someone doesn’t plagiarize code. Titus Brown teaches an undergraduate class where students are required to hand in assignments on GitHub and hasn’t had any problems.

You can find my GitHub at https://github.com/bsiranosian

k-mers are everywhere!

Many problems in bioinformatics involve working with short pieces of DNA sequence. We call these short words k-mers, where k is an integer usually less than 30 or so. A k-mer is essentially a substring of a larger sequence of DNA. If you’re  a biologist you may be wondering why people could be interested in anything other than 3-mers, the codons that encode amino acids. As it turns out, k-mers are at the center of many bioinformatics techniques and are the subject of intense algorithms research.

Some bioinformatics areas where k-mers play a central role:

  • Genome Assembly. Assemblers based on the overlap-consensus model (such as Celera) or De Bruijn Graphs (like Velvet) use k-mers to build the initial data structure for genome assembly. As overlaps between k-mers are found, the assembled sequence grows!
  • Sequence Alignment. The Basic Local Alignment Search Tool, or BLAST, is arguably the most well-known product of the bioinformatics field. BLAST can find DNA sequences conserved between organisms, uncover horizontal gene transfer and explain why we can’t make a vaccine for the common cold. And it all depends on the initial matching of short k-mers from the search sequence to the database.
  • Sequencing Quality Control. Overrepresentation of k-mers in a next gen sequencing library can be diagnostic for errors and duplications. The fastqc program computes the usage of 5-mers in sequencing reads as a form of quality control.
  • Alignment-Free sequence Analysis. My new favorite problem! Expect a post on this soon. Basically, the usage of short k-mers in a genome can be used to infer evolutionary relationships and examine horizontal gene transfer. Kind of like GC content but with more signal.
  • Codons and Repetitive Regions. Codons, the 3-letter sequences that encode for amino acids that build proteins, are essentially 3-mers with special biological function. 3-mers are also important in disease, such as the CAG repeats that cause Huntington’s disease.

K-mers are everywhere in bioinformatics. There is a lot of work into ways to efficiently (computational time and memory) count k-mers in large genomes. Really impressive and cool algorithms have been developed to solve the k-mer counting problem, some of which I’ll be talking about in a later post. It turns out these little words of DNA are important after all!

Academia or Industry?

That question seems to be on the mind of a lot of the people around me lately. Junior year of my undergrad studies at Brown is almost over, and my classmates and I are starting to think of what our lives will be like after next May. For people interested in the sciences, especially biology and biotech related fields. there are two main options everyone considers: should I get a job right our of undergrad, with only a bachelor’s degree? Or, should I stay in school for another 5+ years for a Ph.D?  There are obvious benefits and costs to each (which I’ll cover in a later post), and everyone wants to know exactly what choice will be best for them.

Up until recently, I have wanted to go to graduate school the fall after finishing at Brown. I feel like I’m ready for what the process entails, thanks in part to the books I’ve been reading by recent (either delighted or regretful) Ph.D recipients. But now, I’m not so sure. The Brown Club of Boston Biotech Conference was a big factor in this – listing to people in industry talk about the opportunities for bachelor’s degree holders was eye opening. The starting salaries they mentioned were impressive (and not much less than what you’d get as a Ph.D). The projects looked interesting.  And most importantly, the jobs are there.

I’m considering taking a year or two to work in industry before committing to grad school. After all, whats a year delay when you’re going to be in school for another 5-7 ?

Books on Graduate School

Students are pretty much on their own for figuring things out after undergrad (I definitely don’t miss high school guidance counselors, though). Grad school is a confusing topic – I find that everyone talks about it, but some people don’t really know what to expect if they commit the next 5 years of their life to getting another degree. Asking students in your lab, professors and graduating seniors is a great way to get information, but it’s still hard to get a good picture of what the whole process is like. Over winter break this year I found a few good books that summarize what you can expect at a graduate program in the sciences.

  • The Ph.D Grind by Philip Guo
    This short memoir details Philip’s time at Stanford from 2006 to 2012 while he worked toward a Ph.D in computer science. His story is filled with countless ups and downs, tips to get work published and lessons about working efficiently and maintaining sanity during stressful times. It’s a free ebook, quick read, and definitely recommended (even if you’re not considering computer science).
  • Getting What You Came For by Robert Peters
    The second edition of this book was published in 1992 so it’s a little outdated (it references word processors like they’re the next big thing, for example). It is still a good general account of the entire graduate school process, though. Getting What You Came For covers everything you should be doing in undergrad, how to pick a graduate school that’s a good fit for you, how to navigate your first years in the program while finding an advisor and taking quals, how to get the thesis done, and much more. It’s definitely not a quick read, but I’m keeping a copy for reference.
  • The Ph.D Process by Dale Bloom, Jonathan Karp and Nicholas Cohen
    Another book that lays out each step in the process of getting a Ph.D. Kind of outdated, but it has anecdotes and advice from students that I thing are still relevant and useful. The ebook was free through the library so I gave it a read.
  • A Guide to Academia by Prosanta Chakrabarty
    This one covers the entirety of grad school and also has information for postdocs and people looking for their first job. It’s a pretty solid overview. I think it’s was important to read ahead to learn about life after grad school and what to expect for the job market in academia. Also free through the library at Brown.

Of course, this is not an exhaustive list! If you have any suggestions on other books, or books that compare academia to industry, leave me a comment below.

Brown Club of Boston Biotech Conference

I just returned from an excellent conference sponsored by the Brown Club of Boston. The conference was designed to facilitate networking between Brown alumni, current students and leaders in all aspects of the biotechnology field. There was a panel of three interesting speakers, each of whom talked about their unique experience in the biotech industry.

  •  Angus McQuilken, VP of Communications and Marketing at the Massachusetts Life Sciences Center.

Angus spoke about the biotech boom that’s been happening in Massachusetts recently. A major reason for this is support from the state government – the Massachusetts Life Sciences Center was tasked with giving out $1 billion in funding to companies and individuals over a ten year period. Some of that money goes toward funding the internship challenge, a program that subsidizes internships for Mass residents and students working at biotech companies.

  • Mitch Sanders, Ph.D, founder and CEO of ECI Biotech

Mitch talked about his experience in biotech after his postdoc at MIT. His company is developing sensors that change color in the presence of certain bacteria and viruses – the tech is being applied to band-aids, prosthetics and food and will hopfully be on the market within the next couple of years. He also spoke about the ups and downs his company had been through in the past years and the importance of being able to adapt in the industry.

  • Elaine Crowley, founder and president of the Crowley Group.

The Crowley Group is a consulting firm that provides “insightful executive coaching” to companies and individuals. Although not limited to biotech, Eliane had some good lessons for the audience of students: finding employment that matches your culture is just as (if not more) important than the pay, prestige or benefits.

There was a Q&A session after the panelists spoke. I asked the three of them about the pros and cons of getting a Ph.D before working in the biotech industry, and the answers I recieved surprised me. More on that in a later post!

Many thanks to the members of the Brown Club of Boston who organized this event, especially Paula Freeman. who blogs about etiquette and lessons young people should learn over at jobetiquettebypaula.wordpress.com.