Parallelization Methods for the Distribution of High-Throughput Bioinformatics Algorithms

Date

2011-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The development of high-throughput bioinformatics technologies has caused a massive influx of biological data over the course of the past decade. During this same span of time, computational hardware has also been rapidly increasing in speed while decreasing in price, multi-core processors have become standard in home and office environments, and distributed and cloud based computing has become affordable and readily available to researchers with implementations such as Amazon’s S3, Microsoft’s Azure, Google’s App Engine, and the 3Tera Cloud. Bioinformatics software tools such as BLAST, a tool for finding local alignments between a set of unknown genetic sequences versus a set of known genetic sequences, have simple interfaces and few installation requirements often so biologists can use them easily in the laboratory without needing an in-depth knowledge of how computer systems work. This, however, is rarely the case for distributed implementations of bioinformatics tools which often require the user to first set up and configure the underlying program that will handle the distribution, such as the Message Passing Interface (MPI). Once the underlying distribution algorithm is chosen, many of the software tools require the user to then configure the program to work with their chosen method and, in some cases, write the necessary source code to link the program with the underlying service. These are difficult steps for most computer scientists and are near impossible for the average biologist.

By constructing a modularized set of methods that can connect to, broadcast to, and read from a multicast created by the methods, future bioinformatics software developers will be able to construct the underlying message passing system without requiring the end-user, often a biologist, to set up and configure one of their own. Using these multicast methods will allow any program the ability to seek out and track any nodes on the network that will be used in the distributed system. This communication method allows the program to easily scale up and down depending on available nodes without direct user intervention to alter the size of the system. This system is then tested by creating a program that connects NCBI’s Basic Local Alignment Search Tool to the multicast system to allow the BLAST algorithm to then be distributed across multiple nodes. This new system will demonstrate how future programs could then connect stand alone tools, such as BLAST, to the multicast system to create programs that will execute on a distributed system and automatically scale depending on the network size without altering the tools source code.

Description

Citation