BioGrid: Ruby-based High Performance Computing for Small Labs
Several months ago I started an ambitious project: I wanted to create an automated method for performing dose response modeling. My first code for this was written in Java, and it took forever to run. I looked around the lab, and I found tons of computers that weren't being utilized to their full potential. That's when it occured to me -- I could create a grid that harnesses the computational power of all the computers that weren't being used (most of our computers weren't being used after hours, and several weren't being used to do anything more than host a few internal websites).
So I decided to create some software in Ruby that would make grid-based computation easier. Given: my collaborator's lab (Dr. Tim Zacharewski's lab) has a lot of computers (mostly due to my days as a grad student and postdoc in Tim's lab where I'd grab as many computers as possible for various purposes), but even a small laboratory that has more than one computer can use the BioGrid. At the very least, several small labs can get together and share their resources to create a grid. Even connecting two computers is better than trying to run a long process on one computer.
The data organizational philosophy for BioGrid simple: it uses a hash. So imagine if you will that you have a really complex job, for instance, a dose response gene expression dataset from a set of Agilent microarrays. The way I would break this task (performing dose response modeling on all of the genes) is that I would break it up into a series of jobs that can be run in parallel. In this case, I would break the task up into jobs based on each gene. In other words, each client in the grid would have a job to perform: model the dose response gene expression for a single gene. Once that job is completed, it will send the results of the modeling back to the server (in this case, that would be the model parameters, since we're fitting dose response models), and then the client would ask the server for a new job (i.e., a new gene with its associated gene expression data).
BioGrid uses the Distributed Ruby (DRb) package to do most of the grunt work. Outside of that, it's a relatively straightforward client-server architecture, with 3 primary classes. The server is implemented using the ServerRunner class. The constructor (initialize function) in ServerRunner is where the task and jobs are created. So in my dose response modeling example, I would be reading in the file that contains the gene expression data, and building the hash. Again, the key to the hash is the gene identifier (keys need to be unique), while the value is the gene expression data. You would also need to update the IP address of the server in this class.
Generally speaking, the GridServer class would not need to be updated/changed. Exceptions to this would include any changes you want to make to the logging system, or if you wanted the server to perform some work on the results after all of the results have come back in from the clients.
The client is implemented in the GridClient class. The investigator would need to make significant changes to this code so that it does exactly what you want it to do. For instance, the actual modeling code would be called from the GridClient class. You would also need to update the IP address of the server in this code. Starting the client is a breeze in Windows and Linux:
For Windows, if you want the client to run in the background, you use the start command:
start /low ruby grid_client.rb
For Linux, you use the nice command (the number 15 in the example is the priority number; numbers with lower numbers have higher priority):
nice ruby grid_client.rb 15
You can download BioGrid from Biocodenv.com.