MPTheCoder

BioRuby, Rails, Ubuntu...

Posts tagged gsoc

0 notes &

GSoC weekly status report No.8

The 0.2 version of gff3-pltools has been released, together with a Ruby gem bio-gff3-pltools. Binary and source packages can be downloaded from the following location:

http://mamarjan.github.com/gff3-pltools/

Highlights of this release:

  • the project has been split into two repositories, gff3-pltools for the D library and utilities, and bioruby-gff3-pltools for the Ruby gem,
  • gff3-ffetch has three more options: —pass-fasta-through, —keep-comments and —keep-pragmas,
  • class Record can now represent regular GFF3 lines, comments and pragmas, and the type can be deduced by asking the record if it is_regular(), is_comment() or is_pragma(),
  • Feature objects have a toString(), a recursiveToString(), and both Record and Feature objects have append_to() methods for better output performance,
  • the validation utility can be accessed from Ruby,
  • utility for counting features is now much faster,
  • added more accessor functions for attributes defined in the spec,
  • GDC can be used instead of DMD by setting the env. variable DC to “gdc”, and the downloadable binaries are now built with GDC. Building the compiler on Ubuntu is a bit tricky. More on that can be found in this comment.

On Wednesday I’ll be traveling to Lodi for the EU-codefest, there I’ll be presenting about the project and current GFF3 parser and tools performance.

For the next release I would like to add parallelism to the parser. I’m also thinking about adding a new option to gff3-ffetch, which would let the user specify which fields and attributes to output in tab-separated columns.

Filed under d bioruby gsoc

0 notes &

GSoC weekly status report No.7

I was hoping to get more done over weekend, but the internet connection was down, so I had to take the weekend off :)

Otherwise I’m working toward the 0.2 version. The deadline is set for Saturday evening. What will be in it keeps changing, but for now there are new toString() and recursiveToString() methods in Feature class, and append_to(…) methods which accept an Appender object, for more efficient output. The utility for correctly counting features is now notably faster, and gff3-ffetch has a new option for passing FASTA data to output.

Currently in planning are: support for new types of records (pragmas and comments), GDC support and Ruby interface for the validation utility. More could be added to this list, but I also have to make a plan for the second half of the summer, and that will take some time too.

I was hoping to use the GDC which comes with Ubuntu 12.04, but I gave up on that because of some confusing errors I was receiving in the D stdlib. I will try to build the GDC directly from its GitHub repository and get my library to compile with it.

Making man pages for binaries in gems is also a problem which currently has no elegant solution. I don’t want to force my users to type “gem man command”, so I’m planning to split the current repository into two: gff3-pltools in D and then the second repository for the Ruby library. The gff3-pltools would then receive a more traditional installation procedure and receive proper man pages.

Filed under gsoc d bioruby

0 notes &

GSoC weekly status report No.6 and v0.1.0

This post is a little bit late, but I wanted it to be the announcement of the first release, the v0.1.0… of gff3-pltools! I’m pragmatic and not very creative when it comes to naming things, as you can see.

I’ve created a minimal web-site for this project, which can be found here:

http://mamarjan.github.com/gff3-pltools/

There are links to binary gems for 32 and 64-bit Linux, a source package for other platforms, binary packages with the D tools only, and a link to the API docs for the Ruby library. Currently there is no gem which can be downloaded from rubygems.org, but I hope to add one later with automatic building of D libraries upon installation, given requirements are satisfied.

The binaries were tested on multiple clean Linux installations of Ubuntu and Fedora, the latest but also older releases, so I think they should work on most Linux systems without any additional requirements.

Featurewise the D parser should be pretty much complete, but the tools are just being defined and developed. But as promised, part of this release is a gff3-ffetch tool, which can be used to filter an existing file using a custom filtering expression. It’s very similar to the filtering functionality in the library, which I added last week, except that it doesn’t support AND and OR operations. That could be added later if there is interest, but it didn’t make it into this release. More information about this tool, with examples and everything else can be found in the README:

https://github.com/mamarjan/gff3-pltools/blob/v0.1.0/README.md

The bio-gff3-pltools library can be used to interact with this tool to filter files or strings with GFF3 data in Ruby.

I’m currently thinking about releasing the next version in 10 days, a week before the EU-codefest, and then to continue making new releases each week after that. That way I could make it to v1.0 before the end of the summer.

At this point I’m still not sure what will be in the next release, I’ll be defining it in the Issues on GitHub. You can follow the progress and take part in it by following this link:

https://github.com/mamarjan/gff3-pltools/issues?milestone=2&state=open

I will appreciate all input and comments, but requests for features or new tools even more.

Filed under bioruby d gsoc

0 notes &

GSoC weekly status report No.5

Summary of the last week

During the last week a few improvements have been made:

  • the validation messages have been improved with file names and line number, in the compiler error style,
  • filtering has been added,
  • replacing escaped characters has been re-implemented to get a huge performance improvement. The 1GB file that required 10min for parsing because of 6.5 milion escaped characters, is now parsed in 22.5 seconds, only 0.5 more compared with when replacing them is turned off,
  • added a tool for correctly counting features in a GFF3 file. This will be useful because the user can then find a good value for the feature cache size by using this tool to get the correct count and the benchmark tool to get the count for a particular cache size. The tool is still slow for some files, so I’m thinking about how to improve that,
  • other small fixes, comments and similar…

More on filtering

The filtering was first implemented using classes, but later refactored using delegates instead. The result was 50 lines less code.

The user can now specify a filter before parsing a file like this:

GFF3File.parse_by_records("file.gff3", NO_VALIDATION, false,
                          NO_BEFORE_FILTER,
                          OR(ATTRIBUTE("ID", EQUALS("1")),
                             ATTRIBUTE("ID", CONTAINS("2"))));

The first filter which is set to none in this example is the filter before the line is parsed, that means that the filter doesn’t support ATTRIBUTE and FIELD predicates.

The following predicates are implemented: FIELD, ATTRIBUTE, EQUALS, CONTAINS, STARTS_WITH, AND, OR, NOT. In case they’re used in a way which is not allowed, there will be a compiler error. Otherwise the allowed combinations should be logical enough to guess (but I’ll document them too).

I altered the benchmark tool a few times to test the performance, and what I found was very positive, the performance impact in the few tests I did was very small. I’ll have more data once the next tool is finished.

New week

Release early and often - it’s a mantra a heard quite a few times before. So as the group of mentors and students has agreed, every student will be releasing a gem at the end of this week.

I’m still not sure what will be in it, because the support for shared libraries in D compilers for Linux has not been implemented yet. So it will probably be a combination of a command-line utility and a Ruby module which uses that utility.

What I have currently in mind is re-implementing the gff3-fetch utility developed by Pjotr in Ruby, to make it faster using D. But first I’ll implement filtering functionality for it, so the users can reduce a file to records which are interesting to them and then parse that using a parser in Ruby, for example.

A Ruby module that would make using this utility easier for Ruby developers seems like a good idea for the first release.

Part of this utility will be to support GFF3 output, so that will be implemented too (and has already been done today to some extend).

Filed under d bioruby gsoc

0 notes &

GSoC weekly status report No.4

During the last week combining records into features has been added, and also connecting the features into parent-child relationships. Validation messages have been enhanced with file names and line numbers, and now look like errors reported by a compiler. Feels most natural to me.

Combining the features into records works by keeping a forward cache of a number of features (1000 by default, configurable). That means that the parsing results will be correct only if records which are part of the same feature are at most 1000 features from each other, or the amount of features set. The first implementation which was comparing the IDs of records required 10min for a 233MB file. After switching to first comparing hash values of IDs instead, and only if they match comparing the IDs, the parsing time was down to 45s. After fixing a bug, the time is now 10 seconds for the 233MB m_hapla file :)

Linking the features into parent-child relationships works similarly, by using 32-bit hashes most of the time instead of comparing strings. With this functionality turned on, the same file is parsed in 13 seconds.

All the measurements have been done using the benchmark utility, which has a few more options for setting what should be run.

Otherwise I did more refactoring, moved all the gff3_* files into a gff3 directory, so the D modules are now bio.gff3.*, parsing functions are now static methods of GFF3File and GFF3Data classes, etc.

For the new week, I would like to add filtering to the D library, which I can then use to implement iteration over genes, mRNAs, CDS features, etc. After that the library should be pretty much complete feature-wise, at least per what was promised in the project proposal, so I’ll continue by defining the C API and developing the Ruby gem.

Filed under bioruby d gsoc

0 notes &

GSoC weekly status report No.2

It’s the end of the second week of GSoC and time for a new report.

I spent the last week mostly doing work based on criticism from my mentor. The D parser which parses lines into records is now in a pretty good shape, and tested. Today I received a list of new issues that need to be resolved before going further, but they’re not that much work and I can plan some new developments.

A utility for validation is in planning for next week, which could be also used for performance measurement. And after that I will turn to making the current parser parallel.

Also, tomorrow I’ll be defending my Masters Thesis, after which I should be able to concentrate more on the GFF3 parser.

Filed under gsoc d bioruby

0 notes &

GSoC weekly status report No.1.2

It’s been three months since my first introduction on the BioRuby ML and it’s been great. As it is the end of the GSoC community bonding period, I would like to thank Pjotr most and then all the other community members for their help and support. It’s a great feeling to become a member of a small but growing community of enthusiasts that work together for the better of all of us and for fun.

As Pjotr already did, I would like to encourage you to write blog posts about using Ruby in Bioinformatics and let us include them in our RSS and news feeds on the biogems.info website. The site supports both RSS and Atom feeds now, and a similar functionality will be part of the new website for BioRuby once it’s finished. The code also supports adding only posts for one category/tag, so you can tag your posts with BioRuby or similar, and only those posts will be included in the RSS feed on biogems.info.

The GSoC coding period starts today, It’s time for me to roll my sleeves up, and start working on the GFF3 parser full-time.

Filed under gsoc bioruby

0 notes &

GSoC weekly status report No.1.1

This year we the GSoC students sure are a very creative group, just look at our numbering schemes for our status reports for the pre-coding period - everyone has his own thing going :)

And now back to the GFF3 project. I found a few more sites with big GFF3 files, those will be great for performance testing. And Robert Buels suggested that I should reuse the test suite from the Perl’s Bio::GFF3::LowLevel::Parser, and I think that’s a great idea. I should definitely use that for completeness testing and I will check the test suites of other GFF3 parsers.

I have also finished the work for the first week. That means basically I’m already more then two weeks ahead of schedule. The parser is now reading data on the D side and forwarding that to Ruby line by line. That won’t be faster then reading the file from Ruby, but that’s a nice basic case to get data flowing from D to Ruby.

The rake tasks have been improved too. There are now two tasks for building the D library, “compile” and “compiledebug”, and there is the “spec” task for running rspec tests and “features” task for running cucumber tests. The “clean” task now deletes object and library files.

There is also a problem with the D library and garbage collector. It seems this is the problem Iain Buclaw (one of the GDC developers) has warned us about. When using a D shared library, when the GC kicks in for the first time, it looks like if it collects all the static data, for example the per-module variables. And pretty much everything, even when we register with GC a chuck of memory allocated with malloc, it still gets collected. Or at least that’s what it looks like. However, Iain also assured us that this will be solved by the end of this month/beginning of the next. My cucumber and rspec tests still work because they don’t require enough memory for the GC to run, but to be sure that this issue doesn’t interfere with development at this point, I manually disabled the GC on library initialization. I didn’t try yet, but from what has been discussed in the forums, both 32 and 64-bit DLLs on windows built using DMD work fine.

I also helped Pjotr with getting our blog posts included in the RSS feed on biogems.info.

Filed under gsoc bioruby D

0 notes &

GSoC weekly status report No.1

It has been 10 days since the GSoC results were published, and a lot has happened since then. I got to know the other students and mentors in a longish meeting on Google hangout, I got into a discussion with my mentor on IRC in which we didn’t agree about the parallelization strategy for the parser (experiments will show who’s right) and my inbox is full with mails from my mentor and other students, in which we exchanged loads of interesting ideas. Also, there was a bug in biogems.info site…

biogems.info

We had a bug in the biogems.info website last weekend. The part of the code that was retrieving commit history data from GitHub wasn’t resistant to fluctuations in github.com availability. It seems GitHub had some problems over the weekend (probably a lot of people were working on their hobby projects) and the script that retrieves data and generates the website was failing after it has retrieved data for only a couple of biogems. Making the script ask again for the same data in case the first request wasn’t successful solved the problem.

Here is the link to the pull-request with the fix:

https://github.com/pjotrp/biogems.info/pull/13

Administrative

There is now a Git repository on GitHub for my GSoC project:

https://github.com/mamarjan/bioruby-hpc-gff3

Coding

I have already started doing the work that was planned for the first week of the coding period, to be sure the Ruby interpreter will play along nicely with the D runtime. The first impressions are good and here are now some cucumber tests and features, documenting most of the plugin interface that was intended for the first week (iterating over lines) and there are a few rake tasks for building the D library and running tests. However, the D part for the first week is still not finished.

Currently only Ubuntu 32-bit and the DMD compiler are supported. This will of course be extended over the summer.

I’m also looking into extending the rake-compiler gem to support D. Currently it can be used to build native gems written in C or Java, and there are quite a few gems that are using it. This would make it possible for future gems using D to have a similar directory structure and to reuse the tasks for compiling D libraries on different platforms.

64-bit shared libraries

There is currently no way to build 64-bit shared libraries on Linux with any of the three available D compilers. And we can expect that most users are going to use 64-bit operating systems along with 64-bit versions of Ruby.

Because of that we contacted the author of the GDC compiler, which is open source and in a good shape, and received info that this problem is scheduled to be solved within a month or two. With that info, in a discussion with my Pjotr and Artem, we decided that it’s good enough for us to continue the parser projects in D. We can start working on our projects using 32-bit builds of Ruby, and once the support is available we can start testing our applications in 64-bit mode.

Also, in windows, there are no issues creating 64-bit DLLs, so that we don’t expect any problems with that platform.

The first performance test

At the moment there are two approaches in regard to the parallelization of the parser that I would like to compare and choose the better one. In one approach the Ruby code simply sends the file path to the D library and receives parsed data. In the other, Ruby code reads data from the file and passes it in blocks of data to the D library, which then distributes the work to multiple actors. 

Just to be sure that reading the data from the file into Ruby strings won’t be that expensive, I made a quick performance test. Here is the script:

filename = ARGV[0]

open(filename, "r") do |f|
  while f.read(1024*1024)
  end
end

It reads the whole file a block at a time.

The results are encouraging. Once the 1.1GB GFF3 test file was cached in memory, the script needed only 0.6 seconds to finish reading that file again (on my Intel i3 laptop with 4GB laptop).

There is however one point I need to add. When the block size is reduced to 512 bytes, that is when only 512 bytes are read at a time, the script requires 4.5 seconds instead of 0.6. So, for best performance, it is important that the Ruby part of the parser reads data in big chunks, and send it to the D parsing code.

What’s next?

What follows is that I’m going to search for applications which are using GFF3 files and ask around for examples that might be in development and that would benefit from a fast sequential parser. Also, a list of GFF3 files from different sources would help me make the parser more robust and useful.

Also, there seems to be huge interest for a GFF3 parser with more features, like indexing, random access and writing output, and also support for linking into trees of features that are not located close to each other in the file. A fast sequential parser could be used to generate indexes, and the lower-level parts can be used to reorder the file for faster future usage. Based on that, I think this project is a good start.

So, again, I would like to ask you if you’re using the GFF3/GTF file formats in your research, to send me example files and descriptions of how are your applications using the data. This way I’ll be able to test the parser against your files and optimize it for your applications. Thank you! :)

Filed under gsoc bioruby