Yesterday I tagged the 0.4 release of gff3-pltools, and that marks the end of the summer. At least in GSoC terms. Should I say end of the project? I don’t think so. The tools can still be improved, and the Ruby bindings should follow.
The major changes since the last release include the following:
- filtering functionality has been moved to a separate utility: gff3-filter, along with a new language for specifying filtering expressions,
- conversion to table format of selected fields has been moved to a separate utility: gff3-select. However, the —select option is still part of gff3-filter,
- gff3-ffetch is now fetching FASTA sequences from GFF3 and FASTA files for CDS and mRNA records and features,
- man pages for utilities.
A new filtering language
At the EU-codefest in Italy, I saw that a more flexible filtering language was necessary. It is part of of this release, with a new syntax and a range of new operators: ==, !=, >, <, >=, <=, +, -, *, /, “and” and “or”.
For example, to keep only the records which are above 200 nucleotides in size, you can use this command:
# gff3-filter “(field end - field start) > 200” m_hapla.gff3 -o filtered.gff3
New functionality in gff3-ffetch
My original idea for the gff3-ffetch was to make it a general GFF3 processing utility. But that would not be in the spirit of UNIX. So now, the filtering functionality is in gff3-filter, and selection and table output are in gff3-select. The gff3-ffetch tool has been rewritten so that it does what it was intended for, and that is assembling FASTA sequences of features in GFF3 files, much similar to the one Pjotr created, but the default case is now “per spec”, and it’s much faster.
All the functionality developed before was about processing data. There was no Bio part in developing or testing it, only informatics, simple transformation between input and output. However, to develop and test the current gff3-ffetch, I had to get familiar with quite a few Bio concepts.
The original idea was to create a GFF3/GTF parser in D and Ruby bindings. The Ruby bindings part didn’t work out because there is still no support for D shared libraries in Linux, but instead there are now a few useful command-line tools for processing GFF3 which can be used without programming knowledge.
To me, the summer was fun, challenging, and a great experience. I even got to meet my mentor in person, and other community members too, and to make my first steps in bioinformatics. I even gave a small presentation at the EU-codefest. What a summer it was!
Thanks to everybody who made it possible: Google, Open Bioinformatics Foundation and my mentor Pjotr Prins.
The 0.3 release is available on the website:
In addition to what was described in the last weekly report, a GFF3 sorting tool has been added, grouping records which belong to the same feature, and Ruby bindings have been updated to support GTF and new options in gff3-ffetch.
The trip to Lodi was very fruitful. It was great to meet both my mentor and other community members.
Based on the input received at the codefest, I created a new plan for the second part of the summer:
Since then I have done the following:
- improved validation speed,
- added GTF support for input and output,
- table output with an option to select which fields and attributes should be in the table,
- tools for conversion to GTF and JSON,
- JSON output support, which needs some more polish.
The 0.2 version of gff3-pltools has been released, together with a Ruby gem bio-gff3-pltools. Binary and source packages can be downloaded from the following location:
Highlights of this release:
- the project has been split into two repositories, gff3-pltools for the D library and utilities, and bioruby-gff3-pltools for the Ruby gem,
- gff3-ffetch has three more options: —pass-fasta-through, —keep-comments and —keep-pragmas,
- class Record can now represent regular GFF3 lines, comments and pragmas, and the type can be deduced by asking the record if it is_regular(), is_comment() or is_pragma(),
- Feature objects have a toString(), a recursiveToString(), and both Record and Feature objects have append_to() methods for better output performance,
- the validation utility can be accessed from Ruby,
- utility for counting features is now much faster,
- added more accessor functions for attributes defined in the spec,
- GDC can be used instead of DMD by setting the env. variable DC to “gdc”, and the downloadable binaries are now built with GDC. Building the compiler on Ubuntu is a bit tricky. More on that can be found in this comment.
On Wednesday I’ll be traveling to Lodi for the EU-codefest, there I’ll be presenting about the project and current GFF3 parser and tools performance.
For the next release I would like to add parallelism to the parser. I’m also thinking about adding a new option to gff3-ffetch, which would let the user specify which fields and attributes to output in tab-separated columns.
I was hoping to get more done over weekend, but the internet connection was down, so I had to take the weekend off :)
Otherwise I’m working toward the 0.2 version. The deadline is set for Saturday evening. What will be in it keeps changing, but for now there are new toString() and recursiveToString() methods in Feature class, and append_to(…) methods which accept an Appender object, for more efficient output. The utility for correctly counting features is now notably faster, and gff3-ffetch has a new option for passing FASTA data to output.
Currently in planning are: support for new types of records (pragmas and comments), GDC support and Ruby interface for the validation utility. More could be added to this list, but I also have to make a plan for the second half of the summer, and that will take some time too.
I was hoping to use the GDC which comes with Ubuntu 12.04, but I gave up on that because of some confusing errors I was receiving in the D stdlib. I will try to build the GDC directly from its GitHub repository and get my library to compile with it.
Making man pages for binaries in gems is also a problem which currently has no elegant solution. I don’t want to force my users to type “gem man command”, so I’m planning to split the current repository into two: gff3-pltools in D and then the second repository for the Ruby library. The gff3-pltools would then receive a more traditional installation procedure and receive proper man pages.
This post is a little bit late, but I wanted it to be the announcement of the first release, the v0.1.0… of gff3-pltools! I’m pragmatic and not very creative when it comes to naming things, as you can see.
I’ve created a minimal web-site for this project, which can be found here:
There are links to binary gems for 32 and 64-bit Linux, a source package for other platforms, binary packages with the D tools only, and a link to the API docs for the Ruby library. Currently there is no gem which can be downloaded from rubygems.org, but I hope to add one later with automatic building of D libraries upon installation, given requirements are satisfied.
The binaries were tested on multiple clean Linux installations of Ubuntu and Fedora, the latest but also older releases, so I think they should work on most Linux systems without any additional requirements.
Featurewise the D parser should be pretty much complete, but the tools are just being defined and developed. But as promised, part of this release is a gff3-ffetch tool, which can be used to filter an existing file using a custom filtering expression. It’s very similar to the filtering functionality in the library, which I added last week, except that it doesn’t support AND and OR operations. That could be added later if there is interest, but it didn’t make it into this release. More information about this tool, with examples and everything else can be found in the README:
The bio-gff3-pltools library can be used to interact with this tool to filter files or strings with GFF3 data in Ruby.
I’m currently thinking about releasing the next version in 10 days, a week before the EU-codefest, and then to continue making new releases each week after that. That way I could make it to v1.0 before the end of the summer.
At this point I’m still not sure what will be in the next release, I’ll be defining it in the Issues on GitHub. You can follow the progress and take part in it by following this link:
I will appreciate all input and comments, but requests for features or new tools even more.
Summary of the last week
During the last week a few improvements have been made:
- the validation messages have been improved with file names and line number, in the compiler error style,
- filtering has been added,
- replacing escaped characters has been re-implemented to get a huge performance improvement. The 1GB file that required 10min for parsing because of 6.5 milion escaped characters, is now parsed in 22.5 seconds, only 0.5 more compared with when replacing them is turned off,
- added a tool for correctly counting features in a GFF3 file. This will be useful because the user can then find a good value for the feature cache size by using this tool to get the correct count and the benchmark tool to get the count for a particular cache size. The tool is still slow for some files, so I’m thinking about how to improve that,
- other small fixes, comments and similar…
More on filtering
The filtering was first implemented using classes, but later refactored using delegates instead. The result was 50 lines less code.
The user can now specify a filter before parsing a file like this:
GFF3File.parse_by_records("file.gff3", NO_VALIDATION, false,
The first filter which is set to none in this example is the filter before the line is parsed, that means that the filter doesn’t support ATTRIBUTE and FIELD predicates.
The following predicates are implemented: FIELD, ATTRIBUTE, EQUALS, CONTAINS, STARTS_WITH, AND, OR, NOT. In case they’re used in a way which is not allowed, there will be a compiler error. Otherwise the allowed combinations should be logical enough to guess (but I’ll document them too).
I altered the benchmark tool a few times to test the performance, and what I found was very positive, the performance impact in the few tests I did was very small. I’ll have more data once the next tool is finished.
Release early and often - it’s a mantra a heard quite a few times before. So as the group of mentors and students has agreed, every student will be releasing a gem at the end of this week.
I’m still not sure what will be in it, because the support for shared libraries in D compilers for Linux has not been implemented yet. So it will probably be a combination of a command-line utility and a Ruby module which uses that utility.
What I have currently in mind is re-implementing the gff3-fetch utility developed by Pjotr in Ruby, to make it faster using D. But first I’ll implement filtering functionality for it, so the users can reduce a file to records which are interesting to them and then parse that using a parser in Ruby, for example.
A Ruby module that would make using this utility easier for Ruby developers seems like a good idea for the first release.
Part of this utility will be to support GFF3 output, so that will be implemented too (and has already been done today to some extend).
During the last week combining records into features has been added, and also connecting the features into parent-child relationships. Validation messages have been enhanced with file names and line numbers, and now look like errors reported by a compiler. Feels most natural to me.
Combining the features into records works by keeping a forward cache of a number of features (1000 by default, configurable). That means that the parsing results will be correct only if records which are part of the same feature are at most 1000 features from each other, or the amount of features set. The first implementation which was comparing the IDs of records required 10min for a 233MB file. After switching to first comparing hash values of IDs instead, and only if they match comparing the IDs, the parsing time was down to 45s. After fixing a bug, the time is now 10 seconds for the 233MB m_hapla file :)
Linking the features into parent-child relationships works similarly, by using 32-bit hashes most of the time instead of comparing strings. With this functionality turned on, the same file is parsed in 13 seconds.
All the measurements have been done using the benchmark utility, which has a few more options for setting what should be run.
Otherwise I did more refactoring, moved all the gff3_* files into a gff3 directory, so the D modules are now bio.gff3.*, parsing functions are now static methods of GFF3File and GFF3Data classes, etc.
For the new week, I would like to add filtering to the D library, which I can then use to implement iteration over genes, mRNAs, CDS features, etc. After that the library should be pretty much complete feature-wise, at least per what was promised in the project proposal, so I’ll continue by defining the C API and developing the Ruby gem.
My first report as a Master of Computer Engineering and Communications :)
Here is a list with what I’ve been working on the last week:
- more cleanup and refactoring validation code, README etc,
- made a validation utility in D, which simply reports problems found to stderr,
- made a benchmark tool with -v option for measuring parser speed with and without validation,
- after having a basic benchmark tool, found a few places which were very bad for performance. After fixing that code, parsing a 233MB GFF3 file on a five year old PC took 6 seconds, but without validation, and with only a single thread, and replacing escaped characters turned off,
- made replacing escaped characters optional, because the current implementation requires creation of additional string objects to do that, which has a big impact on performance. There is a plan for making it faster, but is scheduled for later,
- added minimal parallelisation, by reading the file in a separate thread.
Two additional days were spent on a segmentation fault in the D garbage collector which occured when parsing a big file with a lot of errors. That should never happen, as I’m using the safe part of the D language, that is no pointers or anything similar. The worst that should happen is an exception. But a segmentation fault points to an error in either the compiler, the runtime or support library.
The minimum reproducible example is still 42 lines long:
but changing anything in it makes the segmentation fault go away. More info on this topic can be found in the discussion here:
I’ll be probably posting a bug report on the Dlang webpage tomorrow.
For the coming week I would like to add more parallelisation, change the validation code so that exceptions almost never happen (and the seg fault also) and add support for merging records into features.
It’s the end of the second week of GSoC and time for a new report.
I spent the last week mostly doing work based on criticism from my mentor. The D parser which parses lines into records is now in a pretty good shape, and tested. Today I received a list of new issues that need to be resolved before going further, but they’re not that much work and I can plan some new developments.
A utility for validation is in planning for next week, which could be also used for performance measurement. And after that I will turn to making the current parser parallel.
Also, tomorrow I’ll be defending my Masters Thesis, after which I should be able to concentrate more on the GFF3 parser.