Jonasfj.dk/Blog
A blog by Jonas Finnemann Jensen


August 27, 2012
Regular Expressions on dxr.allizom.org
Filed under: Computer,English,Mozilla by jonasfj at 9:01 pm

A couple of weeks I introduced TriLite, an Sqlite extension for fast string matching. TriLite is still very much under active development and not ready for general purpose use. But over the past few weeks I’ve integrated TriLite into DXR, the source code indexing tool I’ve been working on during my internship at Mozilla. So it’s now possible to search mozilla-central using regular expressions, for an example regexp:/(?i)bug\s+#?[0-9]+/ to find references to bugs.

For those interested, I did my end-of-internship presentation of DXR last Thursday, it’s currently available on air.mozilla.org (I’m third on the list, also posted below). I didn’t reharse it very much so appoligies if it’s doesn’t make any sense. What does make sense however, is that fact that dxr.allizom.org supports substring and regular expression searches, so fast that we can facilitate incremental search.


(My end-of-internship presentation, slides available here)

Anyways, as you might have guessed from the fact that I gave an end-of-internship presentation, my internship is coming to an end. I’ll be flying home to Denmark next Saturday to finish my masters. But I’ll probably continue to actively develop TriLite and make sure this project reaches a level where it can be reused by others. I’ve already seen other suggestions for where substring search could be useful.



August 16, 2012
TriLite, Fast String Matching in Sqlite
Filed under: Computer,English,Mozilla by jonasfj at 7:27 pm

The past two weeks I’ve been working on regular expression matching for DXR. For those who doesn’t know it, DXR is the source code cross referencing tool I’m working on during my internship at Mozilla. The idea is to make DXR faster than grep and support full featured regular expressions, such that it can eventually replace MXR. The current search feature in DXR uses the FTS (Full Text Search) extension for Sqlite, with a specialized tokenizer. However, even with this specialized tokenizer DXR can’t do efficient substring matching, nor is there any way to accelerate regular expression matching. Which essentially means DXR can’t support these features, because full table scans are too slow on a server that serves many users. So to facilitate fast string matching and allow restriction on other conditions (ie. table joins), I’ve decided to write an Sqlite extension.

Introducing TriLite, an inverted trigram index for fast string matching in Sqlite. Given a text, a trigram is a substring of 3 characters, the inverted index maps from each trigram to a list of document ids containing the trigram. When evaluating a query for documents with a given substring, trigrams are extracted from the desired substring, and for each such trigram a list of document ids is fetched.  Document ids present in all lists are then fetched and tested for the substring, this reduces the number of documents that needs to fetched and tested for the substring. The approach is pretty much How Google Code Search Worked. In fact, TriLite uses re2 a regular expression engine written by the guy who wrote Google Code search.

TriLite is very much a work in progress, currently, it supports insertion and queries using substring and regular expression matching, updates and deletes haven’t been implemented yet. Anyways, compared to the inverted index structure used in the FTS extension, TriLite has a fairly naive implementation, that doesn’t try to provide a decent amortized complexity for insertion. This means that insertion can be rather slow, but maybe I’ll get around to try and do something about that later.

Nevertheless, with the database in memory I’ve been greping over the 60.000 files in mozilla-central in about 30ms. With an index overhead of 80MiB for the 390MiB text in mozilla-central, the somewhat naive inverted index implementation employed in TriLite seems pretty good. For DXR we have a static database so insertion time is not really an issue, as the indexing is done on an offline build server.

As the github page says, TriLite is no where near ready for use by anybody other than me. I’m currently working to deploy a test version of DXR with TriLite for substring and regular expression matching. Something I’m very much hoping to achieve before the end of my internship at Mozilla. So stay tuned, a blog post will follow…



August 10, 2012
Deploying DXR on dxr.mozilla.org
Filed under: Computer,English,Mozilla by jonasfj at 6:31 pm

This year I’m spending my summer vacation interning at the Mozilla office in Toronto. The past month I’ve been working on DXR, a source code indexing and cross referencing application. So far I’ve worked to deploy DXR on Mozilla infrastructure and is happy to report that dxr.mozilla.org is no longer redirecting to dxr.lanedo.com. The releng team have a build bot indexing mozilla-central, so that we always have a fresh index on dxr.mozilla.org. So in blind faith that this will never crash horribly, I’ll tell to:

Update your bookmarks to: dxr.mozilla.org

Whilst I’ve been handling most of the deployment issues, the releng team have been doing the heavy lifting when comes to automatic build, thanks! This means, that I’ve been working on cleaning up, refactoring, redesigning and rewriting parts of DXR. This sounds like a lot of behind the scenes work that nobody will ever see, however, this means that DXR now has a decent template system. And well, who would I be if I didn’t write a template for it.

You can find the tip of DXR that I’m working on at dxr.allizom.org, please check out it and let me know if there’s something you don’t like and would like to appear. I realize that the site currently have a few crashes, and don’t worry I won’t rush this into production before I’ve ratted these out. Now, as evident from this blog, I’m by no means a talented designer, so if you feel that grey should be red or whatever, please leave a color code in the comments and I’ll try it out.

I’m currently looking into replacing the full text search and my somewhat buggy text tokenizer with a trigram index that’ll facilitate proper substring matching and if we lucky regular expression matching. But I guess we will have to wait and see how that turns out. In the mean time leave a comment if you have questions, requests, suggestions and/or outcries.