Jonasfj.dk/blog

Jonasfj.dk/Blog
A blog by Jonas Finnemann Jensen

August 16, 2012

Filed under: Computer,English,Mozilla by jonasfj at 7:27 pm

The past two weeks I’ve been working on regular expression matching for DXR. For those who doesn’t know it, DXR is the source code cross referencing tool I’m working on during my internship at Mozilla. The idea is to make DXR faster than grep and support full featured regular expressions, such that it can eventually replace MXR. The current search feature in DXR uses the FTS (Full Text Search) extension for Sqlite, with a specialized tokenizer. However, even with this specialized tokenizer DXR can’t do efficient substring matching, nor is there any way to accelerate regular expression matching. Which essentially means DXR can’t support these features, because full table scans are too slow on a server that serves many users. So to facilitate fast string matching and allow restriction on other conditions (ie. table joins), I’ve decided to write an Sqlite extension.

Introducing TriLite, an inverted trigram index for fast string matching in Sqlite. Given a text, a trigram is a substring of 3 characters, the inverted index maps from each trigram to a list of document ids containing the trigram. When evaluating a query for documents with a given substring, trigrams are extracted from the desired substring, and for each such trigram a list of document ids is fetched. Document ids present in all lists are then fetched and tested for the substring, this reduces the number of documents that needs to fetched and tested for the substring. The approach is pretty much How Google Code Search Worked. In fact, TriLite uses re2 a regular expression engine written by the guy who wrote Google Code search.

TriLite is very much a work in progress, currently, it supports insertion and queries using substring and regular expression matching, updates and deletes haven’t been implemented yet. Anyways, compared to the inverted index structure used in the FTS extension, TriLite has a fairly naive implementation, that doesn’t try to provide a decent amortized complexity for insertion. This means that insertion can be rather slow, but maybe I’ll get around to try and do something about that later.

Nevertheless, with the database in memory I’ve been greping over the 60.000 files in mozilla-central in about 30ms. With an index overhead of 80MiB for the 390MiB text in mozilla-central, the somewhat naive inverted index implementation employed in TriLite seems pretty good. For DXR we have a static database so insertion time is not really an issue, as the indexing is done on an offline build server.

As the github page says, TriLite is no where near ready for use by anybody other than me. I’m currently working to deploy a test version of DXR with TriLite for substring and regular expression matching. Something I’m very much hoping to achieve before the end of my internship at Mozilla. So stay tuned, a blog post will follow…

Comments (7)

7 Comments »

This sounds awesome. Since you are not just reusing the FTS3/4 codebase, one feature request I had for FTS4 (that did not make it in) that might be worth considering is storing additional information inline with the doc-id lists.

In short, the idea is to allow your full-text index to also serve as a covering index.

As an example, for Thunderbird fulltext search, we would do a fulltext search and then attempt to rank messages by their recency, static interestingness (based on whether the message was starred, involved people in the address book, etc.) and take a LIMIT over that. Since FTS3 could not maintain additional meta-information, our queries had to do full random disk access for all messages in order to discard them down to the limit.

While one could obviously go way-too-far with this and wildly bloat the index, I think there are benefits even for dxr. For example, being able to include a few bits of extra information with your tri-grams could help indicate that the tri-gram was part of a comment, a string, an identifier in a declaration, an identifier in a usage, etc.

Of course, it could be just as easy and possibly more efficient to simply shard the data, especially if your database is small enough to be entirely cached in memory so random access costs are meaningless. I think my proposal is most advantageous in what amounts to overload cases, especially if the covering index data can simply be directly retrieved. For example, assuming you use your DXR magic to get type identifiers for everything, and were to emit the token “bar” from “foo->Release” so that the type identifier for “foo” is associated for the tri-grams generated from “Release”. Ignoring that macrology hides most of our Release calls, we could call this an overload case for XPCOM.

Using the index magic, we could potentially avoid having to do any document lookups, and first present the user with the list of all of the types that could be associated with the query to use to reduce the document search space which is good for the server, but also good for the user because they could rule out all of the reference-counting stuff related to nsISupports they simply do not care about. Note that I am coming from a world of FTS3 where offsets are encoded along with the document-id’s; I realize that without the offset information, this technique is much more susceptible to false positives and much of the cleverness is leveraging that and the likely necessity to have to scan at least part of the document anyways.

Comment by Andrew Sutherland — August 18, 2012 @ 3:56 am
I suppose it might be possible to add extra data in the document list, but keep in mind that a document can have up to n – 2 unique trigrams. Although the number of unique trigrams is smaller for large documents, each document id will be featured in many doc-list. Thus, a lot of overhead even if it’s just an int, currently I use delta list encoding with variable integer length compression, so each entry is likely 1 or 2 bytes.

As you suggest, I do not store offset information, and as I’m matching substrings and regular expressions I have to fetch and scan the documents. However, this takes place in a scalar function, which IFAIK means that it is done after all joins.

So, I think a vertical partitioning might be better. Ie. store the metadata column in a separate table and join it on the trigram index. That way the metadata column can be used to restrict the results before the document is fetched and scanned.

And if you keep metadata in a special table, that table should be rather small. Although there’s no guarantee that Sqlite won’t scatter the pages all over the database file. But I suppose that’ll always be a problem unless you vacuum.

Comment by jonasfj — August 19, 2012 @ 4:51 pm
Why 3 letters rather than 4?

I was thinking about how to implement substring indexing (for something unrelated), and was planning on using 4 bytes (because that fits neatly into a word).

Is it because people often (enough) query on three letters? Or is there something more subtle I hadn’t thought of, maybe that 2 letters doesn’t help (because too many documents will match) and 5 letters matches too few documents, and 3 happens to be optimal in some sense?

Comment by Bruce Stephens — August 30, 2012 @ 11:20 am
I haven’t had the time to read all papers on the subject, but from what I’ve read. It seems that there is too few 2-grams and too many 4-grams. While 3-grams provides a decent selectivity in most cases.

Variable length grams have been suggested. This enables all entries to have a reasonable selectivity, which is what we want.
But as I understand it, updating/deleting data from indexes with variable length grams is very hard (perhaps it’s still an open problem).

In my usecase fetching and merging the doclists it very cheap, compared to fetching and scanning a 400kb document. So I’m not scared of trigrams with almost no selectivity, ie. an associated doclist containing all document ids.

It might be interesting to use hash([k-gram]) % 2^(24) for some constant k. Take the hash under modulo would reduce the number of doclists, but might even out the selectivity.
Query length is also a concern in my usecase at least, ie. queries with n – 1 letters can’t be optimized by an n-gram index.

Comment by jonasfj — August 30, 2012 @ 3:03 pm
Yes, now I come to think about it, 3-letter queries are quite likely to be common enough to be worth optimizing. As suggested, seems likely that trigrams are the sweet spot.

My hunch was based on wanting to use a std::map and wanting to use the whole word of a key. But I think I’m persuaded that that’s mistaken.

Comment by Bruce Stephens — August 30, 2012 @ 4:02 pm
Way late to the party here but here’s a good breakdown of trigram indexing in another use case: https://swtch.com/~rsc/regexp/regexp4.html He talks a little bit about why he chose 3 vs. 2 or 4.

Comment by Ben Standefer — June 7, 2018 @ 5:55 pm
Ah, I see upon a further read that you reference the same Google Code Search summary in your article. Doh! Please disregard or delete my comment, haha.

Comment by Ben Standefer — June 7, 2018 @ 5:57 pm

7 Comments »

Leave a comment