Jonasfj.dk/Blog
A blog by Jonas Finnemann Jensen


January 8, 2014
Custom Telemetry Dashboards
Filed under: Computer,English,Mozilla by jonasfj at 6:43 pm

In the past quarter I’ve been working on analysis of telemetry pings for the telemetry dashboard, I previously outlined the analysis architecture here. Since then I’ve fixed bugs, ported scripts to C++, fixed more bugs and given the telemetry-dashboard a better user-interface with more features. There’s probably still a few bugs around, decent logging is still missing, but data aggregated is fairly stable and I don’t think we’re going to make major API changes anytime soon.

So I think it’s time to let others consume the aggregated histograms, enabling the creation of custom dashboard. In the following sections I’ll demonstrate how to get started with telemetry.js and build a custom filtered dashboard with CSV export.

Getting Started with telemetry.js

On the server-side the aggregated histograms for a given channel, version and measure is stored in a single JSON file. To reduce storage overhead we use a few tricks, such as translating filter-paths to identifiers, appending statistical fields at the end of a histogram array and computing bucket offsets from specification. This makes the server-side JSON files rather hard to read. Furthermore, we would like the flexibility to change this format, move files to a different server or perhaps do a binary encoding of histograms. To facilitate this, data access is separated from data storage with telemetry.js. This is a Javascript library to be included from telemetry.mozilla.org/v1/telemetry.js. We promise that best efforts will be made to ensure API compatibility of the telemetry.js version hosted at telemetry.mozilla.org.

I’ve used telemetry.js to create the primary dashboard hosted at telemetry.mozilla.org, so the API grants access to the data used here. I ‘ve also written extensive documentation for telemetry.js. To get you started consuming the aggregates presented on the telemetry dashboard, I’ve posted the snippet below to a jsfiddle. The code initializes telemetry.js, then proceeds load the evolution of a measure over build dates. Once loaded the code prints a histogram for each build date of the 'CYCLE_COLLECTOR' measure within 'nightly/27'. Feel free to give it a try…

Telemetry.init(function() {
    Telemetry.loadEvolutionOverBuilds(
        'nightly/28',        // from Telemetry.versions()
        'CYCLE_COLLECTOR',   // From Telemetry.measures('nightly/28', callback)
        function(histogramEvolution) {
            histogramEvolution.each(function(date, histogram) {
                print("--------------------------------");
                print("Date: " + date);
                print("Submissions: " + histogram.submissions());
                histogram.each(function(count, start, end) {
                    print(count + " hits between " + start + " and " + end);
                });
            });
        }
    );
});

Warning: the channel/version string and measure string shouldn’t be hardcoded. The list of channel/versions available is returned by Telemetry.versions() and Telemetry.measure('nightly/27', callback) invokes callback with a JSON object of measures available for the given version/channel (See documentation). I understand that it can be tempting to hardcode these values for some special dashboard, and while we don’t plan to remove data, it would be smart to test that the data is available and show a warning if not. Channel, version and measure names may be subject to change as they are changed in the repository.

CSV Export with telemetry.jquery.js

One of the really boring and hard-to-get-right parts of the telemetry dashboard is the list of selectors used to filter histograms. Luckily, the user-interface logic for selecting channel, version, measure and applying filters is implemented as a reusable jQuery widget called telemetry.jquery.js. There’s no dependency on jQuery UI, just jquery.ui.widget.js which contains the jQuery widget factory. This library makes it very easy to write a custom dashboard if you just want write the part that presents a filtered histogram.

The snippet below shows how to create a histogramfilter widget and bind to the histogramfilterchange event, which is fired whenever the selected histogram is changed and loaded. With this setup you don’t need to worry about loading, filtering or maintaining state as the synchronizeStateWithHash option sets the filter-path as window.location.hash. If you want to have multiple instances of the histogramfilter widget, you might want to disable the synchronizeStateWithHash option, and read the state directly instead, see jQuery stateful plugins tutorial for how to get/set a option like state dynamically.

Telemetry.init(function() {

  // Create histogram-filter from jquery.telemetry.js
  $('#filters').histogramfilter({
    // Synchronize selected histogram with window.location.hash
    synchronizeStateWithHash:   true,

    // This demo fetches histograms aggregated by build date
    evolutionOver:              'Builds'
  });

  // Listen for histogram-filter changes
  $('#filters').bind('histogramfilterchange', function(event, data) {
    // Check if histogram is loaded
    if (data.histogram) {
      update(data.histogram);
    } else {
      // If data.histogram is null, then we're loading...
    }
  });

});

The options for histogramfilter is will documented in the source for telemetry.jquery.js. This file can be found in the telemetry-dashboard repository, but it should be distributed with custom dashboards, as backwards compatibility isn’t a priority for this library. There is a few extra features hidden in telemetry.jquery.js, which let’s you implement custom <select> elements, choose useful defaults, change behavior, limit available histogram kinds and a few other things.

custom-telemetry-dashboard-example
A complete custom dashboard with CSV export, minimal Javascript code and bootstrap styling, as illustrated in the screenshot above, is available at gist.github.com/jonasfj/8280124. You can fork the gist and build your custom dashboards on top of it. Gist like this can be tested using bl.ocks.org. Hence, you can try to gist above at bl.ocks.org/jonasfj/8280124. You can also build custom dashboards by forking the telemetry-dashboard repository, but the code for this dashboard is more complicated and mainly useful if you want to make a small change to the existing dashboard. In this case you can host your custom telemetry-dashboard fork using GitHub Pages, in fact mozilla.github.io/telemetry-dashboard is used as a staging area for telemetry-dashboard development.



November 8, 2013
Telemetry Rebooted: Analysis Future
Filed under: Computer,English,Mozilla by jonasfj at 6:47 pm

A few days ago Mark Reid wrote a post on the Current State of Telemetry Analysis, and as he mentioned in the Performance meeting earlier today we’re still working on better tooling for analyzing telemetry logs. Lately, I’ve been working on tooling to analyze telemetry logs using a pool of Amazon EC2 Spot instances (tl;dr spot instance are cheaper, but may be terminated by Amazon). So far the spot based analysis framework have been deployed to generate and maintain the telemetry dashboard (hosted at telemetry.mozilla.org). I still need to create a few helper scripts, write documentation and offer an accessible way to submit new jobs, but I hope that the general developer will be able to write and submit analysis jobs in a few weeks. Don’t worry I’ll probably do another blog post when this becomes readily available.

If you’re wondering how telemetry logs end up in the cloud, take a look at Mark Reids fancy diagram in the Final Countdown. Inspired by this I decided that I too needed a fancy diagram to show how the histogram aggregations for telemetry dashboard is generated. Telemetry logs are stored in an S3 bucket in LZMA compressed blocks of at most 500 MB, the files are organized in folders by reason, application, channel, version, build date, submission date, so in practice there is a lot of small files too.

To aggregate histograms across all of these blocks of telemetry logs, the dashboard master node creates a series of jobs. Each job has a list of files from S3 to process and a pointer to code to use for processing. The jobs are pushed a queue (currently SQS) from where an auto-scaling pool of spot instances fetches jobs. When a spot worker has fetched a job it downloads the associated analysis code from a special analysis bucket, it then proceeds to download the blocks of telemetry log listed in the job, and process with the analysis code. When a spot worker completes a job, it uploads the output from the analysis code to special analysis bucket, and push an output ready message to a queue.

In the next step, the dashboard master node fetchs output ready messages from the queue, downloads the output generated by the spot worker, uses it to update the JSON histogram aggregates stored in web hosting bucket used for telemetry dashboard. From here telmetry.mozilla.org fetches the histogram aggregations and presents them to the user. The fancy diagram, below outlines the flow, dashed arrows indicates messaging and not data transfer.


telemetry-dashboard-flow

As I briefly mentioned, EC2 spot instances may be terminated by Amazon at any time, that’s why they are cheaper (approximately 20% of on-demand price). To avoid data loss the queue will retain messages after they’ve been fetched and requires messages to be deleted explicitly once processed, of the message isn’t deleted before some timeout, the message will be inserted back into the queue again. So if a spot instance gets terminated, the job will be reprocessed by another worker after the timeout.

By the way, did I mention that this framework processed one months telemetry logs, about 2 TB of LZMA compressed telemetry logs (approx. 20 TB uncompressed), in about 6 hours with 50 machines costing approximately 25 USD. So time and price wise it will be feasible to run an analysis job for a years worth of telemetry logs. The only problem I ran into with the dashboard is the fact that this resulted in 10,000 jobs, and each job created an output that the dashboard master node had to consume. The solution was to temporarily throw more hardware at the problem and run the output aggregation on my laptop. The dashboard master node is an m1.small EC2 node and it can easily keep up with day to day operations, as aggregation of one day only requires 500 jobs.

Anyways, you can look forward to the framework becoming more available in the coming weeks, so analyzing a few TB of telemetry logs will be very fast. In the mean time, checkout the telemetry dashboard at telemetry.mozilla.org, it has data since October 1st. I’ll probably get around to do a post on dashboard customization and aggregate consumption later, for those who would like to play the raw histogram aggregates beneath the current dashboard.



September 22, 2013
Model Checking Weighted Kripke Structures in the Browser
Filed under: Computer,English,School by jonasfj at 7:01 pm

I recently graduated from Aalborg University with Master degree in Computer Science. Since then, Lars Kærlund Østergaard and I, along with our professors Jiri Srba and Kim Guldstrand Larsen published a paper on our work at SPIN 2013. Lars and I attended the conference at Stony Brook University, New York, where I presented the paper. I’ve long planned to write a series of blog posts about these things, but since I started at Mozilla I’ve been way too busy doing other interesting things. I do, however, despise the fact that my paper is hidden behind a pay-wall at Springer. So I feel compelled to use my co-author rights and release the paper here along with a few other documents, that contains proofs and results in greater detail.

Anyways, as the headline suggests this post is mainly about model checking of Weighted Computation Tree Logic (WCTL) on Weighted Kripke Structures (WKS) in a web browser. To test out the techniques we developed for local model checking, we implemented, WKTool, a browser based model checking tool, complete with graphical user-interface, syntax highlighting, inline error messages, graphical state-space exploration, and of course on-the-fly model checking using Web Workers.

If Kripke structures and branching time logics are foreign concepts to you, this post is probably of limited interest. However, if you’re familiar with the Calculus of Communicating Systems (CCS) as used in various versions of the Concurrency Workbench, this post will give an informal introduction to the syntax of the Weighted Calculus of Communicating Systems (WCCS) and WCTL as using in WKTool.

Weighted Calculus of Communicating Systems

The following table outlines the syntax for WCCS expressions. The concepts are similar to those from Concurrency Workbench, parallel composition allows input and output actions to synchronize, and restriction forces synchronization. However, it maybe observed that actions prefixes also carry a weight. This is the weight of the transition with the action is executed, on synchronization the maximum weight is used (though this is not a technical restriction).

Expression Syntax
Process Definition M := P;
Action Prefix <c,8>.P  or <c!,8>.P or <c>.P
Atomic Label a:P
Parallel Composition P | Q
Choice P + Q
Restriction P \ {c} or P \ {c1, c2, c3}
Renaming P [a1 => a2, a3 => a4] or P [c1 -> c2, c3 -> c4]
Empty Process 0

As WCCS defines Kripke structures and not Labelled Transistion Systems (LTS) we shall also specify atomic propositions. These are also prefixed with colon, in parallel and choice composition the atomic propositions from both sub-processes are considered. In fact, the action prefix is the only construction from which atomic propositions of the sub-process isn’t considered.

Weighted Computation Tree Logic

The full syntax of WCTL is given in the help section of WKTool, conceptually there are few surprises here. WCTL features boolean operators, atomic proposition and boolean constants as expected. However, for existential and universal until modalities, WCTL adds an optional upper-bound. For example the property E a U[<=8] b holds if there exists a path such that the atomic property a holds until the atomic property b holds, and the accumulated weight at this point does not exceed 8. Similar upper-bound is available for universal until and derived modalities.

For the existential and universal weak-until modalities WCTL offers a lower-bound constraint. For example the property E a W[>=8] b holds if there exists a path such that the atomic property a always holds, or there is a path such that the atomic property a holds until the atomic property b holds and the accumulated weight at this point isn’t less than 8. Observe that the bound has no effect if there is a path where the atomic property a always holds, thus, a zero-cycle might satisfy this property.

In the examples above the nested properties are atomic, however, WKTool does in fact support nested fixed-points when the min-max encoding us used.

Weighted On-The-Fly Model Checking with WKTool

WKTool is hosted at wktool.jonasfj.dk, you can save your models to local storage (in your browser) or export them in a JSON format that can be shared or loaded later. I’m sure most of the features are easy to locate, but do notice that under “Load” > “Load Example” the 4 last examples are the scalable models used for benchmarks in our paper. Try them out, the default scaling parameters are sane and they come with properties that can be verified, some of them even have comments to help you understand what they do.

wktool-screenshot

WKTool available GNU GPLv3 and the sources are hosted at github.com/jonasfj/WKTool, it’s all in CoffeeScript using PEG.js for parsers, CodeMirror with custom lexers for highlighting, Buckets.js for data structures, Arbor.js for visualization and twitter-bootstrap for awesome styling.

 

 



March 21, 2013
CodeMirror Collaboration with Google Drive Realtime Api
Filed under: Computer,English by jonasfj at 10:37 pm

Yesterday morning, while lying in bed consider whether or not to face the snow outside, I saw a long anticipated entry in gReaderPro (a Google Reader app). I think the release of Google Drive Realtime Api is extremely exciting. The way I see it realtime collaboration is one of the few features that makes browser-based productivity applications preferable to conventional desktop applications.

At University I’ve been using Gobby with other students for years, with a local server the latency is zero, and whether we’re writing a paper in LaTeX or prototyping an algorithm, we’re always doing it in Gobby. It’s simply the best way to do pair programming, or to write a text together, even if you’re not always working on the same section. Futhermore, there’s never a merge conflict :)

Needless to say that I started reading the documentation over breakfast. And as luck would have it, Lars, who I’m writing my master with, decided that he’d rather work from home than go through the snow, saving me from having to rush out the door.

Anyways, I found sometime last night to play around with CodeMirror and the new Google Drive Realtime Api. I’ve previously had a look at Firebase, which does something similar, but Firebase doesn’t support operational transformations on strings. In Google Drive Realtime Api this is supported through the CollaborativeString object, which has events and methods for inserting text and removing ranges.

So I extended the Quickstart example to use CodeMirror for editing, after a bit of fiddling around it turned out to be quite easy to adapt the beforeChange event, such that I can all changes on the collaborativeString using insertText and removeRange methods. The following CoffeeScript snippet show how to synchronize an editor and a collaborativeString.

synchronize = (editor, coString) ->
  # Assign initial value
  editor.setValue coString.getText()

  # Mutex to avoid recursion
  ignore_change = false

  # Handle local changes
  editor.on 'beforeChange', (editor, changeObj) ->
    return if ignore_change
    from  = editor.indexFromPos(changeObj.from)
    to    = editor.indexFromPos(changeObj.to)
    text  = changeObj.text.join('\n')
    if to - from > 0
      coString.removeRange(from, to)
    if text.length > 0
      coString.insertString(from, text)

  # Handle remote text insertion
  coString.addEventListener gapi.drive.realtime.EventType.TEXT_INSERTED, (e) ->
    from  = editor.posFromIndex(e.index)
    ignore_change = true
    editor.replaceRange(e.text, from, from)
    ignore_change = false

  # Handle remote range removal
  coString.addEventListener gapi.drive.realtime.EventType.TEXT_DELETED, (e) ->
    from  = editor.posFromIndex(e.index)
    to    = editor.posFromIndex(e.index + e.text.length)
    ignore_change = true
    editor.replaceRange("", from, to)
    ignore_change = false

I’ve pushed the code to github.com/jonasfj/google-drive-realtime-examples, you can also try the demo here. The demo is a markdown editor, it’ll save the file as a shortcut in Google Drive, but it’s won’t save the file as text document, just store it as a realtime document attached to the shortcut.

So a few things I learned from this exercise, use or at least study the code samples, much of it isn’t documented and the documentation can be a bit fragmented. In particular realtime-client-utils.js was really helpful to get this off the ground.



December 29, 2012
Introducing BJSON.coffee for Binary JSON Serialization
Filed under: Computer,English by jonasfj at 2:15 pm

Lately, I’ve been working an web application which will need to save binary blobs inside JSON objects. Looking around the web it seems that base64 encoding is the method of choice in these cases. However, this adds a 30% overhead and decoding large base64 strings to Javascript typed arrays (ArrayBuffer) is an expensive tasks.

So I’ve been looking at different binary data formats: BSON, Protocol Buffers, Smile Format, UBJSON, BJSON and others. Eventually, I decided to give BJSON a try for the following reasons.

  • BJSON is easy to make a lightweight implementation
  • It can encapsulate any JSON object
  • BJSON documents can be represented as JSON objects with ArrayBuffers for binary blobs.

My primary motivation is the fact that BJSON can serialize ArrayBuffers, as an added bonus a BJSON encoding of JSON object is typically smaller than the traditional string encoding with JSON.stringify(). Now, I’m sure there is valid arguments to use another binary encoding of JSON objects, so I’m going to stop with the arguments and talk code instead…

Well, time to introduce BJSON.coffee, a CoffeeScript implementation of BJSON for modern browsers. Aparts from null, booleans, numbers, arrays and dictionaries also available JSON, the BJSON specification also defines the inclusion of binary data. The specification notes that “this is not fully transcodable“, but as you might have guessed BJSON.coffee uses ArrayBuffers to represent binary data.

Essentially, BJSON.serialize takes a JSON object that is allowed to contain ArrayBuffers and serializes to a single ArrayBuffer. While, BJSON.parse takes an ArrayBuffer and returns a JSON object which may contain ArrayBuffers.For those interested in using BJSON instead of a normal string encoding of JSON objects, there is both good and bad news. The bad news is that UTF-8 string encoding in modern browsers is so slow, that BJSON is slower than a conventional string encoding of JSON objects. Although, this might not be the case when/if the string encoding specification is implemented.

The good news is that the BJSON encoding is 5-10% smaller than the conventional string encoding of JSON objects. The table/terminal output below from my testing script, shows some common JSON objects harvested from common web APIs.

Test:                   size (JSON):   size (BJSON):  Compression:   Success:
bitly-burstphrases.json    17714          16372            16%       true    
bitly-clickrate.json         104             88            24%       true    
bitly-hotphrases.json      65665          60592            16%       true    
bitly-linkinfo.json          497            468            10%       true    
complex.json                 424            384            25%       true    
flickr-hottags.json         3610           3205            11%       true    
twitter-search.json        12968          11994            10%       true    
yahoo-weather.json          1241           1172             5%       true    
youtube-comments.json      26296          24938             5%       true    
youtube-featured.json     110873         104422             5%       true    
youtube-search.json        93202          88473             5%       true    

BJSON.coffee is available at github.com/jonasfj/BJSON.coffee. It should work in all modern browsers with support for typed arrays, Firefox 15+, Chrome 22+, IE 10+, Opera 12.1+, Safari 5.1+. However, I have pushed a github page which runs unit-tests in the browser and shows compatibility results from other browsers using Browserscope. So please visit it here, click “run tests” and help figure out where BJSON.coffee works.

Update: Being bored today I decided to a quick jsperf benchmark of JSON.stringify and BJSON.serialize to see how much slower BJSON.serialize is. You can find the test here, which seems to suggest that BJSON.serialize might be unreasonably slow at the moment. However, it seems that slow UTF-8 encoding is responsible for much of this, and I believe it is possible to improve the current UTF-8 encoding speed.



December 11, 2012
Last.fm Radio API Update – Marks the end of TheLastRipper
Filed under: Computer,English,TheLastRipper by jonasfj at 7:07 pm

Back in high school I started TheLastRipper, an audio stream recorder for last.fm. The project started as product for a school project on copyright and issues with piracy. It spawn from the unavailability of non-DRM infested music services. Which at the time drove many teenagers to piracy. Whilst, you the morality of recording internet radio can be argued. All the research we did at the time, showed it to be perfectly legal. Nevertheless, I’m quite happy that I didn’t have to defend this assertion.

Anyways, I’m glad to see that the music industry didn’t sleep for ever. These days we have digital music stores and I’m quite happy to pay for DRM-free music, and as bonus I get to support the artists as well. So it can hardly comes as surprise that I haven’t used TheLastRipper for years. Nor have I contributed to the project or attended bug reports for years. In fact it has been years since the project saw any active development.

I do occasionally get an email or see a bug report from an die-hard user of TheLastRipper, and it’s in the wake of such an email I’ve decide write this post. Partly because more people will ask why it stopped working, and partly because I had a lot of fun with project and it deserves a final post here at the end. Personally, I’ve long thought the project dead, I know the Linux client have been, so I was surprised to that anybody actually noticed it when the Radio API was updated.

rip-thelastripper
As you may have guessed from the title of this post TheLastRipper has died from being unmaintained while last.fm have updated their APIs. Honestly, I’m quite surprised last.fm have continued to support their old unofficial API for as long as they have. So this is by no means the result of last.fm taking action against TheLastRipper. Truth be told, I’m not even sad that it’s finally dead. The past many years, state of TheLastRipper have been quite embarrassing. The code base is ugly, buggy and completely unmaintainable.

Over the years, TheLastRipper have been downloaded more than 475.000 times, distributed with magazines (Computer Bild) and featured in countless blog posts from around the world. Which, considering that this started as a high school project is pretty good. It’s certainly been a great adventure and I’ve worked with a lot of people from around the world. So here at the of my last post on TheLastRipper, I’d like to say thanks to all the bug reporters, comment posters, testers, developers and people who hopefully also had fun participating in this project.

Oh, and to the few die-hard users out there, I’ll recommend that you buy your digital music from one of the many DRM-free music stores you can find :)



August 27, 2012
Regular Expressions on dxr.allizom.org
Filed under: Computer,English,Mozilla by jonasfj at 9:01 pm

A couple of weeks I introduced TriLite, an Sqlite extension for fast string matching. TriLite is still very much under active development and not ready for general purpose use. But over the past few weeks I’ve integrated TriLite into DXR, the source code indexing tool I’ve been working on during my internship at Mozilla. So it’s now possible to search mozilla-central using regular expressions, for an example regexp:/(?i)bug\s+#?[0-9]+/ to find references to bugs.

For those interested, I did my end-of-internship presentation of DXR last Thursday, it’s currently available on air.mozilla.org (I’m third on the list, also posted below). I didn’t reharse it very much so appoligies if it’s doesn’t make any sense. What does make sense however, is that fact that dxr.allizom.org supports substring and regular expression searches, so fast that we can facilitate incremental search.


(My end-of-internship presentation, slides available here)

Anyways, as you might have guessed from the fact that I gave an end-of-internship presentation, my internship is coming to an end. I’ll be flying home to Denmark next Saturday to finish my masters. But I’ll probably continue to actively develop TriLite and make sure this project reaches a level where it can be reused by others. I’ve already seen other suggestions for where substring search could be useful.



August 16, 2012
TriLite, Fast String Matching in Sqlite
Filed under: Computer,English,Mozilla by jonasfj at 7:27 pm

The past two weeks I’ve been working on regular expression matching for DXR. For those who doesn’t know it, DXR is the source code cross referencing tool I’m working on during my internship at Mozilla. The idea is to make DXR faster than grep and support full featured regular expressions, such that it can eventually replace MXR. The current search feature in DXR uses the FTS (Full Text Search) extension for Sqlite, with a specialized tokenizer. However, even with this specialized tokenizer DXR can’t do efficient substring matching, nor is there any way to accelerate regular expression matching. Which essentially means DXR can’t support these features, because full table scans are too slow on a server that serves many users. So to facilitate fast string matching and allow restriction on other conditions (ie. table joins), I’ve decided to write an Sqlite extension.

Introducing TriLite, an inverted trigram index for fast string matching in Sqlite. Given a text, a trigram is a substring of 3 characters, the inverted index maps from each trigram to a list of document ids containing the trigram. When evaluating a query for documents with a given substring, trigrams are extracted from the desired substring, and for each such trigram a list of document ids is fetched.  Document ids present in all lists are then fetched and tested for the substring, this reduces the number of documents that needs to fetched and tested for the substring. The approach is pretty much How Google Code Search Worked. In fact, TriLite uses re2 a regular expression engine written by the guy who wrote Google Code search.

TriLite is very much a work in progress, currently, it supports insertion and queries using substring and regular expression matching, updates and deletes haven’t been implemented yet. Anyways, compared to the inverted index structure used in the FTS extension, TriLite has a fairly naive implementation, that doesn’t try to provide a decent amortized complexity for insertion. This means that insertion can be rather slow, but maybe I’ll get around to try and do something about that later.

Nevertheless, with the database in memory I’ve been greping over the 60.000 files in mozilla-central in about 30ms. With an index overhead of 80MiB for the 390MiB text in mozilla-central, the somewhat naive inverted index implementation employed in TriLite seems pretty good. For DXR we have a static database so insertion time is not really an issue, as the indexing is done on an offline build server.

As the github page says, TriLite is no where near ready for use by anybody other than me. I’m currently working to deploy a test version of DXR with TriLite for substring and regular expression matching. Something I’m very much hoping to achieve before the end of my internship at Mozilla. So stay tuned, a blog post will follow…



August 10, 2012
Deploying DXR on dxr.mozilla.org
Filed under: Computer,English,Mozilla by jonasfj at 6:31 pm

This year I’m spending my summer vacation interning at the Mozilla office in Toronto. The past month I’ve been working on DXR, a source code indexing and cross referencing application. So far I’ve worked to deploy DXR on Mozilla infrastructure and is happy to report that dxr.mozilla.org is no longer redirecting to dxr.lanedo.com. The releng team have a build bot indexing mozilla-central, so that we always have a fresh index on dxr.mozilla.org. So in blind faith that this will never crash horribly, I’ll tell to:

Update your bookmarks to: dxr.mozilla.org

Whilst I’ve been handling most of the deployment issues, the releng team have been doing the heavy lifting when comes to automatic build, thanks! This means, that I’ve been working on cleaning up, refactoring, redesigning and rewriting parts of DXR. This sounds like a lot of behind the scenes work that nobody will ever see, however, this means that DXR now has a decent template system. And well, who would I be if I didn’t write a template for it.

You can find the tip of DXR that I’m working on at dxr.allizom.org, please check out it and let me know if there’s something you don’t like and would like to appear. I realize that the site currently have a few crashes, and don’t worry I won’t rush this into production before I’ve ratted these out. Now, as evident from this blog, I’m by no means a talented designer, so if you feel that grey should be red or whatever, please leave a color code in the comments and I’ll try it out.

I’m currently looking into replacing the full text search and my somewhat buggy text tokenizer with a trigram index that’ll facilitate proper substring matching and if we lucky regular expression matching. But I guess we will have to wait and see how that turns out. In the mean time leave a comment if you have questions, requests, suggestions and/or outcries.



June 28, 2012
New Domain, New wordpress after “pharma hack”
Filed under: Computer,English by jonasfj at 4:30 pm

I can’t say that I have been very active on this blog the past couple of years, nevertheless when I decided to hack a little on theme this morning I found an infestation of hidden links to all sorts of crap. Apparently this sort of thing is not unheard of on wordpress blog.

Anyways, I’ve been planning to slowly out phase the “jopsen” nickname, and having recently bought “jonasfj.dk” I decided to do an export of my wordpress blog. Setup a new one on a new domain. Dreamhost one click install made this rather easy, and I now have automatic wordpress updates.

My old domain “jopsen.dk” will now redirect to “jonasfj.dk”, and I’ve deleted the infected wordpress install. I still have a wiki running on the old domain, but it’s no longer active, so if possible I plan to render it static, at which point I’ll move it from the domain.

This does not mean that I’ll get all rid of the “jopsen” nickname, I’ve got many accounts under that name. But in the future I’ll try to move some of the frequently used accounts to “jonasfj”. I think it’s a little more sensible and gives sort of a hit about who I might be :)
Update: Changed my github username from jopsen to jonasfj. Well I think that’s all the renaming I can handle for now.

With respect to the “pharma hack” I think it’s sad that we live in a world where some people have so little respect for my time. After all the ad-space on my blog has no value compared to the costs of cleaning up after this. Worst of all is that these links where hidden, I only discovered them by accident, and I have no idea how long time they’ve there.

Anyways, I’ve disabled ftp, used a different domain, different shell user, clean wordpress install with a Google Authenticator plugin. At least that’s a start.



Older Posts »