Dan asked me the other day if I’d setup a DXR instance for Thunderbird. About the same time I met with Chris and Mohak to discuss his work to package DXR for Fedora–he’s getting pretty close. Both of these things pushed me to do some more work on it and try a few experiments.
I started out by creating a new environment for generating my dehydra indexes, documenting as I went. I then built a new dxr index for mozilla-central (Firerfox) and comm-central (Thunderbird). I was interested to see if anything in comm-central would break my code, but it was fine.
Next I decided to create some scripts to generate new indexes. Right now I just have an old index up for people to use as a demo. Ideally we want something that is updated all the time and tracks the changes going into the tree. As I was doing this, I decided to also fix the number one complaint I hear from people: it’s too slow. The easiest way to fix this, without spending more time than I have is to pre-generate all the marked-up source files. Right now, every time you ask for a source file, it gets parsed and marked-up. Every time. For large files (I’m looking at you, sqlite3.c), this is a deal breaker. For normal files it is still annoying.
So I wrote a script to create a copy of the source tree that has already been marked-up. This took ~4 hours and generated 2.4 Gigs worth of html. This seemed large, so I rewrote the script to pipe the marked-up source files to gzip as I went, which apache and the browser will handle just fine, and it weighs in at ~400M.
Next I wanted to see if I could replace glimpse (yes, I’ve been here before) with something licesend better for inclusion in Fedora. I decided to finally give Swish-e a try. I found it quite easy to use, and especially liked that I could define FileFilters, such that my gzipped html files could be indexed directly.
After building the index, I did some tests and found that the markup produced by my markup generator was breaking the Swish-e html parser. A few minutes later the error was found and fixed. So I tried searching. No matter what I did I only ever got back one hit per page, which made me realize that Swish-e wasn’t going to go as far as glimpse and give me the lines themselves: results are documents, not lines in documents. It’s not a huge thing: once I have the list of matching files I can grep on a secondary pass. But it meant that it was one more thing I’d have to write. Every time I have to write more it makes me feel like I should just rewrite the whole thing. But there’s no time for that today, so this will have to do.
Next week I’m on vacation, but when I get back I’ll setup the mozilla-central and comm-central instances using what I did today. I’ll blog the links when they go live.