apocryph.org Notes to my future self

22Jun/060

Progress on high-performance abbreviation search tool

Background

Off and on for the last year or two, I’ve been intrigued by the simple UI idioms exploited by Quicksilver for MacOX X to enable fast, intuitive, and flexible access to the information and tools on one’s own machine and across the network.

I keep meaning to implement something similar for the PC (no, I don’t think AppRocket or Launchy even begin to approach Quicksilver’s coolness), but of course I keep getting distracted. At any rate, one of the most intriguing problems that I would need to solve in order to implement QS on the PC is the abbreviation matching algorithm.

In Quicksilver, you can select items from its catalog (the files and apps it found on your machine) by typing their names in a window that appears when you press a hotkey. What’s cool is you can pick arbitrary abbreviations, like qxp for “Quark XPress” or wrd for “Microsoft Word”, and Quicksilver will find all the items matching that abbreviation. It also incorporates some heuristics and a primitive training algorithm to improve the abbreviation matches over time; for example, if most of the type when you type a w you end up typing wrd for “Microsoft Word”, then after a while you need only type w and QS will assume you want Word unless you pick something else or type more letters.

Details of the abbreviation algorithm aren’t particularly clear, though there are some hints scattered about. What I know:

  • Abbreviation letters must be in the order they appear in the title. So, qxp could match “Quark XPress”, but qpx would not.
  • Some sort of preference or priority is given to letters that match the beginning of words. So qxp would be a stronger match to Quark XPress than it would be to Request Fax Implementation.
  • A learning algorithm tracks which abbreviations match which titles and how often, and minimizes the letters required to match a title that is used more frequently

UPDATE: Nicholas from BlackTree (the company that makes Quicksilver) pointed me to the implementation of the QS string ranking algorithm. I must admit it’s pretty inscrutable to me, in no small part due to my inability to read Objective C. He tried to explain it to me, and the explanation sounds like a different way to get at the same approach I describe below.

I don’t know:

  • How well this algorithm scales; how many items can be in the catalog before it bogs down?
  • How well the algorithm works with foreign languages, particularly non-Latin alphabets and idiographic languages like Chinese
  • If Quicksilver is backed by an off-the-shelf database engine or something hand-rolled

Given these scraps of knowledge, I’ve set about implementing a high-performance abbreviation algorithm that can scale to tens or hundreds of thousands of items, and works with any Unicode writing system. I’ve written a C# and a Ruby implementation so far, and had good luck with both.

Selecting a data store

First, I had to choose a back-end data store. I knew I wanted to go with an existing database library, but I didn’t know what. I wanted Free Software with cross-platform support, which knocked out SQL Server Express and similar offerings from Oracle et al.

I tried Berkeley DB (abandoned due to shitty C# bindings, restrictive license, and lack of a SQL query facility) first. I experiemented with db4o, an object database that supports .NET and Java, but I abandoned it when it became clear it wouldn’t support the size and complexity of the data I would be using, and I didn’t like the restrictive GPL license terms.

The first serious contender I evaluated was Firebird, an open-source relational database derived from the old Borland Interbase. Firebird is covered by a pretty liberal license, well supported by an ADO.NET provider, and supports embedded and client-server modes and most of the standard RDBMS features. In fact, I decided to go w/ this database for my prototype, and completed it successfully, only to find that performance was not acceptable. Firebird simply couldn’t handle the sustained writes and heavy reads I was throwing at it with the response times I required.

Ultimately I settled on what is arguably one of the best open-source software projects the world has yet seen: SQLite. SQLite is like a cross between Berkeley DB and Firebird, but better than both. It has the full-featured SQL capabilities of Firebird, and the easy embeddability of BDB, with the most laissez-faire license terms imaginable: public domain.

On top of all that, it’s actively maintained, cross-platform, extremely lightweight, very fast, and has full-featured bindings for ADO.NET 2.0 and Ruby, as well as every other major static and dynamic language in use today.

In my first test of my prototype with SQLite, I was disappointed to find it was only 2x faster than Firebird. Then I realized I was using a full debug build; running an optimized release build resulted in nearly 50x performance improvement over the fastest I could get Firebird to run.

Of course, this isn’t an indictment of Firebird; SQLite happens to be the right tool for my job, and Firebird isn’t. SQLite also requires that you think a bit more about your data model and query structure, as it doesn’t have the intelligent query optimization features I was used to w/ SQL Server and Oracle. But, that said, once I tuned it a bit, I was able to get sub-500ms response times on my abbreviation algorithm against a 10k item database!

The Algorithm

I spent alot of time trying to find an algorithm for what amounts to non-adjacent substring matching, that could be implemented in such a way as to take advantage of high-performance relational database technology. The solution I came up with, though surely not original, was pretty novel to me.

The problem is thus: given an input string s, find all matching strings t from some collection T such that s can be formed from t by deleting characters in t. Or, a less rigorous statement, by way of example, would be some algorithm that can take a string like foo and match “FOO“, “Fear me or not”, “Firefox On Oxy”, etc, based on the fact that all letters in foo are present in every matching string, in the same order as they appear in foo, though not necessarily all together.

It occurred to me while sitting on a plane in Dulles about to take off for London for a month-long pilot with DoS. I had thought about pre-computing all possible non-adjacent substrings for each title, but that rapidly exploded beyond O(n!) complexity. I dabbled with limiting the algorithm to only leading letters in each word, but that breaks cases like qp for “Quark XPress”…

I also thought about specifying a minimum query length like 3 or 4 letters, but that’s not tenable either; users should start to get options after the first letter typed, and the results must come back quickly, even if there are a ton of them.

I had groped about desperately for research or published work on non-adjacent substring search algorithms, but got nothing.

Finally it occurred to me that I could represent the title space as a tree, with each node corresponding to a letter. So, a search for abcd would start at the root node for a, proceed to the sub-node b, and so on through the tree until d. Each node in the tree would maintain a list of all items whose titles contained a non-adjacent substring composed of that node and its ancestors. So, non-adjacent substring matching becomes a simple graph traversal followed by a query of all items associated with a node; both pretty lightweight relational operations.

I had pretty good luck with an early version of this algorithm in C# against SQLite. Most recently, I picked it up again with Ruby. Largely as a result of the idioms Ruby encourages, a further optimization occurred to me that reduced the complexity of the search algorithm by an order of magnitude, resulting in the following numbers:

  • Search a 10k item collection of real English-language files from my filesystem for test: 0.45 seconds, ~550 matching items
  • Search same collection for t: 1.48 seconds, ~5400 matching items (over 50% of total collection)
  • Build collection of ~10k items from my filesystem in ~2 minutes, at a rate of approximately 83 items/second

Now, these search numbers suck pretty bad if you compare them to the timings one would get with a text search tool like Lucene against a text file of 10k file names, for example, however the important difference is the non-adjacent substring search. In the general case such an algorithm would not be of particular use, and it produces more matches and requires more work to eliminate possible matches than does an adjacent substring search. Nonetheless, it has the potential to enable a search-by-abbreviation idiom which can be both powerful and simple, as evidenced by Quicksilver.

Next steps

There’s still along way to go from a proof of concept matching algorithm to a Quicksilver replacement. Some of the immediate steps:

  • Devise a basic scoring mechanism that can be implemented at the database level and won’t conflict with additional app-layer scoring and post-processing logic. Possibilities: preference to letters near the start of words, preference to matches which span a larger (or should it be smaller?) subset of the title, bonus for frequency of use, etc
  • Implement said basic scoring mechanism at the database level so results can be returned in some sort of match-strength order. This is important when considering the second search case above (searching for t). Search results must come back from the initial key press, even though this will obviously match a huge range of (mostly irrelevant) items. To get this initial query down to sub-second, it needs to be limited to the top n matches, where n is some number or threshold that will return all plausable matches but spare the data layer the IO associated with retrieving info for the weak matches.
  • Implement a basic data-layer filter mechanism, so queries can be filtered by item class or source. For example, contacts and Winamp tracks might be relevant or not depending upon context or user input. If one assumes a collection of 100k+ items, it would be useful to easily include or exclude collections of items at the database level where performance is at its peak.

Of course, once the algorithm is worked out, about 5% of the overall effort will be completed, however it’s an important milestone for me as it gives me something cool to play with.

17Feb/063

A SQLite Schema Provider for CodeSmith

In my spare time here in Baghdad I’ve been playing around more with SQLite, my favorite lightweight open-source database engine. I wanted to generate some data access objects using CodeSmith, but unfortunately CodeSmith can’t use the generic GetSchema() functionality in ADO.NET 2.0.

Instead, CodeSmith has its own chinsy provider model to expose schema information from a database system for the purposes of code generation based on that schema. It ships with SQL Server and ADOX support, and leaves the remaining database systems as exercises to an intrepid reader.

Thankfully, the schema provider model is open and documented, and the vendor encourages development and sharing of providers. There is an existing SQLite 3 provider but it has a few drawbacks:

  • Uses the Finisar .NET wrapper on SQLite, which is not particularly well supported, and was written for .NET FW 1.1
  • Doesn’t expose foreign key information, so templates which rely on entity relationship information (that is, almost all the templates I’m interested in) won’t work
  • Isn’t written by me :)

So, over the last couple of afternoons, I threw together my own SQLite 3.x schema provider using the outstanding .NET 2.0-only SQLite wrapper, SQLite.NET. SQLite.NET does most of the heavy lifting with its GetSchema() support; all I had to do was translate it into CodeSmith’s schema object model, and voila!.

Note that I built this against the 15 Feb 06 release of SQLite.NET, 1.0.26.2. This version has a primary key reporting bug in it for which I have submitted a patch to the author. He’s incorporated the fix and tested it, but is holding back for a few days on updating the official binary release. So, if you download this within a few days of its posting date, you may want to use the CVS binaries for SQLite.NET.

The SVN repository for this project is:

 http://svn.apocryph.org/svn/projects/Sqlite3SchemaProvider/trunk/

The code is released into the public domain; do with it what you will, but I hereby disclaim any and all liability for its misuse.

You can download a buddy build source or binary tarball based on my patched SQLite.NET 1.0.26.2, though I urge you to fetch the source from SVN and build it yourself. I built it with VS2k5 Pro and CodeSmith Pro 3.2.

To install, copy the Sqlite3SchemaProvider.dll and System.Data.SQLite.dll files from binary tarball or the output folder if you build yourself, and place them in the SchemaProviders folder in the CodeSmith install directory. Restart CodeSmith, and the SQLite3 Schema Provider should appear in the list of available providers.

Any problems email me: anelson – at – apocryph – org

Delicious Bookmarks

Recent Posts

Meta

Current Location