apocryph.org Notes to my future self

30Jun/060

Upgrading achilles from VMWare GSX Server 3.x to VMWare Server RC 2

I decided to use the long weekend to break in VMWare Server RC2. As per the release notes, I uninstalled GSX Server and rebooted prior to installing VMWare Server.

Impressively, it didn’t require another reboot, and appears to be fully compatible with my old VMs.

According to the admin guide, in order to take advantage of the new features in Server, you have to upgrade the virtual machines. This changes some of the virtual hardware, so may have some implications for the behavior of the guest OSs.

I’ll start with carlotta, the W2k3 box I don’t use that often. I did so, and was exhorted to upgrade the VMWare tools.

It seemed to work fine. Now the FreeBSD and OpenBSD boxes…

One problem is that the VMWare Console port seems to have changed back to its old default, 900-something. I had previously changed it to tcp/8080, since the fuckwits at CI Host run a facist firewall in front of my box that only passes a few known ports. I don’t remember how I changed it last time, but I can’t find it in the VMWare Server GUI so it’s probably a registry setting.

Sure enough, page 81 of the admin guide has this to say:

To change the port number on a Windows host or client
Add the following line to config.ini in C:\Documents and Settings\All Users\Application Data\VMware\VMware VMware Server:
authd.port = where <portNumber> is the port number that all consoles connecting to virtual machines on this host must use.

So, I added authd.port = 8080 to the config.ini file and rebooted.

The upgrade went all, and all the VMs seem to be working normally. Too easy.

28Jun/060

Maddening problem with clock slip in FreeBSD under VMWare

A few weeks ago my father pointed out that the date stamps on my blog posts were behind by a week. Upon investigation, I found that bonzo‘s clock was a week behind. I updated it and declared victory.

Then, he pointed it out again a few days ago. Sure enough, it had slipped by several days. When I logged into the VMWare Console to check for options to sync the clock or whatever, I noticed a repeated error from the FreeBSD kernel that I’ve been getting on bonzo forever and always ignored:

calcru: runtime went backwards from [some big number] usec to [another] usec for pid [pid]

I googled this message, and found a whole community of FreeBSD users suffering under slipping clocks when running FreeBSD under VMWare. There’s something on the freebsd-current list, and VMWare’s own support forums.

There are a few proposed fixes, most involving the kern.timecounter.hardware sysctl. I tried changing it from its default of APIC to TSC and i8245, but none worked.

I then ran across a post on the VMWare forums suggesting:

In FreeBSD:

'tools.timeSync = "true"' added to .vmx file
 sysctl -w kern.timecounter.hardware=i8254
 kldload vmmemctl (from vmware-tools) and have vmware-guestd running
 add 'kern.hz="250"' to /boot/loader.conf

I don’t have APIC or ACPI disabled in my FreeBSD host either

Now, I don’t want to run the VMware tools just to keep the clock in sync, but I did put kern.hz="250" in /boot/loader.conf and kern.timecounter.hardware=i8254 in /etc/sysctl.conf, then rebooted.

It’s been several minutes now, and the clock seems to be holding. I’m afraid I don’t understand in detail why this helps, though a VMware knowledgebase article alludes to a problem of missed timer interrupts, with a fix being reducing the frequency of the timer interrupts requested by the OS. I think that’s kern.hz="250". The importance of switching the time counting method from APIC to i8254 is less clear, unless it’s just a more reliable source of ticks.

At any rate, this problem has caused me to notice that VMWare server is in RC-2. As it’s the free successor to GSX Server 3, I really need to upgrade. Perhaps over the coming long weekend…

23Jun/060

PeopleAggregator is a great idea

Like everyone else, I’ve recently heard buzz about PeopleAggregator. Though on the surface it’s simply a standards-based social networking service, upon further investigation it’s much more.

First, it’s embracing the URL-based lightweight ID management technologies I’ve posted about previously, which is another step in the direction of de facto standardization. Second, it’s building tools and standards for decentralized hosting and portability of all sorts of user information, not just identity and social network. The presentation materials on the site now specifically call out file and media storage as additional use cases.

What’s important is that PeopleAggregator isn’t just offering to host these services; that’s been done before. PeopleAggregator is trying to catalyze a distributed, decentralized, open network/grid/mesh by which identity, social network, and data can be created, manipulated, and exchanged in both human- and machine-consumable forms.

I can’t wait to learn more. Perhaps I won’t have to build the Grand Unified Storage Architecture all by myself after all…

23Jun/060

Thoughts on what a RESTful information storage service would look like

In my intermittent quest to find an excuse to use Berkeley DB XML, I find myself pondering how a RESTful information storage service would work.

The idea is not new, neither to me nor the Web community more generally. A Grand Unified Storage Architecture, where details such as location and storage media are elegantly abstracted away, a centralized, ad-hoc, decentralized, asynchronous, low-cost, dynamic, flexible, simple, advanced, powerful substrate upon which all information applications can be based. Contemplating the GUSA is somewhat akin to attempting a proof of P=NP; everyone does it at least once.

Foundation

Being RESTful, REST primitives must be employed. That HTTP is the foundation is obvious, though of course any RFC describing the GUSA would take pains to assure the reader that GUSA can be implemented over any transport protocol, while remaining silent on the tight coupling of GUSA idioms to a reliable, connection-oriented, request/response protocol and leaving actual implementation over TCP, UDP, NNTP, FTP, SMTP, IPX, AppleTalk, DEC-NET, and VINES to the industrious and woefully misguided reader.

So, one begins with the URI, which provides the standard (universal, even) means by which resources will be identified. One adds the four horsemen of HTTP verbiage, GET for reading, PUT for creating, POST for updating, and DELETE for, well, deleting.

XML is another obvious ingredient, though in what form is less clear. OpenSearch is of some obvious value, as is Atom and/or RSS). XML-RPC and SOAP are not cool enough to be RESTful, and thus are not considered. OPML may be of some value as well, though its inclusion ensures countless iterations of obscure and irrational idiological warfare on any resulting mailing list.

Where I’m going to diverge somewhat from previous visions of the GUSA is the use of a URI-based identity management technology ala LID or OpenID as the means by which clients to the store are authenticated. I think lightweight, decentralized, URL-based ID management is consistent with the RESTful faith, and perhaps more important, provides the flexibility and openness required of successful Internet scale technologies.

Basic Idioms

In its most basic form, a GUSA server provides information storage services. This service, generalized, is the ability to associate a URI with arbitrary data, and manipulate both the association and the data by way of standard HTTP operations. Really, that’s it.

A file could be uploaded with a PUT to /data/some/url/or/another/file.txt with the contents of the file in the HTTP body, which would associate that URI with the file contents. The associated data could be retrieved with a GET, updated with a POST, and removed with a DELETE.

Arbitrary object metadata would be useful, though it’s not clear how that fits into the REST architecture stack. Custom HTTP headers are one particularly limited possibility; an XML wrapper is another.

Clients would be able to create aliases as well, where a given URI redirects to another URI, on the server or off. The exact mechanism for that isn’t clear; perhaps a special content type in the PUT identifying an XML fragment with the target URI, or a query string.

Search would be via OpenSearch, with results in RSS or Atom. Where stored information would have a URI root like /data/, search functions would be at /search/ or similar.

For objects with an XML content type, XQuery and XPath searches would be supported intrinsically, by way of BDB:XML’s XQuery support. Obviously, storage of data in XML format, or at least with an XML wrapper, would be strongly encouraged.

Administrative functions like setting ACLs would be performed in another namespace, /admin/ perhaps, with URIs like /admin/permissions/some/url/or/another/file.txt representing the XML resource describing the access policies for the /some/url/or/another/file.txt resource. Other administrative functions would be included as needed.

What for?

The obvious question that must be asked is ‘why?’. Most importantly, to play with BDB XML and probably Rails.

More generally, a GUSA is needed to provide a heterogenous system into which all our data can be gathered. From data on various filesystems, to contact databases, email boxes, voicemail boxes, blogs, etc, there’s a ton of information locked away in various repositories. If only everything would use a GUSA box, everything would be in the same place and life would be much better. Or at least, would operate at a higher level of abstraction.

23Jun/060

Project Idea: srcML tagger using TextMate Language Grammars

One of the recurring project ideas I have is the construction of a fast, lightweight, simple srcML tagger, which can markup source files in an extensible set of languages using a subset of the srcML tags. Atop this tagger, an intelligent diff tool could provide more meaningful diffs between versions of a code file, and provide visualizations of the history of tokens within a source file, ala svn blame, but more meaningful.

This idea has always been held back by the difficulty of describing different languages in enough detail to do meaningful srcML tagging, but without so much detail that one ends up writing a complete parser for that language. In the past I had looked to syntax highlighting tools for inspiration, but they all struck me as ad-hoc and not particularly elegant.

No more. Having read about the language grammars in TextMate, I think they may be the simple generalization I’ve been looking for.

It doesn’t seem too difficult to implement a parser for the straightforward language grammar format, and from there not too difficult to mark up a source file according to that grammar. With translations between the language grammar elements and corresponding srcML tags, a tagged srcML document could be produced from the input file relatively easily.

The end result would be a lightweight, simple toolkit that could produce meaningful code history reports ala CodeHistorian.

Someday…

23Jun/060

Weekend Project: Parody generator using RSS, POS tagging, Markov text generation

Previously I’ve noted how neat I think it would be to use Markov text generation) to generate random text from a corpus, but adjust some of the Markov model parameters based on another corpus, in an attempt to yield, for example, a Seussian user manual or Edgar Allan Poe in Biblical English.

I’ve found a Perl toolkit, SVMTool, which provides the POS tagger necessary to mark up English words with their part of speech. It should be easy enough to fetch an arbitrary RSS feed, break it down into paragraphs, sentences, words, build a Markov model from that, and somehow build a hybrid Markov model from the input feed and some examplar text. From this hybrid Markov model, amusing texts could be generated that ‘feel’ like a cross between the source feed and the exemplar corpus.

22Jun/060

Great abbreviation search ideas from Quicksilver and TextMate developers

My post on abbreviation search algorithms led to an email discussion with Nicholas of BlackTree, developer of Quicksilver, the slick MacOS tool with the clever abbreviation-based search function I was trying to duplicate.

Nicholas shared a few details of Quicksilver’s inner workings, most shocking being that Quicksilver maintains the list of items in its catalog in memory, and searches this entire catalog when a user enteres an abbreviation. According to Nicholas, a 10k item catalog is quite performant, though 100k tends to suffer an overload of possible matches.

Nicholas then put me in touch with Allan Odgaard, developer of the acclaimed MacOS X text editor, TextMate. Allan had done some work on a string ranking algorithm for sorting non-adjacent substring matches by match strength which took a different approach than that used by QS.

Allan humbled me by pointing out what should’ve been an obvious benchmark against my non-adjacent match algorithm: a grep against a large wordlist file, using .* patterns between each letter (needless to say, my algorithm isn’t that fast). He also described some neat heuristics he came up with for scoring the relevance of a non-adjacent substring match, which I’ll definitely have to play with.

Thanks to input from Nicholas and Allan, I’ve been forced to re-evaluate what had been a fundamental assumption: that a large (10k to 100k+) catalog cannot be efficiently stored in RAM, and therefore must be queried using a relational database engine. Clearly this is an assumption that at the very least bears some investigation.

Possible ideas include:

  • Build and store the catalog in memory, never touching the disk
  • Keep the catalog maintained on disk, but load it into memory on startup
  • Keep the catalog on disk, but keep a specialized data structure in memory for text matching purposes
  • Keep an in-memory string with each line containing an item title and its ID, and use a high-performance regex implementation to do the initial matching

I’ll need to experiment a bit more to explore the relative merits of each of these ideas, but I expect my concept will be considerably altered one way or another as a result of this epiphany. Thanks for the help, guys.

22Jun/060

Weekend project idea: LID/OpenID/Yadis ASP.NET Membership Provider

Another weekend project idea is to investigate the implementation of an ASP.NET Membership Provider that supports LID/OpenID/Yadis identities. There’s an existing OpenID implementation for .NET, implemented in Boo, but you have to wire it up to your app yourself. A membership provider would be more drop-in, assuming the lightweight URL-based identity idioms can be shoe-horned into the fairly narrow membership provider API.

22Jun/060

Great article on draconian corporate IT

I just read a great piece on DDJ.com entitled Is Centralized IT Killing Tech Innovation?, which seems triggered by Ray Ozzie’s remarks at Tech Ed 2k6 along the same lines.

The basic idea is that stodgy corporate IT departments focused on controlling user experience and centralizing IT authority often end up stiffling attempts to use the tools employees need/want, and forcing shitty internal alternatives on them instead.

This is wholly consistent with my experience of every large IT aparatus I’ve ever dealt with. As a rule, the answer is ‘no’, and failing that, it’s ‘yes, someday’. Support is outsourced, security isn’t accountable for anything, and IT treats its own activities as the mission, rather than supporting the mission by its activities.

It is this phenomenon of IT lagging behind the needs of employees and the technologies outside the firewall that I think spurs the formation and proliferation of so-called greynets. This seems a reasonable response to a tyrannical IT aparatus. I just hope that technologies evolve fast enough to stay ahead of the IT tyrants that control the firewall.

22Jun/060

Refining my understanding of new ID technologies LID/OpenID/Yadis

In a previous post on emerging ID management technologies I came to a few conclusions about these lightweight URL-based technologies that were subsequently corrected in an email conversation with Johannes Ernst.

My confusion originated from a combination of the bias I had towards these technologies based on the initial use case for URL-based identity schemes (authenticating blog comments without site-by-site registration or a centralized repository like TypeKey), and (in my opinion) the way they are presented in the form of examples focused on blog comments, ecommerce, and site registration.

Of course, these are all great applications of a lightweight ID management technology, so I can’t fault the groups involved for advocating their techologies this way, but it obscures what I think is the fundamental abstraction which the lightweight URL-based ID management technologies offer: the ability to present a URL as representative of one’s identity, then authenticate one’s authority to use that URL for that purpose. Sure, there’s alot of stuff built on top, like user-controlled sharing of vCard data, multiple identities, even messaging, but we’ve seen all that before. It’s the URL with distributed authentication abstraction that I see as so powerful.

To demonstrate how powerful this relatively simple concept is, consider the way the problem of ID management is handled today on the Internet. From Amazon.com to Google to blogs to newspaper sites, you’re constantly required to create an account on that site to fully participate. The reasons for this vary from nefarious spamming to accountability to marketing, but a common theme is typicaly some sort of unique ID (an available username, or perhaps your email address), a password, and zero or more additional bits of info.

Often the email address is verified before the account is active, but every now again again you find a laissez-faire site that doesn’t care.

At any rate, this sucks and making it go away is a prereq of any successful ID management technology. To see how the URL-based ID management systems are valuable, imagine if, when visiting a site, you had only specify your email address, and the site could by some plumbing call back to your mail server and affirm that the user agent (browser, whatever) that submitted your email address to that site is in fact authorized to do so on behalf of that email address.

Further imagine that, through some additional tech plumbing, you could control what sites could send you email to that address, could with your permission retrieve more information about you from your mail server, and you could provision email addresses for your various activities, and link them or keep them separate as you desired.

If you replace ‘email address’ with ‘URL’ in the above example, you’ve got the current crop of lightweight URL-based ID management technologies. It’s important to understand, however, that this is by no means limited to authenticating against sites with your web browser, interactively. As Johannes pointed out, the fundamental abstraction is the URL and cryptographic evidence that a given requestor has authority to represent itself as that URL.

So, in the trivial case, this plumbing allows me (or software acting on my behalf) to say “I am the person/place/thing represented by http://apocryph.org/anelson”. Note that the actual content at this URL is of secondary importance; what really matters is that the URL is unique. There are plenty of ‘anelson’s running around the Internet, but only one ‘http://apocryph.org/anelson’.

The recipient of this assertion (an e-commerce site, a blog posting API, a corporate search engine, a Jabber server, whatever) then uses this plumbing to callback to the server hosting my identity at ‘http://apocryph.org/anelson’ (though with LID and OpenID at least, the server hosting the identity can be different than the server hosting the URL, but that’s not important to this example), and asks that server if I (where ‘I’ may be defined by a cookie, a session ID, whatever) do indeed control the identity at ‘http://apocryph.org/anelson’. The server can respond ‘yes’ or ‘no’, _or_ (and this is the piece I was missing) I (where ‘I’ is me or a software agent acting on my behalf) can provide cryptographic evidence that I control the ID URL.

This second bit is key; it means that the URL-based ID management protocols don’t assume any particular topology, and it is also an example of decentralized public key cryptography employed in a much more accessible way to the masses, who can remain blissfully unaware of all the crypto details under the covers and still benefit from its use.

Recently, the principals behind LID, OpenID, sxip, and Yadis, together with various industry players, formed the Open Source Identity Selector (or, OSIS) project. The goals of the project are:

the OSIS project brings together heads of open-source projects related to digital identity, in order

  1. to enable those projects to work independently, but aligned, so overlap of work is avoided, and the parts developed by different projects can fit
  2. to deliver an open-source identity selector as a joint effort of multiple projects, which is intended to be at least as functional, and fully compatible, with Microsoft’s CardSpace (formerly known as InfoCard) identity selector that will be shipped with Windows Vista.

This effort to collaborate on ID technologies and integrate with Microsoft’s CardSpace distributed ID management system seems to point to a real growth in these technologies. OSIS is working towards rich server- and client-side implementations, which could bring URL-based ID management to a wider range of users.

It seems possible that this family of interoperable technologies could succeed where Passport et al have failed, through a decentralized model, open specs, diverse implementations, and serious interoperability. insha’allah.

Delicious Bookmarks

Recent Posts

Meta

Current Location