Explorations: Googleminer, Part 1
In this month's column I'm going to take a break from exploring what's new in JDK 1.5 to poke around at what various people, most prominently Tim O'Reilly, have been calling the Internet Operating System. When people say The Internet Operating System, they're usually referring
to the fact that, increasingly, many programs no longer "live" on a single machine or cluster in the way desktop applications and web applications do. Instead, programs crucially rely on the presence of multiple systems (some on the Internet), and gain most of their functionality from a combination of aggregating data from multiple Internet sources and local processing.
What I like about the phrase Internet operating system is that it's easy to extend the analogy. Just as the Windows operating system has Windows applications (applications that require Windows, justify the purchase of Windows, and are somehow unique to Windows), now that we know there's an Internet operating system, we can keep our eyes peeled for the nascent Internet applications and start thinking about how to design and build them.
So what do "Internet applications" look like? It's impossible to say. But I think that one of the most interesting classes of applications, which is just emerging, is a set of applications I'll call semantic aggregators. These applications basically take data from a wide variety of web servers (or services) and perform some user-specific (and probably computationally expensive) data manipulation based on a deep knowledge of the end user. Metaphorically, the web is a database, each web site is a table, and a semantic aggregator lets you perform a join.
In this column, we're going to build a very simple semantic aggregator, using Java and web services. The point of this column (and next month's, where we'll add a GUI) is that Java turns out to be a pretty good language for the Internet operating system.
At first glance, this might seem obvious. After all, SUN has been saying for years that "The Network is the Computer." Since the Internet is a network, it follows that the programming language that SUN invented should be great for building Internet applications. Especially when you factor in the vast number of additional libraries and the tools that are readily available.
On the other hand, your first reaction might be that Java has been shown to be a great language for building server applications. Which, in the Internet operating system metaphor, are roughly equivalent to device drivers. But when you want to patch together information from a bunch of disparate sources, work some inferential magic, and present the results to the end user in a rich user interface, maybe another language is a better choice. It's worth noting that the Open Source Application Foundation went with Python and WxWindows for Chandler.
Since it's difficult to demonstrate that a programming language is good for a certain type of application, we're going to build an example and then say "There. That wasn't so bad." Along the way, I'll talk about why I think this is a good application and why I think it, as simple as it is, offers compelling value to a large number of people. This month, I'll build the command-line version of the application and explain how the core pieces work. Next month, I'll talk about GUI interfaces and installers.
The Application: Googleminer
Our example application is extremely simple. To understand it, pretend you're a busy programmer who's started a shareware company (like, for example, my shareware company, Seruku) on the side. You want to keep your day job, keep improving your shareware application, and keep tabs on what the Internet thinks of your shareware company (if for no other reason than damage control -- if someone has a bad experience with your product, you want to find out what happened and make it better).
The easiest way to find out what the Internet thinks of your work is, of course, to use the world's default search engine, Google. But using the web browser interface quickly becomes tedious, and, more importantly, you don't have any way to track the changes. What if, for example, the top 85 results are the same, but number 86 is different? Detecting that manually on a day-to-day basis is hard.
What you really want is an application that runs daily, and that tells you what's different in a Google search for your search terms. Something that, for example, produces output like the following:
Search name:Seruku Search terms:seruku Search performed on: 2004-02-04 There were 21 new entries on 2004-02-04 The New Entries for 2004-02-04 Title: Seruku Toolbar for Internet Explorer - Download.com - Free ... Url: http://download.com.com/3000-2379-10236010.html Snippet:... Seruku Toolbar for Internet Explorer 1.0 Download Now Download Now Free download
3.19MB More download links. ... Publisher: Seruku. Date added: October 20, 2003. ... Title: Seruku Toolbar for Internet Explorer Url: http://www.filedevil.com/file/532 Snippet:... Home >> Internet >> Tools & Utilities >> Seruku Toolbar for Internet Explorer.
Seruku Toolbar for Internet Explorer 1.0. Listed: January 14th, 2004. ... Title: Free Seruku Toolbar for Internet Explorer download at Free ... Url: http://www.freedownloadscenter.com/Network_and_Internet/Web_Searching_To... Seruku_Toolbar_for_Internet_Explorer_SimpleDownload.html Snippet:Seruku Toolbar for Internet Explorer 1.0. ... Download Active Email Monitor.
Seruku Toolbar for Internet Explorer download should start soon. ... Title: Seruku Toolbar for Internet Explorer - T