The Source for Java Technology Collaboration
User: Password:



   

Explorations Explorations: Googleminer, Part 1

by William Grosso
03/05/2004


Contents
The Application: Googleminer
The Google API
Wrapping The Google API
Using the Googleminer Application
Using the Preferences API to
Store Command-Line Arguments
Final Thoughts

In this month's column I'm going to take a break from exploring what's new in JDK 1.5 to poke around at what various people, most prominently Tim O'Reilly, have been calling the Internet Operating System. When people say The Internet Operating System, they're usually referring to the fact that, increasingly, many programs no longer "live" on a single machine or cluster in the way desktop applications and web applications do. Instead, programs crucially rely on the presence of multiple systems (some on the Internet), and gain most of their functionality from a combination of aggregating data from multiple Internet sources and local processing.

What I like about the phrase Internet operating system is that it's easy to extend the analogy. Just as the Windows operating system has Windows applications (applications that require Windows, justify the purchase of Windows, and are somehow unique to Windows), now that we know there's an Internet operating system, we can keep our eyes peeled for the nascent Internet applications and start thinking about how to design and build them.

So what do "Internet applications" look like? It's impossible to say. But I think that one of the most interesting classes of applications, which is just emerging, is a set of applications I'll call semantic aggregators. These applications basically take data from a wide variety of web servers (or services) and perform some user-specific (and probably computationally expensive) data manipulation based on a deep knowledge of the end user. Metaphorically, the web is a database, each web site is a table, and a semantic aggregator lets you perform a join.

In this column, we're going to build a very simple semantic aggregator, using Java and web services. The point of this column (and next month's, where we'll add a GUI) is that Java turns out to be a pretty good language for the Internet operating system.

At first glance, this might seem obvious. After all, SUN has been saying for years that "The Network is the Computer." Since the Internet is a network, it follows that the programming language that SUN invented should be great for building Internet applications. Especially when you factor in the vast number of additional libraries and the tools that are readily available.

On the other hand, your first reaction might be that Java has been shown to be a great language for building server applications. Which, in the Internet operating system metaphor, are roughly equivalent to device drivers. But when you want to patch together information from a bunch of disparate sources, work some inferential magic, and present the results to the end user in a rich user interface, maybe another language is a better choice. It's worth noting that the Open Source Application Foundation went with Python and WxWindows for Chandler.

Since it's difficult to demonstrate that a programming language is good for a certain type of application, we're going to build an example and then say "There. That wasn't so bad." Along the way, I'll talk about why I think this is a good application and why I think it, as simple as it is, offers compelling value to a large number of people. This month, I'll build the command-line version of the application and explain how the core pieces work. Next month, I'll talk about GUI interfaces and installers.

The Application: Googleminer

Our example application is extremely simple. To understand it, pretend you're a busy programmer who's started a shareware company (like, for example, my shareware company, Seruku) on the side. You want to keep your day job, keep improving your shareware application, and keep tabs on what the Internet thinks of your shareware company (if for no other reason than damage control -- if someone has a bad experience with your product, you want to find out what happened and make it better).

The easiest way to find out what the Internet thinks of your work is, of course, to use the world's default search engine, Google. But using the web browser interface quickly becomes tedious, and, more importantly, you don't have any way to track the changes. What if, for example, the top 85 results are the same, but number 86 is different? Detecting that manually on a day-to-day basis is hard.

What you really want is an application that runs daily, and that tells you what's different in a Google search for your search terms. Something that, for example, produces output like the following:

Search name:Seruku
Search terms:seruku
Search performed on: 2004-02-04


There were 21 new entries on 2004-02-04

The New Entries for 2004-02-04

Title: Seruku Toolbar for Internet Explorer - Download.com - Free ...
Url: http://download.com.com/3000-2379-10236010.html
Snippet:... Seruku Toolbar for Internet Explorer 1.0 Download Now 
        Download Now Free download
3.19MB More download links. ... Publisher: Seruku. Date added: October 20, 2003. ... Title: Seruku Toolbar for Internet Explorer Url: http://www.filedevil.com/file/532 Snippet:... Home >> Internet >> Tools & Utilities >> Seruku Toolbar for Internet Explorer.
Seruku Toolbar for Internet Explorer 1.0. Listed: January 14th, 2004. ... Title: Free Seruku Toolbar for Internet Explorer download at Free ... Url: http://www.freedownloadscenter.com/Network_and_Internet/Web_Searching_Tools/ Seruku_Toolbar_for_Internet_Explorer_SimpleDownload.html Snippet:Seruku Toolbar for Internet Explorer 1.0. ... Download Active Email Monitor.
Seruku Toolbar for Internet Explorer download should start soon. ... Title: Seruku Toolbar for Internet Explorer - Télécharger - ZDNet Url: http://www.zdnet.fr/telecharger/windows/fiche/0,39021748,39056613s,00.htm Snippet:... Seruku Toolbar for Internet Explorer 1.0 Cliquer sur l'un des liens ci-dessous. http://www.seruku.com/Downloads/Toolbar/seruku_setup.exe. ... .....

This month, we're building a command-line application that runs a Google search, using the Google API. It stores the results of the search to the local file system, and the produces a "daily difference" that lets you find out what's changed since the last time you ran a Google search. As a side note, this application has turned out to be quite useful for me. Running this daily, I've found four competitors to Seruku and a very nice review of the Seruku Toolbar.

This model, of retrieving potentially large data sets from Internet services and storing them locally, is at the heart of one style of semantic aggregators. By doing this on a daily basis, you can create large data sets of historical information in which a user might be interested. This allows you to run slow and complex calculations on the user's data sets (in essence, joins across the history of the Web) and store very large amounts of single-user state without worrying too much about performance considerations (since you're using the client CPU and hard drive). As trivial as Googleminer might seem, it's currently impossible to build a Internet-scalable version as a single web service. The amount of client state you have to store and retrieve to do historical datamining across many searches quickly becomes prohibitive. (And, really, why would you want to store that data anywhere but on the end user's machine?)

I want to emphasize that Googleminer isn't a particularly new idea. For a brief while (or so Kevin Burton tells me), Technorati had a similar service. Similarly, Feedster allows you to subscribe to searches. The difference between those services and this is: we store the data from each day locally. This enables the "daily diff" functionality. But it also enables (even though we won't build this functionality this month) historical analysis and trend-tracking. You could imagine measuring the effectiveness of an advertising campaign by looking at the number of new entries in a day, or the churn in the top 200 entries, or similar metrics.

In fact, the ideas underlying Googleminer have recently become so obvious that there is now a commercial application built on the same premise. Fortunately, since this column is mostly an illustration of an idea, and a demonstration of how to build an Internet application using Java, we don't need to build an entirely original application.

The Google API

As you probably know, Google does a fine job of indexing the public portion of the Internet (currently, it indexes 5.4 billion pages). The main Google interface is simple: it presents a text field to the user. The user types in a query, and Google returns the results. Along the way, Google acquired some pretty impressive side features. For example, if you misspell a search term, Google can often spot the mistake -- I particularly like the fact that if you search for "William Grossoo," Google will respond with "Did you mean: William Grosso."

In 2002, Google decided to expose a basic subset of their search API to anyone who wanted to use it programmatically. The best way to think about the Google API is that they've created an API that exposes everything you could discover by programmatically submitting an HTTP search request and then parsing the resulting HTML. When you think about it that way, they've done a very reasonable thing: they give you a library so that you don't have to figure out a way to scrape their HTML and interpret the results (which involves lots of potentially brittle code). In return, you agree not to perform more than 1,000 searches a day (this restriction is enforced through the use of a license key). Your code gets simpler, their servers don't get slammed, and everyone is happy.

Moreover, from the point of view of Java programmers, the Google API is a very elegant interface to the main functionality available from the Google Main Search Page. Under the covers, it uses web services standards such as SOAP and WSDL. And if you wanted to use it from a language like C or C++, you'd probably wind up having to understand all of the SOAP and WSDL. But as a Java programmer, all of the web services layer is nicely encapsulated and hidden away. (Since my feelings about SOAP roughly echo Gloria Steinem's feelings about the patriarchy, I think this is a good thing.)

In order to use the Google API, you need to perform three simple steps:

  1. Download the API SDK from here. This will get you some documentation, a few code examples, and a .jar (Googleapi.jar) containing everything you need to write the code.

  2. Register for a license key (using the link on the main page). You'll have to agree to their terms of service to get a license key.

  3. Master the GoogleSearch, GoogleSearchResult, and GoogleSearchResultElement classes.

As you might expect, it's actually a very easy API to use. For example, every single search you make via the Google API has exactly the same structure:

  1. Create an instance of GoogleSearch. This instance encapsulates the remote call and query, and serves as a stub for the Google Server.

  2. Make some calls on your instance of GoogleSearch in order to configure your search. These are entirely local calls.

  3. Call the doSearch() method on your instance of GoogleSearch. The return value will be an instance of GoogleSearchResult. This call looks like a local call to your code, but is actually a network call to a Google server.

  4. Inspect the instance of GoogleSearchResult to get any metadata you might need about the way the search was performed (for example, you can call getSearchTips() to get a string containing hints about how to perform a better search). For the most part, this information isn't particularly useful. In terms of the HTML interface, the instance of GoogleSearchResult corresponds to the entire page that Google returns.

  5. Use the getResultElements() method to get all of the instances of GoogleSearchResultElement from the instance of GoogleSearchResult. These instances correspond to single entries on the HTML page, and are, in most cases, the "real information" that is returned by the search.

Thus, for example, the following static method is the entire use of the Google API in the Googleminer application. Even without the above explanation, it's pretty straightforward and easy to read.

public static ResultElementList performBasicSearch() {
  String key = CurrentSearchParameters.getQueryKey();
  String searchTerms = CurrentSearchParameters.getSearchTerms();
  String searchName = CurrentSearchParameters.getSearchName();
  // ... omitted-- creates local data structures.
  GoogleSearch search = new GoogleSearch();	// step 1 from above
  search.setKey(key);				// step 2 from above
  search.setQueryString(searchTerms);
  search.setMaxResults(NUMBER_OF_RESULTS_PER_SEARCH);
  int currentStartingPoint = 0;
  try {
    for (currentStartingPoint = 0;
      currentStartingPoint < maxNumberOfResults;
      currentStartingPoint += NUMBER_OF_RESULTS_PER_SEARCH) {
      search.setStartResult(currentStartingPoint);
      GoogleSearchResult result = search.doSearch(); // step 3 from above
      // Steps 4 and 5 from above omitted because they're local code.
    }
  } catch (Exception e) {
  }
  return returnValue;
}

Pages: 1, 2

Next Page » 

View all java.net Articles.

 Feed java.net RSS Feeds