Skip to main content

Explorations: Googleminer, Part 1

March 5, 2004

{cs.r.title}








Contents
The Application: Googleminer
The Google API
Wrapping The Google API
Using the Googleminer Application
Using the Preferences API to
Store Command-Line Arguments
Final Thoughts

In this month's column I'm going to take a break from exploring what's new in JDK 1.5 to poke around at what various people, most prominently Tim O'Reilly, have been calling the Internet Operating System. When people say The Internet Operating System, they're usually referring
to the fact that, increasingly, many programs no longer "live" on a single machine or cluster in the way desktop applications and web applications do. Instead, programs crucially rely on the presence of multiple systems (some on the Internet), and gain most of their functionality from a combination of aggregating data from multiple Internet sources and local processing.

What I like about the phrase Internet operating system is that it's easy to extend the analogy. Just as the Windows operating system has Windows applications (applications that require Windows, justify the purchase of Windows, and are somehow unique to Windows), now that we know there's an Internet operating system, we can keep our eyes peeled for the nascent Internet applications and start thinking about how to design and build them.

So what do "Internet applications" look like? It's impossible to say. But I think that one of the most interesting classes of applications, which is just emerging, is a set of applications I'll call semantic aggregators. These applications basically take data from a wide variety of web servers (or services) and perform some user-specific (and probably computationally expensive) data manipulation based on a deep knowledge of the end user. Metaphorically, the web is a database, each web site is a table, and a semantic aggregator lets you perform a join.

In this column, we're going to build a very simple semantic aggregator, using Java and web services. The point of this column (and next month's, where we'll add a GUI) is that Java turns out to be a pretty good language for the Internet operating system.

At first glance, this might seem obvious. After all, SUN has been saying for years that "The Network is the Computer." Since the Internet is a network, it follows that the programming language that SUN invented should be great for building Internet applications. Especially when you factor in the vast number of additional libraries and the tools that are readily available.

On the other hand, your first reaction might be that Java has been shown to be a great language for building server applications. Which, in the Internet operating system metaphor, are roughly equivalent to device drivers. But when you want to patch together information from a bunch of disparate sources, work some inferential magic, and present the results to the end user in a rich user interface, maybe another language is a better choice. It's worth noting that the Open Source Application Foundation went with Python and WxWindows for Chandler.

Since it's difficult to demonstrate that a programming language is good for a certain type of application, we're going to build an example and then say "There. That wasn't so bad." Along the way, I'll talk about why I think this is a good application and why I think it, as simple as it is, offers compelling value to a large number of people. This month, I'll build the command-line version of the application and explain how the core pieces work. Next month, I'll talk about GUI interfaces and installers.

The Application: Googleminer

Our example application is extremely simple. To understand it, pretend you're a busy programmer who's started a shareware company (like, for example, my shareware company, Seruku) on the side. You want to keep your day job, keep improving your shareware application, and keep tabs on what the Internet thinks of your shareware company (if for no other reason than damage control -- if someone has a bad experience with your product, you want to find out what happened and make it better).

The easiest way to find out what the Internet thinks of your work is, of course, to use the world's default search engine, Google. But using the web browser interface quickly becomes tedious, and, more importantly, you don't have any way to track the changes. What if, for example, the top 85 results are the same, but number 86 is different? Detecting that manually on a day-to-day basis is hard.

What you really want is an application that runs daily, and that tells you what's different in a Google search for your search terms. Something that, for example, produces output like the following:

Search name:Seruku
Search terms:seruku
Search performed on: 2004-02-04


There were 21 new entries on 2004-02-04

The New Entries for 2004-02-04

Title: Seruku Toolbar for Internet Explorer - Download.com - Free ...
Url: http://download.com.com/3000-2379-10236010.html
Snippet:...
Seruku Toolbar for Internet Explorer 1.0 Download Now
        Download Now Free download
3.19MB More download links. ...
        Publisher: Seruku. Date added: October 20, 2003. ...

Title: Seruku Toolbar for Internet Explorer
Url: http://www.filedevil.com/file/532
Snippet:...
Home >> Internet >> Tools & Utilities >>
        Seruku Toolbar for Internet Explorer.
Seruku Toolbar
        for Internet Explorer 1.0. Listed: January 14th, 2004. ...

Title: Free Seruku Toolbar for Internet Explorer download at Free ...
Url: http://www.freedownloadscenter.com/Network_and_Internet/Web_Searching_To...     Seruku_Toolbar_for_Internet_Explorer_SimpleDownload.html
Snippet:Seruku Toolbar for Internet Explorer 1.0. ... Download Active
        Email Monitor.
Seruku Toolbar for Internet Explorer download
        should start soon. ...

Title: Seruku Toolbar for Internet Explorer - Télécharger - ZDNet
Url: http://www.zdnet.fr/telecharger/windows/fiche/0,39021748,39056613s,00.ht... Seruku Toolbar for Internet Explorer 1.0 Cliquer sur
        l'un des liens ci-dessous.
        http://www.seruku.com/Downloads/Toolbar/seruku_setup.exe. ...

.....

This month, we're building a command-line application that runs a Google search, using the Google API. It stores the results of the search to the local file system, and the produces a "daily difference" that lets you find out what's changed since the last time you ran a Google search. As a side note, this application has turned out to be quite useful for me. Running this daily, I've found four competitors to Seruku and a very nice review of the Seruku Toolbar.

This model, of retrieving potentially large data sets from Internet services and storing them locally, is at the heart of one style of semantic aggregators. By doing this on a daily basis, you can create large data sets of historical information in which a user might be interested. This allows you to run slow and complex calculations on the user's data sets (in essence, joins across the history of the Web) and store very large amounts of single-user state without worrying too much about performance considerations (since you're using the client CPU and hard drive). As trivial as Googleminer might seem, it's currently impossible to build a Internet-scalable version as a single web service. The amount of client state you have to store and retrieve to do historical datamining across many searches quickly becomes prohibitive. (And, really, why would you want to store that data anywhere but on the end user's machine?)

I want to emphasize that Googleminer isn't a particularly new idea. For a brief while (or so Kevin Burton tells me), Technorati had a similar service. Similarly, Feedster allows you to subscribe to searches. The difference between those services and this is: we store the data from each day locally. This enables the "daily diff" functionality. But it also enables (even though we won't build this functionality this month) historical analysis and trend-tracking. You could imagine measuring the effectiveness of an advertising campaign by looking at the number of new entries in a day, or the churn in the top 200 entries, or similar metrics.

In fact, the ideas underlying Googleminer have recently become so obvious that there is now a commercial application built on the same premise. Fortunately, since this column is mostly an illustration of an idea, and a demonstration of how to build an Internet application using Java, we don't need to build an entirely original application.

The Google API

As you probably know, Google does a fine job of indexing the public portion of the Internet (currently, it indexes 5.4 billion pages). The main Google interface is simple: it presents a text field to the user. The user types in a query, and Google returns the results. Along the way, Google acquired some pretty impressive side features. For example, if you misspell a search term, Google can often spot the mistake -- I particularly like the fact that if you search for "William Grossoo," Google will respond with "Did you mean: William Grosso."

In 2002, Google decided to expose a basic subset of their search API to anyone who wanted to use it programmatically. The best way to think about the Google API is that they've created an API that exposes everything you could discover by programmatically submitting an HTTP search request and then parsing the resulting HTML. When you think about it that way, they've done a very reasonable thing: they give you a library so that you don't have to figure out a way to scrape their HTML and interpret the results (which involves lots of potentially brittle code). In return, you agree not to perform more than 1,000 searches a day (this restriction is enforced through the use of a license key). Your code gets simpler, their servers don't get slammed, and everyone is happy.

Moreover, from the point of view of Java programmers, the Google API is a very elegant interface to the main functionality available from the Google Main Search Page. Under the covers, it uses web services standards such as SOAP and WSDL. And if you wanted to use it from a language like C or C++, you'd probably wind up having to understand all of the SOAP and WSDL. But as a Java programmer, all of the web services layer is nicely encapsulated and hidden away. (Since my feelings about SOAP roughly echo Gloria Steinem's feelings about the patriarchy, I think this is a good thing.)

In order to use the Google API, you need to perform three simple steps:

  1. Download the API SDK from here. This will get you some documentation, a few code examples, and a .jar (Googleapi.jar) containing everything you need to write the code.

  2. Register for a license key (using the link on the main page). You'll have to agree to their terms of service to get a license key.

  3. Master the GoogleSearch, GoogleSearchResult, and GoogleSearchResultElement classes.

As you might expect, it's actually a very easy API to use. For example, every single search you make via the Google API has exactly the same structure:

  1. Create an instance of GoogleSearch. This instance encapsulates the remote call and query, and serves as a stub for the Google Server.

  2. Make some calls on your instance of GoogleSearch in order to configure your search. These are entirely local calls.

  3. Call the doSearch() method on your instance of GoogleSearch. The return value will be an instance of GoogleSearchResult. This call looks like a local call to your code, but is actually a network call to a Google server.

  4. Inspect the instance of GoogleSearchResult to get any metadata you might need about the way the search was performed (for example, you can call getSearchTips() to get a string containing hints about how to perform a better search). For the most part, this information isn't particularly useful. In terms of the HTML interface, the instance of GoogleSearchResult corresponds to the entire page that Google returns.

  5. Use the getResultElements() method to get all of the instances of GoogleSearchResultElement from the instance of GoogleSearchResult. These instances correspond to single entries on the HTML page, and are, in most cases, the "real information" that is returned by the search.

Thus, for example, the following static method is the entire use of the Google API in the Googleminer application. Even without the above explanation, it's pretty straightforward and easy to read.

public static ResultElementList performBasicSearch() {
  String key = CurrentSearchParameters.getQueryKey();
  String searchTerms = CurrentSearchParameters.getSearchTerms();
  String searchName = CurrentSearchParameters.getSearchName();
  // ... omitted-- creates local data structures.
  GoogleSearch search = new GoogleSearch(); // step 1 from above
  search.setKey(key); // step 2 from above
  search.setQueryString(searchTerms);
  search.setMaxResults(NUMBER_OF_RESULTS_PER_SEARCH);
  int currentStartingPoint = 0;
  try {
    for (currentStartingPoint = 0;
      currentStartingPoint < maxNumberOfResults;
      currentStartingPoint += NUMBER_OF_RESULTS_PER_SEARCH) {
      search.setStartResult(currentStartingPoint);
      GoogleSearchResult result = search.doSearch(); // step 3 from above
      // Steps 4 and 5 from above omitted because they're local code.
    }
  } catch (Exception e) {
  }
  return returnValue;
}







Wrapping The Google API

Recall that Googleminer does more than just make a call to Google -- it's going to store the results locally, and then create the list of daily differences. This means that we need to figure out how to store the information locally. The way I chose to do this is to create companion classes to GoogleSearchResult and GoogleSearchResultElement that are serializable (I used serialization for the storage mechanism) and that support the "difference" operations we need. The companion for GoogleSearchResult is ResultElementList, and the companion for GoogleSearchResultElement is ResultElement. In addition, I wrote the Comparison class -- instances of it take two instances of ResultElementList and return the results of various difference operations.

Let's start our walk through the code with a look at ResultElement. ResultElement is a simple data structure that's serializable, and has nice solid implementations of both equals() and hashCode(). The only wrinkle is the use of trim in the constructor for ResultElement -- from day to day, Google will sometimes append extra white space to returned elements, which can cause "false negatives" in the comparisons unless we trim the strings. Actually, the snippets sometimes change, as well. As a result, Googleminer sometimes reports a "change" when the only thing that has changed is what Google says about the page. There's really no solution for this unless the snippets are standardized in some way.

Here's the code for ResultElement:

public class ResultElement implements Serializable, Comparable {
  public static final long serialVersionUID = 1;
  private String _hostName;
  private String _title;
  private String _url;
  private String _snippet;
  private String _displayString;

  public ResultElement(GoogleSearchResultElement GoogleSearchResultElement) {
    _hostName = GoogleSearchResultElement.getHostName().trim();
    _title = GoogleSearchResultElement.getTitle().trim();
    _url = GoogleSearchResultElement.getURL().trim();
    _snippet = GoogleSearchResultElement.getSnippet().trim();
    _displayString = "Title: " + _title + "\nUrl: " + _url + "\nSnippet:" + _snippet;
  }

  public String getDisplayString() {
    return _displayString;
  }

  public int compareTo(Object object) {
    if (!(object instanceof ResultElement)) {
      return -1;
    }
    ResultElement otherElement = (ResultElement) object;
    int compare = _title.compareTo(otherElement._title);
    if (0!=compare) {
      return compare;
    }
    compare = _hostName.compareTo(otherElement._hostName);
    if (0!=compare) {
      return compare;
    }
    compare = _url.compareTo(otherElement._url);
    if (0!=compare) {
      return compare;
    }
    return _snippet.compareTo(otherElement._snippet);
  }

  public boolean equals(Object object) {
    if (!(object instanceof ResultElement)) {
      return false;
    }
    return equals((ResultElement)object);
  }

  public boolean equals(ResultElement otherElement) {
    return _displayString.equals(otherElement._displayString);
  }

  public int hashCode() {
    return _url.hashCode();
  }
}

Now that we've got ResultElement, we need a container data structure for it. We'll store instances of ResultElement in ResultElementList. In addition to being a holder class, ResultElementList needs to be serializable and support performing comparisons between two searches. In order to support comparisons, we store the instances of ResultElement in both a list and a hashed collection, and use them to perform difference operations. Here's the code for ResultElementList:

public class ResultElementList implements Serializable {
  public static final long serialVersionUID = 1;
  private String _searchName;
  private Date _searchDate;
  private HashMap _elements;
  private ArrayList _elementList;

  public ResultElementList(String searchName, Date date) {
    _elements = new HashMap();
    _elementList = new ArrayList();
    _searchName = searchName;
    _searchDate = date;
  }

  public String getSearchName() {
    return _searchName;
  }

  public Date getSearchDate() {
    return _searchDate;
  }

  public void addResultElement(ResultElement resultElement) {
    if (_elements.containsKey(resultElement)) {
      return;
    }
    _elements.put(resultElement, resultElement);
_elementList.add(resultElement);
}

public boolean containsResultElement(ResultElement resultElement) {
return _elements.containsKey(resultElement);
}

public int getCount() {
return _elements.size();
}

public Iterator getAll() {
return (new ArrayList(_elementList)).iterator();
}

  public List getAdditions(ResultElementList resultElementList) {
    List returnValue = new ArrayList();
    Iterator i = _elementList.iterator();
    while (i.hasNext()) {
      ResultElement nextElement = (ResultElement) i.next();
      if (!resultElementList.containsResultElement(nextElement)) {
        returnValue.add(nextElement);
      }
    }
    return returnValue;
  }

  public List getRemovals(ResultElementList resultElementList) {
    return resultElementList.getAdditions(this);
  }

  public List getSymmetricDifference(ResultElementList resultElementList) {
    List additions = getAdditions(resultElementList);
    List removals = getRemovals(resultElementList);
    ArrayList returnValue = new ArrayList(additions);
    returnValue.addAll(removals);
    return returnValue;
  }

  public boolean equals(Object object) {
    if (!(object instanceof ResultElementList)) {
      return false;
    }
    return equals((ResultElementList)object);
  }

  public boolean equals(ResultElementList resultElementList) {
    if (!_searchName.equals(resultElementList._searchName)) {
      return false;
    }
    if (!_searchDate.equals(resultElementList._searchDate)) {
      return false;
    }
    List additions = getAdditions(resultElementList);
    if (0!= additions.size()) {
      return false;
    }
    return getCount() == resultElementList.getCount();
  }

  public int hashCode() {
    return 31 * _searchName.hashCode() + _searchDate.hashCode();
  }
}

Just like ResultElement, ResultElementList is mostly boilerplate code. And,
given ResultElement and ResultElementList, the code for performBasicSearch is now trivial. All it does is call Google and build an instance of ResultElementList. For the record, here's the complete code for performBasicSearch():

public class SearchMethods {
  private static int NUMBER_OF_RESULTS_PER_SEARCH = 10;

  public static ResultElementList performBasicSearch() {
    String key = CurrentSearchParameters.getQueryKey();
    String searchTerms = CurrentSearchParameters.getSearchTerms();
    String searchName = CurrentSearchParameters.getSearchName();
    Date date = getToday();
    ResultElementList returnValue =
        new ResultElementList(searchName, searchTerms, date);
    GoogleSearch search = new GoogleSearch();
    search.setKey(key);
    search.setQueryString(searchTerms);
    search.setMaxResults(NUMBER_OF_RESULTS_PER_SEARCH);

    int maxNumberOfResults = CurrentSearchParameters.getTermsPerSearch();
    int currentStartingPoint = 0;
    try {
      for (currentStartingPoint = 0;
        currentStartingPoint < maxNumberOfResults;
        currentStartingPoint += NUMBER_OF_RESULTS_PER_SEARCH) {
        search.setStartResult(currentStartingPoint);
        GoogleSearchResult result = search.doSearch();
        addResult(result, returnValue);
      }
    } catch (Exception e) {
    }
    return returnValue;
  }

  private static Date getToday() {
    return new java.sql.Date(System.currentTimeMillis());
  }

  private static void addResult(GoogleSearchResult result,
                                ResultElementList list) {
    GoogleSearchResultElement[] actualResults = result.getResultElements();
    if (null==actualResults) {
      return;
    }
    for (int counter=0; counter < actualResults.length; counter++) {
      ResultElement nextElement = new ResultElement(actualResults[counter]);
      list.addResultElement(nextElement);
    }
    return;
  }
}

The final "wrapper" class we use when dealing with Google results is a class named Comparison, which encapsulates a comparison of two instances of ResultElementList.

Using the Googleminer Application

Before we finish our examination of the code, let's stop and talk about how you use the application. The main method is contained in the com.seruku.Googleminer.CommandLineMain class. To run the program, you need to make sure Googleapi.jar is on your classpath and then type the following into a shell:

java com.seruku.Googleminer.CommandLineMain

Of course, the first time you do this, nothing will work. Instead, you'll get the following error message:

Your settings are bad. Please change them.
Query Key is
Search Name is
Seach terms are
Storage Directory is
Terms per search is 500
Valid command line arguments are: -terms, -name, -storage, -Googlekey, -nresults

Each of these command-line arguments works the same: you put the flag, a space, and then the value. The following example sets the search name and the number of terms to fetch. They have the following meanings:

  • terms sets the query string to pass to Google.
  • storage sets the base storage directory. Query results are stored in subdirectories of this query (and tagged with the day the query was made).
  • nresults sets how many results to fetch from Google.
  • name names the search. This is used to define a search-specific subdirectory of the main storage directory.
  • Googlekey sets the Google license key.
java com.seruku.Googleminer.CommandLineMain -name myfirstsearch -nterms 300

Of course, you might be wondering, "Do I have to type in all five parameters each time? How tedious." The answer is: of course not. As you will see in the next section, the application uses the Preferences API to "remember" the previous values of these parameters. That is, if you type java com.seruku.Googleminer.CommandLineMain -name myFirstSearch, it will use the name myFirstSearch for both the current search and as the default from now on.

And that's how to use the entire program. I currently have my system set up to run Googleminer every day, using the Windows task scheduler, and pipe the output to a file named SerukuDailyDiff that sits on my desktop. (More precisely, I run a natively compiled version of Googleminer.)

Every now and then, I look at the DailyDiff file and check out how things have changed. It's pretty cool.

Using the Preferences API to Store Command-Line Arguments

The values of command-line arguments are persisted through the use of
static methods defined on the CurrentSearchParameters class. Recall that performBasicSearch relied on lines such as

String key = CurrentSearchParameters.getQueryKey();

to configure the instance of GoogleSearch.

CurrentSearchParameters is simply a wrapper around the Preferences API. We use it to store the values of our parameters across invocations of the program. That is, instead of directly using the command-line arguments in the invocation of performBasicSearch, each command-line argument is processed by an instance of CommandLineArgumentHandler, which is responsible for storing it in the right place (using a set method defined onCurrentSearchParameters). This sets the value as the new default, and makes it available for performBasicSearch(). Given a set of instances of CommandLineArgumentHandler, the main() method simply finds the right handler for any given argument, and calls performAction() on the handler.

Here, for example, is the code for the implementation of one of the handlers.

public interface CommandLineArgumentHandler {
  public String getCommand();
  public String performAction(String value);
}

public class QueryKeyCommandHandler
           implements CommandLineArgumentHandler{
  public String getCommand() {
    return "-Googlekey";
  }

  public String performAction(String value) {
    CurrentSearchParameters.setQueryKey(value);
    return null;
  }
}

Once all of the command-line arguments have been processed, we call performBasicSearch(), which retrieves the current defaults for all the arguments from CurrentSearchParameters.

Final Thoughts

In approximately 750 lines of code, we implemented a simple version of Googleminer. It's a command-line application that enables you to track changes in a search over time. It's not the most impressive application ever, but it's a very simple and very straightforward illustration of two things:

  1. It's very easy to write simple Internet applications in Java. All of the code involved was boilerplate code, and it didn't require any deep thought at all. In fact, you were probably a little bored by parts of this article. But that's part of my point -- this stuff is easy to build.

  2. As more and more information sources get exposed on the Web, Internet applications are going to proliferate. (The only question is: will they proliferate according to Metcalfe's law, or Reed's law?)

The code itself, while completely boilerplate, also illustrates one interesting aspect of semantic aggregators: structurally, they're the exact opposite of web applications. In a typical web application, the server hands the browser a cookie, and stores the state itself. In Googleminer, the exact opposite happens: the "client" stores a vast amount of state, and repeatedly passes a token to the server to get more. This idea, of moving personal and customized data to the edge of the network where the users are, is, I think, an interesting one. In a future column, I'll explore putting a GUI on Googleminer using SWT.

William Grosso is the vice president of engineering for Echopass.
Related Topics >> Search   |   Web Services and XML   |