Wrapping The Google API
Recall that Googleminer does more than just make a call to Google -- it's going to store the results locally, and then create the list of daily differences. This means that we need to figure out how to store the information locally. The way I chose to do this is to create companion classes to GoogleSearchResult and GoogleSearchResultElement that are serializable (I used serialization for the storage mechanism) and that support the "difference" operations we need. The companion for GoogleSearchResult is ResultElementList, and the companion for GoogleSearchResultElement is ResultElement. In addition, I wrote the Comparison class -- instances of it take two instances of ResultElementList and return the results of various difference operations.
Let's start our walk through the code with a look at ResultElement. ResultElement is a simple data structure that's serializable, and has nice solid implementations of both equals() and hashCode(). The only wrinkle is the use of trim in the constructor for ResultElement -- from day to day, Google will sometimes append extra white space to returned elements, which can cause "false negatives" in the comparisons unless we trim the strings. Actually, the snippets sometimes change, as well. As a result, Googleminer sometimes reports a "change" when the only thing that has changed is what Google says about the page. There's really no solution for this unless the snippets are standardized in some way.
Here's the code for ResultElement:
public class ResultElement implements Serializable, Comparable {
public static final long serialVersionUID = 1;
private String _hostName;
private String _title;
private String _url;
private String _snippet;
private String _displayString;
public ResultElement(GoogleSearchResultElement GoogleSearchResultElement) {
_hostName = GoogleSearchResultElement.getHostName().trim();
_title = GoogleSearchResultElement.getTitle().trim();
_url = GoogleSearchResultElement.getURL().trim();
_snippet = GoogleSearchResultElement.getSnippet().trim();
_displayString = "Title: " + _title + "\nUrl: " + _url + "\nSnippet:" + _snippet;
}
public String getDisplayString() {
return _displayString;
}
public int compareTo(Object object) {
if (!(object instanceof ResultElement)) {
return -1;
}
ResultElement otherElement = (ResultElement) object;
int compare = _title.compareTo(otherElement._title);
if (0!=compare) {
return compare;
}
compare = _hostName.compareTo(otherElement._hostName);
if (0!=compare) {
return compare;
}
compare = _url.compareTo(otherElement._url);
if (0!=compare) {
return compare;
}
return _snippet.compareTo(otherElement._snippet);
}
public boolean equals(Object object) {
if (!(object instanceof ResultElement)) {
return false;
}
return equals((ResultElement)object);
}
public boolean equals(ResultElement otherElement) {
return _displayString.equals(otherElement._displayString);
}
public int hashCode() {
return _url.hashCode();
}
}
Now that we've got ResultElement, we need a container data structure for it. We'll store instances of ResultElement in ResultElementList. In addition to being a holder class, ResultElementList needs to be serializable and support performing comparisons between two searches. In order to support comparisons, we store the instances of ResultElement in both a list and a hashed collection, and use them to perform difference operations. Here's the code for ResultElementList:
public class ResultElementList implements Serializable {
public static final long serialVersionUID = 1;
private String _searchName;
private Date _searchDate;
private HashMap _elements;
private ArrayList _elementList;
public ResultElementList(String searchName, Date date) {
_elements = new HashMap();
_elementList = new ArrayList();
_searchName = searchName;
_searchDate = date;
}
public String getSearchName() {
return _searchName;
}
public Date getSearchDate() {
return _searchDate;
}
public void addResultElement(ResultElement resultElement) {
if (_elements.containsKey(resultElement)) {
return;
}
_elements.put(resultElement, resultElement);
_elementList.add(resultElement);
}
public boolean containsResultElement(ResultElement resultElement) {
return _elements.containsKey(resultElement);
}
public int getCount() {
return _elements.size();
}
public Iterator getAll() {
return (new ArrayList(_elementList)).iterator();
}
public List getAdditions(ResultElementList resultElementList) {
List returnValue = new ArrayList();
Iterator i = _elementList.iterator();
while (i.hasNext()) {
ResultElement nextElement = (ResultElement) i.next();
if (!resultElementList.containsResultElement(nextElement)) {
returnValue.add(nextElement);
}
}
return returnValue;
}
public List getRemovals(ResultElementList resultElementList) {
return resultElementList.getAdditions(this);
}
public List getSymmetricDifference(ResultElementList resultElementList) {
List additions = getAdditions(resultElementList);
List removals = getRemovals(resultElementList);
ArrayList returnValue = new ArrayList(additions);
returnValue.addAll(removals);
return returnValue;
}
public boolean equals(Object object) {
if (!(object instanceof ResultElementList)) {
return false;
}
return equals((ResultElementList)object);
}
public boolean equals(ResultElementList resultElementList) {
if (!_searchName.equals(resultElementList._searchName)) {
return false;
}
if (!_searchDate.equals(resultElementList._searchDate)) {
return false;
}
List additions = getAdditions(resultElementList);
if (0!= additions.size()) {
return false;
}
return getCount() == resultElementList.getCount();
}
public int hashCode() {
return 31 * _searchName.hashCode() + _searchDate.hashCode();
}
}
Just like ResultElement, ResultElementList is mostly boilerplate code. And,
given ResultElement and ResultElementList, the code for performBasicSearch is now trivial. All it does is call Google and build an instance of ResultElementList. For the record, here's the complete code for performBasicSearch():
public class SearchMethods {
private static int NUMBER_OF_RESULTS_PER_SEARCH = 10;
public static ResultElementList performBasicSearch() {
String key = CurrentSearchParameters.getQueryKey();
String searchTerms = CurrentSearchParameters.getSearchTerms();
String searchName = CurrentSearchParameters.getSearchName();
Date date = getToday();
ResultElementList returnValue =
new ResultElementList(searchName, searchTerms, date);
GoogleSearch search = new GoogleSearch();
search.setKey(key);
search.setQueryString(searchTerms);
search.setMaxResults(NUMBER_OF_RESULTS_PER_SEARCH);
int maxNumberOfResults = CurrentSearchParameters.getTermsPerSearch();
int currentStartingPoint = 0;
try {
for (currentStartingPoint = 0;
currentStartingPoint < maxNumberOfResults;
currentStartingPoint += NUMBER_OF_RESULTS_PER_SEARCH) {
search.setStartResult(currentStartingPoint);
GoogleSearchResult result = search.doSearch();
addResult(result, returnValue);
}
} catch (Exception e) {
}
return returnValue;
}
private static Date getToday() {
return new java.sql.Date(System.currentTimeMillis());
}
private static void addResult(GoogleSearchResult result,
ResultElementList list) {
GoogleSearchResultElement[] actualResults = result.getResultElements();
if (null==actualResults) {
return;
}
for (int counter=0; counter < actualResults.length; counter++) {
ResultElement nextElement = new ResultElement(actualResults[counter]);
list.addResultElement(nextElement);
}
return;
}
}
The final "wrapper" class we use when dealing with Google results is a class named Comparison, which encapsulates a comparison of two instances of ResultElementList.
Using the Googleminer Application
Before we finish our examination of the code, let's stop and talk about how you use the application. The main method is contained in the com.seruku.Googleminer.CommandLineMain class. To run the program, you need to make sure Googleapi.jar is on your classpath and then type the following into a shell:
java com.seruku.Googleminer.CommandLineMain
Of course, the first time you do this, nothing will work. Instead, you'll get the following error message:
Your settings are bad. Please change them.
Query Key is
Search Name is
Seach terms are
Storage Directory is
Terms per search is 500
Valid command line arguments are: -terms, -name, -storage, -Googlekey, -nresults
Each of these command-line arguments works the same: you put the flag, a space, and then the value. The following example sets the search name and the number of terms to fetch. They have the following meanings:
terms sets the query string to pass to Google.
storage sets the base storage directory. Query results are stored in subdirectories of this query (and tagged with the day the query was made).
nresults sets how many results to fetch from Google.
name names the search. This is used to define a search-specific subdirectory of the main storage directory.
Googlekey sets the Google license key.
java com.seruku.Googleminer.CommandLineMain -name myfirstsearch -nterms 300
Of course, you might be wondering, "Do I have to type in all five parameters each time? How tedious." The answer is: of course not. As you will see in the next section, the application uses the Preferences API to "remember" the previous values of these parameters. That is, if you type java com.seruku.Googleminer.CommandLineMain -name myFirstSearch, it will use the name myFirstSearch for both the current search and as the default from now on.
And that's how to use the entire program. I currently have my system set up to run Googleminer every day, using the Windows task scheduler, and pipe the output to a file named SerukuDailyDiff that sits on my desktop. (More precisely, I run a natively compiled version of Googleminer.)
Every now and then, I look at the DailyDiff file and check out how things have changed. It's pretty cool.
Using the Preferences API to Store Command-Line Arguments
The values of command-line arguments are persisted through the use of
static methods defined on the CurrentSearchParameters class. Recall that performBasicSearch relied on lines such as
String key = CurrentSearchParameters.getQueryKey();
to configure the instance of GoogleSearch.
CurrentSearchParameters is simply a wrapper around the Preferences API. We use it to store the values of our parameters across invocations of the program. That is, instead of directly using the command-line arguments in the invocation of performBasicSearch, each command-line argument is processed by an instance of CommandLineArgumentHandler, which is responsible for storing it in the right place (using a set method defined onCurrentSearchParameters). This sets the value as the new default, and makes it available for performBasicSearch(). Given a set of instances of CommandLineArgumentHandler, the main() method simply finds the right handler for any given argument, and calls performAction() on the handler.
Here, for example, is the code for the implementation of one of the handlers.
public interface CommandLineArgumentHandler {
public String getCommand();
public String performAction(String value);
}
public class QueryKeyCommandHandler
implements CommandLineArgumentHandler{
public String getCommand() {
return "-Googlekey";
}
public String performAction(String value) {
CurrentSearchParameters.setQueryKey(value);
return null;
}
}
Once all of the command-line arguments have been processed, we call performBasicSearch(), which retrieves the current defaults for all the arguments from CurrentSearchParameters.
Final Thoughts
In approximately 750 lines of code, we implemented a simple version of Googleminer. It's a command-line application that enables you to track changes in a search over time. It's not the most impressive application ever, but it's a very simple and very straightforward illustration of two things:
It's very easy to write simple Internet applications in Java. All of the code involved was boilerplate code, and it didn't require any deep thought at all. In fact, you were probably a little bored by parts of this article. But that's part of my point -- this stuff is easy to build.
As more and more information sources get exposed on the Web, Internet applications are going to proliferate. (The only question is: will they proliferate according to Metcalfe's law, or Reed's law?)
The code itself, while completely boilerplate, also illustrates one interesting aspect of semantic aggregators: structurally, they're the exact opposite of web applications. In a typical web application, the server hands the browser a cookie, and stores the state itself. In Googleminer, the exact opposite happens: the "client" stores a vast amount of state, and repeatedly passes a token to the server to get more. This idea, of moving personal and customized data to the edge of the network where the users are, is, I think, an interesting one. In a future column, I'll explore putting a GUI on Googleminer using SWT.