Skip to main content

Instant User Tracking with ClickStream

September 6, 2007

{cs.r.title}



Introducing ClickStream

OpenSymphony's
ClickStream is a
user tracking component for Java web applications. This means you
can take a look and analyze the traffic paths and the sequence of
pages that users have generated as they browsed your site. This
traffic path is called a clickstream and it is the logical
grouping of a HTTP session identifier and the requests associated
with it, until the end of this session. The good news is you can
easily add this feature to your application by embedding
OpenSymphony's ClickStream to take advantage of this site usage
information.

We'll look first at how ClickStream works and what information
it collects. Then we'll proceed to configure your application to
use ClickStream. Finally we'll log the ClickStream-generated
information to a database and exploit it with standard database
queries.

Understanding the ClickStream Lifecycle

ClickStream starts tracking the user's activity as soon as the
web container creates an HTTP session. As you might know, the J2EE
specification defines a listener model. Session listeners are one
of the Servlet-2.3-specified listeners and are notified each time
an HTTP session is created or destroyed. ClickStream's session
listener is called ClickstreamListener, and it and ClickstreamFilter are the
fundamental components of ClickStream.

ClickStream's main element is a servlet filter called
ClickstreamFilter, which intercepts all the requests
to a defined web resource (a single page) or resource sets (a set
of pages) designated by a page pattern. Both components are
configured inside your application's web.xml file, but
don't worry about this yet; we'll look into the configuration of
this filter and session listener in a future section.

For now, we'll take a look at the information gathered from each
clickstream.

The Logged ClickStream Data

ClickStream logs specific information from each request and
accumulates it in the corresponding ClickStream
object. This object is the one sent to the
ClickStreamLogger for logging when the session ends.
ClickStream uses the following information as the ClickStream
header data:

  • The remote host making the request, which it
    gets from request.getRemoteHost(). Remember that if
    your application runs behind a reverse proxy (a common scenario for
    firewalled applications), only the proxy's IP address is
    logged.
  • The stream start time, which is the timestamp
    when the session listener creates the clickstream, as soon as the
    application server creates the new session.
  • The last request time, the timestamp of the
    last request associated to that session.
  • The HTTP referrer header, namely the user's
    previous page, if available.
  • The bot flag: if the user is really a crawler
    bot, this flag will turn to true. ClickStream detects more than 250
    different kinds of bots.
  • The session ID, for association purposes.

It then retrieves and stores from each request the following
items:

  • The request protocol: HTTP 1.0, HTTP 1.1,
    etc.
  • The request parameters: the parameters
    following the ? in the URL and separated by
    &. It also logs the parameter data submitted via
    the HTTP POST method as well.
  • The absolute request URI: the JSP or servlet
    mapping.
  • The session ID for associating with the
    header.
  • The remote host port used to connect to the
    server.
  • The request timestamp.
  • The remote user, if available, from the
    container's request.getRemoteUser() method.

This information is the what we'll store in the database to exploit
and analyze the user's traffic paths. In the next section, we'll see
how easy is to embed ClickStream into your application to start
capturing the page hits.

Embedding ClickStream into Your Application

The first step is obviously to download the ClickStream
distribution from the "https://opensymphony.dev.java.net/">OpenSymphony site. Then,
to embed ClickStream into your application, start by adding clickstream.jar and commons-logging.jar (if
your project doesn't already use this component) into the
WEB-INF/lib directory of your WAR application.

Then edit the web.xml descriptor from your application
with any text editor. You must add ClickStream's session listener
and filter. The filter is defined for each resource wildcard you
want to track with ClickStream. For example, if you want to track
every page hit, you must define the /* wildcard. On the
other hand, if you want to record only the hits directed to the
/MyServlet path, use a /MyServlet/*
wildcard. See the "http://java.sun.com/j2ee/tutorial/1_3-fcs/doc/Servlets8.html">servlet
specification
for more wildcard examples.

The next part of the web.xml instructs the filter to
record only the hits directed to JSP and HTML pages.

[prettify]
<filter>
    <filter-name>clickstream</filter-name>
    <filter-class>com.opensymphony.clickstream.ClickstreamFilter</filter-class>
</filter>

<filter-mapping>
    <filter-name>clickstream</filter-name>
    <url-pattern>*.jsp</url-pattern>
</filter-mapping>

<filter-mapping>
    <filter-name>clickstream</filter-name>
    <url-pattern>*.html</url-pattern>
</filter-mapping>

<listener>
    <listener-class>com.opensymphony.clickstream.ClickstreamListener</listener-class>
</listener>
[/prettify]

The element associates the
ClickStream filter with both extensions.

After adding the listener and filter, you can put the included
ClickStream JSPs into your web application's root directory. Both
clickstreams.jsp and viewstream.jsp are needed to
browse the ClickStream information online. Figure 1 illustrates
clickstreams.jsp, which shows all the active
clickstreams:

ClickStream JSP Screenshot
Figure 1. ClickStream clickstreams.jsp page

The clickstreams.jsp file lists all the active clickstreams
of the application ordered by the remote host IP. When you click one of the host IPs, ClickStream's viewstream.jsp appears,
as shown in Figure 2:

<br "ClickStream detail Screenshot " />
Figure 2. ClickStream viewstream.jsp page

These two pages allow you to browse the
not-yet-stored-in-the-database clickstreams, and are very useful
for quick browsing. The next section shows how to set up ClickStream
to log the user traffic data into a database for further processing
and analysis.

Logging the User Tracking Information to a Database

By default, ClickStream uses the Commons Logging component to
store the tracking information to the console or to logging files.
In this example, we'll use a custom ClickStreamLogger
to save the information to a database. First we'll configure
ClickStream to use our logger and then we'll create the
corresponding database schema.

ClickStream offers the ability to change the logging strategy by
creating a new logging class, which implements the
ClickStreamLogger interface, and configuring its use
in the clickstream.xml file located in the
WEB-INF/classes folder. You can find the
DatabaseClickStreamLogger custom database logger and
the sample clickstream.xml configuration file in the
included source code. Our
clickstream.xml will look like this:

[prettify]
<clickstream>
    <logger class="net.java.cs.DatabaseClickStreamLogger"/>

    <bot-host name="inktomi.com"/>
        ...
        thousands of bots' names skipped for brevity.
</clickstream>
[/prettify]

The configuration of the logger is done through a
database.properties file. This property file is also
included in the sample code, and looks like this:

[prettify]
jdbc.driver.class=org.postgresql.Driver
jdbc.url=jdbc:postgresql://localhost/clickstream-db
jdbc.user=jdoe
jdbc.pass=secret
[/prettify]

Just replace the URL, JDBC driver class, user, and password with
the appropriate values for your database. Our configuration is
ready, so let's create the ClickStream's database schema. The
database model is made up of only two tables: one with the header
clickstream data, and the other with the detailed request
information of each clickstream. Figure 3 graphically shows the
structure.

ClickStream DB Schema
Figure 3. ClickStream DB schema (click for full-size
image)

Execute the included SQL script, clickstream.sql, to
create the tables in your favorite database.

We are all set up; now when your application starts, it'll begin
to log the clickstream information to your database. The following
section shows how to exploit the tracking information using some
very useful metrics.

Exploiting the User Tracking Information

The fact that we've stored the user tracking information inside
a database server means that we can classify, measure, and manipulate
it at will. Some metrics you'll find very useful are:

  • The distinct user count over a period of time
  • The most-accessed pages
  • The length of the average user browsing session, in minutes

You'll find some of these SQL queries with PostgreSQL syntax in the
sample code. Most of the time, you'll want
to browse some sampled sessions to see what the user activity
looks like; you can achieve this by selecting one session
identifier only (

select * from
clickstream_requests where sessionid = 'nlggs2ccbeb2'
).

Under the Hood

You can visualize the interactions by looking at the sequence
diagram in Figure 4, which depicts the complete lifecycle of
ClickStream inside your web application.

ClickStream lifecycle sequence diagram
Figure 4. ClickStream lifecycle sequence diagram (click for
full-size image)

As you can figure out from the UML sequence diagram, the
ClickStream activity starts when an HTTP request arrives. If an
HTTP session is not associated in any way to the request, the web
container creates one and calls the session listeners; in this case
ClickstreamListener is notified. ClickstreamListener generates a new
ClickStream object to collect the user page track and
stores it in the newly created session.

Then, if the request matches one of the resources defined by the
ClickstreamFilter wildcard inside the web.xml
file, the web container calls the ClickstreamFilter.
This filter adds the request information to the session's
ClickStream object. This cycle continues until the
session is explicitly invalidated or the session expires due to
user inactivity. Each page or resource the user requests is logged
into the same ClickStream object.

When the ClickStreamListener is notified about the
end of a user session, it logs the ClickStream by
calling ClickStreamLogger. ClickStream configures
this component with a Jakarta Commons Logging Logger
by default, but this can be overridden with a custom
ClickStreamLogger, as we saw earlier.

Of course, you don't need to wait until the session expires to
see the clickstream information gathered during the application's
uptime. You can browse and list your users' clickstreams and page
hits with the provided viewstream.jsp JSP page.

Where to Go from Here

In this article we have covered the embedding of ClickStream
into your web application, and we've seen how to exploit this
information once stored in a database. Be aware that even slight user
activity can generate a massive amount of tracking information, so
it's highly recommended to do some pruning of this
information every two or three days, depending on your users'
activity. You'll probably encounter more uses for this information:
finding unused pages and bottlenecks, spike predictions, etc.

Resources


width="1" height="1" border="0" alt=" " />
Diego Naya lives in Buenos Aires, Argentina, and is currently working at Argentina's biggest health care company, OSDE.
Related Topics >> Programming   |   Servlets   |   Web Development Tools   |