Skip to main content

Data Analysis and Data Mining Using Java, Jython and jHepWork

May 15, 2011

Introduction

Data mining (sometimes called knowledge discovery) is the process of analyzing and
summarizing data into useful information which can be used to understand common features,
the origin of data and to extract hidden predictive information.
Data mining is used in science, engineering, modeling and analysis of financial markets.

This article discusses a free data-analysis framework called jHepWork [1]
which is widely used to facilitate data analysis and data mining (see Figure 1).
It was designed for scientists, engineers and students who need numerical and statistical computations,
data and function visualization and even symbolic computation.

jHepWork is a 100% Java package, which means it is fully object-oriented and runs on any Java Virtual Machine regardless of computer architecture. Another notable feature - it uses the Python programming language [2] to call Java classes for numerical and statistical computation as well as for data visualization. To be more exact, jHepWork fully unitizes the power of Jython [3] which is an implementation of the Python programming language in Java.


jHepWork

Figure 1. jHepWork IDE with several interactive graphs.

Such merge of Java and Jython is not accidental.
According to the TIOBE Community Programming Community Index [4],
Java is the world’s most popular programming language.
Python is among popular scripting languages widely used in science, engineering and
education. It is also the fastest growing programming language of 2010 according to the same TIOBE index.
jHepWork uses the Python language due to its short and clear syntax which is handy for calling
numerical Java libraries. As the result, data-analysis programs written in such approach are short and clear,
while still utilizing the full strength of Java.

This is somewhat different from GUI-only type programs that
typically require walking through various menus and sub-menus to perform certain tasks.
In the jHepWork approach, one can write short commands using Python to perform computations
with arbitrary algorithmic logic that can be changed at runtime.
Such approach is also important for repetitive tasks when analysis code,
once saved into a file, can be executed multiple number of times depending on inputs
(which is a tedious task for GUI-only programs). In some sense, the scripting approach to data mining is
similar to the R-programming language [5],
but the difference is that jHepWork is based on Jython,
using the full advantage of its object-oriented design, the Python programming language with its high-level standard library,
the power of Java API and jHepWork Java libraries for data manipulation and visualization.

Saying all the above, one should also keep in mind that one can always use a pure Java approach to
develop data-mining analysis programs using jHepWork since all numerical and graphical
libraries of jHepWork are implemented in 100% Java. Or one can use an alternative scripting language,
such as BeanShell or the Java scripting API shipped with the javax.script package.
Finally, one can enjoy using the powerful Eclipse or Netbeans IDEs while editing
analysis programs.

Short tutorial

In this tutorial we will illustrate the full strength of jHepWork for data mining
using the Jython language.
We show how to analyze multidimensional data, display data on 2D and 3D canvases,
plot a function and how to perform a full-scale linear regression analysis widely in statistical interpretation of data.

Let us assume that we have a matrix of numbers organized as:

# this is a comment
1 2 3 4
5 6 7 8
.......

(the numbers of rows and columns can be arbitrary).
The goal of this tutorial is to analyze this data and to extract some useful information.
The numbers can be stored in a file which can be located on the Web.

First, make sure that the Java Virtual Machine http://www.java.com/ is installed.
Then download the jHepWork package from http://jwork.org/jhepwork/, unzip the package file
and run the script “jhepwork.sh” (Linux/Mac) or “jhepwork.bat” (Windows).
If you do this for the first time, Jython will start creating a cache directory. This process may take twenty to forty
seconds depending on the speed of your system.
Jython needs to document all Java classes visible for the Java Virtual Machine since this
will simplify programming (no need to specify every Java class in the import statements) and
will speed up the code execution.

After the start up, you will see the jHepWork IDE as shown in Figure 1.
It is bundled with a powerful code editor and a code assist based on the Java reflection technology.
It also has a Jython shell (below the main editor) and the Bean shell. Both help
interactive development of a data-mining analysis code and also can be used to call external commands.
For this tutorial,
we will use the Jython shell (“JythonShell”) since one can see the program response
immediately after entering commands line by line. The JythonShell is located below the main editor.

A first step is to read the data into a jHepWork data container designed to perform
some handy manipulation.
Our preference is to read our data from a prepared file located on the Web.
Make the JythonShell window bigger and enter the code shown below line by line and pressing [Enter]:

>>> from jhplot import *
>>> pn=PND('data','http://jwork.org/jhepwork/examples/data/pnd.d')
>>> print pn.toString()

Here we create a PND object using the input file “pnd.d”
stored on the Web and print the numbers stored in this container for checking.
The PND class is located in the "jhplot" package which is shipped together with jHepWork;
this is the main jHepWork package to perform data manipulation and visualization.
The input file has exactly the same structure as shown before, i.e. each row is separated by a new line.
From now on, we use the Python syntax
to print a string returned by the method toString().
Alternatively, one can use pn.toTable() method to display all numbers in a sortable and searchable table.
You will see the numbers printed out in the JythonShell (which is used for output of the print command).

Want to learn about methods of the “pn” object? Just type “pn.“ (the dot is important!)
and press [Ctrl]-[Space].
You will see a drop-down menu with the methods of this class.
Alternatively, one can look at the complete API of the PND class as

>>> pn.doc()   # this brings up a widows with the class API

Let us continue with the analysis of our data. First thing we want to do is to extract the
numbers from the second column and display them as a histogram (or a bar-chart density plot)
in order to understand the statistical characteristics of the data.
Assuming that the “pn” object is created as shown before, we will
extract the second column using the index 1 (the first column has the index 0)

>>> p0=pn.getP0D(1)     # extract 2nd column and put to a 1D array
>>> print p0.getStat()  # print a detailed statistical characteristics
>>> c1=HPlot('Plot')    # create a canvas to display a histogram
>>> c1.visible()        # pop-up canvas. c1.visible(False) creates the image in background
>>> c1.setAutoRange()   # set auto-range for the X and Y axis
>>> h1=p0.getH1D(10)    # convert 1D array into a histogram with 10 bins
>>> c1.draw(h1)         # draw the histogram

You will see a long list of statistical characteristics of the array of the first column (object p0) and a pop-up window with the histogram from the first array.
The code is self-explanatory and contains the necessary comments to explain each step.
For example, the method p0.getH1D() fills a one-dimensional histogram (the Java class H1D) using ten ranges
between a minimum and a maximum value of the array “p0” (the Java class P0D).
You will be surprised to find how many methods the H1D class contains.
According to the Java API, the histogram class
H1D has about 100 methods for data manipulation (excluding the methods for graphical representation).

If you want to make a file with a high-quality vector graphics,
use the method c1.export("fig.pdf") (for the PDF format) or c1.export("fig.ps") (for the PostScript format).
jHepWork supports about a dozen formats for image outputs. Figures can be generated in background without bringing up the canvas. In this case, use the method c1.visible(0). Finally, jHepWork has a powerful input-output mechanism for each data object (histograms, functions, data arrays) which allows storing all objects in files either using the Java serialized mechanism
or simple text-based files with compression.

Scatter plot and linear regression

The next step in our analysis is to extract two arbitrary columns and to make an X-Y scatter plot in
order find a correlation between the numbers from these columns.
In the example below we extract the second and third column, plot them on a X-Y canvas and
then perform a least-squared linear regression:

>>> from jhplot.stat import * 
>>> p1=pn.getP1D(1,2)      # extract 2nd and 3rd columns
>>> c1=HPlot('X-Y plot')
>>> c1.visible(); c1.setAutoRange()  # set autorange
>>> c1.draw(p1)
>>> r=LinReg(p1)
>>> print "Intercept=",r.getIntercept(), "+/-",r.getInterceptError()
>>> print "Slope=",r.getSlope(),"+/-",r.getSlopeError()

This code should follow after the code which creates the object “pn” as discussed before.
The execution of this example creates a X-Y graph with the values of the second and third columns, performs a least-squares regression and prints the values of the intercept and the slope (with their statistical uncertainties) of the linear-regression line.
But how to visualize this line? We can create a function using the values of the slope and the
intercept using the Python approach:

>>> func='%4.2f*x+%4.2f' % (r.getSlope(),r.getIntercept()) # a string representing a function a*x+b
>>> f1=F1D( func, p1.getMin(0), p1.getMax(0))              # a function object in the data range
>>> c1.draw(f1) 

This part should follow after the code discussed before.
Here we build a function a*x+b using
the slope and the intercept values instead of the symbols “a” and “b”.
Note that we reduce the precision of these
values during the string formatting (which is not too important in this
example). Then we
build a function object from the string in the X-axis range given by the data
(p1.getMin(0) means the minimum value of our data on the X-axis and p1.getMax(0) is the maximum value).

Now we can do something more: we will calculate a 95% prediction interval of the regression line [6].
The 95% prediction interval is the area in which 95% of all data points are
expected to fall. Do not confuse it with the 95% confidence interval which is the area that has a 95% chance of containing the true regression line. The jHepWork can calculate both, but here we only discuss the 95% prediction interval and will try to plot this interval
in a form of band on top of data points.

>>> from java.awt import Color
>>> p=r.getPredictionBand(Color.green) # extract 95% prediction band
>>> p.setLegend(False)                 # do not draw the legend for this band
>>> p.setErrColor(Color.green)         # set green color for error bars
>>> c1.draw(p)                         # show on the canvas

The method getPredictionBand() returns a P1D data container with a
95% prediction interval. We show this band using errors colored in
green using the “Color” class from the standard Java java.awt package.

Showing data in 3D

Let us continue with this example by displaying the data in three-dimensions (3D)
using three arbitrary columns.
This time we will display data for 1,2,3 and 1,3,4 columns using two separate interactive plot regions (the so-called “pads”).
As before, we assume that this code follows right after the previously discussed lines and
the object “pn” has already been created:

>>> c2=HPlot3D('3D plot',600,400,2,1)  # create a 600x400 canvas and make 2 drawing pads
>>> c2.visible()
>>> c2.cd(1,1);   c2.setAutoRange()    # navigate to first pad and set autorange
>>> p2=pn.getP2D(0,1,2)                # extract 3 columns with index 1,2,3
>>> c2.draw(p2)
>>> c2.cd(2,1);  c2.setAutoRange()     # navigate to second pad and set autorange
>>> p3=pn.getP2D(0,2,3)                # extract 3 columns with index  1,3,4
>>> c2.draw(p3)

The execution of the above code makes two interactive 3D pads which can be rotated and zoomed in.
Use the methods of the Java class “HPlot3D” to change its style. For example, one can change the color of the drawing box to a gray
using the java.awt.Color class as c2.setBoxColor(Color(200,210,210)) which can be inserted after the pad navigation method c2.cd().

It should be noted that, instead of using the JythonShell, one can use the jHepWork editor. Create a file called “example.py”
and copy and paste the lines above. To run this file using Jython, press [F8] or click on the icon run on the tool-bar menu of the jHepWork IDE.
There is one essential advantage in using this approach: One can use the built-in code assist which contains detailed description
of all methods. For example, assuming that the “pn” object is created as shown before,
type a dot after “pn” in the editor and press [F4]:

[prettify]
>>> pn.  # + press [F4] to display a list of methods
[/prettify]

The execution of this script brings up a table showing all methods of this class.
One can get a detailed description of each method and insert a selected method into the code editor.
Later one can make necessary modifications of
the code and rerun it using [F8] or clicking on the icon run.

Putting all together

Now let us run all examples of this tutorial in one go.
The above tutorial is given in the file “tutorial.py” which can be found on the jHepWork web page.
In the jHepWork IDE, go to the menu [File] and then [Open from URL]. Copy and paste this string to the URL window:

[prettify]
http://jwork.org/jhepwork/examples/tutorial.py
[/prettify]

and press the button [Open] (to see the code in the editor) or [Run] (to run the code).
You will see images with our tutorial as shown in Figures 2 and 3.

Histogram and linear regression
Figure 2. Histogram (left) and linear regression analysis.

Points in 3D
Figure 3. Showing data in 3D.

A final word. jHepWork comes with more than 200 example scripts, a detailed on-line tutorial and even a book describing all aspects of the Jython and jHepWork approach to data analysis.
To run the examples included in the jHepWork IDE,
simply go to the main Menu, select [Tools] and then [jHPlot examples]. Then one
can open a Jython example code and run it in the jHepWork IDE.

More details about the jHepWork data-analysis project can be found on the official web page [1]
and in the jHepWork book [7].

About the license: the core numerical and graphical Java libraries are licensed under the GNU General Public License v3.
Documentation, examples, installer, code assist database, language files used by the jHepWork editor
are licensed under the Creative Commons Attribution-Share Alike License; either version 3.0 and are
free for non-commercial usage (academic research, science and education).


References

[1] The jHepWork project, http://jwork.org/jhepwork/

[2] The Python language http://www.python.org/

[3] The Jython project http://www.jython.org/

[4] TIOBE index. http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

[5] The R statistics http://www.r-project.org/

[6] Confidence and Prediction band. Wikipedia http://en.wikipedia.org/wiki/Confidence_band

[7] S.V.Chekanov, Scientific Data analysis using Jython Scripting and Java. Book. 497p. Springer-Verlag, London 2010 ISBN 978-1-84996-286-5

Sergei Chekanov has been developing scientific software for analysis of large data volumes since 1995. He is a primary developer of the jHepWork data-analysis project and several open-source Java projects.
Alejandro D. P. de Astorza is a professional programmer. He contributed to the design and validation of jHepWork.
AttachmentSize
jhepwork_run.png871 bytes
jhepwork_small.png114.45 KB
jhepwork_tutorial_dmin1.png22.34 KB
jhepwork_tutorial_dmin2.png30.58 KB
Related Topics >> GUI   |   J2SE   |   Open Source   |   Programming   |   Research   |   Tools   |   Featured Article   |   

Comments

jHepWork is renamed to ScaVis ...

jHepWork is renamed to ScaVis (http://jwork.org/scavis/).
It should also note that you can install it using JPort Java desktop (http://jwork.org/jport).