The Source for Java Technology Collaboration
User: Password:



Start New Message Delete Post a Reply

Article: 
 Introduction to Nutch, Part 2: Searching
Subject:  NutchBean getContent() Bug
Date:  2008-06-20 01:36:59
From:  sumved_shami


Hi

I was just trying to get the content of the files stored in crawled data. So, I passed "</HTML> as query string.

You can check the code here:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
Configuration conf = NutchConfiguration.create();

NutchBean nb = new NutchBean(conf);

Hits hits = nb.search(Query.parse("</HTML>", conf), 500);
System.out.println("Length: " + hits.getLength());
System.out.println("======================================================================================================");

if (null != hits) {
Hit hit = hits.getHit(34);
HitDetails hitDetails = nb.getDetails(hit);

System.out.println(new String(nb.getContent(hitDetails)));



}


} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}


==============================================================

But, getContent() gives me the byte array of size lesser than 65536, which is the maximum value of any integer.
i.e. I am not able to get the complete content of the HTML file.

I need this file in correct format to parse it properly.

Can you please help me out ?

 Feed java.net RSS Feeds