 |
Article:
 |
 |
Introduction to Nutch, Part 2: Searching
|
| Subject: |
NutchBean getContent() Bug |
| Date: |
2008-06-20 01:36:59 |
| From: |
sumved_shami |
|
|

|
Hi
I was just trying to get the content of the files stored in crawled data. So, I passed "</HTML> as query string.
You can check the code here:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
Configuration conf = NutchConfiguration.create();
NutchBean nb = new NutchBean(conf);
Hits hits = nb.search(Query.parse("</HTML>", conf), 500);
System.out.println("Length: " + hits.getLength());
System.out.println("======================================================================================================");
if (null != hits) {
Hit hit = hits.getHit(34);
HitDetails hitDetails = nb.getDetails(hit);
System.out.println(new String(nb.getContent(hitDetails)));
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
==============================================================
But, getContent() gives me the byte array of size lesser than 65536, which is the maximum value of any integer.
i.e. I am not able to get the complete content of the HTML file.
I need this file in correct format to parse it properly.
Can you please help me out ? |
|