[lucene] 用lucene中的 Highlighter怎样获得查找字符串后的对应的文件中的那一条内容

wanwan108 2009-10-13
小弟才了解lucene不多久。
我从一个文本文件中查找一个字符串然后需要获得这个字符串对应的那条内容。
我用lucene中的Highlighter,运行后报了个空指针异常
找到: 1  个结果!
File: E:\s\b.txt
content=null
Exception in thread "main" java.lang.NullPointerException
at java.io.StringReader.<init>(Unknown Source)
at newFile.TestQuery.main(TestQuery.java:56)

下面是我的代码,
lucene 2.0
package newFile;

import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;

public class TxtFileIndexer {
	public static void main(String[] args) throws Exception{

		File   indexDir = new File("E:\\index");
        File   dataDir  = new File("E:\\s"); 
//        Analyzer luceneAnalyzer = new StandardAnalyzer();
        Analyzer luceneAnalyzer = new WhitespaceAnalyzer();
        File[] dataFiles  = dataDir.listFiles();
        IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
        indexWriter.setMaxFieldLength(99999999);//增加内存域长度限制(非常重要)
        long startTime = new Date().getTime();
        for(int i = 0; i < dataFiles.length; i++){
        	if(dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")){
        		System.out.println("索引文件: " + dataFiles[i].getCanonicalPath());
        		Document document = new Document();
        		Reader txtReader = new FileReader(dataFiles[i]);
        		document.add(new Field("path",dataFiles[i].getCanonicalPath(),Field.Store.YES,
                        Field.Index.NO));
        		document.add(new Field("content",txtReader));
        		indexWriter.addDocument(document);
        	}
        }
        indexWriter.optimize();
        indexWriter.close();
        long endTime = new Date().getTime();
        
        System.out.println("花了 " + (endTime - startTime) 
                           + " 毫秒创建索引! "
        		           + dataDir.getPath());        
	}
}




package newFile;

import java.io.IOException;
import java.io.StringReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;

public class TestQuery {
	public static void main(String[] args) throws IOException, ParseException {
		Date startDate = new Date();
		Hits hits = null;
		String queryString = "张1999";
		Query query = null;
		IndexSearcher searcher = new IndexSearcher("e:/index");
		Highlighter highlighter = null;		//高亮显示
//		Analyzer analyzer = new StandardAnalyzer();
		Analyzer analyzer = new WhitespaceAnalyzer();
		try {
			QueryParser qp = new QueryParser("content", analyzer);
			query = qp.parse(queryString);
			
			//高亮显示设置
		    SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<read>","</read>");   
		    highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));      
		    //这个100是指定关键字字符串的context的长度,你可以自己设定,因为不可能返回整篇正文内容
		    highlighter.setTextFragmenter(new SimpleFragmenter(100));
			
		} catch (ParseException e) {
			e.printStackTrace();
		}
		System.out.println("-------");
		if (searcher != null) {
			System.out.println("sercher != null");
			hits = searcher.search(query);
			System.out.println(hits.length());
			for (int i = 0; i < hits.length(); i++) {
				System.out.println(" 找到: " + hits.length() + "  个结果! ");
				Document doc = hits.doc(i);
				System.out.println("File: " + doc.get("path"));
				System.out.println("content="+doc.get("content")); //这样子写为什么会得到null???????

				//高亮出显示
	        	TokenStream tokenStream =analyzer.tokenStream("content", new StringReader(doc.get("content")));
	            System.out.println(highlighter.getBestFragment(tokenStream,hits.doc(i).get("content")));
			}
		} else {
			System.out.println("空");
		}
		Date endDate = new Date();
		System.out.println("花费" + (endDate.getTime()-startDate.getTime()) +"毫秒");
	}

}



哪位帮忙看一下。。。
wanwan108 2009-10-13
文本文件中的记录就像下面一条一条的
张1  25  男  1310000001  深圳市  某某公司  Sun Sep 27 09:12:40 CST 2009
张2  25  男  1310000002  深圳市  某某公司  Sun Sep 27 09:12:40 CST 2009
张3  25  男  1310000003  深圳市  某某公司  Sun Sep 27 09:12:40 CST 2009
wanwan108 2009-10-13
System.out.println("content="+doc.get("content"));    我能够查找到字符串,为什么这会是NULL呢?
wanwan108 2009-10-13
郁闷,现在又报这个错了。。那个content 我那个地方写法造成content查找不到的。
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.lucene.search.Query.extractTerms(Ljava/util/Set;)V
at org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTermExtractor.java:114)
at org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTermExtractor.java:92)
at org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTermExtractor.java:105)
at org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTermExtractor.java:44)
at org.apache.lucene.search.highlight.QueryScorer.<init>(QueryScorer.java:49)
at newFile.TestQuery.main(TestQuery.java:37)

现在报这句代码错误    highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));
wanwan108 2009-10-13
现在可以显示了。。。但是为什么查找时只能查找前面几百条记录。。我想查找后面的字符串时就找不到了。。请问要设置什么属性?
chonglou 2009-10-15
content 内容你没加进去
chonglou 2009-10-15
document.add(new Field("content",txtReader));
txtReader换成字符串试试
Global site tag (gtag.js) - Google Analytics