使用MMAnalyzer 搜索出现一些问题
zhanjianhua
2008-07-11
最近新学了LUCENE 发现MMAnalyzer分词后有好多英文没办法查出,不知道是不是所说的stop word ,如果是那应该怎么做才能让它在分词时保留那些单词,以下是我代码,请大家看看有什么方法能查到结果,当然将new MMAnalyzer改成SimpleAnalyzer是能搜索出来的,除此还有其他方式没,
package ch2.lucenedemo.test; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.io.StringReader; import java.util.Date; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import jeasy.analysis.MMAnalyzer; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.store.RAMDirectory; import ch2.lucenedemo.process.Search; public class SearchTimeCompareTest { public static void main(String[] args) { Search search = new Search(); SearchTimeCompareTest st = new SearchTimeCompareTest(); st.getSearch(); } public void getSearch() { RAMDirectory ram = new RAMDirectory(); IndexWriter writer; try { writer = new IndexWriter(ram,new SimpleAnalyzer(),true); Document doc1 = new Document(); doc1.add(new Field("content","DIDO Thank you ",Field.Store.YES,Field.Index.TOKENIZED)); Document doc2 = new Document(); doc2.add(new Field("content","HERE and NOW 恭硕良",Field.Store.YES,Field.Index.TOKENIZED)); writer.addDocument(doc1); writer.addDocument(doc2); writer.flush(); writer.close(); IndexSearcher searcher = new IndexSearcher(ram); BooleanQuery bq = new BooleanQuery(); QueryParser parser = new QueryParser("content", new MMAnalyzer()); Query query = parser.parse("NOW"); bq.add(query, BooleanClause.Occur.SHOULD); System.out.println(bq.toString()); Hits hits = searcher.search(bq); System.out.println(hits.length()); if (hits.length() > 0) for (int j = 0; j < hits.length(); j++) { Document doc = hits.doc(j); System.out.println(" " + hits.length() + " " + doc.get("content")); } } catch (CorruptIndexException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (LockObtainFailedException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } 执行时为什么打印出 () 0 是因为索引去掉stop word, |
|
zhanjianhua
2008-07-21
可能我的问题没说请楚,就是当我用了网上提供的分词器时,会屏蔽很多英文和中文,而LUCENE自带StandardAnalyzer类是提供这么构造函数的Analyzer analyzer = new StandardAnalyzer(stopWord ),而我用的MMAnalyzer貌似没有提供,或者提供类似这样的一个中文分词器,
String str = "中文分词器"; 将以上的内容分成: 中,文,分,词,器,中文,文分,分词,词器,中文分,文分词,分词器,中文分词,文分词器,中文分词器 请问网上能找到这种分词器吗?有的话给回复一个 |
|
chester60
2008-07-22
分得那么细的就不知道了,这样的分法很简单,你可以自己写一个.
和这种分法类似的是"庖丁解牛",在javaeye上搜索就找到了. |