使用MMAnalyzer 搜索出现一些问题

zhanjianhua 2008-07-11
最近新学了LUCENE 发现MMAnalyzer分词后有好多英文没办法查出,不知道是不是所说的stop word ,如果是那应该怎么做才能让它在分词时保留那些单词,以下是我代码,请大家看看有什么方法能查到结果,当然将new MMAnalyzer改成SimpleAnalyzer是能搜索出来的,除此还有其他方式没,
package ch2.lucenedemo.test;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import jeasy.analysis.MMAnalyzer;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;

import ch2.lucenedemo.process.Search;

public class SearchTimeCompareTest {
	public static void main(String[] args) {
		Search search = new Search();
		SearchTimeCompareTest st = new SearchTimeCompareTest();
		st.getSearch();
	}
	
	public void getSearch() {
		RAMDirectory ram = new RAMDirectory();   
	    IndexWriter writer;
		try {
			writer = new IndexWriter(ram,new SimpleAnalyzer(),true);
		  
		    Document doc1 = new Document();   
		    doc1.add(new Field("content","DIDO Thank you ",Field.Store.YES,Field.Index.TOKENIZED));   
		    Document doc2 = new Document();   
		    doc2.add(new Field("content","HERE and NOW 恭硕良",Field.Store.YES,Field.Index.TOKENIZED));   
		    writer.addDocument(doc1);   
		    writer.addDocument(doc2);   
		    writer.flush();   
		    writer.close();   
		    IndexSearcher searcher = new IndexSearcher(ram); 
		    BooleanQuery bq = new BooleanQuery();
		    QueryParser parser = new QueryParser("content", new MMAnalyzer());
		    Query query = parser.parse("NOW");
			bq.add(query, BooleanClause.Occur.SHOULD);
			System.out.println(bq.toString());
			
		    Hits hits = searcher.search(bq);   
		    System.out.println(hits.length());  
		    if (hits.length() > 0)
				for (int j = 0; j < hits.length(); j++) {
					Document doc = hits.doc(j);
					System.out.println("  "	+ hits.length() + " " + doc.get("content"));
				}
		} catch (CorruptIndexException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (LockObtainFailedException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}  catch (ParseException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}



执行时为什么打印出
()
0
是因为索引去掉stop word,
zhanjianhua 2008-07-21
可能我的问题没说请楚,就是当我用了网上提供的分词器时,会屏蔽很多英文和中文,而LUCENE自带StandardAnalyzer类是提供这么构造函数的Analyzer analyzer = new StandardAnalyzer(stopWord ),而我用的MMAnalyzer貌似没有提供,或者提供类似这样的一个中文分词器,
String str = "中文分词器";
将以上的内容分成:
中,文,分,词,器,中文,文分,分词,词器,中文分,文分词,分词器,中文分词,文分词器,中文分词器
请问网上能找到这种分词器吗?有的话给回复一个
chester60 2008-07-22
分得那么细的就不知道了,这样的分法很简单,你可以自己写一个.
和这种分法类似的是"庖丁解牛",在javaeye上搜索就找到了.
Global site tag (gtag.js) - Google Analytics