对于lucene实现的全文检索如何测试写索引的效率问题.

stta04 2008-08-25

lucene实现的全文检索如何测试写索引的效率?对于这种测试问题,其思路是什么?

wsnet 2008-09-22
看过一个国外的Benchmarking Indexing,参考一下

1 Hardware Environment

Dedicated machine for indexing: yes
CPU: Dual processor dual core Xeon CPU 3.00GHz; hyperthreading ON for 8 virtual cores
RAM: 8GB
Drive configuration: Dell EMC AX150 storage array fibre channel

2 Software environment

Lucene Version: 2.3.1
Java Version: Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
Java VM: Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_02-b05, mixed mode)
OS Version: Linux OpenSUSE 10.2 (64-bit X86-64)
Location of index: Filesystem, on attached storage
Lucene indexing variables
Number of source documents: 6,404,464
Total filesize of source documents: 141GB; Note that this is only the full-text: the metadata (title, author(s), abstract, keywords, journal name) are in addition to this
Average filesize of source documents: 22KB + metadata (see above)
Source documents storage location: Filesystem
File type of source documents: text (PDFs converted to text then gzipped)
Parser(s) used, if any: None, but text files GZIPed & had to be un-gziped by Java application which also did indexing
Analyzer(s) used: StandardAnalyzer
Number of fields per document: 24
Type of fields: all text; 20 stored; 3 of indexed tokenized with term vector (full-text [not stored], title, abstract); 10 stored with no parsing;
Index persistence: FSDirectory
Index size: 83GB
Number of terms: 143,298,010


Figures

Time taken (in ms/s as an average of at least 3 indexing runs): 20.5 hours
Time taken / 1000 docs indexed: 11.5 seconds
Memory consumption: -Xms4000m -Xmx6000m
Query speed: average time a query takes, type of queries (e.g. simple one-term query, phrase query), not measuring any overhead outside Lucene: <.01s
Global site tag (gtag.js) - Google Analytics