有关lucene 索引pdf格式文档的问题讨论第2页: - lucene爱好者

群组首页 → 企业架构 → lucene爱好者 → 论坛

发表回复

有关lucene 索引pdf格式文档的问题

javaeyes 2008-09-02

Document doc = LucenePDFDocument.getDocument(new File("C:\\file\\LuceneInActionCH.pdf"));
肯定出在这里了

brunoplum 2008-09-02

PDFBox是不支持中文的，建议你换个方法，用xpdf这个东东来解析pdf文件，只要返回的是一个String对象，具体怎么处理就由你了，代码贴给你吧

// 读取PDF的内容
	public String getPdfContent(String filePath) {
		
		// 设置pdftotext所在的路径
		String excute = "E:\\xpdf-3.02\\pdftotext.exe";
		String[] cmd = new String[] { excute, "-enc", "UTF-8", "-q", filePath,"-" };
		Process p = null;
		try {
			// 调用本地命令
			p = Runtime.getRuntime().exec(cmd);
		} catch (IOException e) {
			e.printStackTrace();
		}

		BufferedInputStream bis = new BufferedInputStream(p.getInputStream());
		InputStreamReader reader = null;

		try {
			reader = new InputStreamReader(bis, "UTF-8");
		} catch (UnsupportedEncodingException e1) {
			e1.printStackTrace();
		}

		StringBuffer sb = new StringBuffer();
		BufferedReader br = new BufferedReader(reader);
		String line;
		
		try {
			line = br.readLine();
			sb = new StringBuffer();
			while (line != null) {
				sb.append(line);
				sb.append(" ");
				line = br.readLine();
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
		return sb.toString();
	}

zhjt_88 2008-09-02

IndexReader reader = IndexReader.open("C:\\index");
System.out.println(indexPath[j] + "|索引版本号："+ reader.getVersion() + "|索引文件数" + reader.numDocs());
for (int i = 0; i < reader.numDocs(); i++) {
System.out.println(reader.document(i));
}
把你的索引打出来看看，具体是什么内容

xietingyan 2008-09-03

我记得很早以前我用pdfbox0.7.3就支持中文了啊

cobola 2009-07-29

现在版本的lucene兼容性有问题

无法创建
Document doc = LucenePDFDocument.getDocument(new File("C:\\file\\LuceneInActionCH.pdf"));

这个

pjw0221 2010-03-16

这个问题很简单。pdfbox 在创建索引的时候 contents 是只索引不存储的。