大家用 htmlparser提取文本内容时有没有出现内存溢出呀 - lucene - lucene爱好者

[lucene] 大家用 htmlparser提取文本内容时有没有出现内存溢出呀

85600367 2010-12-08

	public static String getDocument(File html) {
		String htmlPath = html.getAbsolutePath();
		String text = "";
		Parser parser = null;
		try {
			parser = new Parser(htmlPath);
		} catch (ParserException e) {
			e.printStackTrace();
		}
		try {
			parser.setEncoding("UTF-8");
			// parser.setEncoding("gbk");
		} catch (ParserException e) {
			e.printStackTrace();
		}
		HtmlPage visitor = new HtmlPage(parser);
		try {
			parser.visitAllNodesWith(visitor);
		} catch (ParserException e) {
			e.printStackTrace();
		}
		NodeList nodes = visitor.getBody();
		int size = nodes.size();
		for (int i = 0; i < size; i++) {
			Node node = nodes.elementAt(i);
			text += node.toPlainTextString();
		}
		return text;
	}

我在循环读取HTML文件中的文本内容时大概到1000+个文件以后
每个HTML文件大小是1M
parser.visitAllNodesWith(visitor);这行会内存溢出
不知道大家有碰到没？或者给点解决的思路吧？
TOMCAT的内存已经一个G拉

evil9999 2010-12-08

new String(string)

85600367 2010-12-09

额这和那个String有关系呢？

85600367 2010-12-22

无意中发现jsoup 这个解析工具
我直接就抛弃了 htmlparser
确实方便多了

发表回复

>>返回群组首页

[lucene] 大家用 htmlparser提取文本内容时有没有出现内存溢出呀

相关讨论

相关资源推荐

[lucene] 大家用 htmlparser提取文本内容时 有没有出现内存溢出呀

相关讨论

相关资源推荐

[lucene] 大家用 htmlparser提取文本内容时有没有出现内存溢出呀