We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
from crawler.cx_extractor_Python import cx_extractor_Python cx = cx_extractor_Python() # test_html = cx.readHtml("E:\\Documents\\123.html") test_html = cx.getHtml('http://news.163.com/16/0101/10/BC84MRHS00014AED.html') content = cx.filter_tags(test_html) s = cx.getText(content) print(s)
The text was updated successfully, but these errors were encountered:
@chrislinan 您推荐的项目,已成功收录在 HelloGitHub 第 30 期,并把您添加到了贡献者列表中。
欢迎继续推荐如此优秀的项目、告诉其他小伙伴加入到 HelloGitHub 项目中。谢谢 🙏
Sorry, something went wrong.
No branches or pull requests
项目推荐
cx-extractor-python
https://github.com/chrislinan/cx-extractor-python
添加多语言支持
这是一个对网页正文进行抽取的工具,是cx-extractor算法的python版本,改进了原有算法,使其支持中英文,对新闻类网页正文抽取效果较好
不需要解析html,抽取网页正文速度快,准确度高
The text was updated successfully, but these errors were encountered: