HTML Parser HTML Parser

zoukankan html css js c++ java

HTML Parser HTML Parser
HTML Parser - HTML Parser
HTML Parser is a Java library used to parse HTML in either a linear or nested fashion.
Primarily used for transformation or extraction, it features filters, visitors,
custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.
Welcome to the homepage of HTMLParser - a super-fast real-time
parser for real-world HTML. What has attracted most developers to HTMLParser has
been its simplicity in design, speed and ability to handle streaming real-world
html.
The two fundamental use-cases that are handled by the parser are
extraction and transformation
(the syntheses use-case, where HTML pages are created from scratch, is better
handled by other tools closer to the source of data). While prior versions
concentrated on data extraction from web pages, Version 1.4 of the
HTMLParser has substantial improvements in the area of transforming web
pages, with simplified tag creation and editing, and verbatim toHtml() method
output.
In general, to use the HTMLParser you will need to be able to write code in
the Java programming language. Although some example programs are provided
that may be useful as they stand, it's more than likely you will need (or
want) to create your own programs or modify the ones provided to match your
intended application.
To use the library, you will need to add either the htmllexer.jar or
htmlparser.jar to your classpath when compiling and running. The
htmllexer.jar provides low level access to generic string, remark and tag nodes on
the page in a linear, flat, sequential manner. The htmlparser.jar, which
includes the classes found in htmllexer.jar, provides access to a page as a
sequence of nested differentiated tags containing string, remark and other
tag nodes. So where the output from calls to the lexer
nextNode()
method might be:
<html> <head> <title> "Welcome" </title> </head> <body> etc...
The output from the parser NodeIterator would
nest the tags as children of the <html>, <head> and other nodes
(here represented by indentation):
<html> <head> <title> "Welcome" </title> </head> <body> etc...
The parser attempts to balance opening tags with ending tags to present the
structure of the page, while the lexer simply spits out nodes. If your
application requires only modest structural knowledge of the page, and is
primarily concerned with individual, isolated nodes, you should consider
using the lightweight lexer. But if your application requires knowledge of
the nested structure of the page, for example processing tables, you will
probably want to use the full parser.
查看全文

相关阅读:
小公司的10k前端工程师应该会什么？
webService和Restful
码农如何主动学习？
20个Web前端开发工程师必看的国外网站
 网站主题和内容的三个类型
 HTTP慢速攻击
 Linux等待队列原理与实现
 签名你的每个 Git Commit
浅谈 Linux 下常用 Socket 选项设置
 API接口设计，需要注意这4点

原文地址：https://www.cnblogs.com/lexus/p/2388604.html