PHP HTML DOM / Ganon & phpQuery & Simple HTML DOM
PHP Simple HTML DOM Parser
这个我从第一个测试版用到现在好几年了,轻量级,很不错,单文件代码 1393 行
项目地址: http://simplehtmldom.sourceforge.net/
手册: http://simplehtmldom.sourceforge.net/manual.htm
- A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
- Require PHP 5+.
- Supports invalid HTML.
- Find tags on an HTML page with selectors just like jQuery.
- Extract contents from HTML in a single line.
PHP Simple HTML DOM Parser 使用示例
查看
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
修改
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
$html->find('div', 1)->class = 'bar';
$html->find('div[id=hello]', 0)->innertext = 'foo';
echo $html; // Output: <div id="hello">foo</div><div id="world" class="bar">World</div>
Fix absoult url
$baseURI = $uri;
foreach ($html->find('base[href]') as $elem) {
$baseURI = $uri->resolve($elem->href);
}
foreach ($html->find('*[src]') as $elem) {
$elem->src = $baseURI->resolve($elem->src)->__toString();
}
foreach ($html->find('*[href]') as $elem) {
if (strtoupper($elem->tag) === 'BASE') continue;
$elem->href = $baseURI->resolve($elem->href)->__toString();
}
foreach ($html->find('form[action]') as $elem) {
$elem->action = $baseURI->resolve($elem->action)->__toString();
}
Ganon
项目地址: http://code.google.com/p/ganon/
文档: http://code.google.com/p/ganon/w/list
这个功能强大的很,最近才发现的,加入我的常库,单文件代码 2856 行
The Ganon library gives access to HTML/XML documents in a very simple object oriented way. It eases modifying the DOM and makes finding elements easy with CSS3-like queries.
A universal tokenizer
A HTML/XML/RSS DOM Parser
Ability to manipulate elements and their attributes
Supports invalid HTML
Supports UTF8
Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
A HTML beautifier (like HTML Tidy)
Minify CSS and Javascript
Sort attributes, change character case, correct indentation, etc.
Extensible
Parsing documents using callbacks based on current character/token
Operations separated in smaller functions for easy overriding
Fast
Easy
Ganon 使用示例:
$html = file_get_dom('http://code.google.com/');
Access
Accessing elements is made easy through the CSS3-like selectors and the object model.
// value of the class attribute
foreach($html('p[class]') as $element) {
echo $element->class, "<br>\n";
}
// Find the first div with ID "gc-header" and print the plain text of
// the parent element (plain text means no HTML tags, just the text)
echo $html('div#gc-header', 0)->parent->getPlainText();
// Find out how many tags there are which are "ns:tag" or "div", but not
// "a" and do not have a class attribute
echo count($html('(ns|tag, div + !a)[!class]');
?>
Modification
Elements can be easily modified after you've found them.
// their ID attribute and print the new HTML code
foreach($html('div p') as $index => $element) {
$element->id = "id$index";
}
echo $html;
// Center all the links inside a document which start with "http://"
// and print out the new HTML
foreach($html('a[href ^= "http://"]') as $element) {
$element->wrap('center');
}
echo $html;
// Find all odd indexed "td" elements and change the HTML to make them links
foreach($html('table td:odd') as $element) {
$element->setInnerText('<a href="#">'.$element->getPlainText().'</a>');
}
echo $html;
Beautify
Ganon can also help you beautify your code and format it properly.
dom_format($html, array('attributes_case' => CASE_LOWER));
echo $html;
phpQuery
这个重量级,比较耗资源,单文件代码 5702 行
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
Library is written in PHP5 and provides additional Command Line Interface (CLI).
项目地址: http://code.google.com/p/phpquery/
文档:http://code.google.com/p/phpquery/wiki/Manual
phpQuery Examples
CLI
Fetch number of downloads of all release packages
--find '.vt.col_4 a' --contents \
--getString null array_sum
PHP
Examples from demo.php
// for PEAR installation use this
// require('phpQuery.php');
初始化 INITIALIZE IT
// $doc = phpQuery::newDocumentXML();
// $doc = phpQuery::newDocumentFileXHTML('test.html');
// $doc = phpQuery::newDocumentFilePHP('test.php');
// $doc = phpQuery::newDocument('test.xml', 'application/rss+xml');
// this one defaults to text/html in utf8
$doc = phpQuery::newDocument('<div/>');
填充 FILL IT
$doc['div']->append('<ul></ul>');
// array set changes inner html
$doc['div ul'] = '<li>1</li><li>2</li><li>3</li>';
操纵 MANIPULATE IT
$li = null;
$doc['ul > li']
->addClass('my-new-class')
->filter(':last')
->addClass('last-li')
// save it anywhere in the chain
->toReference($li);
选择 SELECT DOCUMENT
phpQuery::selectDocument($doc);
// documents are selected when created or by above method
// query all unordered lists in last selected document
pq('ul')->insertAfter('div');
遍历 ITERATE IT
foreach(pq('li') as $li) {
// iteration returns PLAIN dom nodes, NOT phpQuery objects
$tagName = $li->tagName;
$childNodes = $li->childNodes;
// so you NEED to wrap it within phpQuery, using pq();
pq($li)->addClass('my-second-new-class');
}
输出 PRINT OUTPUT
print phpQuery::getDocument($doc->getDocumentID());
// 2nd way
print phpQuery::getDocument(pq('div')->getDocumentID());
// 3rd way
print pq('div')->getDocument();
// 4th way
print $doc->htmlOuter();
// 5th way
print $doc;
// another...
print $doc['ul'];
Incoming search terms:
- phpquery
- phpquery 手册
- simple_html_dom php
- php html dom
- phpquery 中文手册
- phpquery examples
- phpQuery simplehtmldom
- ganon php
- phpquery example
- phpquery newDocumentHTML
- westmhx
哈哈,php粉儿必备啊!
PHP 采集,折腾DOM快弄个半死.
第一个Simple HTML DOM,后来也一直没用会.
作为一个自学PHP没几天的菜鸟而言,实在有些惭愧.
蛋疼的。第二个运行速度太慢了。第一个都算快了。但是放在sinaapp还是运行不了。不知道怎么回事