kuria / simple-html-parser
Minimalistic HTML parser
Installs: 24 886
Dependents: 2
Suggesters: 0
Security: 0
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Requires
- php: >=7.1
Requires (Dev)
- kuria/dev-meta: ^0.1.0
This package is auto-updated.
Last update: 2024-12-22 18:48:00 UTC
README
Minimalistic HTML parser.
Note
If you need advanced DOM manipulation, consider using kuria/dom
instead.
Contents
- Features
- Requirements
- Usage
- Creating the parser
- Iterating elements
- Element array structure
- Tag name and attribute normalization
- Managing parser state
getHtml()
- get HTML contentgetSlice()
- get part of the HTMLgetSliceBetween()
- get content between 2 elementsgetLength()
- get length of the HTMLgetEncoding()
- determine encoding of the HTML documentgetEncodingTag()
- find the encoding-specifying meta tagusesFallbackEncoding()
- see if the fallback encoding is being usedsetFallbackEncoding()
- set fallback encodinggetDoctypeElement()
- find the doctype elementescape()
- escape a stringfind()
- match a specific elementgetOffset()
- get current offset
- Example: Reading document's title
Features
- parsing opening tags
- parsing closing tags
- parsing comments
- parsing DTDs
- extracting parts of HTML content
- determining encoding of HTML documents
- handling "raw text" tags (
<style>
,<script>
,<noscript>
, etc.)
Requirements
- PHP 7.1+
Usage
Creating the parser
<?php use Kuria\SimpleHtmlParser\SimpleHtmlParser; $parser = new SimpleHtmlParser($html);
Iterating elements
The parser implements Iterator
so it can be traversed using the standard
iterator methods.
<?php foreach ($parser as $element) { print_r($element); }
<?php $parser->rewind(); if ($parser->valid()) { print_r($parser->current()); }
Element array structure
Each element is an array with the following keys. Some keys are available only for specific element types.
Element types
SimpleHtmlParser::COMMENT
- a comment, e.g.<!-- foo -->
SimpleHtmlParser::OPENING_TAG
- an opening tag, e.g.<span class="bar">
SimpleHtmlParser::CLOSING_TAG
- a closing tag, e.g.</span>
SimpleHtmlParser::OTHER
- special element, e.g. doctype, XML headerSimpleHtmlParser::INVALID
- invalid or incomplete tags
Tag name and attribute normalization
Tag and attribute names that contain only ASCII characters are lowercased.
Managing parser state
The state methods can be used to temporarily store and/or revert state of the parser.
pushState()
- push current state of the parser onto the stackpopState()
- pop (discard) state stored on top of the stackrevertState()
- pop and restore state stored on top of the stackcountStates()
- count the number of states currently on the stackclearStates()
- discard all states
getHtml()
- get HTML content
The getHtml()
method may be used to get the entire HTML content or HTML
of a single element.
<?php $parser->getHtml(); // get entire document $parser->getHtml($element); // get single element
getSlice()
- get part of the HTML
The getSlice()
method returns a part of the HTML content.
Returns an empty string for negative or out-of-bounds ranges.
<?php $slice = $parser->getSlice(100, 200);
getSliceBetween()
- get content between 2 elements
The getSliceBetween()
method returns a part of the HTML content that is between
2 elements (usually opening and closing tag).
<?php $slice = $parser->getSliceBetween($openingTag, $closingTag);
getLength()
- get length of the HTML
The getLength()
returns total length of the HTML content.
getEncoding()
- determine encoding of the HTML document
The getEncoding()
method attempts to determine encoding of the HTML document.
If the encoding cannot be determined or is not supported, the fallback encoding will be used instead.
This method does not alter the parser's state.
getEncodingTag()
- find the encoding-specifying meta tag
The getEncodingTag()
method attempts to find the <meta charset="...">
or <meta http-equiv="Content-Type" content="...">
tag in the first 1024
bytes of the HTML document.
Returns NULL
if the tag was not found.
This method does not alter the parser's state.
usesFallbackEncoding()
- see if the fallback encoding is being used
The usesFallbackEncoding()
indicates whether the fallback encoding
is being used. This is the case when the encoding is not specified or
is not supported.
This method does not alter the parser's state.
setFallbackEncoding()
- set fallback encoding
The setFallbackEncoding()
method specifies an encoding to be used in case
the document has no encoding specified or specifies an unsupported encoding.
The fallback encoding must be supported by htmlspecialchars()
.
getDoctypeElement()
- find the doctype element
The getDoctypeElement()
method attempts to find the doctype in the first 1024
bytes of the HTML document.
Returns NULL
if no doctype was found.
escape()
- escape a string
The escape()
method escapes a string using htmlspecialchars()
using
the HTML document's encoding.
find()
- match a specific element
The find()
method attempts to find a specific element starting from the
current position, optionally stopping after a given number of bytes.
Returns NULL
if no element was matched.
<?php $element = $parser->find(SimpleHtmlParser::OPENING_TAG, 'title');
getOffset()
- get current offset
The getOffset()
method returns the current parser offset in bytes.
Example: Reading document's title
<?php $html = <<<HTML <!doctype html> <meta charset="utf-8"> <title>Foo bar</title> <h1>Baz qux</h1> HTML; $parser = new SimpleHtmlParser($html); $titleOpen = $parser->find(SimpleHtmlParser::OPENING_TAG, 'title'); if ($titleOpen) { $titleClose = $parser->find(SimpleHtmlParser::CLOSING_TAG, 'title'); if ($titleClose) { $title = $parser->getSliceBetween($titleOpen, $titleClose); var_dump($title); } }
Output:
string(7) "Foo bar"