pcrov / unicode
Miscellaneous Unicode utility functions
Installs: 887 731
Dependents: 1
Suggesters: 0
Security: 0
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Requires
- php: >=7.3
Requires (Dev)
- phpunit/phpunit: ^9.4.0
This package is auto-updated.
Last update: 2025-01-08 17:40:35 UTC
README
Miscellaneous Unicode utility functions.
Functions
Namespace pcrov\Unicode
.
surrogate_pair_to_code_point(int $high, int $low): int
Translates a UTF-16 surrogate pair into a single code point. Wikipedia's UTF-16 article explains what this is fairly well.
utf8_find_invalid_byte_sequence(string $string): ?int
Returns the position of the first invalid byte sequence or null if the input is valid.
utf8_get_invalid_byte_sequence(string $string): ?string
Returns the first invalid byte sequence or null if the input is valid.
utf8_get_state_machine(): array
Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte.
It is in the form of [byte => [valid next byte => ...,], ...]
Example use:
function utf8_generate_all_code_points(): string { $generator = function (array $machine, string $buffer = "") use (&$generator) { // Completed a UTF-8 encoded code point. if ($buffer !== "" && isset($machine["\x0"])) { return $buffer; } $out = ""; foreach ($machine as $byte => $next) { $out .= $generator($next, $buffer . $byte); } return $out; }; return $generator(utf8_get_state_machine()); }
utf8_validate(string $string): bool
Does what it says on the box.
Data
The test/data directory holds two files containing all possible UTF-8 encoded characters.
All 1,112,064 of them. One as plain text, the other as json. These are not included in
packaged stable releases but can be generated with the example utf8_generate_all_code_points()
function above (returns the plain text string.)
Excerpts from the Unicode 10.0.0 standard:
Recreated here for ease of reference. Nobody likes PDFs.