teon / text-extraction
Text Extraction Library
This package's canonical repository appears to be gone and the package has been frozen as a result.
Requires
- php: >=5.3.0
- smalot/pdfparser: ~0.9
- teon/base: ~0.2
This package is not auto-updated.
Last update: 2024-01-20 14:43:15 UTC
README
PHP library for extracting text from various documents using various drivers and strategies.
Multiple extractors for each file extension/media-type are supported.
Usage
Library has two modes of operation, and two submodes. Modes: 1.) Use one of extraction strategies and let the library do all the work (requires fileinfo PHP extension); or 2.) Get extractor(s) for your file type (by stating explicitly either media type or file extension) and using them manually.
Submodes: a) operate with file path, or b) operate with file contents in a string
Installation with composer
composer require teon/text-extraction
General usage
1.) Fully-automatic mode:
// Instantiate $TextExtraction = new \Teon\Text\Extraction\Extraction(); // Submode a): $text1 = $TextExt->fromFile($filePath); // Submode b): $text2 = $TextExt->fromString($fileContent);
2.) Manual extractor selection mode
// Instantiate $TextExtraction = new \Teon\Text\Extraction\ExtractorRegistry(); $ExtractorRegistry = $TextExtraction->getRegistry(); // Get appropriate extractors $extractors1 = $ExtractorRegistry->getByMediaType($fileMediaType); $extractors2 = $ExtractorRegistry->getByExtension($fileExtension); // Do your magic to decide which extractor to use $Extractor1 = $extractors1[0]; $Extractor2 = $extractors2[0]; // Submode a): $text1 = $Extractor1->fromFile($filePath); // Submode b): $text2 = $Extractor2->fromString($fileContent);
Before using it, you may reconfigure it:
// Get default configuration $config = \Teon\Text\Extraction\Extraction::getDefaultConfiguration(); // Adjust it $config['strategy']['class'] = "\\My\\Super\\Dooper\\TextExtractionStrategy" // Instantiate with adjusted configuration $TextExtraction = new \Teon\Text\Extraction\Extraction($config); // Start using it // ...
Usage in framework: Symfony
Install with composer, as described above:
composer require teon/text-extraction
Adjust configuration settings (app/config/config.yml or parameters.yml):
teon_text_extraction: strategy: class: ConcatOutput extractor: pdfocr: enabled: true command: my-convert-pdf-to-tiff-and-run-tesseract.sh
See the Resources/config/config.yml file for what is tuneable, or print default configuration.
Register bundle:
/* * FILE: app/AppKernel.php */ // ... public function registerBundles() { $bundles = array( // ... new Teon\Text\Extraction\TeonTextExtractionBundle(), // ... ); return $bundles; } // ...
Use in your controller:
/* * FILE: src/YourApp/Controller/TextExtractionController.php */ // ... public function extractAction () { // Get service $TextExtraction = $this->get('teon_text_extraction'); // See section "General usage" above // ... } // ...