dr4g0nsr / sitemap-crawler
Crawler for any type of site using robots.txt and sitemap.xml as the source of URL. Useful for cache regenerating.
1.0
2022-11-11 21:16 UTC
Requires
- php: >=7.2
- ext-curl: *
- guzzlehttp/guzzle: ^7.5
- vipnytt/sitemapparser: 1.1.5
Requires (Dev)
- dealerdirect/phpcodesniffer-composer-installer: ^0.7.0
- doctrine/annotations: ^1.2
- php-parallel-lint/php-console-highlighter: ^1.0.0
- php-parallel-lint/php-parallel-lint: ^1.3.2
- phpcompatibility/php-compatibility: ^9.3.5
- roave/security-advisories: dev-latest
- squizlabs/php_codesniffer: ^3.6.2
- yoast/phpunit-polyfills: ^1.0.0
Suggests
- ext-curl: Required for CURL handler support
- ext-intl: Required for Internationalized Domain Name (IDN) support
- psr/log: Required for using the Log middleware
This package is auto-updated.
Last update: 2025-03-12 02:36:56 UTC
README
Sitemap Crawler
Crawler using sitemap to crawl site/regenerate cache.
Files are not stored, point is just to trigger url.
Get code using composer
composer require dr4g0nsr/sitemap-crawler
How to implement
Create config.php:
<?php
$settings = [
"sleep" => 0,
"excluded" => []
];
Use code like this:
<?php
require __DIR__ . '/vendor/autoload.php';
require __DIR__ . '/config.php';
use dr4g0nsr\Crawler;
$url = 'https://candymapper.com';
print "Crawler version: " . Crawler::version() . PHP_EOL;
$crawler = new Crawler(['sleep' => 0, 'verbose' => true]);
$crawler->loadConfig(__DIR__ . '/config.php');
$sitemap = $crawler->getSitemap($url);
$crawler->crawlURLS($sitemap);
That would be simplest code, you can also find it in test subdir under vendor/dr4g0nsr/SitemapCrawler/test.