PHP Classes

How to Can PHP Get All Links from HTML Pages of a Site to Find Links that Point to Pages that Do Not Exist Using the Package Broken Links Scanner: Scan web sites to identify broken links

Recommend this page to a friend!
  Info   Example   Screenshots   View files Files   Install with Composer Install with Composer   Download Download   Reputation   Support forum   Blog    
Last Updated Ratings Unique User Downloads Download Rankings
2024-09-22 (20 days ago) RSS 2.0 feedNot enough user ratingsTotal: 35 This week: 1All time: 11,005 This week: 34Up
Version License PHP version Categories
broken-links-scanner 1.0.0MIT/X Consortium ...7Debug, Tools, Searching, Validation, T..., P...
Description 

Author

This package can scan Web sites to identify broken links.

It provides a class that can take the URL of a given site and scan it to find links that point to pages that do not exist.

The class can output the list of broken links to the current terminal console or the current Web page.

It provides options that control the time that the class will wait for the response of the remote Web server and a limit of pages that will be scanned.

A PHP library for scanning websites to identify broken links and extract relevant information

Picture of Ujah Chigozie peter
  Performance   Level  
Name: Ujah Chigozie peter <contact>
Classes: 28 packages by
Country: Nigeria Nigeria
Age: 33
All time rank: 198810 in Nigeria Nigeria
Week rank: 18 Up1 in Nigeria Nigeria Up
Innovation award
Innovation award
Nominee: 11x

Example

#!/usr/bin/php
<?php
/**
 * Luminova Framework
 * The Luminova Framework offers high-performance HMVC (Hierarchical Model-View-Controller)
 * and MVC (Model-View-Controller) architectures designed for robust web applications.
 *
 * It combines the strengths of both architectures to enhance modularity, maintainability,
 * and scalability, enabling developers to build efficient and dynamic web solutions.
 *
 * @package Luminova
 * @author Ujah Chigozie Peter
 * @copyright (c) Nanoblock Technology Ltd
 * @license See LICENSE file
 * @link http://luminova.ng
 */
require_once __DIR__ . '/vendor/autoload.php';

use \
Peterujah\BrokenLinks\Scanner;
/**
 * CLI Script to scan for broken links on a website.
 * Usage: php broken --url="https://luminova.ng/" --host="luminova.ng" [--timeout=10] [--path="/path/to/save"] [--output=1] [--limit=0]
 */

// Parse CLI options
$options = getopt("", [
   
"url:", // Mandatory start URL (e.g. "http://luminova.ng/docs/" or "http://luminova.ng/")
   
"host:", // Mandatory host name (e.g, "luminova.ng")
   
"path::", // Optional path to save scans
   
"output::", // Optional output flag to print broken links (1 or 0)
   
"timeout::", // Optional timeout in seconds
   
"limit::" // Optional maximin number of scans
]);

// Check if a URL is provided
if (empty($options['url']) || empty($options['host'])) {
    echo
"Usage: php broken --url=\"https://luminova.ng/page\" [--host=\"luminova.ng\"] [--timeout=10] [--path=\"/path/to/save\"] [--output=1]\n";
    exit(
1);
}

// Extract values from options
$url = $options['url'];
$host = $options['host'] ?? '';
$timeout = isset($options['timeout']) ? (int) $options['timeout'] : 0;
$limit = isset($options['limit']) ? (int) $options['limit'] : 0;
$path = $options['path'] ?? __DIR__ . '/scanner/logs/';
$output = (isset($options['output']) && $options['output'] == 1);

if (!
is_dir($path) && !mkdir($path, 0755, true)) {
    echo
"Error: Unable to create or access path: $path\n";
    exit(
1);
}

// Create an instance of the BrokenLinks class
$scanner = new Scanner($url, $host, $limit);
$scanner->setPath($path);
$scanner->cli(true);

// Start the scan process


// Wait for the process to complete or until timeout
// Output results if requested
if($timeout > 0){
    try {
        echo
"Waiting for the scan to complete...\n";
       
$scanner->wait($timeout, function(Scanner $scanner) use($output){
            if (
$output) {
                echo
"\nBroken Links Found:\n";
               
print_r($scanner->getBrokenLinks());
            }
            exit(
1);
        });
    } catch (
RuntimeException $e) {
        echo
"Error: " . $e->getMessage() . "\n";
        exit(
1);
    }
}else{
   
$scanner->start();
    if(
$output){
        echo
"\nBroken Links Found:\n";
       
print_r($scanner->getBrokenLinks());
    }
}

exit(
0);


Details

PHP Broken Links Scanner

A PHP library for scanning websites to identify broken links and extract relevant information. Ensure that the required PHP extensions are installed, particularly cURL, for the scanner to function properly.

Installation is super-easy via Composer:

composer require peterujah/broken-links-scanner

CLI Usage

CLI Example Use the CLI script to scan a website for broken links.

Options:

  • `--url` (required): The starting URL for the scan (e.g., `http://luminova.ng/docs/` or `http://luminova.ng/`).
  • `--host` (required): The scan URL hostname (e.g., `luminova.ng`).
  • `--path` (optional): Path to save the scan results.
  • `--output` (optional): Flag to control output of broken links. Use `1` to print, or `0` to suppress output (default: `0`).
  • `--timeout` (optional): Maximum time in seconds to wait for the scan to complete (default: `0`).
  • `--limit` (optional): Maximum number of scans to perform. Use `0` to scan all URLs (default: `0`).

Example Usage:

To start a scan, run the following command:

php broken --url="https://luminova.ng/" --host="luminova.ng" [--timeout=10] [--path="/scanner/logs"] [--output=0] [--limit=0]

Example: Using Scanner to Scan a Website for Broken Links

Initialize Scanner with the necessary parameters and register your custom classes.

1. Basic Usage

require_once __DIR__ . '/vendor/autoload.php';

use \Peterujah\BrokenLinks\Scanner;

// Define the starting URL for the scan
$url = 'https://luminova.ng/';
$host = 'luminova.ng';
$maxScan = 10; // Set to 0 to scan all URLs.

// Initialize the BrokenLinks class
$scanner = new Scanner($url, $host, $maxScan);

// Optionally set the path to save scanned URLs
$scanner->setPath($path);

2. Start the Scan and Retrieve Results

If the path is not set, you can get the output directly:

if ($scanner->start() && $scanner->isCompleted()) {
    // Get results from the scan
    $brokenLinks = $scanner->getBrokenLinks();
    $visitedUrls = $scanner->getVisitedUrls();
    $errors = $scanner->getErrors();
    $allUrls = $scanner->getUrls();

    // Output the scanned data
    echo "Broken Links:\n";
    print_r($brokenLinks);

    echo "\nVisited URLs:\n";
    print_r($visitedUrls);

    echo "\nErrors Encountered:\n";
    print_r($errors);

    echo "\nAll Extracted URLs:\n";
    print_r($allUrls);
} else {
    echo "Failed to complete the scan.\n";
}

3. Using the wait Method

To wait for the scan to complete, you can use the wait method with a specified timeout:

$timeout = 30;

try {
    $scanner->wait($timeout, function (BrokenLinks $scanner) {
        $brokenLinks = $scanner->getBrokenLinks();
        echo "Broken Links:\n";
        print_r($brokenLinks);
    });
} catch (RuntimeException $e) {
    echo "Error: " . $e->getMessage() . "\n";
}

> Note: When using the wait method no need to call start method again.

*

Class Methods Documentation

__construct

  • Description: Initializes a new instance of the scanner with the specified URL and hostname.
  • Parameters: - `string $url`: The starting URL for the scan (e.g., `https://luminova.ng/docs/`). - `string $host`: The hostname for the URL to scan (e.g., `luminova.ng`). - `int $maxScan`: The maximum number of scans to perform (default is `0`, which means no limit).

isCompleted(): bool

  • Description: Checks whether the scanning process has been completed.
  • Returns: - `bool`: Returns `true` if the scan is completed; otherwise, returns `false`.

getBrokenLinks(): array

  • Description: Retrieves the list of broken URLs identified during the scan.
  • Returns: - `array`: An array containing the broken URLs.

getVisitedUrls(): array

  • Description: Retrieves the list of URLs that have been visited during the scan.
  • Returns: - `array`: An array containing the visited URLs.

getErrors(): array

  • Description: Retrieves the error messages encountered during the scan process.
  • Returns: - `array`: An array containing the error messages.

getUrls(): array

  • Description: Retrieves the list of extracted URLs during the scan.
  • Returns: - `array`: An array containing the extracted URLs.

setPath(string $path): self

  • Description: Sets the file path where scanned URLs will be saved.
  • Parameters: - `string $path`: The file path to save scanned URLs.
  • Returns: - `self`: Returns the current instance of the class for method chaining.

cli(bool $cli): self

  • Description: Sets whether the scanning results should be shown in the command line interface (CLI).
  • Parameters: - `bool $cli`: `true` if running in CLI mode; otherwise, `false`.
  • Returns: - `self`: Returns the current instance of the class for method chaining.

start(): bool

  • Description: Initiates the link scanning process.
  • Returns: - `bool`: Returns `true` if the scan completes successfully; returns `false` otherwise.
  • Throws: - `RuntimeException`: Throws an exception if the provided URL is invalid.

wait(int $timeout, ?callable $onComplete = null): void

  • Description: Waits for the scanning process to complete or until a specified timeout is reached. If a callback function is provided, it will be executed upon completion.
  • Parameters: - `int $timeout`: The maximum number of seconds to wait. If `0`, it waits indefinitely until the scan is completed. - `callable|null $onComplete`: An optional callback function to be executed when the scan completes.
  • Throws: - `RuntimeException`: Throws an exception if the timeout is exceeded before completion.

Screenshots (1)  
  • broken.png
  Files folder image Files (6)  
File Role Description
Files folder imagesrc (1 file)
Accessible without login Plain text file broken Example Example script
Accessible without login Plain text file composer.json Data Auxiliary data
Accessible without login Plain text file LICENSE Lic. License text
Accessible without login Plain text file README.md Doc. Documentation

  Files folder image Files (6)  /  src  
File Role Description
  Plain text file Scanner.php Class Class source

The PHP Classes site has supported package installation using the Composer tool since 2013, as you may verify by reading this instructions page.
Install with Composer Install with Composer
 Version Control Unique User Downloads Download Rankings  
 100%
Total:35
This week:1
All time:11,005
This week:34Up