Link Checker

⚠️ This article was originally published in 2005 at dubi.org/link-checker. The content is extremely outdated and is preserved here for nostalgic purposes only.

Takes a URL on the command line and outputs a list of contained links and their validity:

link-checker.php
<?php
/*
Takes a URL on the command line and parses it for links and URLs
Outputs the status of every link and URL as either OK or BROKEN
*/
function error( $str ) {
fwrite(STDERR, $str);
exit(1);
}
function test_url( $url ) {
fwrite(STDOUT, " Checking \"$url\": ");
$handle = @fopen($url, 'r');
if ($handle) {
fwrite(STDOUT, "OK\n");
fclose($handle);
}
else {
fwrite(STDOUT, "*BROKEN*\n");
}
}
if ($argc != 2) {
error("syntax: url_tester.php url\n");
}
$url = $argv[1];
/* prefix with http:// */
$url = preg_replace("/^www\./","http://www.",$url);
fwrite(STDOUT, "Testing $url for broken links:\n");
$file_contents = @file_get_contents($url);
if (!$file_contents) {
error("Error reading from $url. Try again later");
}
/* finds all anchor (<a href=) links and ends when they hit a quote */
$url_pattern = "!<a href=(?:\")?([^\" >]+)!i";
preg_match_all($url_pattern, $file_contents, $url_list, PREG_PATTERN_ORDER);
fwrite(STDOUT, " ANCHOR (<a href=) URLS\n");
foreach($url_list[1] as $link) {
if (preg_match("!^(http://|www.)!i", $link)) {
/* prefix with http:// */
$link = preg_replace("/^www\./","http://www.",$link);
test_url($link);
}
else {
if (preg_match("!/$!",$url) and $link[0] == '/' ) {
test_url($url . substr($link,1));
}
else if (preg_match("!/$!",$url) xor $link[0] == '/' ) {
test_url($url . $link);
}
else {
echo $url, '---', $link;
test_url($url . $link);
}
}
}
/* finds a URL in the source, and ends it when it hits a quote or space */
$url_pattern = "!http://(?:[^/\" >:]*)(?::(?:[0-9]*))?(?:/[^ >\"]*)?!i";
preg_match_all($url_pattern, $file_contents, $url_list, PREG_PATTERN_ORDER);
fwrite(STDOUT, " ALL URLS\n");
foreach($url_list[0] as $link) {
test_url($link);
}
?>

Example output:

X:\webdev>php url_tester.php http://www.google.com
Testing http://www.google.com for broken links:
ANCHOR (<a href=) URLS
Checking "http://www.google.com/options/": OK
Checking "http://www.google.com/advanced_search?hl=en": OK
Checking "http://www.google.com/preferences?hl=en": OK
Checking "http://www.google.com/language_tools?hl=en": OK
Checking "http://www.google.com/ads/": OK
Checking "http://www.google.com/intl/en/about.html": OK
ALL URLS
Checking "http://groups-beta.google.com/grphp?hl=en&tab=wg&ie=UTF-8": OK

<-Find more writing back at https://alan.norbauer.com