Jump to content

[SOLVED] Search for url in file


deadlyp99

Recommended Posts

I'm trying to build a simple web crawler.

 

Currently I just want to index urls.

So the first task is to strip all the text in the file and find only the urls.

 

I managed to remove all html tags except <a> with a simple preg_match, but that is not enough.

 

I have gone from a full page of code to:

#main{text-align: center;//display: none;}#title{text-align: center;}#url{//display: none;}web crawler - test<a id="url" href="http://www.google.com">google</a>include("crawl.php");main("index.php");?>

 

Obviously that isn't enough, and I don't really know where to go from here. Of course with crawling the php won't be a problem, but the css and other text will.

 

My code as it stands:

<?php

function Main($StartUrl){	
	//Assign page a variable
	$PageGut = file($StartUrl);
	//Proccess each line of the file
	foreach ($PageGut as $LineNumber => $Line){
		//remove return or line line feed char's at the end 			//of the line
		//$Line = trim($PageGut[$x]);
		//Look for '<a href src="">' type lines
		//print(htmlspecialchars($Line) . "<br />\n");
		RemoveNonUrl($Line);
		}
	 }


function RemoveNonUrl($Line){
	//Function removes anything NOT a url
	//Convert the strings to lower case so the tags dont need 			//1000 different combinations in the array
	$LowCaseLine = strtolower($Line);

	//Strip all white space at begginning of line
	$NoWhiteSpaceBeginningLine = ltrim($LowCaseLine);

	//Strip all white space at end of line
	$NoWhiteSpaceEndLine = rtrim($NoWhiteSpaceBeginningLine);

	//Remove all the html tags but keep url code
	$RemoveHtmlKeepUrl = strip_tags($NoWhiteSpaceEndLine,"<a>");

	//Short the variable
	$Line = $RemoveHtmlKeepUrl;
	echo htmlspecialchars($Line);
	}
?>
[code]

All help appreciated

PS: not looking for web crawler api's so don't bother posting, I want to do this from scratch.

[/code]

Link to comment
https://forums.phpfreaks.com/topic/118582-solved-search-for-url-in-file/
Share on other sites

You can do it very simple with a single preg_match_all():

 

<?php
//load source code of website into a variable as a string
$url = 'http://www.phpfreaks.com/forums/index.php/topic,210772.0.html';
$string = file_get_contents($url);
//search the string for a pattern, and store the content found inside the set of parens in the array $matches
preg_match_all('|<a.*?href="(.*?)"|is', $string, $matches);
//see what's inside $matches[1]
echo '<pre>' . print_r($matches[1], true) . '</pre>';
?>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.