deadlyp99 Posted August 7, 2008 Share Posted August 7, 2008 I'm trying to build a simple web crawler. Currently I just want to index urls. So the first task is to strip all the text in the file and find only the urls. I managed to remove all html tags except <a> with a simple preg_match, but that is not enough. I have gone from a full page of code to: #main{text-align: center;//display: none;}#title{text-align: center;}#url{//display: none;}web crawler - test<a id="url" href="http://www.google.com">google</a>include("crawl.php");main("index.php");?> Obviously that isn't enough, and I don't really know where to go from here. Of course with crawling the php won't be a problem, but the css and other text will. My code as it stands: <?php function Main($StartUrl){ //Assign page a variable $PageGut = file($StartUrl); //Proccess each line of the file foreach ($PageGut as $LineNumber => $Line){ //remove return or line line feed char's at the end //of the line //$Line = trim($PageGut[$x]); //Look for '<a href src="">' type lines //print(htmlspecialchars($Line) . "<br />\n"); RemoveNonUrl($Line); } } function RemoveNonUrl($Line){ //Function removes anything NOT a url //Convert the strings to lower case so the tags dont need //1000 different combinations in the array $LowCaseLine = strtolower($Line); //Strip all white space at begginning of line $NoWhiteSpaceBeginningLine = ltrim($LowCaseLine); //Strip all white space at end of line $NoWhiteSpaceEndLine = rtrim($NoWhiteSpaceBeginningLine); //Remove all the html tags but keep url code $RemoveHtmlKeepUrl = strip_tags($NoWhiteSpaceEndLine,"<a>"); //Short the variable $Line = $RemoveHtmlKeepUrl; echo htmlspecialchars($Line); } ?> [code] All help appreciated PS: not looking for web crawler api's so don't bother posting, I want to do this from scratch. [/code] Link to comment https://forums.phpfreaks.com/topic/118582-solved-search-for-url-in-file/ Share on other sites More sharing options...
thebadbad Posted August 7, 2008 Share Posted August 7, 2008 You can do it very simple with a single preg_match_all(): <?php //load source code of website into a variable as a string $url = 'http://www.phpfreaks.com/forums/index.php/topic,210772.0.html'; $string = file_get_contents($url); //search the string for a pattern, and store the content found inside the set of parens in the array $matches preg_match_all('|<a.*?href="(.*?)"|is', $string, $matches); //see what's inside $matches[1] echo '<pre>' . print_r($matches[1], true) . '</pre>'; ?> Link to comment https://forums.phpfreaks.com/topic/118582-solved-search-for-url-in-file/#findComment-610528 Share on other sites More sharing options...
deadlyp99 Posted August 7, 2008 Author Share Posted August 7, 2008 Oh perfect, That is exactly what I needed. Thank you a lot, this has been bugging me for hours. Link to comment https://forums.phpfreaks.com/topic/118582-solved-search-for-url-in-file/#findComment-610534 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.