tommytx Posted January 22, 2007 Share Posted January 22, 2007 The code below will parse the web page in web.txt file and extract all full domainnamed URLS as long as the domain is included: Now I would also like to extract allthe ones that do not have a domain name listed.... see sample below.************************ begin code to extract domain URL's ****************$link = "";$html = file_get_contents("web.txt");$urls = '(http|file|ftp)';$ltrs = '\w';$gunk = '/#~:.?+=&%@!\-';$punc = '.:?\-';$any = "$ltrs$gunk$punc";preg_match_all("{\b$urls:[$any]+?(?=[$punc]*[^$any]|$)}x",$html,$matches);//the below prints out all the urls that were extracted. foreach ($matches[0] as $u) { $link = '?url=' . urlencode($u); echo "<A HREF='$link'>$u</A><BR>\n"; }// the below prints the total number of url's located.printf("Output of URLs %d URLs<P>\n", sizeof($matches[0]));***************** End of code to extract domain urls. ****************Below urls 3 and 4 have the domain name and extract fine.. what can I add toextract the non domain urls as in 5 and 6. For best results, I would like to simple duplicate the above and run it again on the web.txt file after changing theregex to allow pulling out the short urls..1. <a href="httpcolon//www.idaho.com/vahud/hud_tom.htm>HUD Homes</a>2. <a href="httpcolon//www.idaho.com/tomchambers">MLS Homes</a>3. <a href="buy_sell_house_hampton.htm">buy_sell_house_hampton</a>4. <a href="newport_news_homes_for_sale.htm">newport_news_homes_for_sale</a> Quote Link to comment Share on other sites More sharing options...
effigy Posted January 22, 2007 Share Posted January 22, 2007 You need to make the domain portion optional, but your pattern is too general to do so. See if [url=http://www.phpfreaks.com/forums/index.php/topic,123121.0.html]this[/url] topic helps you match URLs better. Once you've got this working, a pattern of[tt] (?:[i]domain_portion[/i])? [/tt]will make the domains optional. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.