Graxeon Posted October 3, 2011 Share Posted October 3, 2011 I'm trying to parse 2 things. 1. Specific TD tags from a table. 2. Specific URLs from an HTML page. Here's part of the data I'm trying to parse: <tr> <td class="f"> <a href="http://main1.site.com/x.html">Page 1</a> </td> <td>1572</td> <td class="a">Type: F</td> <td><img src="http://site.com/image.gif" title="N" alt="N" /></td> <td class="f">F</td> </tr> <tr class="x"> <td class="m"> <a href="http://main2.site.com/x.html">Page 2</a> </td> <td>1771</td> <td class="a">Type: M</td> Here's the parser that I'm working with: <?php $html = file_get_contents('http://www.website.com/page.html'); // use this to only match "td" tags #preg_match_all ( "/(<(td)>)([^<]*)(<\/\\2>)/", $html, $matches ); // use this to match any tags #preg_match_all("/(<([\w]+)[^>]*>)([^<]*)(<\/\\2>)/", $html, $matches); //use this to match URLs #preg_match_all ( "/http:\/\/[a-z0-9A-Z.]+(?(?=[\/])(.*))/", $html, $matches ); //use this to match URLs #preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches ); preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches ); for ( $i=0; $i< count($matches[0]); $i++) { echo "matched: " . $matches[0][$i] . "\n<br>"; echo "part 1: " . $matches[1][$i] . "\n<br>"; echo "part 2: " . $matches[2][$i] . "\n<br>"; echo "part 3: " . $matches[3][$i] . "\n<br>"; echo "part 4: " . $matches[4][$i] . "\n\n<br>"; } ?> What I'm trying to output is: <a href="http://main1.site.com/x.html">Page 1</a> Hits: 1572 <a href="http://main2.site.com/x.html">Page 2</a> Hits: 1771 ...for the entire table What I've managed to get out of it so far are the "Hits" with the "td" snippet. What I can't figure out is how to extra the full: <a href="http://main.site.com/p#.html">Page #</a> So my question is how can I make it look for just "<a href="http://main#.......">Page #</a>"? Currently it looks for every URL, which is not what I need. Quote Link to comment https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/ Share on other sites More sharing options...
requinix Posted October 3, 2011 Share Posted October 3, 2011 Don't use regular expressions. Try DOMDocument or, if the HTML is XHTML-compatible, even SimpleXML. Both of those allow you to do very specific searches within documents - and by DOM structure, not by raw text. Quote Link to comment https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/#findComment-1275427 Share on other sites More sharing options...
codefossa Posted October 3, 2011 Share Posted October 3, 2011 An example of DOMDocument to help get ya started if you don't already know how to use it. This will echo out each link's location. $html = file_get_contents('http://www.iana.org/domains/example/'); $doc = new DOMDocument(); @$doc -> loadHTML($html); $xp = new DOMXPath($doc); $hrefs = $xp -> evaluate('//a/@href'); foreach ($hrefs as $href) { $href = $href -> nodeValue; echo "{$href}<br />"; } Quote Link to comment https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/#findComment-1275429 Share on other sites More sharing options...
Graxeon Posted October 3, 2011 Author Share Posted October 3, 2011 Interesting... It managed to pull out all of the URLs. But I have to search Google for a few hours to figure out what filters/protocol DOMDocument uses . Cause for example: //a/@href -is searching for href's? I understand the rest of the script. But I have no idea how that filter/parse term is being used. I'm here: http://php.net/manual/en/class.domdocument.php Can someone point me into the right section? xD Quote Link to comment https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/#findComment-1275435 Share on other sites More sharing options...
codefossa Posted October 3, 2011 Share Posted October 3, 2011 Remember that DOMDocument is object oriented. //a is all of the <a>(.*)</a> //a/@href is selecting all the a tags again, but just pulling the href attributes //a['class' = 'pink']/@href will pull all the a tags' href's where class is "pink" Hope that helps a little. Sorry if I suck at explaining. Quote Link to comment https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/#findComment-1275436 Share on other sites More sharing options...
Graxeon Posted October 4, 2011 Author Share Posted October 4, 2011 Ok...I understand that. But the <a> tags depend on the <td>'s classes (which are f or m). I tried playing with it: $hrefs = $xp -> evaluate('//a["class" = "f"]/@td'); $hrefs = $xp -> evaluate('//a["class" = "f"]/@href'); But it didn't return anything. But to make it less complex, is there a section in the manual that gives an example to search for "http://main"? Where "main" could be any value and it would output the value of "a" (which would be "Page #"). Quote Link to comment https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/#findComment-1275439 Share on other sites More sharing options...
codefossa Posted October 4, 2011 Share Posted October 4, 2011 Here's an example to yet again retrieve the link's location. /* Some HTML <table id="myTable"> <tr> <td class="gold"><a href="http://google.com"></td> <td class="black"><a href="http://youtube.com"></td> </tr> </table> */ // Variable value will be "http://youtube.com" $black_href = $xp -> evaluate("//table[@id = 'myTable']//td[@class = 'black']/@href") -> item(0) -> nodeValue; Also, sorry about earlier. I didn't realize I chose class with '' instead of @. This example is correct though. Quote Link to comment https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/#findComment-1275448 Share on other sites More sharing options...
Graxeon Posted October 5, 2011 Author Share Posted October 5, 2011 Hmm...tried messing with it. Can't seem to get it to echo anything :/ I also tried echoing $hrefs directly. Blank page <?php /* domdochtml.html: <table id="myTable"> <tr> <td class="gold"><a href="http://google.com"></td> <td class="black"><a href="http://youtube.com"></td> </tr> </table> */ // Variable value will be "http://youtube.com" $html = file_get_contents('http://fixitplease.ulmb.com/domdochtml.html'); $doc = new DOMDocument(); @$doc -> loadHTML($html); $xp = new DOMXPath($doc); $hrefs = $xp -> evaluate("//table[@id = 'myTable']//td[@class = 'black']/@href") -> item(0) -> nodeValue; /* I've also tried echoing $hrefs: echo $hrefs; (doesn't return anything) */ foreach ($hrefs as $href) { $href = $href -> nodeValue; echo "{$href}<br />"; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/#findComment-1276206 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.