Parse specific URL from HTML?

Graxeon · October 3, 2011

I'm trying to parse 2 things.

1. Specific TD tags from a table.

2. Specific URLs from an HTML page.

Here's part of the data I'm trying to parse:

<tr>
<td class="f">
<a href="http://main1.site.com/x.html">Page 1</a>
</td>
<td>1572</td>
<td class="a">Type: F</td>
<td><img src="http://site.com/image.gif" title="N" alt="N" /></td>
<td class="f">F</td>
</tr>

<tr class="x">
<td class="m">
<a href="http://main2.site.com/x.html">Page 2</a>
</td>
<td>1771</td>

<td class="a">Type: M</td>

Here's the parser that I'm working with:

<?php

$html = file_get_contents('http://www.website.com/page.html');

// use this to only match "td" tags
#preg_match_all ( "/(<(td)>)([^<]*)(<\/\\2>)/", $html, $matches );

// use this to match any tags
#preg_match_all("/(<([\w]+)[^>]*>)([^<]*)(<\/\\2>)/", $html, $matches);

//use this to match URLs
#preg_match_all ( "/http:\/\/[a-z0-9A-Z.]+(?(?=[\/])(.*))/", $html, $matches );

//use this to match URLs
#preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches );

preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches );

for ( $i=0; $i< count($matches[0]); $i++)
{
	echo "matched: " . $matches[0][$i] . "\n<br>";
	echo "part 1: " . $matches[1][$i] . "\n<br>";
	echo "part 2: " . $matches[2][$i] . "\n<br>";
	echo "part 3: " . $matches[3][$i] . "\n<br>";
	echo "part 4: " . $matches[4][$i] . "\n\n<br>";
}

?>

What I'm trying to output is:

<a href="http://main1.site.com/x.html">Page 1</a>
Hits: 1572

<a href="http://main2.site.com/x.html">Page 2</a>
Hits: 1771

...for the entire table

What I've managed to get out of it so far are the "Hits" with the "td" snippet. What I can't figure out is how to extra the full: <a href="http://main.site.com/p#.html">Page #</a>

So my question is how can I make it look for just "<a href="http://main#.......">Page #</a>"?

Currently it looks for every URL, which is not what I need.

requinix · October 3, 2011

Don't use regular expressions. Try DOMDocument or, if the HTML is XHTML-compatible, even SimpleXML. Both of those allow you to do very specific searches within documents - and by DOM structure, not by raw text.

codefossa · October 3, 2011

An example of DOMDocument to help get ya started if you don't already know how to use it.

This will echo out each link's location.

$html = file_get_contents('http://www.iana.org/domains/example/');

$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);

$hrefs = $xp -> evaluate('//a/@href');

foreach ($hrefs as $href)
{
    $href = $href -> nodeValue;
    
    echo "{$href}<br />";
}

Graxeon · October 3, 2011

Interesting...

It managed to pull out all of the URLs. But I have to search Google for a few hours to figure out what filters/protocol DOMDocument uses .

Cause for example:

//a/@href

-is searching for href's?

I understand the rest of the script. But I have no idea how that filter/parse term is being used.

I'm here: http://php.net/manual/en/class.domdocument.php

Can someone point me into the right section? xD

codefossa · October 3, 2011

Remember that DOMDocument is object oriented.

//a is all of the <a>(.*)</a>

//a/@href is selecting all the a tags again, but just pulling the href attributes

//a['class' = 'pink']/@href will pull all the a tags' href's where class is "pink"

Hope that helps a little. Sorry if I suck at explaining.

Graxeon · October 4, 2011

Ok...I understand that.

But the <a> tags depend on the <td>'s classes (which are f or m). I tried playing with it:

$hrefs = $xp -> evaluate('//a["class" = "f"]/@td');

$hrefs = $xp -> evaluate('//a["class" = "f"]/@href');

But it didn't return anything.

But to make it less complex, is there a section in the manual that gives an example to search for "http://main"? Where "main" could be any value and it would output the value of "a" (which would be "Page #").

codefossa · October 4, 2011

Here's an example to yet again retrieve the link's location.

/*

Some HTML

<table id="myTable">
    <tr>
        <td class="gold"><a href="http://google.com"></td>
        <td class="black"><a href="http://youtube.com"></td>
    </tr>
</table>

*/

// Variable value will be "http://youtube.com"
$black_href = $xp -> evaluate("//table[@id = 'myTable']//td[@class = 'black']/@href") -> item(0) -> nodeValue;

Also, sorry about earlier. I didn't realize I chose class with '' instead of @. This example is correct though.

Graxeon · October 5, 2011

Hmm...tried messing with it. Can't seem to get it to echo anything :/

I also tried echoing $hrefs directly. Blank page

<?php

/*

domdochtml.html:

<table id="myTable">
    <tr>
        <td class="gold"><a href="http://google.com"></td>
        <td class="black"><a href="http://youtube.com"></td>
    </tr>
</table>

*/

// Variable value will be "http://youtube.com"

$html = file_get_contents('http://fixitplease.ulmb.com/domdochtml.html');

$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);

$hrefs = $xp -> evaluate("//table[@id = 'myTable']//td[@class = 'black']/@href") -> item(0) -> nodeValue;

/*
I've also tried echoing $hrefs:

echo $hrefs;
(doesn't return anything)
*/

foreach ($hrefs as $href)
{
    $href = $href -> nodeValue;
    
    echo "{$href}<br />";
}


?>

Sign In

Parse specific URL from HTML?

Recommended Posts

Graxeon

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

codefossa

Link to comment

Share on other sites

Graxeon

Link to comment

Share on other sites

codefossa

Link to comment

Share on other sites

Graxeon

Link to comment

Share on other sites

codefossa

Link to comment

Share on other sites

Graxeon

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information