Jump to content

Parse specific URL from HTML?


Graxeon

Recommended Posts

I'm trying to parse 2 things.

1. Specific TD tags from a table.

2. Specific URLs from an HTML page.

 

Here's part of the data I'm trying to parse:

 

<tr>
<td class="f">
<a href="http://main1.site.com/x.html">Page 1</a>
</td>
<td>1572</td>
<td class="a">Type: F</td>
<td><img src="http://site.com/image.gif" title="N" alt="N" /></td>
<td class="f">F</td>
</tr>

<tr class="x">
<td class="m">
<a href="http://main2.site.com/x.html">Page 2</a>
</td>
<td>1771</td>

<td class="a">Type: M</td>

 

Here's the parser that I'm working with:

 

<?php

$html = file_get_contents('http://www.website.com/page.html');

// use this to only match "td" tags
#preg_match_all ( "/(<(td)>)([^<]*)(<\/\\2>)/", $html, $matches );

// use this to match any tags
#preg_match_all("/(<([\w]+)[^>]*>)([^<]*)(<\/\\2>)/", $html, $matches);

//use this to match URLs
#preg_match_all ( "/http:\/\/[a-z0-9A-Z.]+(?(?=[\/])(.*))/", $html, $matches );

//use this to match URLs
#preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches );

preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches );

for ( $i=0; $i< count($matches[0]); $i++)
{
	echo "matched: " . $matches[0][$i] . "\n<br>";
	echo "part 1: " . $matches[1][$i] . "\n<br>";
	echo "part 2: " . $matches[2][$i] . "\n<br>";
	echo "part 3: " . $matches[3][$i] . "\n<br>";
	echo "part 4: " . $matches[4][$i] . "\n\n<br>";
}

?>

 

What I'm trying to output is:

 

<a href="http://main1.site.com/x.html">Page 1</a>
Hits: 1572

<a href="http://main2.site.com/x.html">Page 2</a>
Hits: 1771

...for the entire table

 

What I've managed to get out of it so far are the "Hits" with the "td" snippet. What I can't figure out is how to extra the full: <a href="http://main.site.com/p#.html">Page #</a>

 

So my question is how can I make it look for just "<a href="http://main#.......">Page #</a>"?

 

Currently it looks for every URL, which is not what I need.

Link to comment
https://forums.phpfreaks.com/topic/248366-parse-specific-url-from-html/
Share on other sites

An example of DOMDocument to help get ya started if you don't already know how to use it.

 

This will echo out each link's location.

 

$html = file_get_contents('http://www.iana.org/domains/example/');

$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);

$hrefs = $xp -> evaluate('//a/@href');

foreach ($hrefs as $href)
{
    $href = $href -> nodeValue;
    
    echo "{$href}<br />";
}

Interesting...

 

It managed to pull out all of the URLs. But I have to search Google for a few hours to figure out what filters/protocol DOMDocument uses :P.

 

Cause for example:

 

//a/@href

-is searching for href's?

 

I understand the rest of the script. But I have no idea how that filter/parse term is being used.

 

I'm here: http://php.net/manual/en/class.domdocument.php

Can someone point me into the right section? xD

Remember that DOMDocument is object oriented.

 

//a is all of the <a>(.*)</a>

 

//a/@href is selecting all the a tags again, but just pulling the href attributes

 

//a['class' = 'pink']/@href will pull all the a tags' href's where class is "pink"

 

Hope that helps a little.  Sorry if I suck at explaining.

Ok...I understand that.

 

But the <a> tags depend on the <td>'s classes (which are f or m). I tried playing with it:

 

$hrefs = $xp -> evaluate('//a["class" = "f"]/@td');

$hrefs = $xp -> evaluate('//a["class" = "f"]/@href');

 

But it didn't return anything.

 

But to make it less complex, is there a section in the manual that gives an example to search for "http://main"? Where "main" could be any value and it would output the value of "a" (which would be "Page #").

Here's an example to yet again retrieve the link's location.

 

/*

Some HTML

<table id="myTable">
    <tr>
        <td class="gold"><a href="http://google.com"></td>
        <td class="black"><a href="http://youtube.com"></td>
    </tr>
</table>

*/

// Variable value will be "http://youtube.com"
$black_href = $xp -> evaluate("//table[@id = 'myTable']//td[@class = 'black']/@href") -> item(0) -> nodeValue;

 

Also, sorry about earlier.  I didn't realize I chose class with '' instead of @.  This example is correct though.

Hmm...tried messing with it. Can't seem to get it to echo anything :/

 

I also tried echoing $hrefs directly. Blank page :(

 

<?php

/*

domdochtml.html:

<table id="myTable">
    <tr>
        <td class="gold"><a href="http://google.com"></td>
        <td class="black"><a href="http://youtube.com"></td>
    </tr>
</table>

*/

// Variable value will be "http://youtube.com"

$html = file_get_contents('http://fixitplease.ulmb.com/domdochtml.html');

$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);

$hrefs = $xp -> evaluate("//table[@id = 'myTable']//td[@class = 'black']/@href") -> item(0) -> nodeValue;

/*
I've also tried echoing $hrefs:

echo $hrefs;
(doesn't return anything)
*/

foreach ($hrefs as $href)
{
    $href = $href -> nodeValue;
    
    echo "{$href}<br />";
}


?>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.