Jump to content

Parse specific URL from HTML?


Graxeon

Recommended Posts

I'm trying to parse 2 things.

1. Specific TD tags from a table.

2. Specific URLs from an HTML page.

 

Here's part of the data I'm trying to parse:

 

<tr>
<td class="f">
<a href="http://main1.site.com/x.html">Page 1</a>
</td>
<td>1572</td>
<td class="a">Type: F</td>
<td><img src="http://site.com/image.gif" title="N" alt="N" /></td>
<td class="f">F</td>
</tr>

<tr class="x">
<td class="m">
<a href="http://main2.site.com/x.html">Page 2</a>
</td>
<td>1771</td>

<td class="a">Type: M</td>

 

Here's the parser that I'm working with:

 

<?php

$html = file_get_contents('http://www.website.com/page.html');

// use this to only match "td" tags
#preg_match_all ( "/(<(td)>)([^<]*)(<\/\\2>)/", $html, $matches );

// use this to match any tags
#preg_match_all("/(<([\w]+)[^>]*>)([^<]*)(<\/\\2>)/", $html, $matches);

//use this to match URLs
#preg_match_all ( "/http:\/\/[a-z0-9A-Z.]+(?(?=[\/])(.*))/", $html, $matches );

//use this to match URLs
#preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches );

preg_match_all ( "/<a href=\"([^\"]*)\">(.*)<\/a>/iU", $html, $matches );

for ( $i=0; $i< count($matches[0]); $i++)
{
	echo "matched: " . $matches[0][$i] . "\n<br>";
	echo "part 1: " . $matches[1][$i] . "\n<br>";
	echo "part 2: " . $matches[2][$i] . "\n<br>";
	echo "part 3: " . $matches[3][$i] . "\n<br>";
	echo "part 4: " . $matches[4][$i] . "\n\n<br>";
}

?>

 

What I'm trying to output is:

 

<a href="http://main1.site.com/x.html">Page 1</a>
Hits: 1572

<a href="http://main2.site.com/x.html">Page 2</a>
Hits: 1771

...for the entire table

 

What I've managed to get out of it so far are the "Hits" with the "td" snippet. What I can't figure out is how to extra the full: <a href="http://main.site.com/p#.html">Page #</a>

 

So my question is how can I make it look for just "<a href="http://main#.......">Page #</a>"?

 

Currently it looks for every URL, which is not what I need.

Link to comment
Share on other sites

An example of DOMDocument to help get ya started if you don't already know how to use it.

 

This will echo out each link's location.

 

$html = file_get_contents('http://www.iana.org/domains/example/');

$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);

$hrefs = $xp -> evaluate('//a/@href');

foreach ($hrefs as $href)
{
    $href = $href -> nodeValue;
    
    echo "{$href}<br />";
}

Link to comment
Share on other sites

Interesting...

 

It managed to pull out all of the URLs. But I have to search Google for a few hours to figure out what filters/protocol DOMDocument uses :P.

 

Cause for example:

 

//a/@href

-is searching for href's?

 

I understand the rest of the script. But I have no idea how that filter/parse term is being used.

 

I'm here: http://php.net/manual/en/class.domdocument.php

Can someone point me into the right section? xD

Link to comment
Share on other sites

Remember that DOMDocument is object oriented.

 

//a is all of the <a>(.*)</a>

 

//a/@href is selecting all the a tags again, but just pulling the href attributes

 

//a['class' = 'pink']/@href will pull all the a tags' href's where class is "pink"

 

Hope that helps a little.  Sorry if I suck at explaining.

Link to comment
Share on other sites

Ok...I understand that.

 

But the <a> tags depend on the <td>'s classes (which are f or m). I tried playing with it:

 

$hrefs = $xp -> evaluate('//a["class" = "f"]/@td');

$hrefs = $xp -> evaluate('//a["class" = "f"]/@href');

 

But it didn't return anything.

 

But to make it less complex, is there a section in the manual that gives an example to search for "http://main"? Where "main" could be any value and it would output the value of "a" (which would be "Page #").

Link to comment
Share on other sites

Here's an example to yet again retrieve the link's location.

 

/*

Some HTML

<table id="myTable">
    <tr>
        <td class="gold"><a href="http://google.com"></td>
        <td class="black"><a href="http://youtube.com"></td>
    </tr>
</table>

*/

// Variable value will be "http://youtube.com"
$black_href = $xp -> evaluate("//table[@id = 'myTable']//td[@class = 'black']/@href") -> item(0) -> nodeValue;

 

Also, sorry about earlier.  I didn't realize I chose class with '' instead of @.  This example is correct though.

Link to comment
Share on other sites

Hmm...tried messing with it. Can't seem to get it to echo anything :/

 

I also tried echoing $hrefs directly. Blank page :(

 

<?php

/*

domdochtml.html:

<table id="myTable">
    <tr>
        <td class="gold"><a href="http://google.com"></td>
        <td class="black"><a href="http://youtube.com"></td>
    </tr>
</table>

*/

// Variable value will be "http://youtube.com"

$html = file_get_contents('http://fixitplease.ulmb.com/domdochtml.html');

$doc = new DOMDocument();
@$doc -> loadHTML($html);
$xp = new DOMXPath($doc);

$hrefs = $xp -> evaluate("//table[@id = 'myTable']//td[@class = 'black']/@href") -> item(0) -> nodeValue;

/*
I've also tried echoing $hrefs:

echo $hrefs;
(doesn't return anything)
*/

foreach ($hrefs as $href)
{
    $href = $href -> nodeValue;
    
    echo "{$href}<br />";
}


?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.