Jump to content

extracting specific string out of html code


chupacabrot

Recommended Posts

I'm quite new to regular expressions and i looked through all the asked questions and unfortunately couldn't find any answer.. I want to extract a specific parts of my website and simply echo them to a new page. My list of categories is structured alphabeticlly like this -

<a href="architecture.html">ARCHITECTURE</a><br />
<a href="art.html">ART</a><br />
<a href="avantgarde.html">AVANTGARDE</a><br />

. . . and so on.

now, what i'm trying to actually do is to extract all the categories as a plain text and simply echo them on the screen. (in this case i need to extract every string that starts with ">A and ends with </a (assuming i dont have any other similiar pattern within my code).

i found this piece of code actualy in stackoverflow that supposed to extract anything that exists between tags, but unfortunately it's not the case..

html part -

<div name="changeable_text">**GET THIS TEXT**</div>

php part -

$categories = file_get_contents( $url);

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $categories);
$xpath = new DOMXpath( $doc);

$node = $xpath->query( '//div[@name=changeable_text]')->item( 0);

echo $node->textContent; // This will print **GET THIS TEXT**

 

i found this piece of code actualy in stackoverflow that supposed to extract anything that exists between tags, but unfortunately it's not the case..

The PHP code you posted works fine, it will find all div tags that have a name attribute set to "changeable_text" and return the nodes text value.

 

 

 

My list of categories is structured alphabeticlly like this -

<a href="architecture.html">ARCHITECTURE</a><br />

<a href="art.html">ART</a><br />

<a href="avantgarde.html">AVANTGARDE</a><br />

. . . and so on.

 

To get all anchor tags on the page, you'd use   //a  as the xpath query.

 

If you only to get the category links then you need to specify the container they belong to, eg

$categories = '<div id="categories">
<a href="architecture.html">ARCHITECTURE</a><br />
<a href="art.html">ART</a><br />
<a href="avantgarde.html">AVANTGARDE</a><br />
</div>';

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $categories);
$xpath = new DOMXpath( $doc);

// find all anchor tags within the <div id="categories"> tag
$categorylinks = $xpath->query('//div[@id="categories"]/a');

// loop through the links and echo the link text
foreach($categorylinks as $link)
{
    echo $link->textContent .'<br />';
}

Assuming that http://www.mywebsiteforexample.com/categories.html has a div called 'categories en', here's what i tried and for some reason it still doesn't work

<?php
$url = 'http://www.mywebsiteforexample.com/categories.html';
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $url);
$xpath = new DOMXpath( $doc);


$categorylinks = $xpath->query('//div[@id="categories en"]/a'); //please notice the space in the div's name - maybe that causes any trouble

// loop through the links and echo the link text
foreach($categorylinks as $link)
{
    echo $link->textContent .'<br />';
}

?>

well.. unfortunately it still doesn't work.. :\

<?php
$url = 'http://www.mywebsiteforexample.com/categories.html';
$contents = file_get_contents($url);
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $contents);
$xpath = new DOMXpath( $doc);


$categorylinks = $xpath->query('//div[@id="categories en"]/a'); //please notice the space in the div's name - maybe that causes any trouble

// loop through the links and echo the link text
foreach($categorylinks as $link)
{
    echo $link->textContent .'<br />';
}

?>

<!-- start of categories list -->
<div class="categories en"> </div>
<a href="../agriculture.html" target="_blank">agriculture</a><br />
<a href="../avantgarde.html" target="_blank">avantgarde</a><br />
<a href="../azyx.html" target="_blank">azyx</a><br />

Is that right you close the div tag as soon as you open it? The closing div needs to go after the anchor tags

<!-- start of categories list -->
<div class="categories en"> <!-- open div -->
<a href="../agriculture.html" target="_blank">agriculture</a><br />
<a href="../avantgarde.html" target="_blank">avantgarde</a><br />
<a href="../azyx.html" target="_blank">azyx</a><br />
</div> <!-- close div -->

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.