Jump to content

extracting specific string out of html code


Go to solution Solved by chupacabrot,

Recommended Posts

I'm quite new to regular expressions and i looked through all the asked questions and unfortunately couldn't find any answer.. I want to extract a specific parts of my website and simply echo them to a new page. My list of categories is structured alphabeticlly like this -

<a href="architecture.html">ARCHITECTURE</a><br />
<a href="art.html">ART</a><br />
<a href="avantgarde.html">AVANTGARDE</a><br />

. . . and so on.

now, what i'm trying to actually do is to extract all the categories as a plain text and simply echo them on the screen. (in this case i need to extract every string that starts with ">A and ends with </a (assuming i dont have any other similiar pattern within my code).

i found this piece of code actualy in stackoverflow that supposed to extract anything that exists between tags, but unfortunately it's not the case..

html part -

<div name="changeable_text">**GET THIS TEXT**</div>

php part -

$categories = file_get_contents( $url);

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $categories);
$xpath = new DOMXpath( $doc);

$node = $xpath->query( '//div[@name=changeable_text]')->item( 0);

echo $node->textContent; // This will print **GET THIS TEXT**

 

i found this piece of code actualy in stackoverflow that supposed to extract anything that exists between tags, but unfortunately it's not the case..

The PHP code you posted works fine, it will find all div tags that have a name attribute set to "changeable_text" and return the nodes text value.

 

 

 

My list of categories is structured alphabeticlly like this -

<a href="architecture.html">ARCHITECTURE</a><br />

<a href="art.html">ART</a><br />

<a href="avantgarde.html">AVANTGARDE</a><br />

. . . and so on.

 

To get all anchor tags on the page, you'd use   //a  as the xpath query.

 

If you only to get the category links then you need to specify the container they belong to, eg

$categories = '<div id="categories">
<a href="architecture.html">ARCHITECTURE</a><br />
<a href="art.html">ART</a><br />
<a href="avantgarde.html">AVANTGARDE</a><br />
</div>';

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $categories);
$xpath = new DOMXpath( $doc);

// find all anchor tags within the <div id="categories"> tag
$categorylinks = $xpath->query('//div[@id="categories"]/a');

// loop through the links and echo the link text
foreach($categorylinks as $link)
{
    echo $link->textContent .'<br />';
}
Edited by Ch0cu3r

Assuming that http://www.mywebsiteforexample.com/categories.html has a div called 'categories en', here's what i tried and for some reason it still doesn't work

<?php
$url = 'http://www.mywebsiteforexample.com/categories.html';
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $url);
$xpath = new DOMXpath( $doc);


$categorylinks = $xpath->query('//div[@id="categories en"]/a'); //please notice the space in the div's name - maybe that causes any trouble

// loop through the links and echo the link text
foreach($categorylinks as $link)
{
    echo $link->textContent .'<br />';
}

?>

You need to pass the url to file_get_contents first.

$contents = file_get_contents($url); // load the html into the variable

libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML($content); // pass in the html
...
Edited by Ch0cu3r

well.. unfortunately it still doesn't work.. :\

<?php
$url = 'http://www.mywebsiteforexample.com/categories.html';
$contents = file_get_contents($url);
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $contents);
$xpath = new DOMXpath( $doc);


$categorylinks = $xpath->query('//div[@id="categories en"]/a'); //please notice the space in the div's name - maybe that causes any trouble

// loop through the links and echo the link text
foreach($categorylinks as $link)
{
    echo $link->textContent .'<br />';
}

?>

<!-- start of categories list -->
<div class="categories en"> </div>
<a href="../agriculture.html" target="_blank">agriculture</a><br />
<a href="../avantgarde.html" target="_blank">avantgarde</a><br />
<a href="../azyx.html" target="_blank">azyx</a><br />

Is that right you close the div tag as soon as you open it? The closing div needs to go after the anchor tags

<!-- start of categories list -->
<div class="categories en"> <!-- open div -->
<a href="../agriculture.html" target="_blank">agriculture</a><br />
<a href="../avantgarde.html" target="_blank">avantgarde</a><br />
<a href="../azyx.html" target="_blank">azyx</a><br />
</div> <!-- close div -->
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.