Jump to content

Parsing DOMDocument and only keeping a one link


twittoris

Recommended Posts

I am trying to take a specific link from my site and place it into my database. I only want links starts with CORPSEARCH.ENTITY_INFORMATION?p_nameid=

 

Can someone point me in the right direction here?

 

 

Code for this is below:

 

 

// make the cURL request to $target_url

$html= curl_exec($ch);

if (!$html) {

echo "<br />cURL error number:" .curl_errno($ch);

echo "<br />cURL error:" . curl_error($ch);

exit;

}

 

// parse the html into a DOMDocument

$dom = new DOMDocument();

@$dom->loadHTML($html);

 

// grab all the on the page

$xpath = new DOMXPath($dom);

$hrefs = $xpath->evaluate("/html/body//a");

 

for ($i = 0; $i < $hrefs->length; $i++) {

$href = $hrefs->item($i);

$url = $href->getAttribute('href');

$sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')";

$result=mysql_query($sql);

echo $result;

echo $url;

Here I have edited it a little and put the script online but it is still spitting out every link on the page.

 

http://empirebuildingsestate.com/table.php

 

I just want to grab any link similar to this layout only.

 

CORPSEARCH.ENTITY_INFORMATION?p_nameid=3236937&p_corpid=3227476&p_entity_name=%41%72%77%65%6E%20%45%71%75%69%74%69%65%73&p_name_type=%41&p_search_type=%42%45%47%49%4E%53&p_srch_results_page=0

 

$dom = new DOMDocument();

@$dom->loadHTML($html);

 

// grab all the on the page

$xpath = new DOMXPath($dom);

$hrefs = $xpath->evaluate("/html/body//a");

 

 

for ($i = 0; $i < $hrefs->length; $i++) {

$href = $hrefs->item($i);

$url = $href->getAttribute('href');

preg_match_all(nameid,$url);

$sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')";

$result=mysql_query($sql);

echo $result;

echo $url;

// if successfully insert data into database, displays message "Successful".

if($result){

echo "Successful";

echo "<BR>";

}

 

else {

echo "ERROR";

}

 

echo "<br />Link stored: $url";

}

 

 

 

?>

Use the built in XPath function, starts_with() to select only the links that begin with 'CORPSEARCH.ENTITY_INFORMATION'

 

So change this

$hrefs = $xpath->evaluate("/html/body//a");

To

$hrefs = $xpath->evaluate("/html/body//a[starts-with(@href, 'CORPSEARCH.ENTITY_INFORMATION')]");

Or it can be just this

$hrefs = $xpath->evaluate("//a[starts-with(@href, 'CORPSEARCH.ENTITY_INFORMATION')]");

 

Now your loop will be

for ($i = 0; $i < $hrefs->length; $i++) {
   $href = $hrefs->item($i);
   $url = $href->getAttribute('href');
   
   echo '<p>Found:<br />' . $url. '<br />Adding it to the database... ';
   
   $sql="INSERT INTO links(cid, nlink)VALUES('$i','$url')";
   $result = mysql_query($sql);
   
   echo (($result) ? 'Success!' : 'FAIL') . '</p>';
}

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.