Jump to content

Need help for fixing html when scrape using php simple html dom parser


Recommended Posts

require_once 'phpSimpleHtmlDomClass.php'; $html = '<div> <div class="man">Name: madac</div> <div class="man">Age: 18 <div class="man">Class: 12</div> </div>' $name=$html->find('div[class="man"]', 0)->innertext; $age=$html->find('div[class="man"]', 1)->innertext; $cls=$html->find('div[class="man"]', 2)->innertext;

wanna get a text from each div class="man" but it didn't work because there is a missing closing div tag on 2nd line of html code. please help me to fix this.

thanks in advance.

Link to comment
Share on other sites

The only difference that missing </div> makes is that the second find() returns the text up to the next </div>, thus giving

$name = "Name: madac"
$age  = "Age: 18    <div class='man'>Class: 12</div> "  
$cls  = "Class: 12"

So you just need to look for and  trim off the excess "<div> ... </div>"

 

Perhaps...

$html = str_get_html('<div> 
<div class="man">Name: madac</div> 
<div class="man">Age: 18  
<div class="man">Class: 12</div> 
</div>'); 

$name = trim_html($html->find('div[class="man"]', 0)->innertext); 
$age  = trim_html($html->find('div[class="man"]', 1)->innertext); 
$cls  = trim_html($html->find('div[class="man"]', 2)->innertext);

function trim_html($str)
{
    if ( ($p = strpos($str, '<'))  !== false) {
        $str = substr($str, 0, $p);
    }
    return trim($str);
}

 

Edited by Barand
  • Great Answer 1
Link to comment
Share on other sites

2 hours ago, requinix said:

A website that you control? Someone else's? Can you tell them to fix their markup so that it's, you know, syntactically valid HTML?

That's a government site. I've no access to the site. :(    

Link to comment
Share on other sites

2 hours ago, Barand said:

The only difference that missing </div> makes is that the second find() returns the text up to the next </div>, thus giving


$name = "Name: madac"
$age  = "Age: 18    <div class='man'>Class: 12</div> "  
$cls  = "Class: 12"

So you just need to look for and  trim off the excess "<div> ... </div>"

 

Perhaps...


$html = str_get_html('<div> 
<div class="man">Name: madac</div> 
<div class="man">Age: 18  
<div class="man">Class: 12</div> 
</div>'); 

$name = trim_html($html->find('div[class="man"]', 0)->innertext); 
$age  = trim_html($html->find('div[class="man"]', 1)->innertext); 
$cls  = trim_html($html->find('div[class="man"]', 2)->innertext);

function trim_html($str)
{
    if ( ($p = strpos($str, '<'))  !== false) {
        $str = substr($str, 0, $p);
    }
    return trim($str);
}

 

I've tried your code. That's amazingly worked for me. Thank you so much.  

Link to comment
Share on other sites

You really need to analyze the source data to identify the different data and types of issues that can exist in order to determine the appropriate solution. For example, the solution @Barand provided would be acceptable if:

1) The only issues are missing closing tags

2) None of the "data" contains a greater than sign

If there are other types of issues or if data can contain the < symbol then a different solution is in order.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.