Need help for fixing html when scrape using php simple html dom parser

shaadamin44 · June 17, 2021

require_once 'phpSimpleHtmlDomClass.php'; $html = '<div> <div class="man">Name: madac</div> <div class="man">Age: 18 <div class="man">Class: 12</div> </div>' $name=$html->find('div[class="man"]', 0)->innertext; $age=$html->find('div[class="man"]', 1)->innertext; $cls=$html->find('div[class="man"]', 2)->innertext;

wanna get a text from each div class="man" but it didn't work because there is a missing closing div tag on 2nd line of html code. please help me to fix this.

thanks in advance.

requinix · June 17, 2021

Add the closing div tag? Where is the HTML coming from?

shaadamin44 · June 17, 2021

16 minutes ago, requinix said:

Add the closing div tag? Where is the HTML coming from?

From a website

requinix · June 17, 2021

30 minutes ago, shaadamin44 said:

From a website

A website that you control? Someone else's? Can you tell them to fix their markup so that it's, you know, syntactically valid HTML?

Barand · June 17, 2021

The only difference that missing </div> makes is that the second find() returns the text up to the next </div>, thus giving

$name = "Name: madac"
$age  = "Age: 18    <div class='man'>Class: 12</div> "  
$cls  = "Class: 12"

So you just need to look for and trim off the excess "<div> ... </div>"

Perhaps...

$html = str_get_html('<div> 
<div class="man">Name: madac</div> 
<div class="man">Age: 18  
<div class="man">Class: 12</div> 
</div>'); 

$name = trim_html($html->find('div[class="man"]', 0)->innertext); 
$age  = trim_html($html->find('div[class="man"]', 1)->innertext); 
$cls  = trim_html($html->find('div[class="man"]', 2)->innertext);

function trim_html($str)
{
    if ( ($p = strpos($str, '<'))  !== false) {
        $str = substr($str, 0, $p);
    }
    return trim($str);
}

Edited June 17, 2021 by Barand

shaadamin44 · June 17, 2021

2 hours ago, requinix said:

A website that you control? Someone else's? Can you tell them to fix their markup so that it's, you know, syntactically valid HTML?

That's a government site. I've no access to the site.

requinix · June 17, 2021

Regular expressions are also an option. Once you've drilled down as far into the HTML as you can, you can much more safely look for things like "Age: <number>".

shaadamin44 · June 17, 2021

2 hours ago, Barand said:

The only difference that missing </div> makes is that the second find() returns the text up to the next </div>, thus giving


$name = "Name: madac"
$age  = "Age: 18    <div class='man'>Class: 12</div> "  
$cls  = "Class: 12"

So you just need to look for and trim off the excess "<div> ... </div>"

Perhaps...


$html = str_get_html('<div> 
<div class="man">Name: madac</div> 
<div class="man">Age: 18  
<div class="man">Class: 12</div> 
</div>'); 

$name = trim_html($html->find('div[class="man"]', 0)->innertext); 
$age  = trim_html($html->find('div[class="man"]', 1)->innertext); 
$cls  = trim_html($html->find('div[class="man"]', 2)->innertext);

function trim_html($str)
{
    if ( ($p = strpos($str, '<'))  !== false) {
        $str = substr($str, 0, $p);
    }
    return trim($str);
}

I've tried your code. That's amazingly worked for me. Thank you so much.

Psycho · June 17, 2021

You really need to analyze the source data to identify the different data and types of issues that can exist in order to determine the appropriate solution. For example, the solution @Barand provided would be acceptable if:

1) The only issues are missing closing tags

2) None of the "data" contains a greater than sign

If there are other types of issues or if data can contain the < symbol then a different solution is in order.

Sign In

Need help for fixing html when scrape using php simple html dom parser

Recommended Posts

shaadamin44

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

shaadamin44

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

Barand

Link to comment

Share on other sites

shaadamin44

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

shaadamin44

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information