Jump to content

Validating HTML and counting elements


NotionCommotion

Recommended Posts

Given string $html, how can I:

1) Tell that it is valid HTML?

2) Count p.p for each given data-id (i.e. 1=1, 2=2, 3=1)

$html = <<<EOD
<p>Hello</p>
<div>
 <p class="p" data-id="3"></p>
 <p class="p" data-id="2"></p>
</div>
<div>
 <p class="p" data-id="1"></p>
 <div>
  <p class="p" data-id="2"></p>
 </div>
</div>
EOD;

 

Link to comment
Share on other sites

does this help?

$html = '<html>';

$html .= <<<EOD
<p>Hello</p>
<div>
 <p class="p" data-id="3"></p>
 <p class="p" data-id="2"></p>
</div>
<div>
 <p class="p" data-id="1"></p>
 <div>
  <p class="p" data-id="2"></p>
 </div>
</div>
EOD;

$html .= '</html>';

$x = simplexml_load_string($html);

$counts = array_fill_keys(range(1,4),0);

foreach ($x->xpath('//p[@class="p"]') as $p) {
    $id = intval($p['data-id']);
    $counts[$id]++;
}

echo '<pre>' . print_r($counts, 1) . '</pre>';

Link to comment
Share on other sites

To validate the HTML... well, you know HTML is rather flexible. "string" is valid. "

string" is valid". Some stuff you might consider invalid is still acceptable to browsers.

 

I'm thinking (a) just load it with anything in PHP that supports HTML and see if it complains, or (b) try Tidy. Maybe your end result should be less valid/invalid but whether it's supposed to be valid already and you can just fix it if it has minor errors?

Link to comment
Share on other sites

How on earth did you even end up with this weird problem? What's the overall goal you're trying to achieve?

 

The second part sounds like you're webscraping somebody else's site, but then why would you validate the markup? To file a complaint if it's invalid? Or is this your markup? Then why would you choose HTML as your data format? That's a terrible choice compared to pretty much anything else.

Link to comment
Share on other sites

To validate the HTML... well, you know HTML is rather flexible. "string" is valid. "<p>string" is valid". Some stuff you might consider invalid is still acceptable to browsers.

 

I'm thinking (a) just load it with anything in PHP that supports HTML and see if it complains, or (b) try Tidy. Maybe your end result should be less valid/invalid but whether it's supposed to be valid already and you can just fix it if it has minor errors?

 

In regards to loading it with anything in PHP that supports HTML, you mean something like simplexml_load_string()?  Below is the output of a slightly modified version of Barand's script (changed <p> to <xp>).  I suppose I can then set a new error handler to "catch" the warning, and then restoring it to the previous error handler afterwards.

Warning: simplexml_load_string(): Entity: line 1: parser error : Opening and ending tag mismatch: xp line 1 and p in /var/www/application/lib/testing/parse.php on line 20
Warning: simplexml_load_string(): <html><xp>Hello</p> in /var/www/application/lib/testing/parse.php on line 20
Warning: simplexml_load_string(): ^ in /var/www/application/lib/testing/parse.php on line 20
Fatal error: Call to a member function xpath() on a non-object in /var/www/application/lib/testing/parse.php on line 24

I've never used tidy() before.  How do you envision this working?

 

Another option maybe is an api into https://validator.w3.org/docs/api.html?  Don't know how it will work yet.

 

Thanks

Edited by NotionCommotion
Link to comment
Share on other sites

How on earth did you even end up with this weird problem? What's the overall goal you're trying to achieve?

 

The second part sounds like you're webscraping somebody else's site, but then why would you validate the markup? To file a complaint if it's invalid? Or is this your markup? Then why would you choose HTML as your data format? That's a terrible choice compared to pretty much anything else.

 

Creating a simple Drupal module which allows the site's administrator to add multiple HTML strings which will be stored in a DB and later displayed on the site.  Special elements identified by a given class (I used class "p", but in practice will use something more descriptive) will be replaced with some other content (TBD either clientside using JavaScript of serverside using PHP), and that content will be based on the data-id value.  There is also a record corresponding to each data-id value, and I wish to prevent that record from being deleted should an element with that given data-id value exist in any of the saved HTML strings.

 

So, I wish to validate the user's HTML, and wish to determine (i.e. count) whether an element with a given data-id attribute exists in any of the HTML strings.

 

Concerns?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.