Jump to content

Recommended Posts

Given string $html, how can I:

1) Tell that it is valid HTML?

2) Count p.p for each given data-id (i.e. 1=1, 2=2, 3=1)

$html = <<<EOD
<p>Hello</p>
<div>
 <p class="p" data-id="3"></p>
 <p class="p" data-id="2"></p>
</div>
<div>
 <p class="p" data-id="1"></p>
 <div>
  <p class="p" data-id="2"></p>
 </div>
</div>
EOD;

 

Link to comment
https://forums.phpfreaks.com/topic/301484-validating-html-and-counting-elements/
Share on other sites

does this help?

$html = '<html>';

$html .= <<<EOD
<p>Hello</p>
<div>
 <p class="p" data-id="3"></p>
 <p class="p" data-id="2"></p>
</div>
<div>
 <p class="p" data-id="1"></p>
 <div>
  <p class="p" data-id="2"></p>
 </div>
</div>
EOD;

$html .= '</html>';

$x = simplexml_load_string($html);

$counts = array_fill_keys(range(1,4),0);

foreach ($x->xpath('//p[@class="p"]') as $p) {
    $id = intval($p['data-id']);
    $counts[$id]++;
}

echo '<pre>' . print_r($counts, 1) . '</pre>';

To validate the HTML... well, you know HTML is rather flexible. "string" is valid. "

string" is valid". Some stuff you might consider invalid is still acceptable to browsers.

 

I'm thinking (a) just load it with anything in PHP that supports HTML and see if it complains, or (b) try Tidy. Maybe your end result should be less valid/invalid but whether it's supposed to be valid already and you can just fix it if it has minor errors?

How on earth did you even end up with this weird problem? What's the overall goal you're trying to achieve?

 

The second part sounds like you're webscraping somebody else's site, but then why would you validate the markup? To file a complaint if it's invalid? Or is this your markup? Then why would you choose HTML as your data format? That's a terrible choice compared to pretty much anything else.

To validate the HTML... well, you know HTML is rather flexible. "string" is valid. "<p>string" is valid". Some stuff you might consider invalid is still acceptable to browsers.

 

I'm thinking (a) just load it with anything in PHP that supports HTML and see if it complains, or (b) try Tidy. Maybe your end result should be less valid/invalid but whether it's supposed to be valid already and you can just fix it if it has minor errors?

 

In regards to loading it with anything in PHP that supports HTML, you mean something like simplexml_load_string()?  Below is the output of a slightly modified version of Barand's script (changed <p> to <xp>).  I suppose I can then set a new error handler to "catch" the warning, and then restoring it to the previous error handler afterwards.

Warning: simplexml_load_string(): Entity: line 1: parser error : Opening and ending tag mismatch: xp line 1 and p in /var/www/application/lib/testing/parse.php on line 20
Warning: simplexml_load_string(): <html><xp>Hello</p> in /var/www/application/lib/testing/parse.php on line 20
Warning: simplexml_load_string(): ^ in /var/www/application/lib/testing/parse.php on line 20
Fatal error: Call to a member function xpath() on a non-object in /var/www/application/lib/testing/parse.php on line 24

I've never used tidy() before.  How do you envision this working?

 

Another option maybe is an api into https://validator.w3.org/docs/api.html?  Don't know how it will work yet.

 

Thanks

Edited by NotionCommotion

How on earth did you even end up with this weird problem? What's the overall goal you're trying to achieve?

 

The second part sounds like you're webscraping somebody else's site, but then why would you validate the markup? To file a complaint if it's invalid? Or is this your markup? Then why would you choose HTML as your data format? That's a terrible choice compared to pretty much anything else.

 

Creating a simple Drupal module which allows the site's administrator to add multiple HTML strings which will be stored in a DB and later displayed on the site.  Special elements identified by a given class (I used class "p", but in practice will use something more descriptive) will be replaced with some other content (TBD either clientside using JavaScript of serverside using PHP), and that content will be based on the data-id value.  There is also a record corresponding to each data-id value, and I wish to prevent that record from being deleted should an element with that given data-id value exist in any of the saved HTML strings.

 

So, I wish to validate the user's HTML, and wish to determine (i.e. count) whether an element with a given data-id attribute exists in any of the HTML strings.

 

Concerns?

I've never used tidy() before.  How do you envision this working?

 

I can envision how it will work, and think it is perfect.  I might have some questions regarding doctype and the like, but will if necessary create a separate post.  Thanks

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.