Jump to content

How to handle poor HTML data in PHP


jwhite68

Recommended Posts

I am receiving some fairly poor HTML from a client.

 

eg. it contains "· " instead of using <ul><li> tags to create bullets.

Does anyone have any suggestions how I can convert this kind of information, eg to the <ul><li> tags - if I have a series of sentences preeceded by this "· " ?

Link to comment
https://forums.phpfreaks.com/topic/70920-how-to-handle-poor-html-data-in-php/
Share on other sites

Hi,

 

What you'd need to do is use some kind of fancy regular expression stuff to replace the "· " with a <li> element, the next newline or break with a </li>. The trickier part then comes in placing the <ul> and </ul> tags before the first and after the last of a block of bullets.

 

For reference, the WikiParser class might be worth a look. It can be downloaded from http://code.blitzaffe.com/pages/phpclasses/files/wiki_parser_52-13/view/1. Basically, it can convert lines beginning with asterisks into proper ul style html lists. It will also do a bunch of other conversions as well. Take a look at it and strip out the bits you don't need.

 

Hope this helps,

Darren.

Heres an example with the bullet + nbsp I mentioned:

 

$desc3 = "<P align=left><FONT size=2><FONT color=#ff0000>ABC1 1 is a modern luxury complex, located just 50 m away from the beach strip in the heart of town<BR></FONT></FONT><FONT size=2>· 6-storey complex, 5 sections <BR>· solid brick-built structure <BR>· each residential section has a separate lobby, reception, and lift <BR>· flats from 40sq.m -190 sq.m <BR>· maisonettes from 190 sq.m - 315 sq.m <BR>· on-site parking facilities in a 2-level basement - garage sections, parking space <BR>· each residential section is thermo-insulated <BR>· stone-panelled common parts <BR>· modern lifts <BR>· flooring: terracota, laminate <BR>· 3-layer window and door frames from the USA<BR>· 24-hour security service <BR></P></FONT>";

I was able to resolve this specific issue with:

 

$output = preg_replace("/·/", "-", $data);

 

Which replaces the bullet symbol that utf-8 cannot display (it displays as ?) as a hyphen symbol instead.  Does anyone know the code for a bullet point symbol that will display in utf-8?

Can anyone spot why this doesnt work:

[code
$output = preg_replace('/·/', '\x{2022}/u', $data);  

 

The bullet point symbol is supposed to be hex 2022, and I understood that the /u is needed to identify as unicode value.  But this just displays the text as \x{2022}/u. What am I doing wrong?

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.