Jump to content

How to handle poor HTML data in PHP


jwhite68

Recommended Posts

I am receiving some fairly poor HTML from a client.

 

eg. it contains "· " instead of using <ul><li> tags to create bullets.

Does anyone have any suggestions how I can convert this kind of information, eg to the <ul><li> tags - if I have a series of sentences preeceded by this "· " ?

Link to comment
Share on other sites

Hi,

 

What you'd need to do is use some kind of fancy regular expression stuff to replace the "· " with a <li> element, the next newline or break with a </li>. The trickier part then comes in placing the <ul> and </ul> tags before the first and after the last of a block of bullets.

 

For reference, the WikiParser class might be worth a look. It can be downloaded from http://code.blitzaffe.com/pages/phpclasses/files/wiki_parser_52-13/view/1. Basically, it can convert lines beginning with asterisks into proper ul style html lists. It will also do a bunch of other conversions as well. Take a look at it and strip out the bits you don't need.

 

Hope this helps,

Darren.

Link to comment
Share on other sites

Heres an example with the bullet + nbsp I mentioned:

 

$desc3 = "<P align=left><FONT size=2><FONT color=#ff0000>ABC1 1 is a modern luxury complex, located just 50 m away from the beach strip in the heart of town<BR></FONT></FONT><FONT size=2>· 6-storey complex, 5 sections <BR>· solid brick-built structure <BR>· each residential section has a separate lobby, reception, and lift <BR>· flats from 40sq.m -190 sq.m <BR>· maisonettes from 190 sq.m - 315 sq.m <BR>· on-site parking facilities in a 2-level basement - garage sections, parking space <BR>· each residential section is thermo-insulated <BR>· stone-panelled common parts <BR>· modern lifts <BR>· flooring: terracota, laminate <BR>· 3-layer window and door frames from the USA<BR>· 24-hour security service <BR></P></FONT>";

Link to comment
Share on other sites

I was able to resolve this specific issue with:

 

$output = preg_replace("/·/", "-", $data);

 

Which replaces the bullet symbol that utf-8 cannot display (it displays as ?) as a hyphen symbol instead.  Does anyone know the code for a bullet point symbol that will display in utf-8?

Link to comment
Share on other sites

Can anyone spot why this doesnt work:

[code
$output = preg_replace('/·/', '\x{2022}/u', $data);  

 

The bullet point symbol is supposed to be hex 2022, and I understood that the /u is needed to identify as unicode value.  But this just displays the text as \x{2022}/u. What am I doing wrong?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.