Jump to content

[SOLVED] Including External HTML files formatting problem.


ajclarkson

Recommended Posts

Hi guys,

 

Here is my issue, I hope you can help as this has got me going crazy!

 

I am building an article management system for a website, and I needed an easy way for the publishers to be able to get their article onto the site. All articles which are submitted come through in word 2003 format.

 

I decided that it would be best for me to get the publisher to save it as a HTML file and then upload it. That way all of their formatting remained in tact.

 

Then at the appropriate place I include the code:

 

<?php require_once('./articles/articlename.html'); ?> 

 

And it does what I need, it includes the article with all bold text etc in tact. However, it does not handle apostrophes. It simply shows a bunch of characters where the symbol should appear. I cannot seem to find any information on how to avoid this on the internet.

 

Can anyone please tell me where I am going wrong with this, or if there is any sort of parsing method to avoid it?

 

While we are on the subject, you guys are the experts, please tell me if my idea of publishing via html files is stupid?  :-\

 

Thanks

 

Adam

 

And

The apostrophe thing can have many reasons. Can you submit an example of an article with the HTML headers and a part of the text.

 

Regarding your second question, I think you should think of other ways of publishing your articles online. Depending on the needs of your clients and on how much work you want to put into it, you could either think of a more stable format (PDF) or of a dynamic format (XML). I would not reccommend HTML for storaging texts.

 

Must be a character encoding problem. Be sure that the HTML file is saved with the same encoding as the page you're displaying it on is using. Else, if your page is in UTF-8, you could try to utf8_encode() the contents of the HTML file:

 

<?php
echo utf8_encode(file_get_contents('./articles/articlename.html'));
?>

 

The idea of publishing via HTML files isn't bad, if the HTML is written by hand (i.e. using <strong> for bold text etc.). If the HTML is exported from Word, it's very bad if you ask me. I haven't tried that for many years, but I remember how horrible messed up it was back then.

 

Why not make the publishers write the article directly on the site, in a form, and then save it in a database? If that's not an option, you could also consider the former method, just with BB code instead of HTML, where you'd write something like [bold]bold text[/bold] and then translate that to the proper HTML with PHP. That could be easier for publishers not knowing about HTML.

Thanks for your response, here is a sample of the MS Word generated HTML file:

 

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 11">
<meta name=Originator content="Microsoft Word 11">
<link rel=File-List href="application%20process%20(2)_files/filelist.xml">
<title>UK medical schools are university-based institutions that generally
provide a programme of preclinical and clinical-based study; depending on the
school and the option of intercalation, undergraduate medical education can
range from four to six years of </title>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
name="country-region"/>
<o:SmartTagType namespaceuri="urn:schemas-microsoft-com:office:smarttags"
name="place"/>
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
 <o:Author>npcsxpuser</o:Author>
 <o:Template>Normal</o:Template>
 <o:LastAuthor>Adam Clarkson</o:LastAuthor>
 <o:Revision>2</o:Revision>
 <o:TotalTime>1</o:TotalTime>
 <o:Created>2008-08-15T10:33:00Z</o:Created>
 <o:LastSaved>2008-08-15T10:33:00Z</o:LastSaved>
 <o:Pages>1</o:Pages>
 <o:Words>620</o:Words>
 <o:Characters>3540</o:Characters>
 <o:Company>University of Durham</o:Company>
 <o:Lines>29</o:Lines>
 <o:Paragraphs>8</o:Paragraphs>
 <o:CharactersWithSpaces>4152</o:CharactersWithSpaces>
 <o:Version>11.9999</o:Version>
</o:DocumentProperties>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:WordDocument>
 <w:PunctuationKerning/>
 <w:ValidateAgainstSchemas/>
 <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
 <w:IgnoreMixedContent>false</w:IgnoreMixedContent>
 <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
 <w:Compatibility>
  <w:BreakWrappedTables/>
  <w:SnapToGridInCell/>
  <w:WrapTextWithPunct/>
  <w:UseAsianBreakRules/>
  <w:DontGrowAutofit/>
  <w:UseFELayout/>
 </w:Compatibility>
 <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="156">
</w:LatentStyles>
</xml><![endif]--><!--[if !mso]><object
classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D" id=ieooui></object>
<style>
st1\:*{behavior:url(#ieooui) }
</style>
<![endif]-->
<style>
<!--
/* Font Definitions */
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;
mso-font-alt:\5B8B\4F53;
mso-font-charset:134;
mso-generic-font-family:auto;
mso-font-format:other;
mso-font-pitch:variable;
mso-font-signature:1 135135232 16 0 262144 0;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;
mso-font-charset:0;
mso-generic-font-family:swiss;
mso-font-pitch:variable;
mso-font-signature:536871559 0 0 0 415 0;}
@font-face
{font-family:"\@SimSun";
panose-1:0 0 0 0 0 0 0 0 0 0;
mso-font-charset:134;
mso-generic-font-family:auto;
mso-font-format:other;
mso-font-pitch:variable;
mso-font-signature:1 135135232 16 0 262144 0;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:SimSun;
mso-fareast-language:ZH-CN;}
p.MsoHeader, li.MsoHeader, div.MsoHeader
{margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
tab-stops:center 207.65pt right 415.3pt;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:SimSun;
mso-fareast-language:ZH-CN;}
p.MsoFooter, li.MsoFooter, div.MsoFooter
{margin:0cm;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
tab-stops:center 207.65pt right 415.3pt;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:SimSun;
mso-fareast-language:ZH-CN;}
p.MsoTitle, li.MsoTitle, div.MsoTitle
{margin:0cm;
margin-bottom:.0001pt;
text-align:center;
mso-pagination:widow-orphan;
font-size:10.0pt;
mso-bidi-font-size:12.0pt;
font-family:Verdana;
mso-fareast-font-family:"Times New Roman";
mso-bidi-font-family:"Times New Roman";
mso-fareast-language:EN-US;
text-decoration:underline;
text-underline:single;}
p
{mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";
mso-fareast-language:EN-US;}
/* Page Definitions */
@page
{mso-footnote-separator:url("application%20process%20\(2\)_files/header.htm") fs;
mso-footnote-continuation-separator:url("application%20process%20\(2\)_files/header.htm") fcs;
mso-endnote-separator:url("application%20process%20\(2\)_files/header.htm") es;
mso-endnote-continuation-separator:url("application%20process%20\(2\)_files/header.htm") ecs;}
@page Section1
{size:595.3pt 841.9pt;
margin:72.0pt 90.0pt 72.0pt 90.0pt;
mso-header-margin:35.4pt;
mso-footer-margin:35.4pt;
mso-footer:url("application%20process%20\(2\)_files/header.htm") f1;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style>
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
</style>
<![endif]-->
</head>

<body lang=EN-GB style='tab-interval:36.0pt'>

<div class=Section1>

<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><b><span
style='font-family:Verdana;mso-bidi-font-family:Arial;color:navy'>The application
process<o:p></o:p></span></b></p>

<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><b><i><span
style='font-size:10.0pt;font-family:Verdana;mso-bidi-font-family:Arial'><o:p> </o:p></span></i></b></p>

<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><b><i><span
style='font-size:10.0pt;font-family:Verdana;mso-bidi-font-family:Arial'>Christopher
Ghazala, Co-founder<o:p></o:p></span></i></b></p>

<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><b><i><span
style='font-size:10.0pt;font-family:Verdana;mso-bidi-font-family:Arial'><o:p> </o:p></span></i></b></p>

<p class=MsoTitle style='text-align:justify;text-justify:inter-ideograph'><span
style='text-decoration:none;text-underline:none'>Received May 2008; published
July 2008<o:p></o:p></span></p>

<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><span
style='font-size:10.0pt;font-family:Verdana;mso-bidi-font-family:Arial'><o:p> </o:p></span></p>

<p class=MsoNormal style='text-align:justify;text-justify:inter-ideograph'><st1:place
w:st="on"><st1:country-region w:st="on"><span style='font-size:10.0pt;
 font-family:Verdana;mso-bidi-font-family:Arial'>UK</span></st1:country-region></st1:place><span
style='font-size:10.0pt;font-family:Verdana;mso-bidi-font-family:Arial'>
medical schools are university-based institutions that generally provide a
programme of preclinical and clinical-based study; depending on the school and
the option of intercalation, undergraduate medical education can range from
four to six years of study. This article will discuss Chris Ghazala's experiences in the medical schools. It's all personal opinion so bear that in mind!  <o:p></o:p></span></p>


 

I have cut it short here, as the full article is probably not required, the headers are here etc.

 

With regards to a more stable solution, I would certianly be willing to put the time in. XML is something which would interest me, however my only concern is that the publishers who are uploading the articles may not be comfortable with creating an XML document.

 

If you have any suggestions they would be more than welcome

 

Thanks

 

 

RE: the bad bad

 

Thanks, the only problem with having the pulbishers writing on the site is that they are supplied with an article which is already in .doc format. So I was trying to think of the easiest method!

I agree to thebadbad - the behaviour of the HTML code generated by Microsoft Word is very unpredictable, while hand written code or BB generated code will be more stable.

 

However, if the articles are already in .doc, it would be very easy to print them to a PDF file and then upload them. It might look better and it is well suited for printing.

 

If you want to go into XML, this is an issue that goes into the direction that was proposed by thebadbad.

 

In this context, as the documents are already available as Microsoft .doc or .html, my idea is:

 

- let users upload their file via web

- when a user uploads a file, let php filter all html tags from it, except some simple formatting tags (only those you really need)

- maybe replace the formatting tags by more simple ones, maybe in XML or simple HTML style

 

This might require some filtering experiments, but it might be very comfortable for the users.

 

I believe the idea of removing all tags except basic formatting ones, say <p><b> etc seems like a promising idea.

 

If you think the html from word is bad have you seen the XML it generates!!! hehe that was a nightmare!

 

I will look into filtering and using strip tags to see if I can acheive the desired result, unfortunalty PDF would not allow me to integrate the article enough, I will be offering these as a download option for users.

Hi Guys,

 

Apparently I forgot that <b > tag would turn my last post bold!

 

I have managed to solve the problem with the apostrophe by setting the encoding of the site to utf-8 so that is problem solved as far as that bit is concerned.

 

Thanks to you both for planting the seeds of using strip_tags etc to make sure that I get a more stable publishing environment, in the future I may look into XML but for now this will do.

 

Thanks for your help, once more phpfreaks saves me!

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.