Jump to content

What's best way to get a user's Word doc converted to simple html and images?


Recommended Posts

Hi all,

 

I was just wondering if anybody has any experience of this.

Basically, I'm building a site for a guy and he keeps throwing tonnes of content my way, all Word docs with tables, images, etc.

The site is PHP / MySQL based and when it's handling text it's so easy to use. I want it to preserve formatting if possible and maybe where the images are (aligned left or right at the very least)

 

At the moment I have integrated a Javascript RTE (Rich Text Editor) called fckeditor, an earlier version of Ckeditor [ http://ckeditor.com/ ]

 

I think fckeditor is free for commercial projects and Ckeditor isn't  (could be wrong but it doesn't matter too much to my main query anyhow)

 

So basically the user will have to copy in each block of text and then upload each image seperately and align it left or right or whatever, which is fine by me, but for the moment I need to put some of these docs in myself, and it can be very, very, time-consuming. I'm looking into ways to get his word documents up in the most automatic way possible.

 

Here are a few of my options

 

1. wvWare  - wvware.sourceforge.net/

I'm not exactly sure how to install it but I do have a VPS so it should be at least possible. Does anybody have any experience using this ? I suppose the question I'm asking is: Does it work?

 

2. There are a few RTF to html converters here etc.: http://www.w3.org/Tools/Word_proc_filters.html

but nothing that seems to do the job for me so far. I'm afraid to start downloading them all and testing because I'm on a tight schedule here and I just need to know what will actually work for me.

 

3. I could ask the user to save the file as XML (he's bright enough) and then use simplexml to parse, but I'm having trouble finding any info on google about using simplexml to extract the image bits to files or whatever. Perhaps storing them in the MySQL database as binary is the best way but I'd prefer not to have the overhead if possible and just to save JPG or PNG files out and have the html link to them. Is there anybody out there that has done this before?

 

4. Maybe there's a client-side app for this ? Written in Java or Flash or something ???

 

5. Perhaps MHTML - I could get him to save as .MHT archive and then do something with PHP on the server side - if I can't find any other solution I will probably try this one next.

 

 

 

So does anybody know of the best option for this or should I stick with the Rich text editor road ?

 

Any alternative (that works and is relatively easy for the client to do) is fine.

 

Thanks in advance

I usually try to avoid responding unless I have an answer or direct criticism, but you will probably spend a lot more time trying to get an automated solution working than it would take to actually do it manually.

 

Word to HTML is never pretty, thanks to Words crappy markup language.  If you want your pages to be clean at all, I suggest pasting as plain text and inserting the images manually even though it will take a lot of time.

 

Also: If your company *requires* a license, you do have the option of purchasing one.  Otherwise, CKEditor is free for commercial projects just as FCKEditor is.  I currently use CKEditor, but I've had to spend a while reducing the bugs to where our clients can use it.  If you don't have the time to fix bugs, I suggest waiting for a future release of CKEditor before upgrading.

Well at least you're honest. Yeah I thought as much myself but there's just *SO* much crap to do.  - I myself have no trouble in getting images from documents and saving them as JPG and uploading, but client isn't that tech savvy. but if it's something he can do in word, like save as MHT and then upload, he'd get it. I understand your point and agree whole-heartedly. However if I can get the stuff into the Database and the JPGS up to the server any way at all I'd be happy, and then he can edit the kinks out of each article later on himself.

 

I just found this MHTML class and will test:

http://www.phpclasses.org/browse/file/23132.html

 

The stuff it outputs seems ok. I can strip a lot of the nasty crappy Word HTML code and tags myself with PHP if necessary but I do have a feeling that yeah this is going to be a lost cause.

 

Thanks anyway for posting your opinion.

Ok so I tested the MHTML class - it's not too bad actually. Sure the code needs to be cleaned up a little bit but here's an example:

 

http://www.thedavil.com/testing/Augustinian Saints and Blessed.doc

 

got converted to

 

http://www.thedavil.com/testing/index.htm

 

Which isn't terrible. but not great either. I'm going to do a bit more work on this MHT way of doing things but I'm not gonna work too hard - if it doesn't work out then I'll go back to manual.

 

My next step is too see if I can get that HTML into the Rich Text Editor and see can the little kinks be worked out there. because if they can, it's just a case of moving the files around and putting the HTML into the database and I'm onto a winner.....

 

Still dubious though

Looking at an XML version of the word doc you posted, it doesnt look any more promising.

 

You could try saving the Word doc as "Web page, filtered".  Its not perfect, but probably the best option available next to manually copying, pasting, and fixing.

Yeah maybe... I know this sounds stupid but I have to ask .... if I save as filtered html in Word, then re-open and save as MHT or XML will it gain all the same crap code again?

 

I suppose if I really want to go down this road (crazy as it is) I could write a small EXE for the client that saves it as filtered HTML and then zips up all the stuff for upload. It is a bit crazy though alright. :-D

 

btw,

Is ckeditor any better than fckeditor (I don't mind fixing a few bugs if they're small and fixable by PHP)

I personally like CKEditor better than FCKEditor, but its still a bit buggy even after I spent hours spent fixing some bugs.  Fixing the bugs is probably more involved than you want to get with it.  If I were you, I would probably stick with FCKEditor until CKEditor increases a couple versions.

 

Edit: Saving a word doc as filtered html and then resaving the filtered html as MHT seems to bring back a lot of the Word formatting junk.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.