Jump to content

Archived

This topic is now archived and is closed to further replies.

behindspace

Converting plain text to XML with PHP

Recommended Posts

maybe I'm going about this the wrong way. I've got numerous regular expressions that are taking plain text from a form and converting it to XML. The purpose is to create a book index file from a word document. words "XML" is complete trash (as is their HTML).

Maybe I'm heading the wrong direction with this, and someone has already written a utility like this.

basically, I need to take a line like this:

Aardvark, 100, 110-12

and convert it to this:

[code]
<term>
<name>Aardvark</name>
<page>100</page>
<page>110-12</page>
</term>
[/code]

but also I have nested terms as well that would end up appearing like this:

[code]
<term>
<name>Chainsaw</name>
<page>50</page>
     <term>
     <name>juggling</name>
     <page>210</page>
     </term>
</term>
[/code]

Share this post


Link to post
Share on other sites
any guesses? I'm dwelling in preg_replace() Hades right now

maybe after I get the code written, I'll post it here, and see if anyone has ideas about clean up or better syntax

Share this post


Link to post
Share on other sites
Show us what you've tried already.

Ken

Share this post


Link to post
Share on other sites
ok, forgive me for being a complete n00b with regex. until recently I haven't had to ever delve into them like this, so I'm sure that this code can be written FAR better, I'm just awful and out of practice.

so, forgive my n00bness, and don't laugh too hard at me :(

FYI, I'm running this application on PHP4.0.2 on Apache on my windows box (no linux box access in the office)

[code]
<?
    
    $word = $_POST['word'];
    
    $clean = nl2br($word);
    
    
    $str = preg_replace('/<br \/>/', '<br>', $clean);
    $strxx = preg_replace('/(\D), ([1-9][0-9][2-9][0-9])/', '$1, <y>$2<y>', $str);
    $strx1 = preg_replace('/(\D), ([1-9][1-9][0-9][0-9])/', '$1, <y>$2<y>', $strxx);
    $str01 = eregi_replace(",", " , ", $strx1);
    $str02 = eregi_replace(" +", " ", $str01);
    $str03 = preg_replace('/\-([1-9][0-9][2-9][0-9])/', '-<y>$1<y>', $str02);
    $str0A = preg_replace('/\-([1-9][1-9][0-9][0-9])/', '-<y>$1<y>', $str03);
    $str0B = preg_replace('/([1-9][1-9][0-9][0-9])/', '<y>$1<y>', $str0A);
    $str04 = preg_replace('/([2-9][0-9][0-9][0-9])/', '<y>$1<y>', $str0B);
    $str0C = preg_replace('/\–/', '-', $str04);
    $str0D = preg_replace('/á/', 'a', $str0C);
    $str0E = preg_replace('/\”/', '"', $str0D);
    $str0F = preg_replace('/\“/', '"', $str0E);
    $str0G = preg_replace('/Á/', 'A', $str0F);
    $str0H = preg_replace('/ú/', 'u', $str0G);
    $str0I = preg_replace('/ñ/', 'n', $str0H);
    $str05 = preg_replace('/(\D) , ([0-9])/', '$1</name><page>$2', $str0I);
    $str06 = preg_replace('/(\D)<br>/', '$1</name><br>', $str05);
    $str07 = preg_replace('/([0-9])<br>/', '$1</page><br>', $str06);
    $str08 = preg_replace('/([0-9]) , ([0-9])/', '$1</page><page>$2', $str07);
    $str8A = preg_replace('/<page>([0-9]{1,4}). /', '<page>$1</page>', $str08);
    $str8B = preg_replace('/<page>([0-9]{1,4})-([0-9]{1,2}). /', '<page>$1-$2</page>', $str8A);
    $str09 = preg_replace('/\n/', '<name>', $str8B);
    $str10 = preg_replace('/<name>([A-Z])/', '</term><name>$1', $str09);
    $str11 = preg_replace('/<name>/', '<term><name>', $str10);
    $str12 = preg_replace('/<br>/', '', $str11);
    $strXB = preg_replace('/<y>/', '', $str12);
    $str13 = preg_replace('/</', '<', $strXB);
    $str14 = preg_replace('/>/', '>', $str13);
    $str15 = preg_replace('/></', '><br><', $str14);
    $str16 = preg_replace('/<term>/', '<br><term>', $str15);
    $str17 = preg_replace('/<\/term>/', '<br></term>', $str16);
    $str18 = preg_replace('/AT&T/', 'AT&T', $str17);
    
    print_r($str18);
    
?>
[/code]

if you are curious as to what I am thinking on any given line, just ask... <_<

Share this post


Link to post
Share on other sites
another note:

I'm echoing back the results as html that you can copy/paste into a text document.

a few issues that I haven't figured out yet:

1.) nested terms, I can't figure out how to get the first term to not close until after the last nester term resulting in:

</term>
</term>

2.) I'm still stumped on moving the "See also..." text from after the page number to the end of the text in the: <name></name> tags like it should be...

any help would be GREATLY appreciated

Share this post


Link to post
Share on other sites
I have used the [a href=\"http://minixml.psychogenic.com/\" target=\"_blank\"]miniXML[/a] package to decode XML, but it can also be used to generate XML. Take a look and see if it meets your needs.

Ken

Share this post


Link to post
Share on other sites
that package may work, but I think I'll have to write my code to create an array out of a text file then parse it

Share this post


Link to post
Share on other sites

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.