Jump to content

Converting plain text to XML with PHP


behindspace

Recommended Posts

maybe I'm going about this the wrong way. I've got numerous regular expressions that are taking plain text from a form and converting it to XML. The purpose is to create a book index file from a word document. words "XML" is complete trash (as is their HTML).

Maybe I'm heading the wrong direction with this, and someone has already written a utility like this.

basically, I need to take a line like this:

Aardvark, 100, 110-12

and convert it to this:

[code]
<term>
<name>Aardvark</name>
<page>100</page>
<page>110-12</page>
</term>
[/code]

but also I have nested terms as well that would end up appearing like this:

[code]
<term>
<name>Chainsaw</name>
<page>50</page>
     <term>
     <name>juggling</name>
     <page>210</page>
     </term>
</term>
[/code]
Link to comment
Share on other sites

ok, forgive me for being a complete n00b with regex. until recently I haven't had to ever delve into them like this, so I'm sure that this code can be written FAR better, I'm just awful and out of practice.

so, forgive my n00bness, and don't laugh too hard at me :(

FYI, I'm running this application on PHP4.0.2 on Apache on my windows box (no linux box access in the office)

[code]
<?
    
    $word = $_POST['word'];
    
    $clean = nl2br($word);
    
    
    $str = preg_replace('/<br \/>/', '<br>', $clean);
    $strxx = preg_replace('/(\D), ([1-9][0-9][2-9][0-9])/', '$1, <y>$2<y>', $str);
    $strx1 = preg_replace('/(\D), ([1-9][1-9][0-9][0-9])/', '$1, <y>$2<y>', $strxx);
    $str01 = eregi_replace(",", " , ", $strx1);
    $str02 = eregi_replace(" +", " ", $str01);
    $str03 = preg_replace('/\-([1-9][0-9][2-9][0-9])/', '-<y>$1<y>', $str02);
    $str0A = preg_replace('/\-([1-9][1-9][0-9][0-9])/', '-<y>$1<y>', $str03);
    $str0B = preg_replace('/([1-9][1-9][0-9][0-9])/', '<y>$1<y>', $str0A);
    $str04 = preg_replace('/([2-9][0-9][0-9][0-9])/', '<y>$1<y>', $str0B);
    $str0C = preg_replace('/\–/', '-', $str04);
    $str0D = preg_replace('/á/', 'a', $str0C);
    $str0E = preg_replace('/\”/', '"', $str0D);
    $str0F = preg_replace('/\“/', '"', $str0E);
    $str0G = preg_replace('/Á/', 'A', $str0F);
    $str0H = preg_replace('/ú/', 'u', $str0G);
    $str0I = preg_replace('/ñ/', 'n', $str0H);
    $str05 = preg_replace('/(\D) , ([0-9])/', '$1</name><page>$2', $str0I);
    $str06 = preg_replace('/(\D)<br>/', '$1</name><br>', $str05);
    $str07 = preg_replace('/([0-9])<br>/', '$1</page><br>', $str06);
    $str08 = preg_replace('/([0-9]) , ([0-9])/', '$1</page><page>$2', $str07);
    $str8A = preg_replace('/<page>([0-9]{1,4}). /', '<page>$1</page>', $str08);
    $str8B = preg_replace('/<page>([0-9]{1,4})-([0-9]{1,2}). /', '<page>$1-$2</page>', $str8A);
    $str09 = preg_replace('/\n/', '<name>', $str8B);
    $str10 = preg_replace('/<name>([A-Z])/', '</term><name>$1', $str09);
    $str11 = preg_replace('/<name>/', '<term><name>', $str10);
    $str12 = preg_replace('/<br>/', '', $str11);
    $strXB = preg_replace('/<y>/', '', $str12);
    $str13 = preg_replace('/</', '<', $strXB);
    $str14 = preg_replace('/>/', '>', $str13);
    $str15 = preg_replace('/></', '><br><', $str14);
    $str16 = preg_replace('/<term>/', '<br><term>', $str15);
    $str17 = preg_replace('/<\/term>/', '<br></term>', $str16);
    $str18 = preg_replace('/AT&T/', 'AT&T', $str17);
    
    print_r($str18);
    
?>
[/code]

if you are curious as to what I am thinking on any given line, just ask... <_<
Link to comment
Share on other sites

another note:

I'm echoing back the results as html that you can copy/paste into a text document.

a few issues that I haven't figured out yet:

1.) nested terms, I can't figure out how to get the first term to not close until after the last nester term resulting in:

</term>
</term>

2.) I'm still stumped on moving the "See also..." text from after the page number to the end of the text in the: <name></name> tags like it should be...

any help would be GREATLY appreciated
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.