Jump to content


Photo

Converting plain text to XML with PHP


  • Please log in to reply
6 replies to this topic

#1 behindspace

behindspace
  • New Members
  • Pip
  • Newbie
  • 7 posts

Posted 06 March 2006 - 02:46 PM

maybe I'm going about this the wrong way. I've got numerous regular expressions that are taking plain text from a form and converting it to XML. The purpose is to create a book index file from a word document. words "XML" is complete trash (as is their HTML).

Maybe I'm heading the wrong direction with this, and someone has already written a utility like this.

basically, I need to take a line like this:

Aardvark, 100, 110-12

and convert it to this:

<term>
<name>Aardvark</name>
<page>100</page>
<page>110-12</page>
</term>

but also I have nested terms as well that would end up appearing like this:

<term>
<name>Chainsaw</name>
<page>50</page>
     <term>
     <name>juggling</name>
     <page>210</page>
     </term>
</term>


#2 behindspace

behindspace
  • New Members
  • Pip
  • Newbie
  • 7 posts

Posted 06 March 2006 - 06:16 PM

any guesses? I'm dwelling in preg_replace() Hades right now

maybe after I get the code written, I'll post it here, and see if anyone has ideas about clean up or better syntax

#3 kenrbnsn

kenrbnsn
  • Staff Alumni
  • Advanced Member
  • 8,235 posts
  • LocationHillsborough, NJ, USA

Posted 06 March 2006 - 06:56 PM

Show us what you've tried already.

Ken

#4 behindspace

behindspace
  • New Members
  • Pip
  • Newbie
  • 7 posts

Posted 06 March 2006 - 07:11 PM

ok, forgive me for being a complete n00b with regex. until recently I haven't had to ever delve into them like this, so I'm sure that this code can be written FAR better, I'm just awful and out of practice.

so, forgive my n00bness, and don't laugh too hard at me :(

FYI, I'm running this application on PHP4.0.2 on Apache on my windows box (no linux box access in the office)

<?
    
    $word = $_POST['word'];
    
    $clean = nl2br($word);
    
    
    $str = preg_replace('/<br \/>/', '<br>', $clean);
    $strxx = preg_replace('/(\D), ([1-9][0-9][2-9][0-9])/', '$1, <y>$2<y>', $str);
    $strx1 = preg_replace('/(\D), ([1-9][1-9][0-9][0-9])/', '$1, <y>$2<y>', $strxx);
    $str01 = eregi_replace(",", " , ", $strx1);
    $str02 = eregi_replace(" +", " ", $str01);
    $str03 = preg_replace('/\-([1-9][0-9][2-9][0-9])/', '-<y>$1<y>', $str02);
    $str0A = preg_replace('/\-([1-9][1-9][0-9][0-9])/', '-<y>$1<y>', $str03);
    $str0B = preg_replace('/([1-9][1-9][0-9][0-9])/', '<y>$1<y>', $str0A);
    $str04 = preg_replace('/([2-9][0-9][0-9][0-9])/', '<y>$1<y>', $str0B);
    $str0C = preg_replace('/\–/', '-', $str04);
    $str0D = preg_replace('/á/', 'a', $str0C);
    $str0E = preg_replace('/\”/', '"', $str0D);
    $str0F = preg_replace('/\“/', '"', $str0E);
    $str0G = preg_replace('/Á/', 'A', $str0F);
    $str0H = preg_replace('/ú/', 'u', $str0G);
    $str0I = preg_replace('/ñ/', 'n', $str0H);
    $str05 = preg_replace('/(\D) , ([0-9])/', '$1</name><page>$2', $str0I);
    $str06 = preg_replace('/(\D)<br>/', '$1</name><br>', $str05);
    $str07 = preg_replace('/([0-9])<br>/', '$1</page><br>', $str06);
    $str08 = preg_replace('/([0-9]) , ([0-9])/', '$1</page><page>$2', $str07);
    $str8A = preg_replace('/<page>([0-9]{1,4}). /', '<page>$1</page>', $str08);
    $str8B = preg_replace('/<page>([0-9]{1,4})-([0-9]{1,2}). /', '<page>$1-$2</page>', $str8A);
    $str09 = preg_replace('/\n/', '<name>', $str8B);
    $str10 = preg_replace('/<name>([A-Z])/', '</term><name>$1', $str09);
    $str11 = preg_replace('/<name>/', '<term><name>', $str10);
    $str12 = preg_replace('/<br>/', '', $str11);
    $strXB = preg_replace('/<y>/', '', $str12);
    $str13 = preg_replace('/</', '<', $strXB);
    $str14 = preg_replace('/>/', '>', $str13);
    $str15 = preg_replace('/></', '><br><', $str14);
    $str16 = preg_replace('/<term>/', '<br><term>', $str15);
    $str17 = preg_replace('/<\/term>/', '<br></term>', $str16);
    $str18 = preg_replace('/AT&T/', 'AT&T', $str17);
    
    print_r($str18);
    
?> 

if you are curious as to what I am thinking on any given line, just ask... <_<

#5 behindspace

behindspace
  • New Members
  • Pip
  • Newbie
  • 7 posts

Posted 06 March 2006 - 07:35 PM

another note:

I'm echoing back the results as html that you can copy/paste into a text document.

a few issues that I haven't figured out yet:

1.) nested terms, I can't figure out how to get the first term to not close until after the last nester term resulting in:

</term>
</term>

2.) I'm still stumped on moving the "See also..." text from after the page number to the end of the text in the: <name></name> tags like it should be...

any help would be GREATLY appreciated

#6 kenrbnsn

kenrbnsn
  • Staff Alumni
  • Advanced Member
  • 8,235 posts
  • LocationHillsborough, NJ, USA

Posted 06 March 2006 - 07:47 PM

I have used the [a href=\"http://minixml.psychogenic.com/\" target=\"_blank\"]miniXML[/a] package to decode XML, but it can also be used to generate XML. Take a look and see if it meets your needs.

Ken

#7 behindspace

behindspace
  • New Members
  • Pip
  • Newbie
  • 7 posts

Posted 06 March 2006 - 08:47 PM

that package may work, but I think I'll have to write my code to create an array out of a text file then parse it






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users