behindspace Posted March 6, 2006 Share Posted March 6, 2006 maybe I'm going about this the wrong way. I've got numerous regular expressions that are taking plain text from a form and converting it to XML. The purpose is to create a book index file from a word document. words "XML" is complete trash (as is their HTML). Maybe I'm heading the wrong direction with this, and someone has already written a utility like this.basically, I need to take a line like this:Aardvark, 100, 110-12and convert it to this:[code]<term><name>Aardvark</name><page>100</page><page>110-12</page></term>[/code]but also I have nested terms as well that would end up appearing like this:[code]<term><name>Chainsaw</name><page>50</page> <term> <name>juggling</name> <page>210</page> </term></term>[/code] Quote Link to comment Share on other sites More sharing options...
behindspace Posted March 6, 2006 Author Share Posted March 6, 2006 any guesses? I'm dwelling in preg_replace() Hades right nowmaybe after I get the code written, I'll post it here, and see if anyone has ideas about clean up or better syntax Quote Link to comment Share on other sites More sharing options...
kenrbnsn Posted March 6, 2006 Share Posted March 6, 2006 Show us what you've tried already.Ken Quote Link to comment Share on other sites More sharing options...
behindspace Posted March 6, 2006 Author Share Posted March 6, 2006 ok, forgive me for being a complete n00b with regex. until recently I haven't had to ever delve into them like this, so I'm sure that this code can be written FAR better, I'm just awful and out of practice.so, forgive my n00bness, and don't laugh too hard at me :(FYI, I'm running this application on PHP4.0.2 on Apache on my windows box (no linux box access in the office)[code]<? $word = $_POST['word']; $clean = nl2br($word); $str = preg_replace('/<br \/>/', '<br>', $clean); $strxx = preg_replace('/(\D), ([1-9][0-9][2-9][0-9])/', '$1, <y>$2<y>', $str); $strx1 = preg_replace('/(\D), ([1-9][1-9][0-9][0-9])/', '$1, <y>$2<y>', $strxx); $str01 = eregi_replace(",", " , ", $strx1); $str02 = eregi_replace(" +", " ", $str01); $str03 = preg_replace('/\-([1-9][0-9][2-9][0-9])/', '-<y>$1<y>', $str02); $str0A = preg_replace('/\-([1-9][1-9][0-9][0-9])/', '-<y>$1<y>', $str03); $str0B = preg_replace('/([1-9][1-9][0-9][0-9])/', '<y>$1<y>', $str0A); $str04 = preg_replace('/([2-9][0-9][0-9][0-9])/', '<y>$1<y>', $str0B); $str0C = preg_replace('/\–/', '-', $str04); $str0D = preg_replace('/á/', 'a', $str0C); $str0E = preg_replace('/\”/', '"', $str0D); $str0F = preg_replace('/\“/', '"', $str0E); $str0G = preg_replace('/Á/', 'A', $str0F); $str0H = preg_replace('/ú/', 'u', $str0G); $str0I = preg_replace('/ñ/', 'n', $str0H); $str05 = preg_replace('/(\D) , ([0-9])/', '$1</name><page>$2', $str0I); $str06 = preg_replace('/(\D)<br>/', '$1</name><br>', $str05); $str07 = preg_replace('/([0-9])<br>/', '$1</page><br>', $str06); $str08 = preg_replace('/([0-9]) , ([0-9])/', '$1</page><page>$2', $str07); $str8A = preg_replace('/<page>([0-9]{1,4}). /', '<page>$1</page>', $str08); $str8B = preg_replace('/<page>([0-9]{1,4})-([0-9]{1,2}). /', '<page>$1-$2</page>', $str8A); $str09 = preg_replace('/\n/', '<name>', $str8B); $str10 = preg_replace('/<name>([A-Z])/', '</term><name>$1', $str09); $str11 = preg_replace('/<name>/', '<term><name>', $str10); $str12 = preg_replace('/<br>/', '', $str11); $strXB = preg_replace('/<y>/', '', $str12); $str13 = preg_replace('/</', '<', $strXB); $str14 = preg_replace('/>/', '>', $str13); $str15 = preg_replace('/></', '><br><', $str14); $str16 = preg_replace('/<term>/', '<br><term>', $str15); $str17 = preg_replace('/<\/term>/', '<br></term>', $str16); $str18 = preg_replace('/AT&T/', 'AT&T', $str17); print_r($str18); ?> [/code]if you are curious as to what I am thinking on any given line, just ask... <_< Quote Link to comment Share on other sites More sharing options...
behindspace Posted March 6, 2006 Author Share Posted March 6, 2006 another note:I'm echoing back the results as html that you can copy/paste into a text document. a few issues that I haven't figured out yet:1.) nested terms, I can't figure out how to get the first term to not close until after the last nester term resulting in: </term></term>2.) I'm still stumped on moving the "See also..." text from after the page number to the end of the text in the: <name></name> tags like it should be...any help would be GREATLY appreciated Quote Link to comment Share on other sites More sharing options...
kenrbnsn Posted March 6, 2006 Share Posted March 6, 2006 I have used the [a href=\"http://minixml.psychogenic.com/\" target=\"_blank\"]miniXML[/a] package to decode XML, but it can also be used to generate XML. Take a look and see if it meets your needs.Ken Quote Link to comment Share on other sites More sharing options...
behindspace Posted March 6, 2006 Author Share Posted March 6, 2006 that package may work, but I think I'll have to write my code to create an array out of a text file then parse it Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.