jpratt Posted July 24, 2009 Share Posted July 24, 2009 I have html like this: <h3>title here</h3> <div style='width:515px;'> <img src='images/images1.jpg' id='imgr'> <p>first paragraph of text here</p> <p>another one here</p> <img src='images/image2.jpg' id='imgl'>... Alot of my pages are formated differently. So what I want to be able to do is remove all html tags, but I want to but the blocks of text in an array and save the different chunks in different fields in the database. the this array would be filed with these different blocks: first element: title here 2nd element: first paragraph of text here 3rd element: another one here Any ideas how I could do this? Quote Link to comment Share on other sites More sharing options...
vineld Posted July 24, 2009 Share Posted July 24, 2009 Look at preg_match_all and regular expressions. Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 24, 2009 Author Share Posted July 24, 2009 Ive looked into them, but all my pages I want to loop thru have a different format. So <p> tags will be in different places and <img> tags will also be in different places. There may also be div, b, em, etc tags mixed it. I wouldnt think reg expressions would be able to accommodate that. I thought about using something like strip tags but it just lops everything off and returns the entire thing as a string. Quote Link to comment Share on other sites More sharing options...
vineld Posted July 24, 2009 Share Posted July 24, 2009 What EXACTLY is it that you wish to do with the text? Do you need to know which tag it was in or how do you know what text you want where? You must know the logic of what you want to know before constructing the actual code. Strip_tags will simply strip all tags as you said so that will not help you. Quote Link to comment Share on other sites More sharing options...
alphanumetrix Posted July 24, 2009 Share Posted July 24, 2009 Look at preg_match_all and regular expressions. PCRE functions are slow & inefficient. Just try the DOMDocument parser (depending on your PHP configuration, you should be able to use it). *Note, parsing XML with DOMDoc. is also slow & inefficient, if there are better alternatives available... but this is definitely a better choice than PCRE from what you said. IE: function parseTagData( $tag, $xml, $single = false, $dir = true ) { $parser = new DOMDocument(); if ( $dir === true ) $parser->load( $xml ); else $parser->loadXML( $xml ); $data = $parser->getElementsByTagName( $tag ); if ( $single === false ) { $collected = array(); foreach ( $data as $d ) array_push( $collected, trim( $d->nodeValue, "\n" ) ); } else { $collected = $data->item(0)->nodeValue; } return $collected; } /* should print "title here" */ echo ( parseTagData( "h3", "file.html", true ) ); /* should print "first paragraph of text here - another one here - " */ $paragraphs = parseTagData( "p", "file.html" ); foreach ( $paragraphs as $p ) echo $p . ' - '; Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 24, 2009 Author Share Posted July 24, 2009 I looked into the DOMDocument model and its isn't very dynamic. Some of my text will be in p tags, some in span tags and so forth. I just need something like the striptags function but instead of just removing all the tags and return all the text together, it needs to return an array of the chunks of text found between the tags in order they were encountered. Quote Link to comment Share on other sites More sharing options...
vineld Posted July 24, 2009 Share Posted July 24, 2009 preg_match_all will work either way... Speed is not really of importance if the files aren't huge or you do this on the fly (which is usually not the case). I have not used DOMDocument myself so thanks for the tip btw! Quote Link to comment Share on other sites More sharing options...
alphanumetrix Posted July 24, 2009 Share Posted July 24, 2009 Hmm... If you are going to strip all the tags, then maybe you should try the "fgetss" function. Should be here: http://php.net/fgetss Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 27, 2009 Author Share Posted July 27, 2009 So my question is. How would I implement something like preg_match_all or fgetss to get me desired results. The first tag it hits might be an img tag an a tag a p tag or something else. How would I loop through the string to place the chunks of text in an array? Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 27, 2009 Author Share Posted July 27, 2009 I have tried various reg expressions trying to remove the html using preg_match_all with little or no success. Anyone know of a regular expression that will remove this? Quote Link to comment Share on other sites More sharing options...
.josh Posted July 27, 2009 Share Posted July 27, 2009 What about... $content = preg_replace('~<[^>]+>~',"\n",$content); $content = explode("\n",$content); $content = array_filter($content); Quote Link to comment Share on other sites More sharing options...
.josh Posted July 27, 2009 Share Posted July 27, 2009 Also, might wanna throw in a str_replace for \n's already there, to avoid multi-line content being broken into diff array elements: $content = str_replace("\n","",$content); $content = preg_replace('~<[^>]+>~',"\n",$content); $content = explode("\n",$content); $content = array_filter($content); Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 27, 2009 Author Share Posted July 27, 2009 I dont want to go off new lines because some tags my be in the middle of the line such as b tags. I just need an expression to strip out all html tags and place the chunks of text between the tags in an array in the order they were encountered. Quote Link to comment Share on other sites More sharing options...
.josh Posted July 27, 2009 Share Posted July 27, 2009 did you actually try the code posted? Using the example content in your OP (and adding a bit to it for example sake), this.... $content = <<<BLAH <h3>title here</h3> <div style='width:515px;'> <img src='images/images1.jpg' id='imgr'> <p>first paragraph of text here</p> blah blahblah <p>first paragraph of text here</p> more blah <p>another one here</p> <img src='images/image2.jpg' id='imgl'>... BLAH; $content = str_replace("\n","",$content); $content = preg_replace('~<[^>]+>~',"\n",$content); $content = explode("\n",$content); $content = array_filter($content); echo "<pre>";print_r($content); produces this: Array ( [1] => title here [5] => first paragraph of text here [6] => blah blahblah [7] => first paragraph of text here [8] => more blah [9] => another one here [11] => ... ) Is that not what you want? (edited to fix code tag) Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 31, 2009 Author Share Posted July 31, 2009 Sorry, after looking through things, your idea works great. Now I have to reverse engineer the thing. So the text has been taken out and changed to a different language. Now I have to stick it back in based on the english html. I was thinking of looping through the field sticking it into a 2 dimentional array. The first element would contain tags or text, the second would contain a 1 or 0 depending on if it was html or text. If it is text it is replaced with the new language for that section. We got the text to be separate from the html, but how do I get them all into an array together without just removing the html? Thanks. Quote Link to comment Share on other sites More sharing options...
.josh Posted July 31, 2009 Share Posted July 31, 2009 Okay so let me get this straight. Are you saying the overall goal is to strip the tags out, translate each chunk of text, and then put the tags back in? If so, can you post what you have as far as going from languageA -> languageB for these chunks of text? Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 31, 2009 Author Share Posted July 31, 2009 OK here goes. The first step you helped with was getting the text out of the html. This I wrote to a file orginized like this: 1.1 First chunk of text from first page. blah blah blah 1.2 second chunk of text. blah blah blah 1.3 third chunk of text 2.1 First chunk of text from second page. blah blah blah 2.2 second chunk of text. blah blah blah 2.3 third chunk of text 2.4 could have additional text ... This file is given to the translator, and given back in the same format just another language. Now I have to stick the new language back into the html tags in the right place. I am not over writing text, just creating a new entry in the database with a different language key. So I was thinking of reading the english html string again and walking through the string placing the html and text in an array. if the element of the array is html it ignores it, but if it is text it replaces the english version in the array. Then when it is done, it writes the array to the database in a new row with the new language id. Quote Link to comment Share on other sites More sharing options...
.josh Posted July 31, 2009 Share Posted July 31, 2009 hmmm I'm still kind of fuzzy on the overall picture here. I see the "before" and "after" list. I'm unclear about this whole language key/db thing, but are you saying (going of previous content example) you have this to start out with: <h3>title here</h3> <div style='width:515px;'> <img src='images/images1.jpg' id='imgr'> <p>first paragraph of text here</p> blah blahblah <p>first paragraph of text here</p> more blah <p>another one here</p> <img src='images/image2.jpg' id='imgl'>... You run the code to put it into an array, run it through the translator, so now you have an array of text in a different language, and you want to have this: <h3>**** ****</h3> <div style='width:515px;'> <img src='images/images1.jpg' id='imgr'> <p>***** ******** ** **** ****</p> *** ******* <p>***** ********** ** **** ****</p> **** **** <p>******* *** ****</p> <img src='images/image2.jpg' id='imgl'>... where the *'s are the new language text lines. Is that what you want? Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 31, 2009 Author Share Posted July 31, 2009 yep, thats what I want. The middle of the process is outputting a file for the translator, then getting it back and importing to get the result you described. Quote Link to comment Share on other sites More sharing options...
.josh Posted July 31, 2009 Share Posted July 31, 2009 // $content represents original content string $content = preg_replace('~(>)(?!\s+<).*?(<|$)~s',"$1[*CONTENT*]$2",$content); //$array represents the array of translated lines... foreach($array as $a) { $content = preg_replace('~\[\*CONTENT\*\]~',$a,$content,1); } Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 31, 2009 Author Share Posted July 31, 2009 This works out great. There is only a few problems I am having. It is placing content area between 2 tags with no data. So if I have this: <img src='blah' align='right'><p>some text</p> It places a content area between the img tag and the opening p tag as well as where the true content is. Also Exporting was not a problem, because I could look at each chunk of text coming in, but I have <script> tags with content between them as well. Other than this I think we might have it about there. Quote Link to comment Share on other sites More sharing options...
.josh Posted July 31, 2009 Share Posted July 31, 2009 hmm...try changing it to this: // $content represents original content string $content = preg_replace('~(>)(?!\s*<).*?(<|$)~s',"$1[*CONTENT*]$2",$content); //$array represents the array of translated lines... foreach($array as $a) { $content = preg_replace('~\[\*CONTENT\*\]~',$a,$content,1); } Quote Link to comment Share on other sites More sharing options...
jpratt Posted July 31, 2009 Author Share Posted July 31, 2009 That is awesome thanks. I am still learning about reg expressions, but you have helped out tons. Also last problem is the content between the script tags. No idea on how to ignore this when creating the content 'fields'. Quote Link to comment Share on other sites More sharing options...
jpratt Posted August 3, 2009 Author Share Posted August 3, 2009 Any idea how I would handle the content in the script tags? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.