text from html

jpratt · July 24, 2009

I have html like this:

<h3>title here</h3>
<div style='width:515px;'>
<img src='images/images1.jpg' id='imgr'>
<p>first paragraph of text here</p>
<p>another one here</p>
<img src='images/image2.jpg' id='imgl'>...

Alot of my pages are formated differently. So what I want to be able to do is remove all html tags, but I want to but the blocks of text in an array and save the different chunks in different fields in the database. the this array would be filed with these different blocks:

first element: title here

2nd element: first paragraph of text here

3rd element: another one here

Any ideas how I could do this?

vineld · July 24, 2009

Look at preg_match_all and regular expressions.

jpratt · July 24, 2009

Ive looked into them, but all my pages I want to loop thru have a different format. So <p> tags will be in different places and <img> tags will also be in different places. There may also be div, b, em, etc tags mixed it. I wouldnt think reg expressions would be able to accommodate that. I thought about using something like strip tags but it just lops everything off and returns the entire thing as a string.

vineld · July 24, 2009

What EXACTLY is it that you wish to do with the text? Do you need to know which tag it was in or how do you know what text you want where? You must know the logic of what you want to know before constructing the actual code.

Strip_tags will simply strip all tags as you said so that will not help you.

alphanumetrix · July 24, 2009

Look at preg_match_all and regular expressions.

PCRE functions are slow & inefficient.

Just try the DOMDocument parser (depending on your PHP configuration, you should be able to use it).

*Note, parsing XML with DOMDoc. is also slow & inefficient, if there are better alternatives available... but this is definitely a better choice than PCRE from what you said.

IE:


function parseTagData( $tag, $xml, $single = false, $dir = true ) {

$parser = new DOMDocument();

if ( $dir === true )
	$parser->load( $xml );
else
	$parser->loadXML( $xml );

$data = $parser->getElementsByTagName( $tag );

if ( $single === false )

{

$collected = array();

foreach ( $data as $d )
	array_push( $collected, trim( $d->nodeValue, "\n" ) );

} else {

$collected = $data->item(0)->nodeValue;

}

	return $collected;

}

/* should print "title here" */

echo ( parseTagData( "h3", "file.html", true ) );

/* should print "first paragraph of text here - another one here - " */

$paragraphs = parseTagData( "p", "file.html" );

foreach ( $paragraphs as $p )
echo $p . ' - ';

jpratt · July 24, 2009

I looked into the DOMDocument model and its isn't very dynamic. Some of my text will be in p tags, some in span tags and so forth. I just need something like the striptags function but instead of just removing all the tags and return all the text together, it needs to return an array of the chunks of text found between the tags in order they were encountered.

vineld · July 24, 2009

preg_match_all will work either way... Speed is not really of importance if the files aren't huge or you do this on the fly (which is usually not the case).

I have not used DOMDocument myself so thanks for the tip btw!

alphanumetrix · July 24, 2009

Hmm... If you are going to strip all the tags, then maybe you should try the "fgetss" function. Should be here: http://php.net/fgetss

jpratt · July 27, 2009

So my question is. How would I implement something like preg_match_all or fgetss to get me desired results. The first tag it hits might be an img tag an a tag a p tag or something else. How would I loop through the string to place the chunks of text in an array?

jpratt · July 27, 2009

I have tried various reg expressions trying to remove the html using preg_match_all with little or no success. Anyone know of a regular expression that will remove this?

.josh · July 27, 2009

What about...

$content = preg_replace('~<[^>]+>~',"\n",$content);
$content = explode("\n",$content);
$content = array_filter($content);

.josh · July 27, 2009

Also, might wanna throw in a str_replace for \n's already there, to avoid multi-line content being broken into diff array elements:

$content = str_replace("\n","",$content);
$content = preg_replace('~<[^>]+>~',"\n",$content);
$content = explode("\n",$content);
$content = array_filter($content);

jpratt · July 27, 2009

I dont want to go off new lines because some tags my be in the middle of the line such as b tags. I just need an expression to strip out all html tags and place the chunks of text between the tags in an array in the order they were encountered.

.josh · July 27, 2009

did you actually try the code posted?

Using the example content in your OP (and adding a bit to it for example sake), this....

$content = <<<BLAH
<h3>title here</h3>
<div style='width:515px;'>
<img src='images/images1.jpg' id='imgr'>
<p>first paragraph of text here</p> blah blahblah  <p>first paragraph of text here</p> more blah
<p>another one here</p>
<img src='images/image2.jpg' id='imgl'>...
BLAH;

$content = str_replace("\n","",$content);
$content = preg_replace('~<[^>]+>~',"\n",$content);
$content = explode("\n",$content);
$content = array_filter($content);

echo "<pre>";print_r($content);

produces this:

Array
(
    [1] => title here
    [5] => first paragraph of text here
    [6] =>  blah blahblah  
    [7] => first paragraph of text here
    [8] =>  more blah
    [9] => another one here
    [11] => ...
)

Is that not what you want?

(edited to fix code tag)

jpratt · July 31, 2009

Sorry, after looking through things, your idea works great. Now I have to reverse engineer the thing. So the text has been taken out and changed to a different language. Now I have to stick it back in based on the english html. I was thinking of looping through the field sticking it into a 2 dimentional array. The first element would contain tags or text, the second would contain a 1 or 0 depending on if it was html or text. If it is text it is replaced with the new language for that section. We got the text to be separate from the html, but how do I get them all into an array together without just removing the html? Thanks.

.josh · July 31, 2009

Okay so let me get this straight. Are you saying the overall goal is to strip the tags out, translate each chunk of text, and then put the tags back in? If so, can you post what you have as far as going from languageA -> languageB for these chunks of text?

jpratt · July 31, 2009

OK here goes. The first step you helped with was getting the text out of the html. This I wrote to a file orginized like this:

1.1 First chunk of text from first page. blah blah blah

1.2 second chunk of text. blah blah blah

1.3 third chunk of text

2.1 First chunk of text from second page. blah blah blah

2.2 second chunk of text. blah blah blah

2.3 third chunk of text

2.4 could have additional text

...

This file is given to the translator, and given back in the same format just another language. Now I have to stick the new language back into the html tags in the right place. I am not over writing text, just creating a new entry in the database with a different language key.

So I was thinking of reading the english html string again and walking through the string placing the html and text in an array. if the element of the array is html it ignores it, but if it is text it replaces the english version in the array. Then when it is done, it writes the array to the database in a new row with the new language id.

.josh · July 31, 2009

hmmm I'm still kind of fuzzy on the overall picture here. I see the "before" and "after" list. I'm unclear about this whole language key/db thing, but are you saying (going of previous content example) you have this to start out with:

<h3>title here</h3>
<div style='width:515px;'>
<img src='images/images1.jpg' id='imgr'>
<p>first paragraph of text here</p> blah blahblah  <p>first paragraph of text here</p> more blah
<p>another one here</p>
<img src='images/image2.jpg' id='imgl'>...

You run the code to put it into an array, run it through the translator, so now you have an array of text in a different language, and you want to have this:

<h3>**** ****</h3>
<div style='width:515px;'>
<img src='images/images1.jpg' id='imgr'>
<p>***** ******** ** **** ****</p> *** *******  <p>***** ********** ** **** ****</p> **** ****
<p>******* *** ****</p>
<img src='images/image2.jpg' id='imgl'>...

where the *'s are the new language text lines. Is that what you want?

jpratt · July 31, 2009

yep, thats what I want. The middle of the process is outputting a file for the translator, then getting it back and importing to get the result you described.

.josh · July 31, 2009

// $content represents original content string
$content = preg_replace('~(>)(?!\s+<).*?(<|$)~s',"$1[*CONTENT*]$2",$content);
//$array represents the array of translated lines...
foreach($array as $a) {
  $content = preg_replace('~\[\*CONTENT\*\]~',$a,$content,1);
}

jpratt · July 31, 2009

This works out great. There is only a few problems I am having. It is placing content area between 2 tags with no data. So if I have this:

<img src='blah' align='right'><p>some text</p>

It places a content area between the img tag and the opening p tag as well as where the true content is.

Also Exporting was not a problem, because I could look at each chunk of text coming in, but I have <script> tags with content between them as well. Other than this I think we might have it about there.

.josh · July 31, 2009

hmm...try changing it to this:

// $content represents original content string
$content = preg_replace('~(>)(?!\s*<).*?(<|$)~s',"$1[*CONTENT*]$2",$content);
//$array represents the array of translated lines...
foreach($array as $a) {
  $content = preg_replace('~\[\*CONTENT\*\]~',$a,$content,1);
}

jpratt · July 31, 2009

That is awesome thanks. I am still learning about reg expressions, but you have helped out tons. Also last problem is the content between the script tags. No idea on how to ignore this when creating the content 'fields'.

jpratt · August 3, 2009

Any idea how I would handle the content in the script tags?

Sign In

text from html

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information