Jump to content

text from html


jpratt

Recommended Posts

I have html like this:

 

<h3>title here</h3>
<div style='width:515px;'>
<img src='images/images1.jpg' id='imgr'>
<p>first paragraph of text here</p>
<p>another one here</p>
<img src='images/image2.jpg' id='imgl'>...

 

Alot of my pages are formated differently. So what I want to be able to do is remove all html tags, but I want to but the blocks of text in an array and save the different chunks in different fields in the database. the this array would be filed with these different blocks:

 

first element: title here

2nd element: first paragraph of text here

3rd element: another one here

 

Any ideas how I could do this?

Link to comment
Share on other sites

Ive looked into them, but all my pages I want to loop thru have a different format. So <p> tags will be in different places and <img> tags will also be in different places. There may also be div, b, em, etc tags mixed it. I wouldnt think reg expressions would be able to accommodate that. I thought about using something like strip tags but it just lops everything off and returns the entire thing as a string.

Link to comment
Share on other sites

What EXACTLY is it that you wish to do with the text? Do you need to know which tag it was in or how do you know what text you want where? You must know the logic of what you want to know before constructing the actual code.

 

Strip_tags will simply strip all tags as you said so that will not help you.

Link to comment
Share on other sites

Look at preg_match_all and regular expressions.

 

PCRE functions are slow & inefficient.

 

Just try the DOMDocument parser (depending on your PHP configuration, you should be able to use it).

 

*Note, parsing XML with DOMDoc. is also slow & inefficient, if there are better alternatives available... but this is definitely a better choice than PCRE from what you said.

 

IE:

 


function parseTagData( $tag, $xml, $single = false, $dir = true ) {

$parser = new DOMDocument();

if ( $dir === true )
	$parser->load( $xml );
else
	$parser->loadXML( $xml );

$data = $parser->getElementsByTagName( $tag );

if ( $single === false )

{

$collected = array();

foreach ( $data as $d )
	array_push( $collected, trim( $d->nodeValue, "\n" ) );

} else {

$collected = $data->item(0)->nodeValue;

}

	return $collected;

}

/* should print "title here" */

echo ( parseTagData( "h3", "file.html", true ) );

/* should print "first paragraph of text here - another one here - " */

$paragraphs = parseTagData( "p", "file.html" );

foreach ( $paragraphs as $p )
echo $p . ' - ';

Link to comment
Share on other sites

I looked into the DOMDocument model and its isn't very dynamic. Some of my text will be in p tags, some in span tags and so forth. I just need something like the striptags function but instead of just removing all the tags and return all the text together, it needs to return an array of the chunks of text found between the tags in order they were encountered.

Link to comment
Share on other sites

So my question is. How would I implement something like preg_match_all or fgetss to get me desired results. The first tag it hits might be an img tag an a tag a p tag or something else. How would I loop through the string to place the chunks of text in an array?

Link to comment
Share on other sites

Also, might wanna throw in a str_replace for \n's already there, to avoid multi-line content being broken into diff array elements:

 

$content = str_replace("\n","",$content);
$content = preg_replace('~<[^>]+>~',"\n",$content);
$content = explode("\n",$content);
$content = array_filter($content);

Link to comment
Share on other sites

I dont want to go off new lines because some tags my be in the middle of the line such as b tags. I just need an expression to strip out all html tags and place the chunks of text between the tags in an array in the order they were encountered.

Link to comment
Share on other sites

did you actually try the code posted?

 

Using the example content in your OP (and adding a bit to it for example sake), this....

$content = <<<BLAH
<h3>title here</h3>
<div style='width:515px;'>
<img src='images/images1.jpg' id='imgr'>
<p>first paragraph of text here</p> blah blahblah  <p>first paragraph of text here</p> more blah
<p>another one here</p>
<img src='images/image2.jpg' id='imgl'>...
BLAH;

$content = str_replace("\n","",$content);
$content = preg_replace('~<[^>]+>~',"\n",$content);
$content = explode("\n",$content);
$content = array_filter($content);

echo "<pre>";print_r($content);

 

produces this:

 

Array
(
    [1] => title here
    [5] => first paragraph of text here
    [6] =>  blah blahblah  
    [7] => first paragraph of text here
    [8] =>  more blah
    [9] => another one here
    [11] => ...
)

 

Is that not what you want?

 

(edited to fix code tag)

Link to comment
Share on other sites

Sorry, after looking through things, your idea works great. Now I have to reverse engineer the thing. So the text has been taken out and changed to a different language. Now I have to stick it back in based on the english html. I was thinking of looping through the field sticking it into a 2 dimentional array. The first element would contain tags or text, the second would contain a 1 or 0 depending on if it was html or text. If it is text it is replaced with the new language for that section. We got the text to be separate from the html, but how do I get them all into an array together without just removing the html? Thanks.

Link to comment
Share on other sites

OK here goes. The first step you helped with was getting the text out of the html. This I wrote to a file orginized like this:

 

1.1 First chunk of text from first page. blah blah blah

1.2 second chunk of text. blah blah blah

1.3 third chunk of text

 

2.1 First chunk of text from second page. blah blah blah

2.2 second chunk of text. blah blah blah

2.3 third chunk of text

2.4 could have additional text

 

...

 

This file is given to the translator, and given back in the same format just another language. Now I have to stick the new language back into the html tags in the right place. I am not over writing text, just creating a new entry in the database with a different language key.

 

So I was thinking of reading the english html string again and walking through the string placing the html and text in an array. if the element of the array is html it ignores it, but if it is text it replaces the english version in the array. Then when it is done, it writes the array to the database in a new row with the new language id.

Link to comment
Share on other sites

hmmm I'm still kind of fuzzy on the overall picture here.  I see the "before" and "after" list.  I'm unclear about this whole language key/db thing, but are you saying (going of previous content example) you have this to start out with:

 

<h3>title here</h3>
<div style='width:515px;'>
<img src='images/images1.jpg' id='imgr'>
<p>first paragraph of text here</p> blah blahblah  <p>first paragraph of text here</p> more blah
<p>another one here</p>
<img src='images/image2.jpg' id='imgl'>...

 

You run the code to put it into an array, run it through the translator, so now you have an array of text in a different language, and you want to have this:

 

<h3>**** ****</h3>
<div style='width:515px;'>
<img src='images/images1.jpg' id='imgr'>
<p>***** ******** ** **** ****</p> *** *******  <p>***** ********** ** **** ****</p> **** ****
<p>******* *** ****</p>
<img src='images/image2.jpg' id='imgl'>...

 

where the *'s are the new language text lines.  Is that what you want?

Link to comment
Share on other sites

This works out great. There is only a few problems I am having. It is placing content area between 2 tags with no data. So if I have this:

 

<img src='blah' align='right'><p>some text</p>

 

It places a content area between the img tag and the opening p tag as well as where the true content is.

 

Also Exporting was not a problem, because I could look at each chunk of text coming in, but I have <script> tags with content between them as well. Other than this I think we might have it about there.

Link to comment
Share on other sites

hmm...try changing it to this:

 

// $content represents original content string
$content = preg_replace('~(>)(?!\s*<).*?(<|$)~s',"$1[*CONTENT*]$2",$content);
//$array represents the array of translated lines...
foreach($array as $a) {
  $content = preg_replace('~\[\*CONTENT\*\]~',$a,$content,1);
}

 

 

 

Link to comment
Share on other sites

That is awesome thanks. I am still learning about reg expressions, but you have helped out tons. Also last problem is the content between the script tags. No idea on how to ignore this when creating the content 'fields'.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.