FishSword Posted May 25, 2011 Share Posted May 25, 2011 Hiya! I have a file (see attached) that contains basic HTML for a page with multiple paragraphs. How do I check the length of each paragraph to find out if it is 30 characters or over? If the first paragraph is not equal to 30 characters, then it should move on to the next one, and so on. If a paragraph is found to be the correct length, I then need to extract the text from the <p> tags. If a none of the paragraph match 30 characters in length, then the code will need to choose the best length paragraph. Any help is greatly appreciated. Many Thanks, FishSword [attachment deleted by admin] Quote Link to comment Share on other sites More sharing options...
gizmola Posted May 26, 2011 Share Posted May 26, 2011 There various ways to get the text inside paragraphs into variables. Two that are often used are using the preg_match function, and another would be to use http://www.php.net/manual/en/domdocument.getelementsbytagname.php to load the html in a domdocument and access the paragraph nodes. In either case, once you have the text in a string, you can use strlen to get the length of the strings. Quote Link to comment Share on other sites More sharing options...
FishSword Posted June 1, 2011 Author Share Posted June 1, 2011 Hi, Thanks for your reply. How would you achieve this, using preg_match? I also found out that this can be done using strpos, strlen. Which would you say is the best out of the above two, and how would each solution be achieved? Thanks for your help. FishSword Quote Link to comment Share on other sites More sharing options...
Psycho Posted June 1, 2011 Share Posted June 1, 2011 This will return the first paragraph that is at least 30 characters. // $text is the string to be searched //If this is a web page then I would assume you are using something like: // $text = file_get_contents('http://somedomain.com/somefile.htm'); preg_match("#<p[^>]*>(.{30,})</p>#i", $text, $match); $firstParagraph30orMoreCharacters = $match[1]; Edit: just realized from your first past that if there is no para 30 or more characters you need the longest of the ones that do exist. Give me a few minutes. Quote Link to comment Share on other sites More sharing options...
Psycho Posted June 1, 2011 Share Posted June 1, 2011 Here is a function that should do exactly as you want function getParagraph($input, $minLength) { //Check for 1st paragraph of minimum length if(preg_match("#<p[^>]*>(.{{$minLength},})</p>#i", $input, $match)) { //Return 1st para matching min length, if found return $match[1]; } //No para of min length found, Check for any paragraphs preg_match_all("#<p[^>]*>(.*?)</p>#i", $input, $matches); if(count($matches)<0) { //No pragraphs found return false; } //Find longest paragraph and return it $longestPara = ''; foreach($matches[1] as $para) { if(strlen($para) > strlen($longestPara)) { $longestPara = $para; } } return $longestPara; } //Usage echo getParagraph($text, 30); Quote Link to comment Share on other sites More sharing options...
xyph Posted June 1, 2011 Share Posted June 1, 2011 You want something like this - no RegEx required. Won't work with nested tags. <?php $str = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Page Title</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <p>12345</p> <p class="test">12345678901234567890</p> <p>1234567</p> <p>123456789012345</p> </body> </html>'; echo get_paragraph( $str,'p',30 ); function get_paragraph( $html, $tag, $length ) { $b = array(0,0); $i = 0; while( ($offset = strpos($html,'<'.$tag,$i)) !== FALSE ) { $start = strpos($html,'>',$offset); $end = strpos($html,'</'.$tag.'>',$start); if( ($end-$start-1) >= $length ) return substr($html,$start+1,$end-$start-1); if( ($end-$start-1) > $b[0] - $b[1] ) $b = array($end,$start+1); $i = $end; } return substr($html,$b[1],$b[0]-$b[1]); } ?> Quote Link to comment Share on other sites More sharing options...
Psycho Posted June 1, 2011 Share Posted June 1, 2011 You want something like this - no RegEx required. Won't work with nested tags. Seriously? You want to loop through the entire string instead of running a simple regex? The function I provided will return the correct result after only one line of code if there are any matches over the minimum length. The remaining lines are only there if it needs to check for the longest match less than the minimum. And the majority of that code is comments Quote Link to comment Share on other sites More sharing options...
xyph Posted June 2, 2011 Share Posted June 2, 2011 What do you think your RegEx is doing? Looping through the string Mine is simply an alternate way. I think benchmarks would show mine to be slightly more efficient as well, because there is no backtracking required. Variety is the spice of life. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.