rupam_jaiswal Posted June 12, 2009 Share Posted June 12, 2009 Hi, My html looks like this <meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" /> <!-- message --> <div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br /> <br /> <br /> <img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br /> <br /> <br /> info!<br /> <br /> <div style="margin:20px; margin-top:5px"> <div class="smallfont" style="margin-bottom:2px">Code:</div> <pre class="alt2" dir="ltr" style=" margin: 0px; padding: 6px; border: 1px inset; width: 470px; height: 34px; text-align: left; overflow: auto">http://www.sample1.com/part1.html http://www.sample1.com/part1.html http://www.sample1.com/part1.html</pre> </div><br /> <div class="smallfont" style="margin-bottom:2px">Code:</div> <pre class="alt2" dir="ltr" style=" margin: 0px; padding: 6px; border: 1px inset; width: 470px; height: 1490px; text-align: left; overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar http://www.sample1.com/part1/sample_code.part01.rar</pre> </div></div> I want all the values that are after Code:</div> and between pre tags. eg http://www.sample1.com/part1.html http://www.sample1.com/part1.html http://www.sample1.com/part1.html and http://www.sample1.com/part1/sample_code.part01.rar http://www.sample1.com/part1/sample_code.part01.rar Please note that at the start in meta tag there is also string Code: and I don't value from it. Thanks in advance Regards Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/ Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 http://www.regular-expressions.info/posix.html http://perldoc.perl.org/perlre.html PREG = Perl Compatible Regular Expressions (PCRE): EREG = Regular Expression (POSIX Extended) http://uk3.php.net/manual/en/ref.regex.php http://uk3.php.net/manual/en/ref.pcre.php //values for pre $matches = array(); // . means any character * means 0 r more times ? means 1 or more times and theres another for 1 time // i without a \ means lower or upercase eg preg_match("/<pre>.*<\/pre>i/", $html, $matches); // PREG = Perl Compatible Regular Expressions (PCRE): // EREG = Regular Expression (POSIX Extended) preg_match("/<pre>.*<\/pre>/", $html, $matches); // the backslash may be an issue //now look into the array echo "*****************************************<br><br>"; echo "*******************BEFORE****************<br>"; echo "*****************************************<br><br>"; print_r($matches); // print_r($matches[0]); print_r($matches[1]); /* $matches[0] will contain an array with the text that matched the full pattern, $matches[0][0];$matches[0][1];$matches[0][2];$matches[0]etc. etc... $matches[1] will have an array with the text that matched the first captured parenthesized subpattern, and so on. $matches[1][0];$matches[1][1];$matches[1][2];$matches[1]etc. etc... */ // you can cycle throgh and alter them foreach($matches as $_matches_k => $matches_v){ foreach($matches_v as $matches_v_k => $matches_v_v){ // PREG = Perl Compatible Regular Expressions (PCRE): // EREG = Regular Expression (POSIX Extended) $matches[$_matches_k][$matches_v_k ] = ereg_replace("<pre>", "", $matches_v_v); $matches[$_matches_k][$matches_v_k ] = ereg_replace("<\/pre>", "", $matches_v_v); // the backslash may be an issue } } echo "*****************************************<br><br>"; echo "*******************AFTER*****************<br>"; echo "*****************************************<br><br>"; print_r($matches); // print_r($matches[0]); print_r($matches[1]); Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854258 Share on other sites More sharing options...
rupam_jaiswal Posted June 12, 2009 Author Share Posted June 12, 2009 Thanks for you help. But I have posted only a part of my html page.This page has several pre tags and my concern is to 1)get values with pre tags only if it comes after the string Code: 2)My pre tag has certain attributes (<pre class="alt2" dir="ltr" style=" ...) so I can't use <pre>. If i use <pre.*<\/pre> or <pre(.*)<\/pre>,still it returns empty array. Regards Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854260 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 $matches = array(); preg_match("/Code:\i.*<pre.*>.*<\/pre>/", $html, $matches); // the backslash may be an issue foreach($matches as $_matches_k => $matches_v){ foreach($matches_v as $matches_v_k => $matches_v_v){ $attributes = array(); //get all attributes preg_match('#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#', $matches_v_v, $attributes); printr($attributes); // printr($attributes[0]); printr($attributes[1]); echo "********************************************"; } } Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854263 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 // \s = Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines // \i means lower or upercase eg preg_match("/<pre>.*<\/pre>\i/", $html, $matches); Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854264 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 basicaly you need to test your backslashes and forward slashes as you can see im confused im not testing im jus about to got o bed its 7 in the morn lol Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854265 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 preg_match("#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#", $matches_v_v, $attributes); ^ WRONg preg_match('#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#', $matches_v_v, $attributes); Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854266 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 strat from the begigning eith a basic regex use < then print tr then go back and change your code to <pre then <pre.*> rember encapse in / <pre.*>/ Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854268 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 '#<'.$element_name.'(?:\s+[^>]+)?>(.*?)' Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854269 Share on other sites More sharing options...
rupam_jaiswal Posted June 12, 2009 Author Share Posted June 12, 2009 Hey .. thanx..for your help...am sorry but still it couldnot solve my problem. I am getting empty $matches from the very first regex preg_match("/Code:\i.*<pre.*>.*<\/pre>/", $html, $matches); Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854270 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 i was making a proxy for the visa payer authentication and baclays password because our frwall on our clients prohibits usage of internet without paying. Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854271 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 this one will sort it out '#<pre(?:\s+[^>]+)?>(.*?)' Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854272 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 RTFM http://perldoc.perl.org/perlre.html Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854273 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 .*+ match 0 or more times and give nothing back Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854274 Share on other sites More sharing options...
rupam_jaiswal Posted June 12, 2009 Author Share Posted June 12, 2009 .*+ match 0 or more times and give nothing back I am not getting anything..what the use of # here..can you write the full regex... Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854278 Share on other sites More sharing options...
thebadbad Posted June 12, 2009 Share Posted June 12, 2009 My God, that was confusing. Are you trying to beat a record with all those posts, nadeemshafi9? Seriously. @OP Please post any code within or [php] tags. This should grab what you're looking for: [code=php:0]preg_match_all('~Code:</div>\s*<pre[^>]*>([^<]*)<~i', $data, $matches); echo '<pre>', print_r($matches[1], true), '</pre>'; Where $data is the HTML source code. Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854351 Share on other sites More sharing options...
nrg_alpha Posted June 12, 2009 Share Posted June 12, 2009 One possible solution. Example: $html = <<<HTML <meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" /> <!-- message --> <div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br /> <br /> <br /> <img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br /> <br /> <br /> info!<br /> <br /> <div style="margin:20px; margin-top:5px"> <div class="smallfont" style="margin-bottom:2px">Code:</div> <pre class="alt2" dir="ltr" style=" margin: 0px; padding: 6px; border: 1px inset; width: 470px; height: 34px; text-align: left; overflow: auto">http://www.sample1.com/part1.html http://www.sample1.com/part1.html http://www.sample1.com/part1.html</pre> </div><br /> <div class="smallfont" style="margin-bottom:2px">Code:</div> <pre class="alt2" dir="ltr" style=" margin: 0px; padding: 6px; border: 1px inset; width: 470px; height: 1490px; text-align: left; overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar http://www.sample1.com/part1/sample_code.part01.rar</pre> </div></div> HTML; preg_match_all('#</div>\s*<pre[^\n]*\n(.+?)</pre>#si', $html, $matches); $count = count($matches[1]); for ($a = 0 ; $a < $count ; $a++) { echo $matches[1][$a] . "<br />\n"; } Output (via view source): margin: 0px; padding: 6px; border: 1px inset; width: 470px; height: 34px; text-align: left; overflow: auto">http://www.sample1.com/part1.html http://www.sample1.com/part1.html http://www.sample1.com/part1.html<br /> margin: 0px; padding: 6px; border: 1px inset; width: 470px; height: 1490px; text-align: left; overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar http://www.sample1.com/part1/sample_code.part01.rar Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854374 Share on other sites More sharing options...
thebadbad Posted June 12, 2009 Share Posted June 12, 2009 I'm pretty sure he only wanted to grab the contents of the specified pre elements. Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854421 Share on other sites More sharing options...
nrg_alpha Posted June 12, 2009 Share Posted June 12, 2009 I'm pretty sure he only wanted to grab the contents of the specified pre elements. My bad (not sure what I was thinking there...) preg_match_all('#</div>\s*<pre[^>]*>(.+?)</pre>#si', $html, $matches); Output: http://www.sample1.com/part1.html http://www.sample1.com/part1.html http://www.sample1.com/part1.html<br /> http://www.sample1.com/part1/sample_code.part01.rar http://www.sample1.com/part1/sample_code.part01.rar Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854427 Share on other sites More sharing options...
thebadbad Posted June 12, 2009 Share Posted June 12, 2009 No offense, but what's the point of your post then? When I tested my snippet it worked fine. Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854431 Share on other sites More sharing options...
nrg_alpha Posted June 12, 2009 Share Posted June 12, 2009 No offense, but what's the point of your post then? When I tested my snippet it worked fine. If you are referring to: preg_match_all('~Code:</div>\s*<pre[^>]*>([^<]*)<~i', $data, $matches); Yeah, it would work... but I think I would opt for .+? instead of [^<]* in case there is any tags (for whatever reason) within the <pre> (which I admit is currently not the case). To me, it's almost akin to say trying to match everything within say a <b> tag.. if there are any additional tags nested within <b> like say <i>...</i>, the [^<]* could get botched.. where as .+? will stop matching once the closing </b> tag is found. But yes, in this case, your solution does work. Many ways to skin a cat... this could all be done in DOM / XPath as well. Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854436 Share on other sites More sharing options...
thebadbad Posted June 12, 2009 Share Posted June 12, 2009 Oh yea, you're right. I didn't think of that at all. Won't complain anymore then Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854442 Share on other sites More sharing options...
nadeemshafi9 Posted June 12, 2009 Share Posted June 12, 2009 .*+ match 0 or more times and give nothing back I am not getting anything..what the use of # here..can you write the full regex... basicaly im not sure what teh diferrence between / and #, but you need to start and end with one or the other. /<pre.*>/ i am a begginnner with regex, ibv been on about it for ages but only recently implamented it Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854557 Share on other sites More sharing options...
thebadbad Posted June 12, 2009 Share Posted June 12, 2009 basicaly im not sure what teh diferrence between / and #, but you need to start and end with one or the other. They are called pattern delimiters, and can be any non-alphanumeric character. And it doesn't make a difference which you choose, but to make it easy for yourself, choose a char you won't use within your pattern (so you don't have to escape it). Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854569 Share on other sites More sharing options...
nrg_alpha Posted June 12, 2009 Share Posted June 12, 2009 They are called pattern delimiters, and can be any non-alphanumeric character. And it doesn't make a difference which you choose, but to make it easy for yourself, choose a char you won't use within your pattern (so you don't have to escape it). To be pedantic, delimiters can be any non-white space, non-alphanumeric ASCII character (except a backslash). @nadeemshafi9, you can read about this stuff here and the pcre aspect of the manual. As thebadbad mentioned, characters that are within the pattern need to be escaped (for the most part this is true.. but there are oddball exceptions.. but I digress). So I tend to use #....#. You'll probably see /...../ as the most common format.. but I don't like using those as the / character is used in file paths for instance.. so you would need to start escaping every / inside the pattern that is delimited by /..../. Other characters that would reduce the need to escape is ~.....~ or !......! for instance. It all boils down to a matter of personal preference (so long as the delimiters are legal of course). Quote Link to comment https://forums.phpfreaks.com/topic/161908-extract-data-from-web-page/#findComment-854654 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.