extract data from web page

//values for pre

$matches = array();
// . means any character * means 0 r more times ? means 1 or more times and theres another for 1 time
// i without a \ means lower or upercase eg preg_match("/<pre>.*<\/pre>i/",  $html, $matches);
// PREG = Perl Compatible Regular Expressions (PCRE):
// EREG = Regular Expression (POSIX Extended)
preg_match("/<pre>.*<\/pre>/",  $html, $matches);  // the backslash may be an issue

//now look into the array
echo "*****************************************<br><br>";
echo "*******************BEFORE****************<br>";
echo "*****************************************<br><br>";
print_r($matches); // print_r($matches[0]);  print_r($matches[1]); 

/*
$matches[0] will contain an array with the text that matched the full pattern, 
$matches[0][0];$matches[0][1];$matches[0][2];$matches[0]etc. etc...
$matches[1] will have an array with the text that matched the first captured parenthesized subpattern, and so on. 
$matches[1][0];$matches[1][1];$matches[1][2];$matches[1]etc. etc...
*/

// you can cycle throgh and alter them

foreach($matches as $_matches_k => $matches_v){
foreach($matches_v as $matches_v_k  =>  $matches_v_v){
        // PREG = Perl Compatible Regular Expressions (PCRE):
        // EREG = Regular Expression (POSIX Extended)
	$matches[$_matches_k][$matches_v_k ] = ereg_replace("<pre>", "", $matches_v_v);
	$matches[$_matches_k][$matches_v_k ] = ereg_replace("<\/pre>", "", $matches_v_v); // the backslash may be an issue
}
}
echo "*****************************************<br><br>";
echo "*******************AFTER*****************<br>";
echo "*****************************************<br><br>";
print_r($matches); // print_r($matches[0]);  print_r($matches[1]);

rupam_jaiswal · June 12, 2009

Thanks for you help.

But I have posted only a part of my html page.This page has several pre tags and my concern is to 1)get values with pre tags only if it comes after the string Code:

2)My pre tag has certain attributes (<pre class="alt2" dir="ltr" style=" ...) so I can't use <pre>.

If i use <pre.*<\/pre> or <pre(.*)<\/pre>,still it returns empty array.

Regards

nadeemshafi9 · June 12, 2009


$matches = array();
preg_match("/Code:\i.*<pre.*>.*<\/pre>/",  $html, $matches);  // the backslash may be an issue
foreach($matches as $_matches_k => $matches_v){
foreach($matches_v as $matches_v_k  =>  $matches_v_v){
	$attributes = array();
	//get all attributes
	preg_match('#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#', $matches_v_v, $attributes);
	printr($attributes); // printr($attributes[0]); printr($attributes[1]);
	echo "********************************************";
}
}

nadeemshafi9 · June 12, 2009

// \s = Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines

// \i means lower or upercase eg preg_match("/<pre>.*<\/pre>\i/", $html, $matches);

nadeemshafi9 · June 12, 2009

basicaly you need to test your backslashes and forward slashes as you can see im confused im not testing im jus about to got o bed its 7 in the morn lol

nadeemshafi9 · June 12, 2009

preg_match("#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#", $matches_v_v, $attributes);

^

WRONg

preg_match('#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#', $matches_v_v, $attributes);

nadeemshafi9 · June 12, 2009

strat from the begigning eith a basic regex

use <

then print tr

then go back and change your code to <pre

then <pre.*>

rember encapse in / <pre.*>/

nadeemshafi9 · June 12, 2009

'#<'.$element_name.'(?:\s+[^>]+)?>(.*?)'

rupam_jaiswal · June 12, 2009

Hey ..

thanx..for your help...am sorry but still it couldnot solve my problem.

I am getting empty $matches from the very first regex

preg_match("/Code:\i.*<pre.*>.*<\/pre>/", $html, $matches);

nadeemshafi9 · June 12, 2009

i was making a proxy for the visa payer authentication and baclays password because our frwall on our clients prohibits usage of internet without paying.

nadeemshafi9 · June 12, 2009

this one will sort it out

'#<pre(?:\s+[^>]+)?>(.*?)'

nadeemshafi9 · June 12, 2009

RTFM http://perldoc.perl.org/perlre.html

nadeemshafi9 · June 12, 2009

.*+ match 0 or more times and give nothing back

rupam_jaiswal · June 12, 2009

.*+ match 0 or more times and give nothing back

I am not getting anything..what the use of # here..can you write the full regex...

thebadbad · June 12, 2009

My God, that was confusing. Are you trying to beat a record with all those posts, nadeemshafi9? Seriously.

@OP

Please post any code within

 or [php] tags. This should grab what you're looking for:
 
[code=php:0]preg_match_all('~Code:</div>\s*<pre[^>]*>([^<]*)<~i', $data, $matches);
echo '<pre>', print_r($matches[1], true), '</pre>';

Where $data is the HTML source code.

nrg_alpha · June 12, 2009

One possible solution.

Example:

$html = <<<HTML
<meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" />
<!-- message -->
<div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br />
<br />
<br />
<img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br />
<br />
<br />
info!<br />
<br />

<div style="margin:20px; margin-top:5px">
<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 34px;
text-align: left;
overflow: auto">http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html</pre>
</div><br />

<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 1490px;
text-align: left;
overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar</pre>

</div></div>
HTML;

preg_match_all('#</div>\s*<pre[^\n]*\n(.+?)</pre>#si', $html, $matches);
$count = count($matches[1]);
for ($a = 0 ; $a < $count ; $a++) {
echo $matches[1][$a] . "<br />\n";
}

Output (via view source):

margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 34px;
text-align: left;
overflow: auto">http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html<br />
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 1490px;
text-align: left;
overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar

thebadbad · June 12, 2009

I'm pretty sure he only wanted to grab the contents of the specified pre elements.

nrg_alpha · June 12, 2009

I'm pretty sure he only wanted to grab the contents of the specified pre elements.

My bad (not sure what I was thinking there...)

preg_match_all('#</div>\s*<pre[^>]*>(.+?)</pre>#si', $html, $matches);

Output:

http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html<br /> 
http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar

thebadbad · June 12, 2009

No offense, but what's the point of your post then? When I tested my snippet it worked fine.

nrg_alpha · June 12, 2009

No offense, but what's the point of your post then? When I tested my snippet it worked fine.

If you are referring to:

preg_match_all('~Code:</div>\s*<pre[^>]*>([^<]*)<~i', $data, $matches);

Yeah, it would work... but I think I would opt for .+? instead of [^<]* in case there is any tags (for whatever reason) within the <pre> (which I admit is currently not the case). To me, it's almost akin to say trying to match everything within say a <b> tag.. if there are any additional tags nested within <b> like say <i>...</i>, the [^<]* could get botched.. where as .+? will stop matching once the closing </b> tag is found. But yes, in this case, your solution does work. Many ways to skin a cat... this could all be done in DOM / XPath as well.

thebadbad · June 12, 2009

Oh yea, you're right. I didn't think of that at all.

Won't complain anymore then

nadeemshafi9 · June 12, 2009

.*+ match 0 or more times and give nothing back

I am not getting anything..what the use of # here..can you write the full regex...

basicaly im not sure what teh diferrence between / and #, but you need to start and end with one or the other.

/<pre.*>/

i am a begginnner with regex, ibv been on about it for ages but only recently implamented it

thebadbad · June 12, 2009

basicaly im not sure what teh diferrence between / and #, but you need to start and end with one or the other.

They are called pattern delimiters, and can be any non-alphanumeric character. And it doesn't make a difference which you choose, but to make it easy for yourself, choose a char you won't use within your pattern (so you don't have to escape it).

nrg_alpha · June 12, 2009

They are called pattern delimiters, and can be any non-alphanumeric character. And it doesn't make a difference which you choose, but to make it easy for yourself, choose a char you won't use within your pattern (so you don't have to escape it).

To be pedantic, delimiters can be any non-white space, non-alphanumeric ASCII character (except a backslash).

@nadeemshafi9, you can read about this stuff here and the pcre aspect of the manual.

As thebadbad mentioned, characters that are within the pattern need to be escaped (for the most part this is true.. but there are oddball exceptions.. but I digress). So I tend to use #....#. You'll probably see /...../ as the most common format.. but I don't like using those as the / character is used in file paths for instance.. so you would need to start escaping every / inside the pattern that is delimited by /..../. Other characters that would reduce the need to escape is ~.....~ or !......! for instance. It all boils down to a matter of personal preference (so long as the delimiters are legal of course).

Sign In

extract data from web page

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information