Jump to content

extract data from web page


rupam_jaiswal

Recommended Posts

Hi,

My html looks like this

 

<meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" />

<!-- message -->

<div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br />

<br />

<br />

<img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br />

<br />

<br />

info!<br />

<br />

 

<div style="margin:20px; margin-top:5px">

<div class="smallfont" style="margin-bottom:2px">Code:</div>

<pre class="alt2" dir="ltr" style="

margin: 0px;

padding: 6px;

border: 1px inset;

width: 470px;

height: 34px;

text-align: left;

overflow: auto">http://www.sample1.com/part1.html

http://www.sample1.com/part1.html

http://www.sample1.com/part1.html</pre>

</div><br />

 

<div class="smallfont" style="margin-bottom:2px">Code:</div>

<pre class="alt2" dir="ltr" style="

margin: 0px;

padding: 6px;

border: 1px inset;

width: 470px;

height: 1490px;

text-align: left;

overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar

http://www.sample1.com/part1/sample_code.part01.rar</pre>

 

</div></div>

I want all the values that are after Code:</div> and between pre tags.

eg http://www.sample1.com/part1.html

    http://www.sample1.com/part1.html

    http://www.sample1.com/part1.html

and 

    http://www.sample1.com/part1/sample_code.part01.rar

    http://www.sample1.com/part1/sample_code.part01.rar

 

Please note that at the start in meta tag there is also string Code: and I don't value from it.

Thanks in advance

Regards

 

Link to comment
Share on other sites

http://www.regular-expressions.info/posix.html

http://perldoc.perl.org/perlre.html

 

PREG = Perl Compatible Regular Expressions (PCRE):

EREG = Regular Expression (POSIX Extended)

 

http://uk3.php.net/manual/en/ref.regex.php

http://uk3.php.net/manual/en/ref.pcre.php

 

//values for pre

$matches = array();
// . means any character * means 0 r more times ? means 1 or more times and theres another for 1 time
// i without a \ means lower or upercase eg preg_match("/<pre>.*<\/pre>i/",  $html, $matches);
// PREG = Perl Compatible Regular Expressions (PCRE):
// EREG = Regular Expression (POSIX Extended)
preg_match("/<pre>.*<\/pre>/",  $html, $matches);  // the backslash may be an issue

//now look into the array
echo "*****************************************<br><br>";
echo "*******************BEFORE****************<br>";
echo "*****************************************<br><br>";
print_r($matches); // print_r($matches[0]);  print_r($matches[1]); 

/*
$matches[0] will contain an array with the text that matched the full pattern, 
$matches[0][0];$matches[0][1];$matches[0][2];$matches[0]etc. etc...
$matches[1] will have an array with the text that matched the first captured parenthesized subpattern, and so on. 
$matches[1][0];$matches[1][1];$matches[1][2];$matches[1]etc. etc...
*/

// you can cycle throgh and alter them

foreach($matches as $_matches_k => $matches_v){
foreach($matches_v as $matches_v_k  =>  $matches_v_v){
        // PREG = Perl Compatible Regular Expressions (PCRE):
        // EREG = Regular Expression (POSIX Extended)
	$matches[$_matches_k][$matches_v_k ] = ereg_replace("<pre>", "", $matches_v_v);
	$matches[$_matches_k][$matches_v_k ] = ereg_replace("<\/pre>", "", $matches_v_v); // the backslash may be an issue
}
}
echo "*****************************************<br><br>";
echo "*******************AFTER*****************<br>";
echo "*****************************************<br><br>";
print_r($matches); // print_r($matches[0]);  print_r($matches[1]);  

 

 

Link to comment
Share on other sites

Thanks for you help.

But I have posted only a part of my html page.This page has several pre tags and my concern is to 1)get values with pre tags only if it comes after the string Code:

2)My pre tag has certain attributes (<pre class="alt2" dir="ltr" style=" ...) so I can't use <pre>.

If i use <pre.*<\/pre> or <pre(.*)<\/pre>,still it returns empty array.

Regards

Link to comment
Share on other sites


$matches = array();
preg_match("/Code:\i.*<pre.*>.*<\/pre>/",  $html, $matches);  // the backslash may be an issue
foreach($matches as $_matches_k => $matches_v){
foreach($matches_v as $matches_v_k  =>  $matches_v_v){
	$attributes = array();
	//get all attributes
	preg_match('#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#', $matches_v_v, $attributes);
	printr($attributes); // printr($attributes[0]); printr($attributes[1]);
	echo "********************************************";
}
}

Link to comment
Share on other sites

My God, that was confusing. Are you trying to beat a record with all those posts, nadeemshafi9? Seriously.

 

@OP

Please post any code within

 or [php] tags. This should grab what you're looking for:
 
[code=php:0]preg_match_all('~Code:</div>\s*<pre[^>]*>([^<]*)<~i', $data, $matches);
echo '<pre>', print_r($matches[1], true), '</pre>';

Where $data is the HTML source code.

Link to comment
Share on other sites

One possible solution.

 

Example:

$html = <<<HTML
<meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" />
<!-- message -->
<div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br />
<br />
<br />
<img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br />
<br />
<br />
info!<br />
<br />

<div style="margin:20px; margin-top:5px">
<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 34px;
text-align: left;
overflow: auto">http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html</pre>
</div><br />

<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 1490px;
text-align: left;
overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar</pre>

</div></div>
HTML;

preg_match_all('#</div>\s*<pre[^\n]*\n(.+?)</pre>#si', $html, $matches);
$count = count($matches[1]);
for ($a = 0 ; $a < $count ; $a++) {
echo $matches[1][$a] . "<br />\n";
}

 

Output (via view source):

margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 34px;
text-align: left;
overflow: auto">http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html<br />
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 1490px;
text-align: left;
overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar

Link to comment
Share on other sites

I'm pretty sure he only wanted to grab the contents of the specified pre elements.

 

My bad (not sure what I was thinking there...)

 

preg_match_all('#</div>\s*<pre[^>]*>(.+?)</pre>#si', $html, $matches);

 

Output:

http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html<br /> 
http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar

Link to comment
Share on other sites

:) No offense, but what's the point of your post then? When I tested my snippet it worked fine.

 

If you are referring to:

preg_match_all('~Code:</div>\s*<pre[^>]*>([^<]*)<~i', $data, $matches);

 

Yeah, it would work... but I think I would opt for .+? instead of [^<]* in case there is any tags (for whatever reason) within the <pre> (which I admit is currently not the case). To me, it's almost akin to say trying to match everything within say a <b> tag.. if there are any additional tags nested within <b> like say <i>...</i>, the [^<]* could get botched.. where as .+? will stop matching once the closing </b> tag is found. But yes, in this case, your solution does work. Many ways to skin a cat... this could all be done in DOM / XPath as well.

Link to comment
Share on other sites

.*+ match 0 or more times and give nothing back

I am not getting anything..what the use of # here..can you write the full regex...

 

 

basicaly im not sure what teh diferrence between / and  #, but you need to start and end with one or the other.

 

/<pre.*>/

 

i am a begginnner with regex, ibv been on about it for ages but only recently implamented it

Link to comment
Share on other sites

basicaly im not sure what teh diferrence between / and  #, but you need to start and end with one or the other.

 

They are called pattern delimiters, and can be any non-alphanumeric character. And it doesn't make a difference which you choose, but to make it easy for yourself, choose a char you won't use within your pattern (so you don't have to escape it).

Link to comment
Share on other sites

They are called pattern delimiters, and can be any non-alphanumeric character. And it doesn't make a difference which you choose, but to make it easy for yourself, choose a char you won't use within your pattern (so you don't have to escape it).

 

To be pedantic, delimiters can be any non-white space, non-alphanumeric ASCII character (except a backslash).

 

@nadeemshafi9, you can read about this stuff here and the pcre aspect of the manual.

 

As thebadbad mentioned, characters that are within the pattern need to be escaped (for the most part this is true.. but there are oddball exceptions.. but I digress). So I tend to use #....#. You'll probably see /...../ as the most common format.. but I don't like using those as the / character is used in file paths for instance.. so you would need to start escaping every / inside the pattern that is delimited by /..../. Other characters that would reduce the need to escape is ~.....~ or !......! for instance. It all boils down to a matter of personal preference (so long as the delimiters are legal of course).

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.