Jump to content

[SOLVED] Using regular expressions to grab data out of a string


wintallo

Recommended Posts

Hello,

 

Right now, I'm trying to right a piece of PHP that grabs data out of a string, based on a regular expression. This is the regex:

(width)(=")[0-9]{3,4}

Say I want to get the "425" out of the following bit of HTML (stored as a string in PHP) and store it in another variable.

<object width="425" height="350">
<param name="movie" value="http://www.youtube.com/v/SRzm3wm1Qu0"></param><param name="wmode" value="transparent"></param>
<embed src="http://www.youtube.com/v/SRzm3wm1Qu0" type="application/x-shockwave-flash" wmode="transparent" width="425" height="350"></embed>
</object>

The regex that matches the

width="425"

but I don't know how to use it with a PHP function to actually get the number out of the string.

 

I looked into ereg, which test if its in the string or not, and ereg_replace, which replaces it in the string. Niether of those functions do what I am looking for. I want to (in terms of the above example) get the "425" out of the block of HTML.

 

Thanks for the read! (and sorry if this shouldn't be in the regex forum  :))

Link to comment
Share on other sites

$string ='<object width="425" height="350">
<param name="movie" value="http://www.youtube.com/v/SRzm3wm1Qu0"></param><param name="wmode" value="transparent"></param>
<embed src="http://www.youtube.com/v/SRzm3wm1Qu0" type="application/x-shockwave-flash" wmode="transparent" width="425" height="350"></embed>
</object>';
preg_match_all('~width="(.*?)"~s',$string, $matches);
echo $matches[1][1];

 

should give 425

Link to comment
Share on other sites

If you want to match the substring, here's a faster alternative that produces no backtracking on the regex engine:

~width="([0-9]+)"~

 

Tested:

http://nancywalshee03.freehostia.com/regextester/regex_tester.php?seeSaved=cocamcrd

 

You have two options in matching your '425' , the first is demonstrated in the above example, where you can match it in the subgroup(subgroups are the parts of the pattern in parenthesis), the other option is to match only the 425 with lookaheads and lookbehinds like so:

~(?<=width=")[0-9]+(?=")~

Tested:

http://nancywalshee03.freehostia.com/regextester/regex_tester.php?seeSaved=favns8nv

 

 

The advantage of the 2nd regex I just stated is that you can only replace what you want with preg_replace() can leaving everything else unchanged. preg_replace() only replaces the full pattern match, which in the 2nd option is only what you want. While with the first option the content that is not the subgroup is not dynamic, it is known 'width="' and ' " '  . So you could still replace only the 425 and put back the known other parts of the match.

 

Of course...

There are instances where it is much harder to create a regex pattern to only match these subgroups that you want to replace, or just different circumstances, this is why I created this function that only replaces the subgroups within a haystack.

See it here:

http://tinyurl.com/yvkbak

 

 

Read about lookaheads, lookbehinds, and other regex methods:

http://www.regular-expressions.info/refadv.html

 

I had to figure out this knowledge the hard way :) I wish someone would have told me this.. like so.. :)

So continue spreading the knowledge!

Link to comment
Share on other sites

Thanks a lot for your replies! I have a few questions though.

 

I gave the regex:

(width)(=")[0-9]{3,4}

You gave me a regex that looks a lot different:

~(?<=width=")[0-9]+(?=")~

I honestly have no idea how to read the latter regex. If I wanted to write a regex that works with preg_match_all (like the one you gave me) that matches both the example I gave above:

width="425"

and

width:425px;

how would I do that? In both cases I wanted the "$matches" array to contain 425.

 

 

Link to comment
Share on other sites

I honestly have no idea how to read the latter regex.

NODE                    EXPLANATION

----------------------------------------------------------------------

  (?<=                    look behind to see if there is:

----------------------------------------------------------------------

    width="                  'width="'

----------------------------------------------------------------------

  )                        end of look-behind

----------------------------------------------------------------------

  [0-9]+                  any character of: '0' to '9' (1 or more

                          times (matching the most amount possible))

----------------------------------------------------------------------

  (?=                      look ahead to see if there is:

----------------------------------------------------------------------

    "                        '"'

----------------------------------------------------------------------

  )                        end of look-ahead

matches both...

width="425"

and

width:425px;

how would I do that?

 

~(?<=width[:=])\D?(\d+)~

 

Link to comment
Share on other sites

I did not see that you wanted to match 'width:425px;' in your earlier post

 

#matching just the 425

this one very specific:

~((?<=width=")[0-9]+(?=")|(?<=width:)[0-9]+(?=px;))~

 

this one more general:

~((?<=width=")|(?<=width:))[0-9]+((?=")|(?=px;))~

 

 

#matching all of it and 425 in the 2nd subgroup ($2):

~width(="|:)([0-9]+)("|px;)~

 

 

If you want to better understand these patterns lookup/study these symbols and what they mean:

(?<=) lookbehind

(?=) lookahead

| the OR pipe used in parenthesis (matchthis|orthat)

[0-9] character classes

+ repetition symbol

Link to comment
Share on other sites

Thanks you so much for all your help guys!

 

For future viewers: This is the code I used to do what I was looking for:

$movie_code = '<object width="425" height="350"><param name="movie" value="http://www.youtube.com/v/SRzm3wm1Qu0"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/SRzm3wm1Qu0" type="application/x-shockwave-flash" wmode="transparent" width="425" height="350"></embed></object>';
preg_match_all('~((?<=width=")[0-9]+(?=")|(?<=width:)[0-9]+(?=px;))~', $movie_code, $matches);
echo $matches[0][0]."<br />";
$movie_code = '<embed style="width:400px; height:326px;" id="VideoPlayback" type="application/x-shockwave-flash" src="http://video.google.com/googleplayer.swf?docId=3728266100951844857&hl=en" flashvars=""> </embed>>';
preg_match_all('~((?<=width=")[0-9]+(?=")|(?<=width:)[0-9]+(?=px;))~', $movie_code, $matches);
echo $matches[0][0];

// The first "echo" outputted 425 and the second outputted 400. Yay! That's just what I needed!

 

Keywords for the Google Spider:

 

use regexp regex regular expressions to grab extract get pull HTML attributes parameters preg_match preg_match_all php

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.