Jump to content

[SOLVED] URL Capturing RegEx !


d.shankar

Recommended Posts

I have this three regex .. Each of them retrieve links which are really distinct from another.

 

 

preg_match_all("/<a (?:.*?)href=\"([^\"]+?)\"(?:[^>]*?)>/si", $src, $val);
preg_match_all("/<a[s]+[^>]*hrefs*=s*[\"']?([^'\" >]+)['\" >]/s",$src,$val);
preg_match_all("/href=\"(.*?)\"|<frame.*?src=\"(.*?)\"/",$src,$val);

 

 

 

Is it possible the three regex to united to a single thing ?

Link to comment
Share on other sites

Thanks for reply..

 

I have a HTML source like this and i need to extract the values embedded inside the href tags

 

<html>
.....
<a href="www.google.com">click</a>
<a href=www.yahoo.com>click</a> NOTE: here there is no double quote
<a href='www.yahoo.com'>click</a> NOTE: here there is a single quote
<a class=subclass href="www.ask.com">click</a>
</html>

 

So under these circumstances my code is unable to retrieve these links..

 

So is it possible to frame a regular expression that it will retrieve all the values between href ?

 

please help !

Link to comment
Share on other sites

try this

$HTML = $thehtmlpage;
preg_match_all('/(?:href\s?=\s?(?:"|\'))(.*?)(?:"|\')/i', $HTML, $result, PREG_PATTERN_ORDER);
$result = $result[0];
print_r($result);

 

or

 

$HTML = $thehtmlpage;
preg_match_all('/\bwww\.[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|]/i', $HTML , $result, PREG_PATTERN_ORDER);
$result = $result[0];

this will find URL's formatted text

Link to comment
Share on other sites

Your code doesnt work for the 4th condition i mentioned above..

 

this regex works perfectly for extracting any value embedded between href.

 

preg_match_all("/<a\s+.*?href=[\"\'\s]?(.*?)>(.*?)<\/a>/i",$source,$result);

 

now i need to get the value between the action attribute.. like this

 

<form action="new.asp" method="post">

 

Is it possible to use my above expression to suit the form thing ??

Link to comment
Share on other sites

seams to work here

 

ie

<a class=subclass href="www.ask.com">click</a>

returns

www.ask.com

 

as for

<form action="new.asp" method="post">

 

preg_match_all('/<form(?:.*)(?:action=)(?:"|\')([^"\']*)/si', $subject, $result, PREG_SET_ORDER);

 

will find the value of action

 

or

preg_match_all('/(??:href\s?=\s?|action\s?=\s?)(?:"|\\'))(.*?)(?:"|\\')/si', $subject, $result, PREG_SET_ORDER);

to extend the one above

Link to comment
Share on other sites

ok, had to break it up a little

<?php
$subject = '<form action="new.asp" method="post">';
$Reg1 = '/(??:href\s?=\s?|action\s?=\s?)(?:"|';
$Reg2 = "\\')?)(.*?)";
$Reg3 = '(?:"|';
$Reg4 = "\\'|\s)/si";

preg_match_all($Reg1.$Reg2.$Reg3.$Reg4, $subject, $result, PREG_PATTERN_ORDER);
$result = $result[1];
print_r($result);
?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.