Jump to content

Newbie: Extracting links / iframes from HTML


MarkusJ
Go to solution Solved by .josh,

Recommended Posts

Hi, I am still learning PHP and given some HTML I am trying to extract all links and iframes from the HTML and append them to a different string.

 

I am still learning PHP so I am not sure if how I am checking that the returned array has values (isset) or if I should be appending strings together (.) is correct

 

The code that I have so far is

function GetLinksIFrames($content)
{
       $innerContent ='';
       $regex_pattern_links = "/<a href=\"(.*)\">(.*)<\/a>/";

	preg_match_all($regex_pattern_links,$content,$matches);

		for ($i = 0; $i < count($matches); $i++) 
		{
	  		if(isset($matches[0][$i]))// Is this correct?
			{
				$innerContent = $innerContent.$matches[0][$i]." "; // Is this how to append a result to an existing string?
			}
		}

	$regex_pattern_iframe = "/<iframe src=\"(.*)\">(.*)<\/iframe>/";

	preg_match_all($regex_pattern_iframe,$content,$matches);

		for ($i = 0; $i < count($matches); $i++) 
		{
			if(isset($matches[0][$i]))
			{
				$innerContent = $innerContent.$matches[0][$i]." ";
			}
		}

	return $innerContent;
}

Any help appreciated

 

Thanks
Mark

Link to comment
Share on other sites

If you **want**, you can use JavaScript's DOM methods for that.

 

document.links and window.frames respectively do just that.

 

If you are looking for the PHP version, look up non-greedy repititions.

 

<a href="some/path/link.php">A link here.</a><a href="another/path/link.php">Another link here.</a>

 

The above line would be matched by your RegExp ;)

Link to comment
Share on other sites

Thanks for the feedback :)

 

If I can ask as direct php question

if(isset($matches[0][$i]))// Is this correct?
            {
                $innerContent = $innerContent.$matches[0][$i]." "; // Is this how to append a result to an existing string?
            }

Is the above the best way to check for a null reference and to append a string to itself?

 

Thanks!

Link to comment
Share on other sites

  • Solution

Technically you should check if $matches[0] exists before checking if $matches[0][$1] exists:

 

if(isset($matches[0])&&isset($matches[0][$i]))
But this is just to avoid a warning (assuming your error level is set to report warnings); the logic itself would have worked as-is.

 

And yes, that is how you append a string to another string (concatenation).

 

Sidenote: In your patterns, you use (.*) in several places. This is a greedy match, and it will yield unexpected results. You should use a non-greedy match instead: (.*?)

 

Also, I would like to point out that regex isn't really the best method for parsing html. For example, if the links and iframes have any other attributes, or use single quotes instead of double quotes, spaced differently, or any number of other things, your regex would fail, since it doesn't account for any of that. What you should instead use is a DOM parser.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.