Jump to content

Filtering Scraped Content with PHP, Add absolute URL to relative URL ?


schapel

Recommended Posts

Hey all, I've run into a little problem with a script I've been working on.  The script grabs a bunch of HTML from a separate web page, and then displays the results onto my page.  I have already put in some basic filtering using the strip_tags function of PHP so that only A TR TD TH and P are allowed as HTML tags when the code is put in.

 

However, my next step is to convert the links that are coming back to a specific absolute URL prefix.  For example, the links are being returned as

<a href="/test/blabla.html">text link</a>

which is accurate to how they are on the page the URL is grabbed from.  However, I need to loop through all the A tags and add a direct URL before the HREF so that the links actually work when clicked.  If they are left as is, the link behaves as if that URL is located on my page, which it isn't.

 

I guess my question is, is there a simple function to accomplish this feature in any of the major PHP frameworks or in the core PHP code that I have missed?  I found some long code examples on how to accomplish this but I was hoping there is just some simple function I can use instead that might be easier.

 

Thanks!

I'm using regular expressions to scrape the html I want to begin with, as from what I understood it is the fastest way.  Wouldn't the output need to be converted to XML first, and then run through DOM parameters? It seems like a very indirect way of doing it with some unnecessary steps, but then again I'm not really sure...

I'm using regular expressions to scrape the html I want to begin with, as from what I understood it is the fastest way.  Wouldn't the output need to be converted to XML first, and then run through DOM parameters? It seems like a very indirect way of doing it with some unnecessary steps, but then again I'm not really sure...

 

Does this looks complicated or indirect way to solve your problem?

<?php

$doc = new DOMDocument;
$doc->load('page.html');

$items = $doc->getElementsByTagName('a');

for ($i = 0; $i < $items->length; $i++) {
    $items->item($i)->setAttribute('href', '/test/blabla.html');
}

?> 

No, I was just asking  ;)

 

Although, that script doesn't really accomplish or fit with what I'm doing, but I get the jist of what you're saying. 

 

I already have the scraped content stored in a string variable, nicely filtered, and I would need to ADD a partial string to the beginning of the HREF tag rather than just cycle through and replace all of them.

 

I suppose I'll have to read up more on DOM, thanks for your help.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.