Jump to content

Removing absolute links with regex


Lukus

Recommended Posts

Hi guys

 

I've had a search around, but haven't found anything that suits my needs, and my regex isn't good enough to manipulate what I have found to work for me. So I'd appreciate any advice you can give me to help with my problem :)

 

Basically, I've written a script which pulls html data from 2000+ pages, and then puts the data for each page into a new file with a new template wrapped around it. Everything works fine, except now I'd like to do tidy the data I've output by removing any absolute links found, but leaving relative links in tact.

 

Here's an example of the content I'm dealing with (it's all stored in a variable, say $content):

 

[<a
href="http://www.one-absolute-link.com/test.html">site map
</a>]</em></p>
[<a href="http://www.absolute-link.com/test.html">comments</a>]
  [<a
href="http://www.another-absolute-link.com/test.html">search
</a>]</em></p>

<h4 align="left"><img src="../images/arrow.gif" width="15" height="12">  
  Header</h4>
<ul>
  <li><a href="members.html">Text1</a></li>
</ul>
<ul>
  <li><a href="agendas/index.html">Text2</a><br>
  </li>
  <li><a href="minutes/index.html">Text3</a><br>

  </li>
  <li><a href="papers/index.html">Text4</a></li>
  <li><a href="reports/index.html">Text5</a><br>
  </li>
</ul>
<p align="left">[<a href="http://www.one-absolute-link.com/test.html">Abs Link</a>] [<a href="http://www.one-absolute-link.com/test.html">Abs Link]</a></p>

 

Note how messy the html is, this is one of the problems I was having as links often span multiple lines. I'd like to be able to run a function on $content, which removes any absolute links it finds, but leaves relative links intact.

 

My ideal output would be:

 

<h4 align="left"><img src="../images/arrow.gif" width="15" height="12">  
  Header</h4>
<ul>
  <li><a href="members.html">Text1</a></li>
</ul>
<ul>
  <li><a href="agendas/index.html">Text2</a><br>
  </li>
  <li><a href="minutes/index.html">Text3</a><br>

  </li>
  <li><a href="papers/index.html">Text4</a></li>
  <li><a href="reports/index.html">Text5</a><br>
  </li>
</ul>
<p align="left"></p>

(I still expect the output html to be just as messy and unformatted, but this can't really be helped when dealing with so many pages)

 

If anyone could point me in the right direction I'd be extremely grateful.

 

Thanks, and good morning :)

Luke

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.