Jump to content

Need a hand, nothing complicated i think....


dheaven69

Recommended Posts

i have developed a script that crates feeds from html files (or at least it should if i would get the regex) but ain't working cause of the regex, can't figure it out

 

if ($feed == false) {
        $Surl = 'myurl.com/myarticle.html'';
        $feed = fetch($Surl);
        preg_match_all('/<description>(.*?)<\/description>/s',$feed,$f);
        foreach($f[1] as $fa) {
            $feed.=$fa;
        }
	//FILTER URL 
        $feed = preg_replace('/\.+/','',$feed);
        $feed = preg_replace('/\-+/','',$feed);
//		$feed = str_replace("<description>", "<p>", $feed);
//		$feed = str_replace("</description>", "</p>", $feed);
	$feed = strip_tags($feed);
	//makes feed item

 

ok so i wanna create a feed from an article within my website i have the rest of the code working but this part, can't manage to figure it out.

 

i will have the article marked at the beginning and at the end with whatever you say and i need the script to rip that part between the tags/comments or whatever

 

thank you

Link to comment
Share on other sites

Is this database driven? Might be easier to store the description in the database, then create the feed from that. (I think that's how its usually done, for simplicity's sake.) What does 'fetch' do? (Content or just URL?) You've got the right idea if its content. What are these for? (Why are they necessary?)

$feed = preg_replace('/\.+/','',$feed); // This removes all series of 1 or more dots
$feed = preg_replace('/\-+/','',$feed); // '-' isn't a metacharacter, so it doesn't need to be escaped

Post a little bit of an article (or a link), so we can help you with a specific solution.

 

Link to comment
Share on other sites

the website does not have a DB so i'm not using one.

 

the script is called in every page when it loads, it checks if there is a feed already generated if not it makes one so fetch grabs the page where the script is called then it should eliminate or get only the desired part of the page.

 

fetch gets the content

 

i made it work to some point but it gets almost all  of the content from the page...menu items, links all kind of tags and so on.

 

Link to comment
Share on other sites

Ok. Sounds good, so pretty much you're "self-scraping". Since, you have control of the content that you're scraping, I'd make it easy on yourself. Define a specific class or id for the specific part you want to scrape, like you've done with the <description> tag. I would guess that you'd only want one of those per article (correct?). So you could change this:

preg_match_all('/<description>(.*?)<\/description>/s',$feed,$f);

to

preg_match('%<description>(.*?)</description>%s',$feed,$f);

(Its generally best practice to use another delimiter besides '/' when scraping a markup language, makes it more readable)

Then its just a matter of cleaning up the markup. striptags is a great way to do that. What other types of formatting do you need to do? Will the markup between the <description> tags contain images, lists, or links?

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.