Need a hand, nothing complicated i think....

dheaven69 · April 18, 2007

i have developed a script that crates feeds from html files (or at least it should if i would get the regex) but ain't working cause of the regex, can't figure it out

if ($feed == false) {
        $Surl = 'myurl.com/myarticle.html'';
        $feed = fetch($Surl);
        preg_match_all('/<description>(.*?)<\/description>/s',$feed,$f);
        foreach($f[1] as $fa) {
            $feed.=$fa;
        }
	//FILTER URL 
        $feed = preg_replace('/\.+/','',$feed);
        $feed = preg_replace('/\-+/','',$feed);
//		$feed = str_replace("<description>", "<p>", $feed);
//		$feed = str_replace("</description>", "</p>", $feed);
	$feed = strip_tags($feed);
	//makes feed item

ok so i wanna create a feed from an article within my website i have the rest of the code working but this part, can't manage to figure it out.

i will have the article marked at the beginning and at the end with whatever you say and i need the script to rip that part between the tags/comments or whatever

thank you

c4onastick · April 18, 2007

Is this database driven? Might be easier to store the description in the database, then create the feed from that. (I think that's how its usually done, for simplicity's sake.) What does 'fetch' do? (Content or just URL?) You've got the right idea if its content. What are these for? (Why are they necessary?)

$feed = preg_replace('/\.+/','',$feed); // This removes all series of 1 or more dots
$feed = preg_replace('/\-+/','',$feed); // '-' isn't a metacharacter, so it doesn't need to be escaped

Post a little bit of an article (or a link), so we can help you with a specific solution.

dheaven69 · April 18, 2007

the website does not have a DB so i'm not using one.

the script is called in every page when it loads, it checks if there is a feed already generated if not it makes one so fetch grabs the page where the script is called then it should eliminate or get only the desired part of the page.

fetch gets the content

i made it work to some point but it gets almost all of the content from the page...menu items, links all kind of tags and so on.

c4onastick · April 19, 2007

Ok. Sounds good, so pretty much you're "self-scraping". Since, you have control of the content that you're scraping, I'd make it easy on yourself. Define a specific class or id for the specific part you want to scrape, like you've done with the <description> tag. I would guess that you'd only want one of those per article (correct?). So you could change this:

preg_match_all('/<description>(.*?)<\/description>/s',$feed,$f);

to

preg_match('%<description>(.*?)</description>%s',$feed,$f);

(Its generally best practice to use another delimiter besides '/' when scraping a markup language, makes it more readable)

Then its just a matter of cleaning up the markup. striptags is a great way to do that. What other types of formatting do you need to do? Will the markup between the <description> tags contain images, lists, or links?

dheaven69 · April 19, 2007

pretty much plain text but there could appear some links and some tags, images also. i don't intend that but who knows :\

thank you for your help

Sign In

Need a hand, nothing complicated i think....

Recommended Posts

dheaven69

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

dheaven69

Link to comment

Share on other sites

c4onastick

Link to comment

Share on other sites

dheaven69

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information