dheaven69 Posted April 18, 2007 Share Posted April 18, 2007 i have developed a script that crates feeds from html files (or at least it should if i would get the regex) but ain't working cause of the regex, can't figure it out if ($feed == false) { $Surl = 'myurl.com/myarticle.html''; $feed = fetch($Surl); preg_match_all('/<description>(.*?)<\/description>/s',$feed,$f); foreach($f[1] as $fa) { $feed.=$fa; } //FILTER URL $feed = preg_replace('/\.+/','',$feed); $feed = preg_replace('/\-+/','',$feed); // $feed = str_replace("<description>", "<p>", $feed); // $feed = str_replace("</description>", "</p>", $feed); $feed = strip_tags($feed); //makes feed item ok so i wanna create a feed from an article within my website i have the rest of the code working but this part, can't manage to figure it out. i will have the article marked at the beginning and at the end with whatever you say and i need the script to rip that part between the tags/comments or whatever thank you Quote Link to comment Share on other sites More sharing options...
c4onastick Posted April 18, 2007 Share Posted April 18, 2007 Is this database driven? Might be easier to store the description in the database, then create the feed from that. (I think that's how its usually done, for simplicity's sake.) What does 'fetch' do? (Content or just URL?) You've got the right idea if its content. What are these for? (Why are they necessary?) $feed = preg_replace('/\.+/','',$feed); // This removes all series of 1 or more dots $feed = preg_replace('/\-+/','',$feed); // '-' isn't a metacharacter, so it doesn't need to be escaped Post a little bit of an article (or a link), so we can help you with a specific solution. Quote Link to comment Share on other sites More sharing options...
dheaven69 Posted April 18, 2007 Author Share Posted April 18, 2007 the website does not have a DB so i'm not using one. the script is called in every page when it loads, it checks if there is a feed already generated if not it makes one so fetch grabs the page where the script is called then it should eliminate or get only the desired part of the page. fetch gets the content i made it work to some point but it gets almost all of the content from the page...menu items, links all kind of tags and so on. Quote Link to comment Share on other sites More sharing options...
c4onastick Posted April 19, 2007 Share Posted April 19, 2007 Ok. Sounds good, so pretty much you're "self-scraping". Since, you have control of the content that you're scraping, I'd make it easy on yourself. Define a specific class or id for the specific part you want to scrape, like you've done with the <description> tag. I would guess that you'd only want one of those per article (correct?). So you could change this: preg_match_all('/<description>(.*?)<\/description>/s',$feed,$f); to preg_match('%<description>(.*?)</description>%s',$feed,$f); (Its generally best practice to use another delimiter besides '/' when scraping a markup language, makes it more readable) Then its just a matter of cleaning up the markup. striptags is a great way to do that. What other types of formatting do you need to do? Will the markup between the <description> tags contain images, lists, or links? Quote Link to comment Share on other sites More sharing options...
dheaven69 Posted April 19, 2007 Author Share Posted April 19, 2007 pretty much plain text but there could appear some links and some tags, images also. i don't intend that but who knows :\ thank you for your help Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.