jjk2 Posted April 18, 2009 Share Posted April 18, 2009 as you know, one can use REGEX to find specific data from a page. however, some websites, use Javascripts to hide their data. So the page you see in your browser, vs. the page in html format is different. What is a possible solution? Is there any way to translate the fully rendered page, onto html, and then scraping it ? Another difficulty is scraping flash. is it even possible to scrape texts on flash ? I do not see how its possible, unless the .swf file is downloaded, decompiled, and match for regex..... Quote Link to comment https://forums.phpfreaks.com/topic/154581-scraping-the-fully-rendered-page-not-the-html/ Share on other sites More sharing options...
.josh Posted April 18, 2009 Share Posted April 18, 2009 javascript can't hide the data if javascript isn't enabled... Quote Link to comment https://forums.phpfreaks.com/topic/154581-scraping-the-fully-rendered-page-not-the-html/#findComment-812837 Share on other sites More sharing options...
jjk2 Posted April 18, 2009 Author Share Posted April 18, 2009 true but, if javascript is disabled, then the page will not fully render. Quote Link to comment https://forums.phpfreaks.com/topic/154581-scraping-the-fully-rendered-page-not-the-html/#findComment-812850 Share on other sites More sharing options...
jackpf Posted April 18, 2009 Share Posted April 18, 2009 Can you not just adapt you regex to match the correct terms? is it even possible to scrape texts on flash ? And I wouldn't have thought so. And good luck writing a script to download and decompile a flash file lol. Probably illegal anyway. Quote Link to comment https://forums.phpfreaks.com/topic/154581-scraping-the-fully-rendered-page-not-the-html/#findComment-812852 Share on other sites More sharing options...
jjk2 Posted April 18, 2009 Author Share Posted April 18, 2009 well i need a way to read the javascript with the html , and push the output to an array. i do not know where i can find such tool or code, that will read the javascript + html, and completely push the output to an array, which then I can scrape. as for flash decompiling, what makes you think its illegal ? its a simple way to extract data from otherwise difficult flash. take your armchair law & enforcement elsewhere kiddie Quote Link to comment https://forums.phpfreaks.com/topic/154581-scraping-the-fully-rendered-page-not-the-html/#findComment-812976 Share on other sites More sharing options...
jackpf Posted April 18, 2009 Share Posted April 18, 2009 Well decompiling an exe is just a difficult way of getting data. Still illegal though. And fine, if you're going to give me shit, I'll take my suggestion elsewhere. Quote Link to comment https://forums.phpfreaks.com/topic/154581-scraping-the-fully-rendered-page-not-the-html/#findComment-813075 Share on other sites More sharing options...
Axeia Posted April 18, 2009 Share Posted April 18, 2009 You're facing the same problems as the big search engines like yahoo and google. Google wipped something up to read a little bit of text in flash, but it's near useless atm. Don't think there's much you can do about javascript either.. it's why pages that make heavy use of javascript and flash often don't rank very well on the search results. It's an accessibility issue that the maker of the page should avoid by never relying on javascript to be present, as for flash.. not really machine accessible either, if you figure out a good way I'm sure that all the writers of search engines as well the writers of screen readers etc would like to have a talk with ya. So yes, that was basically a "just give up, it's too hard". Quote Link to comment https://forums.phpfreaks.com/topic/154581-scraping-the-fully-rendered-page-not-the-html/#findComment-813082 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.