Jump to content

mrhenniger

Members
  • Posts

    24
  • Joined

  • Last visited

Profile Information

  • Gender
    Not Telling

mrhenniger's Achievements

Newbie

Newbie (1/5)

0

Reputation

  1. As it turns out once I removed enough of the unneeded HTML/text writing the contents to the file caused no more issues. Will know better in the future. Mike
  2. > Are you absolutely sure that the page you are loading is the same one you are editing? > I've done it before where I copy a page/folder and edit the wrong page wondering why > I am not seeing any changes when loading the page. Been there done that as well. I am absolutely sure I am editing and loading the correct scraping page. Minor changes are correctly reflected. > I'm thinking you may have had the page initially output the contents to the page to verify you > were getting the right results before changing the code to write to a file. Good idea as well, but the number of lines between getting the page contents and writing it to a file is minimal. Only a few lines. Definitely no output. See revised code snippet below. > If you are scraping content from a location that has javascript that redirects, and then > output that scraped content on your page, and the javascript is executed, then yes, it's > going to redirect. If you don't want that to happen, in principle, you're going to have > to do what dalecosp said: find and strip it from the scraped content. OK, a bit of review first. Here is what I have developed since my original post... $fd = fopen($temporaryFileFlickr, 'w'); $context = stream_context_create(array('http' => array('follow_location' => false))); $contents = file_get_contents($selectedURL, FALSE, $context); $contents = str_replace("<script", "", $contents); $contents = str_replace("al:ios:url", "", $contents); // ...Many more lines like this to strip unneeded content... $contents = str_replace("url", "", $contents); fwrite($fd, $contents); fclose($fd); header("Location:" . $redirect); exit(); These are the actual lines in my code, minus the many str_replace commands that are unnecessary here. As you can see I produce no text before the redirect line. If any Javascript survived the purge, I am not displaying or writing it into the scraping page to be executed. If I comment out the "fwrite" line the redirect in the "header" line just before the "exit" line executes as expected. HOWEVER, if I put the fwrite line in as it should be, then the page does the unwanted redirect although with corrupted results. THAT is weird. Somehow writing the text scraped and mangled html & java script contents to a local file is resulting in the unexpected redirect. Here is what I think it comes down to... Despite the "damage" I have done to the html & Javascript, I don't think I have mangled enough. I am going to try cutting out large portions of the text I know I don't need. I'll report my results tomorrow. > Or, it you are certain that is not the problem, provide the URL of the page you are hitting > so we can see the problem for ourselves. Well, that is a dilemma. Let me explain. I developed a website that documents the location of displayed vintage aircraft and their histories (www.aerialvisuals.ca). One thing that we (a small group of admins) do is contact photographers on a popular photo hosting service and ask their permission to use their photos on our site as part of documenting the histories of airframes. For those who are willing to grant us permission we access their photos directly from the photo hosting service by scraping the page to get the ID info and the photos themselves. We have permissions from the owners of the photos to use the photos. However... discussing the methods this popular photo hosting service uses to secure their website and essentially how to defeat the security, may be crossing an ethical boundary. I would therefore hesitate to post a link to the page I am trying to scrape on an open forum like this. Perhaps if one of you is really interested in this particular odd, but interesting, problem you could try sending me a private message here. I will report any progress I make on this issue. Mike
  3. That is brilliant... Don't bother with shutting down the machinery nicely... just throw a wrench into the gears!!! So I modified the snippet as follows... $context = stream_context_create(array('http' => array('follow_location' => false))); fwrite($fd, str_replace("<script", "[script", file_get_contents($selectedURL, FALSE, $context))); fclose($fd); header("Location:" . $redirect); exit(); The undesired redirection still happens, but the resulting page is messed up. I am still investigating. I'll post again if I discover something interesting. Mike
  4. I have been using file_get_contents to retrieve the HTML of web pages without issue until now. I prefer this method over curl due to the simpler syntax, but scraping a more complex page, perhaps with a form submission, then curl is the way to go. So I started scraping a new site today and discovered some odd behavior. Here is a code snippet to give some context... file_delete($temporaryFile); // $temporaryFile is defined as a relative path to a local text file $fd = fopen($temporaryFile, 'w'); $context = stream_context_create(array('http' => array('follow_location' => false))); $contents = file_get_contents($selectedURL, FALSE, $context); // $selectedURL is the URL of the page being scraped fwrite($fd, $contents); fclose($fd); header("Location:" . $redirect); // $redirect is the intended destination page which parses the $contents in the local file but we never reach the page exit(); This script is running as part of a web page and not from the command line. The code snippet is placed before the html <body> tag and there has been no other output to the web page (no echos, etc.). Normally when the file_get_conents stuffs the HTML of the page pointed to by the URL in $selectedURL into the variable $contents there is no issue. Normally the contents of variable $contents has no effect on the behavior of the scraping script or the rendering of the web page hosting the scraping script. HOWEVER in this cause either the actual contents or the activity of retrieving them do affect the rendering of the scraping script/page. You can see I write $contents to a file for post-analysis. The problem is when the page loads I expect it to be the page specified by the URL $redirect, however in this particular case instead the page rendered is the page that is scraped (not the page doing the scraping as expected). How do I know that? I examine the contents of the written file and confirm the page rendered is from the contents of the written file, or at least the same source. Very odd. I have not seen this before. I suspect there is Javascript in the page being rendered within file_get_contents that is overriding the rendering of the scraping page. There is definitely Javascript in the page being scraped (I can see it in the captured file) but it appears complex and I am not a Javascript expert. Does it make sense the javascript of the page being scraped is affecting the page doing the scraping? I ask since here since there are likely people with more expertise on the subject. I have spent all day on this and I can now say I need help. Clearly I don't want this unintended page redirection, and I just want the HTML of the subject page so that it does not have an effect the page/script doing the scraping. Has anyone here seen this before? If it is Javascript in the page being scraped that is having an effect how can I disable the Javascript in the scrape subject page. Any thoughts or comments would be greatly appreciated. Mike Henniger P.S. When I use curl to get the page by substituting the following for file_get_conents... $curl = curl_init(); curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, FALSE); curl_setopt($curl, CURLOPT_AUTOREFERER, TRUE); curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30); curl_setopt($curl, CURLOPT_TIMEOUT, 30); curl_setopt($curl, CURLOPT_MAXREDIRS, 0); curl_setopt($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0",rand(4,5))); curl_setopt($curl, CURLOPT_URL, $selectedURL); curl_setopt($curl, CURLOPT_HEADER, (int)NULL); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); $contents = trim(curl_exec($curl)); curl_close($curl); ...it has the same effect along with the unintended page redirection.
  5. That was it! I accessed the file via a relative path rather than the URL and it work. I had set-up for the URL access for another purpose, but thought I could use it in this application as well. Apparently not. Lesson learned. Thanks!!! Mike
  6. Thanks for the tip. I will report back here when I find a solution. Mike
  7. I have a dynamic page that displays data from an XML. When the data is edited in the database the XML file is replaced with new content from the database. The problem is when the display page reloads (clicking refresh on the browser) the displayed contents do not change, even though the XML file definitely changed. When the XML file is missing, the PHP code behind the scenes makes the appropriate queries to the database to regenerate the XML file. So you would think that when I use FileZilla to go in manually and delete the XML file, then refresh the display page, I would see the latest data... but I don't. It is at least a few hours old and doesn't reflect the revised data which was changed only minutes before. Here is the function that is used by the display page to get the XMLdata... function utility_AirframeData_readFileCache($refAirframeSN) { $fileURL = "http://www.mywebsite.ca/AirframeDossierData.php?Serial=" . $refAirframeSN . ",Ignore=" . rand(0,10000); $fileContents = file_get_contents($fileURL); if( simplexml_load_string($fileContents) === FALSE ) { utility_AirframeData_deleteFileCache($refAirframeSN); $fileContents = file_get_contents($fileURL); } return new SimpleXMLElement($fileContents); } You can see that in an attempt to troubleshoot this I have added a parameter 'Ignore' which will be ignored by the URL, but will present a different URL to file_get_contents each time. This didn't work. So here is a bit about how AirframeDossierData.php... It looks to see if the XML file exists, and if it doesn't, makes sure it is built fresh from the data in the database. Either way once it is sure the XML file is in place it redirects to the XML file. So file_get_contents asks for the contents of... http://www.mywebsite.ca/AirframeDossierData.php?Serial=12345,Ignore=9876 ...but instead gets directed to the XML file... http://www.mywebsite.ca/Airframe/Data/000/000/012/0000012345.xml ...and it is supposed to open this file, and you would hope it opens the file as it is "now". So even when I delete the file and it is freshly regenerated, the file_get_contents presents old data. I have opened the XML file with a text editor and proven it is a new version and has new data. I added some debugging code to my display page to examine the data parsed by SimpleXMLElement, and it is showing the old content. Has anyone seen anything like this before? Is there a way to force file_get_contents not to cache? I very much prefer the old cached data be in the XML file itself. Thanks in advance for any wisdom you can share. Regards, Mike Henniger
  8. Well I learned something new today. Thanks! I added the resource to the mysql_error call, and I finally got the error I was looking for. It pointed me right to the bug. Thanks for the help! Mike
  9. I see where you are going with this. It gave me something to think about. I think I still have a problem with MySQL (asked by provider for the version number, waiting to hear back), but let me make my case first. As you suspected I do have a set of scripts for executing MySQL queries. The scripts manage the databases, users and do some security checks, etc. Here is a support function in that script... function logExecutionQueryError($callingFunction, $command, $error) { global $user; if( $user IS AN ADMINISTRATOR* ) { $aLog = $callingFunction . " - ERROR - " . "Failed to execute [" . $command . "] failed with " . "Error [" . $error . "]"; logError($aLog); } else { logError("Unfortunately there was an error accessing the database. Please wait a few moments and retry. " . "If the issue persists please use the contact link to inform Aerial Visuals."); } } * NOT ACTUAL CODE The logError function is rock solid. It allows a queue of error messages to be stored until the end of script execution, at which time the script may choose to dump them for viewing. I use this for troubleshooting. I have been using it for years. No problems there. Here is a snipit of the function which does the actual query execution... ... $returnValue = mysql_query($command, $connSQL); if( $returnValue === FALSE ) { $error = mysql_error(); logExecutionQueryError("UtilitiesGeneral_DBs:executeQuery", $command, $error); ... I don't give any containers a chance to be stomped before the error gets logged. This is the same execution path for SELECTs, etc., which I don't have problems with. Using the example from before I get this in the log... UtilitiesGeneral_DBs:executeQuery - ERROR - Failed to execute [uPDATE AirframeEvent SET Ordr=2, AirframeSN=12, AirframeVariantSN=0, DateStart=-820526400, DateStartDesig='Y', DateStartType='C', DateEnd=0, DateEndDesig='', DateEndType='', Ownership='', LocSN=0, LocOrdr=0, MilitarySN='44-73463', MilitaryServiceSN=2, NewConstructionNumber='', CivilRegistration='', CivilRegistrationNotes='', Markings='', PhotoSN=0, EventTypeSN=3, Notes='', SourceDesignation='', SourceUserSN=1, Help='', DateApproved=1238087851, InsertOrdr=0, InsertNotes='', Private='' WHERE Serial=166 LIMIT 1] failed with Error [] ...and you can see there is no error logged. However, if I try to set just one column... UPDATE AirframeEvent SET Ordr="2" WHERE Serial=166 LIMIT 1 ...I get no execution failure and the result is applied to the value in the database. This of course is good, but it is only one column/field and I would like to set a number of them at one time. Getting back to my original statement in this post I am thinking it has something to do with the syntax of my multiple-column update or one of the individual values. I am going to experiment with this. I'll post here if I find something. Again if anyone thinks of something first I would love to read about it. Thanks again for your help. I appreciate having someplace to use as a sounding board in this solo project of mine. Mike
  10. Just FYI... BIGINT A large integer. The signed range is -9223372036854775808 to 9223372036854775807. I got this from here... http://dev.mysql.com/doc/refman/5.0/en/numeric-type-overview.html I have a few thoughts about things to check out. Still if anyone has any ideas to share feel free. TIA Mike
  11. I should have mentioned that I have it set-up to spit out the mysql_error() result. It is giving nothing... nada... blank. Not much help there. PHP handles negative int's as dates just fine. They represent dates before Jan 1, 1970. Does MySQL not like negative values for BigInt? (I'll try and look that up) Mike
  12. I have a table with this structure... int Serial int Ordr int AirframeSN int AirframeVariantSN bigint DateStart varchar DateStartDesig varchar DateStartType bigint DateEnd varchar DateEndDesig varchar DateEndType varchar Ownership int LocSN int LocOrdr varchar MilitarySerial int MilitaryServiceSN varchar NewConstructionNumber varchar CivilRegistration varchar CivilRegistrationNotes varchar Markings int PhotoSN int EventTypeSN varchar Notes varchar SourceDesignation int SourceUserSN varchar Help bigint DateApproved int InsertOrdr varchar InsertNotes varchar Private I am trying to execute this command... UPDATE AirframeEvent SET Ordr=2, AirframeSN=12, AirframeVariantSN=0, DateStart=-820526400, DateStartDesig='Y', DateStartType='C', DateEnd=0, DateEndDesig='', DateEndType='', Ownership='', LocSN=0, LocOrdr=0, MilitarySN='44-73463', MilitaryServiceSN=2, NewConstructionNumber='', CivilRegistration='', CivilRegistrationNotes='', Markings='', PhotoSN=0, EventTypeSN=3, Notes='', SourceDesignation='', SourceUserSN=1, Help='', DateApproved=1238087851, InsertOrdr=0, InsertNotes='', Private='' WHERE Serial=166 LIMIT 1 Unfortunately mysql_query is returning FALSE. I confirmed the user is good for both reads and writes and the password is correct. I don't have much experience with multiple column UPDATEs. I am sure it is a simple syntax thing, but with all of the examples I have Googled I can't see what I am doing wrong. Can someone point me towards the error. Thanks in advance for any help! Mike
  13. There is no quick easy-to-use php function for Google Maps. It took me a while to learn integrated Google Maps, but they do have it well documented and there is a Google Group where you can post your questions. BTW... I found it took some education in Javascript before I became good at it.
  14. Can anyone point me to some good cURL tutorials? TIA Mike
  15. I am trying to write a page to extract historical data from the FAA website of vintage aircraft. This is my first scraper. I have been successful with extracting data from a standard web page, but some of the FAA pages have a Javascript link which needs to be followed first before getting to the same data format. Here is an example... http://registry.faa.gov/aircraftinquiry/NNum_Results.aspx?NNumbertxt=N164TB When I look at the source I find this link... <a id="_ctl0__ctl0_MainContent_SideMenuContent_lbtnWarning" class="Results_link" href="javascript:__doPostBack('_ctl0$_ctl0$MainContent$SideMenuContent$lbtnWarning','')">Continue</a> The source also provides the javascript function being called... function __doPostBack(eventTarget, eventArgument) { if (!theForm.onsubmit || (theForm.onsubmit() != false)) { theForm.__EVENTTARGET.value = eventTarget; theForm.__EVENTARGUMENT.value = eventArgument; theForm.submit(); } } So I have been doing some surfing to try and find a way to follow this link. So what I did was decided to open the page and submit to the same form the javascript does in an attempt to follow it. By parsing the page I was able to define this parameter set... $paramSet = "__EVENTTARGET=_ctl0$_ctl0$MainContent$SideMenuContent$lbtnWarning&__EVENTARGUMENT="; I then used the following php... $c = curl_init(); $ret = curl_setopt($c, CURLOPT_URL, $page); $ret = curl_setopt($c, CURLOPT_POST, TRUE); $ret = curl_setopt($c, CURLOPT_POSTFIELDS, $paramSet); $ret = curl_setopt($c, CURLOPT_RETURNTRANSFER, TRUE); $data = curl_exec($c); $data = htmlspecialchars($data); curl_close($c); ...which gave me this value for $data... <html><head><title>Object moved</title></head><body> <h2>Object moved to <a href="%2faircraftinquiry%2fLastResort.aspx%3faspxerrorpath%3d%2faircraftinquiry%2fNNum_Results.aspx">here</a>.</h2> </body></html> I then assumed I was go to the url... http://registry.faa.gov/aircraftinquiry/LastResort.aspx?aspxerrorpath=/aircraftinquiry/NNum_Results.aspx I followed this link as well, but all that it tells me is that "We Can Not Process Your Request At This Time.". So you can see that I am not seeing any obvious errors, but I am not getting the same results with curl that I can get by manually clicking on the javascript link. So.......... Does anyone have any suggestions? It would be great if I could get a critique of my technique. Thanks in advance. Mike (the rookie scraper)
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.