Jump to content

wilbur_wc

Members
  • Posts

    11
  • Joined

  • Last visited

    Never

Everything posted by wilbur_wc

  1. two part question... PART 1... part one is really simple, but for some reason it's just not happening for me... i need to find and replace all numbers which have a space between them... ex: 9 0 8would become 908 & 6 9 would become 69 what're the proper preg_replace attributes i'd need to achieve this? PART 2... this one is a little tricky... i also have some numbers like 1,000 that are showing up as 1 ,000 or 1, 000... so i need to get rid of the unwanted space, but there's a catch... if the line which contains the number we're potentially replacing also contains the string respective, then i need leave the space intact. the reason being, some number/comma combinations represent two separate values as opposed to a thousand delimiter (this case can be identified by the line containing somewhere the string respective). thanks so much
  2. i need to read a pdf and convert it to raw text (with line breaks - but that's as fancy as i need it)... pdflib does way more than i need, and it's super expensive, and i don't really see the need to install an app on my server just to read a simple pdf... there must be an alternative out there, but i can't seem to find it... and when all of php.net seems to reference pdflib, i start to get a little discouraged... it seems like a simple pdf reader class/package would be open source somewhere... any suggestions? thanks
  3. got it working... needed to make one call, set the cookie and then attempt the download... i'm sure there's some redundancy in there, but it works... more info: http://www.php.net/manual/en/function.curl-setopt.php $agent = $_SERVER[ 'HTTP_USER_AGENT' ]; $ref_url = "http://somesite.com"; // in case they don't allow automated logins $data = "handle=username&password=pass"; // syntax pulled from firebug's post $fp = fopen( "cookie.txt", "w" ); fclose( $fp ); $curl = curl_init(); curl_setopt( $curl, CURLOPT_URL, "http://somesite.com/login.php" ); curl_setopt( $curl, CURLOPT_RETURNTRANSFER, 1 ); curl_setopt( $curl, CURLOPT_HTTPAUTH, CURLAUTH_BASIC ); curl_setopt( $curl, CURLOPT_USERPWD, "username:pass" ); curl_setopt( $curl, CURLOPT_USERAGENT, $agent ); curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, true ); curl_setopt( $curl, CURLOPT_COOKIEFILE, "cookie.txt" ); curl_setopt( $curl, CURLOPT_COOKIEJAR, "cookie.txt" ); curl_setopt( $curl, CURLOPT_SSLVERSION, 3) ; curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, 0 ); curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, 0 ); curl_setopt( $curl, CURLOPT_HEADER, true ); curl_setopt( $curl, CURLOPT_POST, true ); curl_setopt( $curl, CURLOPT_TIMEOUT, 40 ); curl_setopt( $curl, CURLOPT_REFERER, $ref_url ); curl_setopt( $curl, CURLOPT_POSTFIELDS, $data ); ob_start(); $result = curl_exec( $curl ); if( $error = curl_error( $curl ) ) echo( "</br><--- cURL ERROR:" . $error . " --->" ); ob_end_clean(); curl_close( $curl ); //echo( "</br><--- curl:" . $result . " --->" ); $curl = curl_init(); curl_setopt( $curl, CURLOPT_URL, $this->pdfURL ); curl_setopt( $curl, CURLOPT_RETURNTRANSFER, 1 ); curl_setopt( $curl, CURLOPT_HTTPAUTH, CURLAUTH_BASIC ); curl_setopt( $curl, CURLOPT_USERPWD, "username:pass" ); curl_setopt( $curl, CURLOPT_USERAGENT, $agent ); curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, true ); curl_setopt( $curl, CURLOPT_COOKIEFILE, "cookie.txt" ); curl_setopt( $curl, CURLOPT_COOKIEJAR, "cookie.txt" ); curl_setopt( $curl, CURLOPT_SSLVERSION, 3) ; curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, 0 ); curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, 0 ); curl_setopt( $curl, CURLOPT_HEADER, true ); curl_setopt( $curl, CURLOPT_POST, true ); curl_setopt( $curl, CURLOPT_TIMEOUT, 40 ); curl_setopt( $curl, CURLOPT_REFERER, $ref_url ); curl_setopt( $curl, CURLOPT_POSTFIELDS, $data ); ob_start(); $result = curl_exec( $curl ); if( $error = curl_error( $curl ) ) echo( "</br><--- cURL ERROR:" . $error . " --->" ); ob_end_clean(); curl_close( $curl ); echo( "</br><--- curl:" . $result . " --->" );
  4. oh, damn... now that i look at it, i see that it's redirecting in the same manner it does when doing it all manually through the browser... EX: if you had tried the download before logging in, you'd be prompted for your login, but once login is accepted it sends you to a different entry page (the uri listed in the 302 status) from which you have to re-navigate to the pdf download. is there a way to establish a connection (the way a browser does once you've logged in) and then attempt the download? or simply re-attempt the download without losing your logged in status? thanks
  5. i added that as well as a couple other options... curl_setopt( $curl, CURLOPT_HEADER, true ); curl_setopt( $curl, CURLOPT_POST, true); curl_setopt( $curl, CURLOPT_RETURNTRANSFER, true); and now curl_exec returns: HTTP/1.1 302 Found Date: Tue, 31 Aug 2010 05:39:29 GMT Server: Apache X-Powered-By: PHP/5.2.6 Location: /archive/2010/08/page/0001 Cache-Control: max-age=14400 Expires: Tue, 31 Aug 2010 09:39:29 GMT Content-Length: 0 Content-Type: text/html does this indicate that it was found the pdf? the content length 0 concerns me. how do i write the pdf file contents to a local variable? the pdf could be up to 500kb... won't i need to wait until it's been loaded with some sort of oncomplete callback? thanks
  6. hmn... i've been playing around with curl stuff, and it looks like my login is working fine... but i'm still at a loss as to how i can then load a pdf file (large file) into a variable and know when the pdf is ready to be read... this is what i'm doing, and evidently the curl_exec returns true... $curl = curl_init(); curl_setopt( $curl, CURLOPT_HTTPAUTH, CURLAUTH_BASIC ); curl_setopt( $curl, CURLOPT_USERPWD, "user:pass" ); curl_setopt( $curl, CURLOPT_URL, $this->pdfURL); curl_exec( $curl ); how do i write the pdf file to a variable (fopen/fread aren't working)? how to i track the progress of the pdf download/write? thanks
  7. i'm new to php and don't really know where to start here... i'm automating a system that scrapes a site for a particular pdf download link (got this far), downloads it, parses the pdf, etc... problem is that you must be logged in (while viewing in the browser) in order to access the pdf... if you're not logged in, you are redirected and the download fails... i do have a proper login... how would i go about utilizing my login in order automate the pdf download? is there a way to send the login with the url request? or open a stream, login, and retry the download? thanks
  8. thanks PFMaBiSmAd... that does the job, and then some... great class and i'm already up and running with it. however, i'm still curious if there's a regex solution. thanks
  9. and and one note... the number of divs/tags within the main div is not constant.
  10. i'm trying to parse an html page and retrieve one div (or better yet one specific item from within the div). here's the div that i'm looking for... <div class="content-item "> <div class="type">XXX</div> <div class="title"><a href="xxxxxxx">THIS IS MY SEARCH FLAG</a></div> <i></i> <div class="tags"></div> <a title="View the PDF version of this article" href="this-is-the-url-i-want-to-pull.pdf" class="pdf-link"><img alt="PDF" src="xxxxx" class="xxxx">PDF</a> <a title="xxxx" href="xxx" class="xx"><img alt="xxxx" src="xxxxx" class="xxxxx">XXXXX</a> </div> you can see that my query constant (the only thing i can constantly depend on existing in the same format) is a string represented in the html as 'THIS IS MY SEARCH FLAG' and the item i ultimately want to return is a url represented by 'this-is-the-url-i-want-to-pull.pdf' i'm new to php and regex is always something of trial and error for me anyhow... any help would be greatly appreciated. thanks
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.