Jump to content

Help cleaning invidible characters out of scraped data


5kyy8lu3

Recommended Posts

Hi.  I've been banging my head against the wall with this stupid problem and I just figured out the problem.

 

I scrape data from a website.  I query my table to see if I already have everything in there.  If no match is found, I insert it. 

 

Today, I noticed even after inserting, my script kept telling me there's a new entry I need to insert, despite it actually being there when I physically check the table.

 

But when I echo the query out and run it in phpmyadmin, it finds the row.  But not if I run the query directly in my script.

 

Turns out, there are several invisible characters in my string.  When I do var_dump(), it says length is 25.  When I copy and paste the string into notepad, then back into a fresh php script and wrap it in quotes and echo out a strlen(), I get 21.  There are apparently 4 invisible characters that I can't see.  I trim() everything, so that apparently didn't catch it.

 

So... is there a good way to "clean" my data before inserting or comparing it to avoid this in the future?  This wasted a ton of time and I hope to find a way to clean this junk out of my data.

 

Thanks!  It just seems like I run into this sort of thing often when it's data scraped from the web.  (breaks my regex! grr!)

Link to comment
Share on other sites

This is what vardump shows:

 

string(25) "4 Hands Sugar & Spice"

 

strlen() shows this:

21

 

and I get nothing when I bin2hex the containing variable

 

oh and I should mention that when I do the strlen, I'm copy/pasting output of browser back into script to do teh strlen.  when I wrap strlen around the variable I get 25 just like vardump shows

Edited by 5kyy8lu3
Link to comment
Share on other sites

You need to output the result of bin2hex(). As the name already says, this function converts a binary string into a hexadecimal representation. Now it's your job to print this on the screen.

 

Also note that the output in your browser is not an exact representation of the actual text. For example, all whitespace (tabs, spaces, newlines etc.) is “folded” into a single space character. If you want to know what the actual text looks like, you need to look at the HTML source code.

 

Last but not least, do you understand the concept of character encodings? You generally can't just take text from another site and use it straight in your own application. You have to determine the source encoding, compare it with your own encoding and, if necessary, transcode the data.

Edited by Jacques1
Link to comment
Share on other sites

You need to output the result of bin2hex(). As the name already says, this function converts a binary string into a hexadecimal representation. Now it's your job to print this on the screen.

 

Also note that the output in your browser is not an exact representation of the actual text. For example, all whitespace (tabs, spaces, newlines etc.) is “folded” into a single space character. If you want to know what the actual text looks like, you need to look at the HTML source code.

 

Last but not least, do you understand the concept of character encodings? You generally can't just take text from another site and use it straight in your own application. You have to determine the source encoding, compare it with your own encoding and, if necessary, transcode the data.

 

Oh, I didn't think about having to check the source to see the output.  I was just looking at the browser output and it was blank.  This is the output of the bin2hex(): 342048616e64732053756761722026616d703b205370696365

 

My knowledge on character encodings are admittedly limited

 

 

 

Stealing data from somebody else? :confused:

 

Scraping public data is not stealing lol

 

Also, if you want to know more specifically what I'm doing, it's a website that my friends and I use to keep track of the beers we've had for flying saucer's "ufo club".  It gets really difficult to know which beers we've had after you get past about 50, so I made this site.  To avoid errors with data entry, instead of letting users add new beers that aren't found when they do a search, I scrape saucer's "beer menu" to get a full list of what's in stock.  In fact, my "menu" ends up being more accurate than theirs because the printed menus are often a couple days old.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.