5kyy8lu3 Posted December 20, 2014 Share Posted December 20, 2014 Hi. I've been banging my head against the wall with this stupid problem and I just figured out the problem. I scrape data from a website. I query my table to see if I already have everything in there. If no match is found, I insert it. Today, I noticed even after inserting, my script kept telling me there's a new entry I need to insert, despite it actually being there when I physically check the table. But when I echo the query out and run it in phpmyadmin, it finds the row. But not if I run the query directly in my script. Turns out, there are several invisible characters in my string. When I do var_dump(), it says length is 25. When I copy and paste the string into notepad, then back into a fresh php script and wrap it in quotes and echo out a strlen(), I get 21. There are apparently 4 invisible characters that I can't see. I trim() everything, so that apparently didn't catch it. So... is there a good way to "clean" my data before inserting or comparing it to avoid this in the future? This wasted a ton of time and I hope to find a way to clean this junk out of my data. Thanks! It just seems like I run into this sort of thing often when it's data scraped from the web. (breaks my regex! grr!) Quote Link to comment https://forums.phpfreaks.com/topic/293195-help-cleaning-invidible-characters-out-of-scraped-data/ Share on other sites More sharing options...
requinix Posted December 20, 2014 Share Posted December 20, 2014 What does var_dump() show and what is the bin2hex of it? Quote Link to comment https://forums.phpfreaks.com/topic/293195-help-cleaning-invidible-characters-out-of-scraped-data/#findComment-1500169 Share on other sites More sharing options...
5kyy8lu3 Posted December 20, 2014 Author Share Posted December 20, 2014 (edited) This is what vardump shows: string(25) "4 Hands Sugar & Spice" strlen() shows this: 21 and I get nothing when I bin2hex the containing variable oh and I should mention that when I do the strlen, I'm copy/pasting output of browser back into script to do teh strlen. when I wrap strlen around the variable I get 25 just like vardump shows Edited December 20, 2014 by 5kyy8lu3 Quote Link to comment https://forums.phpfreaks.com/topic/293195-help-cleaning-invidible-characters-out-of-scraped-data/#findComment-1500172 Share on other sites More sharing options...
Jacques1 Posted December 20, 2014 Share Posted December 20, 2014 (edited) You need to output the result of bin2hex(). As the name already says, this function converts a binary string into a hexadecimal representation. Now it's your job to print this on the screen. Also note that the output in your browser is not an exact representation of the actual text. For example, all whitespace (tabs, spaces, newlines etc.) is “folded” into a single space character. If you want to know what the actual text looks like, you need to look at the HTML source code. Last but not least, do you understand the concept of character encodings? You generally can't just take text from another site and use it straight in your own application. You have to determine the source encoding, compare it with your own encoding and, if necessary, transcode the data. Edited December 20, 2014 by Jacques1 Quote Link to comment https://forums.phpfreaks.com/topic/293195-help-cleaning-invidible-characters-out-of-scraped-data/#findComment-1500179 Share on other sites More sharing options...
Frank_b Posted December 20, 2014 Share Posted December 20, 2014 Stealing data from somebody else? Quote Link to comment https://forums.phpfreaks.com/topic/293195-help-cleaning-invidible-characters-out-of-scraped-data/#findComment-1500181 Share on other sites More sharing options...
5kyy8lu3 Posted December 20, 2014 Author Share Posted December 20, 2014 You need to output the result of bin2hex(). As the name already says, this function converts a binary string into a hexadecimal representation. Now it's your job to print this on the screen. Also note that the output in your browser is not an exact representation of the actual text. For example, all whitespace (tabs, spaces, newlines etc.) is “folded” into a single space character. If you want to know what the actual text looks like, you need to look at the HTML source code. Last but not least, do you understand the concept of character encodings? You generally can't just take text from another site and use it straight in your own application. You have to determine the source encoding, compare it with your own encoding and, if necessary, transcode the data. Oh, I didn't think about having to check the source to see the output. I was just looking at the browser output and it was blank. This is the output of the bin2hex(): 342048616e64732053756761722026616d703b205370696365 My knowledge on character encodings are admittedly limited Stealing data from somebody else? Scraping public data is not stealing lol Also, if you want to know more specifically what I'm doing, it's a website that my friends and I use to keep track of the beers we've had for flying saucer's "ufo club". It gets really difficult to know which beers we've had after you get past about 50, so I made this site. To avoid errors with data entry, instead of letting users add new beers that aren't found when they do a search, I scrape saucer's "beer menu" to get a full list of what's in stock. In fact, my "menu" ends up being more accurate than theirs because the printed menus are often a couple days old. Quote Link to comment https://forums.phpfreaks.com/topic/293195-help-cleaning-invidible-characters-out-of-scraped-data/#findComment-1500192 Share on other sites More sharing options...
Barand Posted December 20, 2014 Share Posted December 20, 2014 (edited) The & is actually & which gives the extra four characters 34 4 20 48 H 61 a 6e n 64 d 73 s 20 53 S 75 u 67 g 61 a 72 r 20 26 & 61 a 6d m 70 p 3b ; 20 53 S 70 p 69 i 63 c 65 e Edited December 20, 2014 by Barand Quote Link to comment https://forums.phpfreaks.com/topic/293195-help-cleaning-invidible-characters-out-of-scraped-data/#findComment-1500201 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.