highrise Posted June 23, 2007 Share Posted June 23, 2007 Hi Guys I'm quite new to PHP and I have a question. I have a MySQL database which has a load of musicians on it. Each musician has a field that stores their influences, which are separated by commas, e.g: The Beatles, The Beach Boys, Kiss etc etc. Now, what I would like to do is start with the influences of one musician, and then compare it to all the others to find the closest matches - i.e find the musicians that have (say) more than three influences i common. This would then allow you to see which musicians had the most influences in common with the primary one, the one you started with. Here's what I have in mind, it seems a little laborious - load the influences of the primary musician into an array using the explode command. Then step through each of the other musicians in a loop, each time putting their influences in to an array and checking to see how many matches there are between the two arrays. Whenever there is at least one match, the name of the musician and the number of matches would be put into a third array. This array would then be sorted at the end, so that the musicians with the most matches would be at the top, and the script could then display them. This would probably work, but if looking at several hundred or several thousand musicians, it would surely be quite slow - is there a better way? I don't particularly want to use a separate, normalised table at this point - I know it would be the quickest way but it would involve a lot of restructuring for me. I know that PHP has several clever inbuilt routines that might do this job for me, so if anyone can suggest anything I'd appreciate it. Thanks in advance for the help! Quote Link to comment Share on other sites More sharing options...
Wildbug Posted June 23, 2007 Share Posted June 23, 2007 Here's what I have in mind, it seems a little laborious ...*snip*[Yep]- .... is there a better way? [Yep, use a separate table] I don't particularly want to use a separate, normalised table at this point - I know it would be the quickest way [Yep] but it would involve a lot of restructuring for me. [Whoops] I'd recommend making the separate table, recoding, and chalking up the experience to planning ahead next time. However, if you still really want to keep that structure (and you probably do), I'd recommend against pulling all the data into PHP to analyze it. You should just grab enough to manufacture a query and use that to get results. (This code is a quick exercise in concept; you'll need to flesh it out with error checking, formatting, etc.) <?php // Find current user's influences $influences = mysql_result(mysql_query('SELECT influences FROM users WHERE id=' . $id),0); // Assemble a query by breaking the influences string into seperate "FIND_IN_SET('artist',influences) OR..." conditions $result = mysql_query("SELECT id,username,influences FROM users WHERE FIND_IN_SET('" . str_replace(',',"',influences) OR FIND_IN_SET('",$influences) . "',influences) AND id != $id GROUP BY id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC"); // Get those results if ($result && mysql_num_rows($result)) while ($row = mysql_fetch_assoc($result)) echo "$row[username] $row[influences]<br />\n"; ?> Otherwise, with a table containing userid and a single influence in a row you could use a query such as: SELECT b.uid,users.username,GROUP_CONCAT(DISTINCT b.artist SEPARATOR ', ') AS MatchingInfluences FROM influences a JOIN influences b ON a.uid!=b.uid AND a.artist=b.artist AND a.uid=$id JOIN users ON b.uid=users.user_id GROUP BY b.uid HAVING COUNT(*) > 1; Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 thanks! I'll have a play about with both those solutions, I think I can learn a lot from experimenting with each of them. Quote Link to comment Share on other sites More sharing options...
Wildbug Posted June 24, 2007 Share Posted June 24, 2007 Another problem you will have is if these artist are user-inserted, that is you have musicians typing in the names of bands. That means that there will be someone who likes "Guns 'n Roses" who will never find someone who likes "Guns & Roses" or the person who loves "GnR". Not to mention the goober who can't spell: "gunz and rosses". Not that I have a solution, but you should be aware of that. Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 one thing I was wondering about though - if I decide to go for a separate table (well, two tables, one for the influences, the other for the relationships, i.e user id / influence id), how would I go about converting them? I have about 3,000+ records all with the data in this comma delimited format. thanks Quote Link to comment Share on other sites More sharing options...
Wildbug Posted June 24, 2007 Share Posted June 24, 2007 Not too hard; you could write a PHP script to do it. In fact, I'm not sure you could do it in MySQL purely. Since it's a one-time conversion, don't worry about performance or anything. You can write a query to read in the data, explode it, check if the influence is already inserted, if not insert it, if so get id, insert uid/influence_id.... etc. Back up first. Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 thanks, I'll look into that. I probably don't need to back up if I leave the original data intact. One thing though, I will probably make the shift over to a table, but in the meantime I am having trouble with this line: $result = mysql_query("SELECT id,username,influences FROM users WHERE FIND_IN_SET('" . str_replace(',',"',influences) OR FIND_IN_SET('",$influences) . "',influences) AND id != $id GROUP BY id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC"); I have highlighted in bold where I think the problem lies - any ideas? Thanks again Quote Link to comment Share on other sites More sharing options...
Wildbug Posted June 24, 2007 Share Posted June 24, 2007 This snippet: <?php $influences = 'King\'s X,Spock\'s Beard,Yoko Ono'; $query = "SELECT id,username,influences FROM users WHERE FIND_IN_SET('" . str_replace( ',', "',influences) OR FIND_IN_SET('", addslashes($influences)) . "',influences) AND id != $id GROUP BY id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC"; echo "$query\n"; ?> Yields this output: SELECT id,username,influences FROM users WHERE FIND_IN_SET('King\'s X',influences) OR FIND_IN_SET('Spock\'s Beard',influences) OR FIND_IN_SET('Yoko Ono',influences) AND id != GROUP BY id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC I added addslashes() in case of apostrophes. You probably should use mysql_real_escape_string() there. So, in other words, it seems to be working for me. What errors/problems are you finding? Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 it's working now, I was having a few minor problems replacing a couple of field names, it's good! thanks. One thing I am struggling a little bit with though, is how I would fit the MySQL 'Trim' command into that string - sometimes the influences have leading spaces, other times they don't, I'd like to remove them. Quote Link to comment Share on other sites More sharing options...
Wildbug Posted June 24, 2007 Share Posted June 24, 2007 You should have the "clean up" in PHP before inserting the data into MySQL -- trim whitespace, mysql_real_escape_string, etc. If you have " bandname;bandname2;bandname3 ", you can use UPDATE users SET influences = TRIM(influences) and you'll get "bandname;bandname2;bandname3" If you have "bandname ; bandname2; bandname3", I don't think you can fix that in MySQL; you'll have to do a one-time PHP cleaner function (take them out, use preg_replace to find the whitespace around the semi-colon, write the values again), and make changes to your usual PHP to clean the names before they are put in a string and inserted. Well, maybe you could do it all in MySQL, but it'd be a pain if there are alot of spaces. UPDATE users SET influences = TRIM(influences); # Repeat each of these next two lines until no rows match UPDATE users SET influences = REPLACE(influences,' ;',';') WHERE influences LIKE '% ;%'; UPDATE users SET influences = REPLACE(influences,'; ',';') WHERE influences LIKE '%; %'; Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 it's not a lot of spaces, it's just that sometimes people input the influences like this: beatles, rolling stones, the who and other times like this beatles,rolling stones,the who so when it does the query, it does this: ' beatles', instead of 'beatles' (use of 'the' is a whole other problem of course) What I'm finding is that it is working, but sometimes you look at one users influences, and you might get one user listed, but if you look at that user, the first user is not listed as having it in common. I really appreciate the help by the way, thanks for taking the time, I'm learning as I go along Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 I added the following line: $influences =str_replace(", ",",",$influences); which removes the leading spaces after the commas. But now it only picks up those users with influence lists without commas. The problem here is this: beatles, rolling stones, the who is actually "beatles"," rolling stones","the who" and so is different from this: rolling stones, the beatles, the who since this is actually "rolling stones"," beatles","the who" if one user wrote this one way and one the other, the only match it would find would be 'the who'. I need to remove the leading spaces from the SQL query. I don't really want to change the actual data held in the record though, as this is used for display purposes. Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 ok, I think I know how to solve the problem, but I can't quite figure out how to do it. Within the string replace function, I need to set it up so that it strips the original influences and then looks, for example, for both '$influences' and ' $influences' (with a leading space). In your example, this would come out like this: SELECT id,username,influences FROM users WHERE FIND_IN_SET('King\'s X',influences) OR FIND_IN_SET(' King\'s X',influences) OR FIND_IN_SET('Spock\'s Beard',influences) OR FIND_IN_SET(' Spock\'s Beard',influences) OR FIND_IN_SET('Yoko Ono',influences) OR FIND_IN_SET(' Yoko Ono',influences) AND id != GROUP BY id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC see what I mean? There should be a way of putting this into the code, but I have not quite managed it yet. Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 ok, I got it to work - the final query was : $query = "SELECT id,name,influences FROM users WHERE ((FIND_IN_SET('" . str_replace( ',', "',influences) OR FIND_IN_SET('",addslashes($influences)). "',influences) OR (FIND_IN_SET('" . str_replace( ',', "',influences) OR FIND_IN_SET(' ",addslashes($influences)). "',influences)))) AND id <> '$id' GROUP BY features.id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC"; I added second line to the query. It strips out the original influences of all leading spaces, and then looks for occurences of each influence, with out without a leading space. Success! Quote Link to comment Share on other sites More sharing options...
highrise Posted June 24, 2007 Author Share Posted June 24, 2007 Gah... still a couple of problems. Firstly, I have the same issue when comparing two users - it works one way but not the other. Secondly, is this query case sensitive? Quote Link to comment Share on other sites More sharing options...
Wildbug Posted June 25, 2007 Share Posted June 25, 2007 You shouldn't be using a "static" query like that. In fact, it shouldn't work at all..... can you print out the $query there and post? And what do you mean by "it works one way but not the other?" MySQL isn't case sensitive by default; you have to use the BINARY operator to force it. I'm going to change something in the SQL -- I benchmarked using FIND_IN_SET vs. LIKE '%%', and the LIKE was about 30% faster. <?php $influences = 'King\'s X,Spock\'s Beard,Yoko Ono'; $query = "SELECT id,username,influences FROM users WHERE influences LIKE '%" . str_replace( ',', "%' OR influences LIKE '%", mysql_real_escape_string(preg_replace('/\bthe /i','',$influences)) . "%' AND id != $id GROUP BY id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC"; echo "$query\n"; ?> (output:) SELECT id,username,influences FROM users WHERE influences LIKE '%King\'s X%' OR influences LIKE '%Spock\'s Beard%' OR influences LIKE '%Yoko Ono%' AND id != GROUP BY id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC The str_replace replaces each comma with part of an SQL clause, making a string of "influences LIKE '%...%'" comparisons. You won't have to worry about spaces with the LIKE query, unless a user double spaces within a band name. Quote Link to comment Share on other sites More sharing options...
Wildbug Posted June 25, 2007 Share Posted June 25, 2007 Correction. My code is missing a parenthesis. I'm also going to just throw everything in a preg_replace. That'll take care of spaces, a leading "the", and insert the "OR influences..." part. <?php $id = 1; $influences = 'King\'s X,Spock\'s Beard,Yoko Ono,The Beatles,Thespians'; $query = "SELECT id,username,influences FROM users WHERE influences LIKE '%" . preg_replace( array( '/\bthe /i', '/\s*,\s*/' ),array( '', "%' OR influences LIKE '%" ), mysql_real_escape_string($influences)) . "%' AND id != $id GROUP BY id HAVING COUNT(*) >= 3 ORDER BY COUNT(*) DESC"; echo "$query\n"; ?> Quote Link to comment Share on other sites More sharing options...
highrise Posted July 5, 2007 Author Share Posted July 5, 2007 thanks, and sorry, I've been away - I'll try your suggestion though, it looks like a more elegant solution. Quote Link to comment Share on other sites More sharing options...
highrise Posted July 5, 2007 Author Share Posted July 5, 2007 was about to implement this code, but then realised that there is a potential issue. If someone puts an influence in like Prince, surely this method would return Prince, Prince Buster, Fresh Prince? Quote Link to comment Share on other sites More sharing options...
Wildbug Posted July 5, 2007 Share Posted July 5, 2007 Correct, the LIKE %..% is a substring match solution. If you really want exact matching, use the FIND_IN_SET method, but realize that the leading "the" problem will still cause you trouble. (Pssst, *separate table*) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.