Jump to content

[SOLVED] php comparison, loops, arays and file double deletion.


eslachance

Recommended Posts

Hello all!

 

I'm working on kind of a one-time hard-coded script (reusable only through re-coding) that will go through a directory that contains karaoke song files, in which some are doubles. They do not have exactly the same name, but mostly the alphanumericals are same (for example, "Gimme some lovin'" and "Gimme some loving" are considered the same).

 

To do this, I put the contents of the directory within an array, the array contains the song artist, title, as well as a "stripped" version of the artist and title. The stripped version removes all characters that are not alphanumerical.

 

There are two issues I have to fix before running the code.

 

1. The first one is that some doubles are not detected properly. For example, the following are not detected as being the same:

Amy Grant - Baby Baby (amygrant - babybaby)

Amy Grant - Baby, Baby (amygrant - babybaby)

 

Zz Top - Gimme All Your Lovin' (zztop - gimmeallyourlovin)

Zz Top - Gimme All Your Lovin (zztop - gimmeallyourlovin)

 

Obviously, that shows artist, title, strippedartist, strippedtitle. The comparison wants strippedartist and strippedtitle to be the same between themselves (the two stripped version), while artist and title have to be different (so that it ignores "itself" when comparing with the same array).

 

2. The second problem is that I cannot for the life of me figure out a simple way to delete one of the doubles, but not both. Since I'm storing the content of the folder into an array (in order to avoid reading the folder 18,750 times, which is the number of songs I have), if I delete a song, when the script reaches the second double it will still "see" the first one and delete it.

 

The three solutions I thought of (and more or less rejected) is to put all the double filenames in an array (so that each song will be there) and to just delete every odd or even entry... But the doubles aren't actually ordered properly, so doing that would mess up everything. And of course I could just put all the doubles in a separate folder and manually sort through them... But though that would reduce my workload, it would still be a pain to do. And thirdly, I could put it all in a database and try to work from there, but I would still have similar issues, and besides using a DB for a one-time use script is overkill IMHO.

 

So anyway, here's the code I use (nothing extremely complicated):

(sorry for the large lines...)

<pre>
<?php

function stripstring($string) {
// Used for comparison between artists and titles while ignoring special characters

// Remove all accents (french) and replace "and" with "&" for duets and such
$accents = array('/é/', '/è/', '/ê/', '/ë/', '/à/', '/â/', '/î/', '/ï/', '/ç/', '/ and /', '/ et /', "/n\' /");
$removed = array('e', 'e', 'e', 'e', 'a', 'a', 'i', 'i', 'c', ' & ', ' & ', "ng ");
$string = preg_replace($accents, $removed, $string);

//Remove all other characters except letters and number
$string = preg_replace('/[^a-zA-Z0-9]/', '', $string);
return $string;

}

function fixstring($string) {
// Used to permanently fix artists and titles so I don't ever need to use cleanstring() again 
$old = array('/ and /', '/ et /', '/`/');
$new = array(' & ', ' & ', "\'");
$string = preg_replace($old, $new, $string);
return ucwords($string);
}


// Make sure the script won't timeout.
set_time_limit(0);

// Get new songs in an array (with clean and original file names)
$temp = opendir('F:/NewDB/SoundChoice');
$newsongs = array();
$i = 0;
   while (false !== ($file = readdir($temp))) {
//Get filename
if ($file !== '.' AND $file !== '..') {
$name = strtolower(substr($file, 0, -4));
// Separate artist and title
$nameparts = explode(" - ", $name);
// Get artist and title strings for comparison (Stripped)
$newsongs[$i]['compartist'] = stripstring($nameparts[0]);
$newsongs[$i]['comptitle']  = stripstring($nameparts[1]);
// Get artist and title strings for rewrite (Cleaned)
$newsongs[$i]['artist'] = fixstring($nameparts[0]);
$newsongs[$i]['title'] = fixstring($nameparts[1]);
$i++;
}
   }

closedir($temp);


echo "there is a total of ". count($newsongs) ." songs in the new database.\n";

$count = 0;
foreach ($newsongs as $newsong) {
$hasdouble = FALSE;
$baseartist = $newsong['compartist'];
$basetitle = $newsong['comptitle'];
foreach ($newsongs as $compsong) {
	if ($compsong['compartist'] == $baseartist AND $compsong['comptitle'] == $basetitle AND $compsong['title'] != $newsong['title'] AND $compsong['artist'] != $newsong['artist']) {
		$hasdouble = TRUE;
		break;
	}
}
if ($hasdouble) {
	echo " <font color=red>double</font>: for {$newsong['artist']} - {$newsong['title']} ($baseartist - $basetitle)\n";
} else {
	echo " <font color=green>unique</font>: for {$newsong['artist']} - {$newsong['title']} ($baseartist - $basetitle)\n";
}
$count++;
}

?> 
</pre>

Link to comment
Share on other sites

Here's a method I use all the time for duplicate removal.

 

$array_with_dups = array(1, 5, 3, 6, 5, 3, 5);
$seen = array();
foreach ($array_with_dups as $key => $item) {
  if ($seen[$item]) {
    # Already seen this one
    unset($array_with_dups[$key]);
  } else {
    $seen[$item] = true;
  }
}
var_dump($array_with_dups);  # No longer with dups

 

The result will have all duplicates removed.  If you want to check for duplicateness by another condition, such as $item['name'], then you should use $seen[$item['name']] instead of $seen[$item].

 

This method is much faster because it uses the $seen array to do lookups in constant time, instead of searching through a large array.  For 20k files I would expect it to take under one second.

Link to comment
Share on other sites

I can see how that would work with a simple series of numbers, and potentially with identical strings... But this is a multi-dimensional array where two of the keys have to be identical and the rest have to be different, unless you can maybe help me figure out a more complex use of this, I can't really use it... I don't want you to do all the work for me, I just can't figure it out by myself (even google has it's limits).

Link to comment
Share on other sites

You don't need the "different" comparison here, because you are not checking against the same list, you are checking against an auxiliary list which serves only to remember which songs you've already seen.

 

For dealing with the "two keys" problem, you can just combine them into a single string using a separator that cannot appear in either string (such as a space, in this case).  Here's an example with sample data to show that it works:

 

<?php
$newsongs[] = array(
  'artist' => 'Tool',
  'title' => 'Rosetta Stoned',
  'compartist' => 'tool',
  'comptitle' => 'rosettastoned',
);
$newsongs[] = array(
  'artist' => 'Tool',
  'title' => '10000 days',
  'compartist' => 'tool',
  'comptitle' => '10000days',
);
$newsongs[] = array(
  'artist' => 'Tool',
  'title' => 'Rosetta-Stoned',
  'compartist' => 'tool',
  'comptitle' => 'rosettastoned',
);

$seen = array();
foreach ($newsongs as $ns_key => $newsong) {
  $seen_key = $newsong['compartist'] . ' ' . $newsong['comptitle']; # space cannot occur in either key
  if ($seen[$seen_key]) {
    # Seen this one before, nuke it
    print "Nuking {$newsong['title']} by {$newsong['artist']}<br>\n";
    unset($newsongs[$ns_key]);
  } else {
    # Not seen before, remember it
    $seen[$seen_key] = true;
  }
  $count++;
}

var_dump($newsongs);
?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.