Jump to content

Recommended Posts

You could try shaking your hard disk...

 

 

Seriously though, I can't think of any solution, that would be using less resources than just loading the file into array.

You could load a file line by line, save those lines to new files on disk, then read them by random, but that's just crazy (would use less RAM though :P ).

Wow, I got it. I've been thinking about this for 2 days and posting the question made the bang:

 

Simply taking the original list's last line, save it on file #2, take first line, save, take the line before the last one, save, take second line, save and so on.

Then rince and repeat until it's mixed enough. So simple, yet efficient. The point is that I got to use it with 1M+ entries list, it would consume too much memory.

Holy crab. I've end up with a 100+ lines function for this only to realise it takes over 1 second per kb, at 100% CPU usage, which means over an hour for a file.

 

What to do now? Any suggestion?

 

function shuffleList( $theList, $listName, $replace, $quality )
{
// $theList: path to original list to shuffle
// $listName: name of the original list
// $replace: whether the original list is overwritten or saved under a new name (bool)
// $quality: shuffle iterations to determine quality of the mix (int)

if ( file_exists( $theList ) )
{
	$originalList = fopen( $theList, 'r' );
	$tmpList1 = fopen( 'mix1-' . $listName, 'w+' );
	$tmpList2 = fopen( 'mix2-' . $listName, 'w+' );

	// Count entries total.
	$total = 0;
	while ( !feof( $originalList ) )
	{
		fgets( $originalList );
		$total++;
	}
	rewind( $originalList );

	// Shuffle.
	for ( $i = 1; $i <= $quality; $i++ )
	{
		// Determine which file to shuffle.
		if ( $i == 1 )
		{
			$listRead = $originalList;
			$listWrite = $tmpList1;
		}
		elseif ( isPair( $i ) )
		{
			$listRead = $tmpList1;
			$listWrite = $tmpList2;
		}
		else
		{
			$listRead = $tmpList2;
			$listWrite = $tmpList1;
		}

		$counter = 0;
		$top = 1;
		$bot = 0;
		while ( $counter != $total )
		{
			$tmpCounter = 0;
			if ( isPair( $counter ) )
			{
				// Pick an entry at the bottom.
				while ( $tmpCounter != $total - $bot )
				{
					$tmpCounter++;
					$tmpEmail = fgets( $listRead );
				}
				$bot++;
				$counter++;
				fwrite( $listWrite, trim( $tmpEmail ) . "\r\n" );
				rewind( $listRead );
			}
			else
			{
				// Pick an entry at the top.
				while ( $tmpCounter != $top )
				{
					$tmpCounter++;
					$tmpEmail = fgets( $listRead );
				}
				$top++;
				$counter++;
				fwrite( $listWrite, trim( $tmpEmail ) . "\r\n" );
				rewind( $listRead );
			}
		}
		rewind( $listWrite );
	}

	fclose( $originalList );

	// Save mixed list.
	if ( $replace == true )
	{
		$listFinal = $theList;
	}
	else
	{
		$listFinal = 'mix-' . $listName;
	}

	$listFinal = fopen( $listFinal, 'w' );

	while ( !feof( $listWrite ) )
	{
		fwrite( $listFinal, fgets( $listWrite ) );
	}

	fclose( $tmpList1 );
	fclose( $tmpList2 );
}
}

function isPair( $number )
{
// true: pair, false: not pair.

if ( $number == 0 )
{
	return true;
}
else
{
	if ( ( $number - 1 )%2 == 1 )
	{
		return true;
	}
	else
	{
		return false;
	}
}
}

or random...how is that random?

 

Just a thought, but instead of randomizing the order of the list, read a random line from the list.  Make all the lines the same length (take the longest line and add blankspace or another delimiter buffers to the shorter ones).  Then use fseek randomizing the offset argument (and rounding up or down to the nearest newline mark).

Thanks for the input guys. Great idea Crayon Violent, but I'm worried that reading a random line wouldn't be efficient at all: in a 200k+ lines file, finding the last unread line would take ages assuming I replace read ones by spaces. The question is: would it be possible to delete these bytes directly within the original file? Quite impossible I guess.

 

ex:

 

dummy1

bob___

hello_

 

I randomly read "bob___" and delete it:

 

dummy1

hello_

 

Otherwise I'd have to do:

 

dummy1

______

hello_

 

Then randomly find a line and verify if it's not only spaces.

 

 

Seems like a database would be the most efficient way.

You can create a good approximation where a certain line starts, of course this assumes that each line is of equal length, example:

 

$line_length = 82;//in bytes (accounting for \r\n on Windows)
$filesize = filesize('test.txt');
echo 'Lines: ', ceil($filesize / $line_length);

 

Using something like this:

 

qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies
qqmsdlkfjlmqskdjfmlkqsjdfmlksjqdmlfkjsqmldkfjmlqskdjflmksqjdfmlkqsjdfnvsqdfeies

 

Returns "Lines: 10"

 

[ot]The good old Assembler days[/ot]

Thanks for the input guys. Great idea Crayon Violent, but I'm worried that reading a random line wouldn't be efficient at all: in a 200k+ lines file, finding the last unread line would take ages assuming I replace read ones by spaces. The question is: would it be possible to delete these bytes directly within the original file? Quite impossible I guess.

 

ex:

 

dummy1

bob___

hello_

 

I randomly read "bob___" and delete it:

 

dummy1

hello_

 

Otherwise I'd have to do:

 

dummy1

______

hello_

 

Then randomly find a line and verify if it's not only spaces.

 

 

Seems like a database would be the most efficient way.

 

Yes, you can delete previously read lines, however, it would basically involve removing it by writing all lines except for the selected line to a new file.  This isn't very convenient esp with really large files, but it should be a lot less memory intensive than keeping all of the lines in memory at once. 

 

But overall, if you can use a db instead of a text file for this then it would be a million times better to do that.

 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.