Working with large datasets

NotionCommotion · June 23, 2020

I have an application which takes some time and would like to take steps to improve execution speed.

Sample data is provided as JSON and is as follows where the values array has few columns and many rows. My desired outcome is three PHP objects for mean_P51, mean_P55, and max_P56 which all have a reference to the time array as well as their own values array.

{
    "name": "L2",
    "columns": ["time", "mean_P51", "mean_P55", "max_P56"],
    "values": [
        ["2020-06-13T14:02:02Z", 4.3527550826446255, 5.668302919254657, 0.6175362252066116],
        ["2020-06-13T14:02:12Z", 4.472219604166665, 5.493282520833331, 0.6095558604166668],
        ["2020-06-23T14:02:22Z", 4.332343173277662, 5.477678517745302, 0.6014520167014615],
        ...
        ["2020-06-23T14:02:22Z", 4.272219604166665, 5.468302919254657, 0.6195558604166668]
    ]
}

Originally, I thought that it would be more efficient to iterate over the big values array once and on each iteration process each column (I envision myself walking one mile and snapping my fingers three times each foot, or walking three miles and snapping my fingers once every foot). What I witness, however, is it is faster to iterate over the big values array multiple times to generate each object, and I've done a few simple tests comparing iterating big loops within little loops to little loops within big loops, and get similar results. I guess this makes sense I am not really stopping when I snap my fingers and if I did walking three miles would likely be quicker. Is this expected behavior?

If there is too much data, I've needed to use a JSON stream parser and either a generator or iterator. I haven't tested it yet, but expect a generator would be more efficient than an iterator and using PHP's built-in json_decode() and an array would be more efficient than either a generator or iterator. Think I need to test this hypothesis or is it likely correct?

Any other general strategies one should take when working with large datasets?

Thanks

kicken · June 23, 2020

Not sure I understand your desired output. Do you want to take each value item and transform them such as

["2020-06-13T14:02:02Z", 4.3527550826446255, 5.668302919254657, 0.6175362252066116]
//becomes three objects
new p51("2020-06-13T14:02:02Z", 4.3527550826446255),
new p55("2020-06-13T14:02:02Z", 5.668302919254657),
new p56("2020-06-13T14:02:02Z", 0.6175362252066116)

NotionCommotion · June 23, 2020

4 hours ago, kicken said:

Not sure I understand your desired output. Do you want to take each value item and transform them such as

There isn't a single desired output, and sometimes it will be fairly similar to the received data but must often it will be like the following:

$series = [
    new Series(new Point(51), new Aggregator('mean'), new DataValues([4.3527550826446255, 4.472219604166665, 4.332343173277662, ..., 4.272219604166665)])),
    new Series(new Point(55), new Aggregator('mean'), new DataValues([5.3527550826446255, 5.472219604166665, 5.332343173277662, ..., 5.272219604166665)])),
    new Series(new Point(56), new Aggregator('max'),  new DataValues([0.6175362252066116, 0.609555860166668, 0.604520167014615, ..., 0.619555604166668)]))
];

$timeValues=new TimeValues(["2020-06-13T14:02:02Z","2020-06-13T14:02:12Z","2020-06-23T14:02:22Z", ..., "2020-06-23T14:02:22Z"]);

$seriesCollection = new SeriesCollection($timeValues, ...$series);

To be flexible, will likely just drop the raw data in some class:

class RawDataCollection
{
    public function __construct(string $name, array $columns, array $values)
    {
        //
    }
}

Not really sure whether best or whether it even matters to have some method to create the desired output:

class RawDataCollection
{
    public function createDataCollection(TransformerInterface $transformer):DataCollectionInterface
    {
        
    }
}

class SeriesCollectionTransfomer implements TransformerInterface
{
    //
}

or do something like:

class SeriesCollection
{
    public function __construct(RawDataCollection $rawData)
    {
        //
    }
}

Was I correct in my observation that iterating over a large loop multiple times is often more efficient that iterating over it only once but then iterating over some smaller loop?

Edited June 23, 2020 by NotionCommotion

kicken · June 23, 2020

3 hours ago, NotionCommotion said:

Was I correct in my observation that iterating over a large loop multiple times is often more efficient that iterating over it only once but then iterating over some smaller loop?

If the overall function of the loops doesn't change between versions then it generally doesn't matter which way you do it

foreach (1000){ foreach (3){ something; }} and foreach (3){ foreach (1000){ something; }} are both 3000 executions of something; and it's ultimately the time of the something block that determines the time spent.

Sometimes the order of the loops can help make whatever something; is more efficient by say reducing the number of variable assignments/lookups etc. Other times it truly doesn't make a difference so you'd just do whichever seems easiest to understand.

I'd probably start with something like this to transform the data and see how it goes.

$data = json_decode($yourJsonString);
$columns = array_map(function($v){
    return new ColumnData($v);
}, $data->columns);

foreach ($data->values as $row){
    foreach ($row as $idx=>$value){
        $columns[$idx]->addValue($value);
    }
}

That makes the main loop simple and just shifts the data around basically into a more usable format. The ColumnData class is just a simple container to hold the data temporarily. I tested the above on a 54MB file of random data and it took about 8 seconds and used ~575MB of memory (439MB for the json_decode, 136MB for the sorting).

class ColumnData {
    private $name;
    private $values;
    public function __construct($name){
        $this->name = $name;
        $this->values =[];
    }

    public function getName(){
        return $this->name;
    }

    public function getValues(){
        return $this->values;
    }

    public function addValue($value){
        $this->values[] = $value;
    }
}

The ColumnData class could be removed and simple array's used instead, not sure how that'd affect runtime/memory usage. Edit: Tried this, reduced the runtime from 8 seconds to 1.5 seconds, but had no effect on memory usage.

Once you've got the data all grouped by column, just generate the classes you need from there.

//This assumes time is the first column.
$timeColumn = new TimeValues(array_shift($columns)->getValues());
$series = array_map(function($v){
    list($fn, $point) = explode('_', $v->getName());
    return new Series(new Point($point), new Aggregator($fn), new DataValues($v->getValues()));
}, $columns);

$collection = new SeriesCollection($timeColumn, ...$series);

Edited June 23, 2020 by kicken

NotionCommotion · June 24, 2020

14 hours ago, kicken said:

The ColumnData class could be removed and simple array's used instead, not sure how that'd affect runtime/memory usage. Edit: Tried this, reduced the runtime from 8 seconds to 1.5 seconds, but had no effect on memory usage.

Thanks kicken,

Well, that is significant. ColumnData was applied on the short loop so there were few of them which contained large arrays? I would not expect significant difference in time. Guess that is why testing is always good!

I know I am a little off topic, but still hope to get your thoughts. Say I had a class RawData which was injected with either a stream or some data array, and needed the ability to export the data in several different formatted versions. Would you any of the following approaches or some other approach?

$rawData=new RawData($stream);

$outputV1 = $rawData->createOutput(new OneOfServeralOutputFormatters());

$outputV2 = (new OneOfServeralOutputFormatters($rawData))->createOutput();

$outputV3 = (new OneOfServeralOutputFormatters())->createOutput($rawData);

kicken · June 24, 2020

1 hour ago, NotionCommotion said:

I would not expect significant difference in time.

It's the addValue function call that ate up the time. Keeping the object but making the values property public and just doing $columns[$idx]->values[]=$value; resulted in the same 1.5 seconds run time.

1 hour ago, NotionCommotion said:

Would you any of the following approaches or some other approach?

I'd probably go with option 3. It keeps the data and the formatting de-coupled and seems to be more correct placement of the output function to me. Any of them would be fine really though.

NotionCommotion · June 24, 2020

Thanks kicken,

Yes, I see why addValue() might take a little time. I know performance optimization is often foolhardy, but think this is a good time to do so.

Regarding the my responder question, I definitely don't need to spend any more time dwelling on it and just pick one, and will likely go with the one you suggested.

Sign In

Working with large datasets

Recommended Posts

NotionCommotion

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

NotionCommotion

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

NotionCommotion

Link to comment

Share on other sites

kicken

Link to comment

Share on other sites

NotionCommotion

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information