Jump to content

Combining time series data with irregular timestamps


NotionCommotion

Recommended Posts

I have some data where each has a timestamp when is was measured, and not all data points will be necessarily measured at the same time.

 

                Time |    Value1 |    Value2 |    Value3 |    Value4
 2020-12-15 00:05:07 |           |      27.6 |           |          
 2020-12-15 00:30:27 |           |           |        26 |          
 2020-12-15 01:20:10 |      53.9 |      25.2 |           |          
 2020-12-15 02:09:44 |           |      26.4 |           |      60.2
 2020-12-15 02:43:19 |        33 |      49.2 |           |          
 2020-12-15 03:04:44 |      30.8 |           |           |        42
 2020-12-15 03:21:46 |           |        54 |           |          
 2020-12-15 03:47:19 |      52.8 |           |      53.3 |          
 2020-12-15 04:26:20 |           |           |      27.3 |          
 2020-12-15 04:45:10 |      37.4 |           |           |          
 2020-12-15 05:24:25 |           |        42 |           |          
 2020-12-15 05:34:49 |      36.3 |           |      41.6 |          
 2020-12-15 05:36:04 |      24.2 |           |           |          
 2020-12-15 06:06:47 |      49.5 |           |           |          
 2020-12-15 07:00:07 |           |           |           |        49
 2020-12-15 07:59:17 |           |      27.6 |           |          
 2020-12-15 08:14:45 |           |           |      48.1 |      51.8
 2020-12-15 08:26:36 |           |           |      53.3 |          
 2020-12-15 08:54:23 |      50.6 |           |           |          
 2020-12-15 09:46:53 |           |           |      58.5 |      43.4
 2020-12-15 10:25:03 |           |           |           |      36.4

 

I am trying to obtain the total value (i.e. value1 + value2 + value3 + value4) and since there is no common time interval, come up with an arbitrary one such as one hour.  It doesn't need to be exact and one option is to linearize each sample between each of their individual timestamps, but ideally I can do a little better.  It has been many, many years, and I recall that least squares regression used to be used for things like this, but I think I would need to the process the entire series for each which I expect would require many polynomials.  I've never used R and it might be an option if necessary.  Not sure if there is anything baked into PHP which would be great.    Any ideas?  Thanks

                Time |Total Value
 2020-12-15 00:00:00 |      27.3
 2020-12-15 01:00:00 |      48.4
 2020-12-15 02:00:00 |        21
 2020-12-15 03:00:00 |        22
 2020-12-15 04:00:00 |        49
 2020-12-15 05:00:00 |        38
 2020-12-15 06:00:00 |        49
 2020-12-15 07:00:00 |        29
 2020-12-15 08:00:00 |        49
 2020-12-15 09:00:00 |        46
 2020-12-15 10:00:00 |        36

 

Link to comment
Share on other sites

The simplest SQL solution would be to average everything and GROUP BY the hour (which you can do by formatting the date to Y-m-d H:i:00).

Not sure averaging is right, though. If you need something more complex then you might want to shift all the processing to PHP and go with some statistical-type analysis. Dunno. Depends on what the measurements and timestamps mean.

Link to comment
Share on other sites

56 minutes ago, requinix said:

The simplest SQL solution would be to average everything and GROUP BY the hour (which you can do by formatting the date to Y-m-d H:i:00).

Not sure averaging is right, though. If you need something more complex then you might want to shift all the processing to PHP and go with some statistical-type analysis. Dunno. Depends on what the measurements and timestamps mean.

I've been doing so, but it doesn't make pretty results when I have gaps.

I haven't yet validated the results, but the following was a pretty easy solution.

<?php
use Phpml\Regression\LeastSquares;
error_reporting(E_ALL);

require __DIR__.'/../../vendor/autoload.php';

$scale=[1, 2, 3, 4];

$date = new DateTime();
$date->setTime(0,0,0)->sub(new DateInterval('P1D'));
$date2 = clone $date;

printf('%20s | %20s |%10s |%10s |%10s |%10s'.PHP_EOL, 'Time', 'Timestamp', 'Value1', 'Value2', 'Value3', 'Value4');

$series = [
    ['samples'=>[], 'targets'=>[]],
    ['samples'=>[], 'targets'=>[]],
    ['samples'=>[], 'targets'=>[]],
    ['samples'=>[], 'targets'=>[]],
];

for ($i = 0; $i <= 20; $i++) {
    $record=array_fill(0,4,null);

    $date->add(new DateInterval(sprintf('PT%sS', rand(60, 60*60))));
    $timestamp = $date->getTimestamp();
    $item = rand(0,3);
    $value = rand(30,50)*$scale[$item];
    $series[$item]['samples'][] = [$timestamp];
    $series[$item]['targets'][] = $value;
    $record[$item] = $value;

    $item = rand(0,6);
    if($item<=3 && !isset($record[$item])){
        $value = rand(30,50)*$scale[$item];
        $series[$item]['samples'][] = [$timestamp];
        $series[$item]['targets'][] = $value;
        $record[$item] = $value;
    }
    printf('%20s | %20d |%10s |%10s |%10s |%10s'.PHP_EOL, $date->format('Y-m-d H:i:s'), $timestamp, ...$record);
}
echo("\n\n\n\n");

$ls = [];
foreach($series as $item=>$rs) {
    $ls[$item] = new LeastSquares();
    $ls[$item]->train($rs['samples'], $rs['targets']);
}

$interval = new DateInterval('PT1H');

printf('%20s | %20s |%10s |%10s |%10s |%10s |%10s'.PHP_EOL, 'Time', 'Timestamp', 'Value1', 'Value2', 'Value3', 'Value4', 'Total');
while ($date2 < $date) {
    $record = [];
    $sum = 0;
    $ts = $date2->getTimestamp();
    foreach($series as $item=>$rs) {
        $v = round($ls[$item]->predict([$ts]), 2);
        $record[$item] = $v;
        $sum+=$v;
    }
    $record[] = $sum;
    printf('%20s | %20d |%10s |%10s |%10s |%10s |%10s'.PHP_EOL, $date2->format('Y-m-d H:i:s'), $ts, ...$record);
    $date2->add($interval);
}
                Time |            Timestamp |    Value1 |    Value2 |    Value3 |    Value4
 2020-12-15 00:01:34 |           1607990494 |        37 |           |           |          
 2020-12-15 00:04:04 |           1607990644 |           |           |       120 |       164
 2020-12-15 00:51:11 |           1607993471 |           |           |       114 |          
 2020-12-15 01:47:15 |           1607996835 |           |           |       141 |       164
 2020-12-15 02:13:54 |           1607998434 |        34 |        88 |           |          
 2020-12-15 02:43:55 |           1608000235 |           |           |       120 |          
 2020-12-15 03:14:47 |           1608002087 |        40 |           |       150 |          
 2020-12-15 04:11:14 |           1608005474 |           |        84 |       114 |          
 2020-12-15 05:04:15 |           1608008655 |           |        90 |       114 |          
 2020-12-15 05:45:17 |           1608011117 |           |           |           |       188
 2020-12-15 05:52:47 |           1608011567 |           |           |       111 |          
 2020-12-15 05:56:44 |           1608011804 |           |           |           |       156
 2020-12-15 06:54:21 |           1608015261 |           |           |       114 |       120
 2020-12-15 07:21:56 |           1608016916 |           |           |       132 |          
 2020-12-15 07:30:44 |           1608017444 |        40 |           |           |          
 2020-12-15 08:06:50 |           1608019610 |           |           |           |       136
 2020-12-15 08:09:03 |           1608019743 |        50 |           |           |          
 2020-12-15 08:21:22 |           1608020482 |        35 |       100 |           |          
 2020-12-15 09:00:20 |           1608022820 |        34 |           |           |       144
 2020-12-15 09:49:22 |           1608025762 |           |           |           |       196
 2020-12-15 10:23:03 |           1608027783 |           |        76 |           |       180


                Time |            Timestamp |    Value1 |    Value2 |    Value3 |    Value4 |     Total
 2020-12-15 00:00:00 |           1607990400 |     36.63 |     90.24 |     126.1 |    158.33 |     411.3
 2020-12-15 01:00:00 |           1607994000 |     36.98 |     89.81 |    125.28 |    158.73 |     410.8
 2020-12-15 02:00:00 |           1607997600 |     37.33 |     89.37 |    124.47 |    159.13 |     410.3
 2020-12-15 03:00:00 |           1608001200 |     37.69 |     88.93 |    123.66 |    159.53 |    409.81
 2020-12-15 04:00:00 |           1608004800 |     38.04 |     88.49 |    122.84 |    159.93 |     409.3
 2020-12-15 05:00:00 |           1608008400 |     38.39 |     88.06 |    122.03 |    160.32 |     408.8
 2020-12-15 06:00:00 |           1608012000 |     38.75 |     87.62 |    121.22 |    160.72 |    408.31
 2020-12-15 07:00:00 |           1608015600 |      39.1 |     87.18 |    120.41 |    161.12 |    407.81
 2020-12-15 08:00:00 |           1608019200 |     39.45 |     86.75 |    119.59 |    161.52 |    407.31
 2020-12-15 09:00:00 |           1608022800 |     39.81 |     86.31 |    118.78 |    161.92 |    406.82
 2020-12-15 10:00:00 |           1608026400 |     40.16 |     85.87 |    117.97 |    162.31 |    406.31

 

Link to comment
Share on other sites

Yeah, my initial thought were the same.  My memory was that LSR would come up with f(x) = c0 + c1*x + c2*x^2 + c3*x^4...  The more polynomials, the more it can reflect the curve.  Obviously, I don't have enough!  For what I am doing, however, I don't think LSR is the right solution.  If I wanted to predict the whatever sometime in the future, it might be a good solution (especially since no one could doubt me!), but I am just trying to fit a curve within existing data.  I am sure there is a de facto standard for doing so and most likely an existing PHP class, but I don't know what it is called and haven't found it yet.

Quote

Doesn't look valid to me. Like, the 00 hour has a single Value1 measurement of 37, but you're outputting 36.63? Shouldn't the value be, you know, 37?

Not necessarily.  If the curve is trending one way or another either linearly or by change in slope, etc, it might be the right results.  That being said, don't think so and agree with you.

Edited by NotionCommotion
Link to comment
Share on other sites

Least squares does not come up with equations. It's meant for times when you have too much data and are trying to "average" it out to try to get an "actual" value, and from there you can easily derive an equation.
By the way, to get a good equation you have to inject a little knowledge of your own into the system. Like whether the underlying truth is polynomial or sinusoidal in nature.

In other words, to apply fun statistical models to your data, you are going to need a lot more data to work with.

Tip: now would be a good time for the question to stop being abstract.

Link to comment
Share on other sites

29 minutes ago, requinix said:

Least squares does not come up with equations. It's meant for times when you have too much data and are trying to "average" it out to try to get an "actual" value, and from there you can easily derive an equation.
By the way, to get a good equation you have to inject a little knowledge of your own into the system. Like whether the underlying truth is polynomial or sinusoidal in nature.

In other words, to apply fun statistical models to your data, you are going to need a lot more data to work with.

Tip: now would be a good time for the question to stop being abstract.

Maybe doesn't come up with equations, but does come up with the constants in those equations.  Agree one needs to understand the system (unless you got yourself a billion polynomials of course).

While I have seemed to gained a reputation for being abstract and maybe well earned, I am not trying to be so now.  I need to provide the appropriate data to render a chart, and am currently providing a start date, an interval time and 400 values per series data using a GROUP BY query.  The chart renders well when I have sufficient data samples per each GROUP BY duration, but when less sufficient, I am providing 400 values of step data and Highcharts renders it as if it was actually measured that way.  One option is to only provide the values and timestamps that I have and let Highcharts use its own algorithm to smooth out the curve, but I would rather have the server perform this task.

Link to comment
Share on other sites

7 hours ago, NotionCommotion said:

Maybe doesn't come up with equations, but does come up with the constants in those equations.

Yeah, looks like I was just being pedantic.

 

Quote

While I have seemed to gained a reputation for being abstract and maybe well earned, I am not trying to be so now.

Your question is "I have values measured inconsistently over time and I want to group similar values together and I'm currently doing some math on them".
Raises questions like "what values?" and "why are they inconsistent?" and "does it make sense to group/average them like you've been doing, and if so why and how?" and "are you trying to interpolate data or render a chart or what?"

 

Quote

I need to provide the appropriate data to render a chart, and am currently providing a start date, an interval time and 400 values per series data using a GROUP BY query.  The chart renders well when I have sufficient data samples per each GROUP BY duration, but when less sufficient, I am providing 400 values of step data and Highcharts renders it as if it was actually measured that way.  One option is to only provide the values and timestamps that I have and let Highcharts use its own algorithm to smooth out the curve, but I would rather have the server perform this task.

And you can't tell it "no, these aren't discrete values, please give me a trendline"? Surely there's some sort of options to override what it's trying to determine automatically.

Link to comment
Share on other sites

2 hours ago, requinix said:

Yeah, looks like I was just being pedantic.

Nice word.  Had to look it up :)

2 hours ago, requinix said:

Your question is "I have values measured inconsistently over time and I want to group similar values together and I'm currently doing some math on them".
Raises questions like "what values?" and "why are they inconsistent?" and "does it make sense to group/average them like you've been doing, and if so why and how?" and "are you trying to interpolate data or render a chart or what?"

  • What values?  Environmental parameters such as temperature, humidity, pressure, power consumption, etc.
  • Why are they inconsistent?  Only that the actual measurements are not taken at common instance in time unless by chance.
  • Does it make sense to group/average them like you've been doing, and if so why and how.  For a query over a longer duration of time, I might have over a million samples and need to group them for obvious reasons, and for these cases I am using a GROUP BY and it works fine.  My need is to come up with a strategy for short durations where I don't have multiple samples per group interval.
  • Are you trying to interpolate data or render a chart or what?  Maybe, yes, and yes.  Render charts and/or provide in CSV format, and maybe interpolate data as a means to an end.

 

2 hours ago, requinix said:

And you can't tell it "no, these aren't discrete values, please give me a trendline"? Surely there's some sort of options to override what it's trying to determine automatically.

I don't believe so.  I can ask for the actual values with timestamps, or ask for them grouped by some time duration.  When grouping, I can ask for empty groups to be filled with NULL, some given value, the previous valve, or a linier value.  For short durations, I will have many occurrences of empty groups and none of these group by strategies seem to meet my needs, and since I will also not have much data to process, I believe I will be better off just requesting the actual valves and having PHP deal with making it presentable.

Link to comment
Share on other sites

8 minutes ago, NotionCommotion said:
  • Why are they inconsistent?  Only that the actual measurements are not taken at common instance in time unless by chance.

They aren't even consistent with themselves. I would expect a sensor to report values on a regular basis, even if not at the same rate as another sensor, and yet, for the Value1 sensor, you have a value at midnight, one at 2:13, one at 3:14, and then nothing until 7:30?

 

8 minutes ago, NotionCommotion said:
  • Does it make sense to group/average them like you've been doing, and if so why and how.  For a query over a longer duration of time, I might have over a million samples and need to group them for obvious reasons, and for these cases I am using a GROUP BY and it works fine.  My need is to come up with a strategy for short durations where I don't have multiple samples per group interval.

Too little data and all you can do is guess...

Now, if you want to get really smart, you might be able to pool more data than you think. For example, if daily values tend to be similar then you can include the previous and next days' data in calculations. Expand out to a week or more both ways and you can derive a trendline for your trendline...

Link to comment
Share on other sites

7 hours ago, requinix said:

They aren't even consistent with themselves. I would expect a sensor to report values on a regular basis, even if not at the same rate as another sensor, and yet, for the Value1 sensor, you have a value at midnight, one at 2:13, one at 3:14, and then nothing until 7:30?

That was data being generated randomly as I was just trying to come up with an approach to make it look "nice".  Below is some raw and group by data.  For this raw data subset, there aren't any rows with only a single measurement, but that is not always the case.

 

image.thumb.png.6b671f2d4cc0db5f389be94b6cbb087a.png

 

image.png.068c903dfcb20863db8c682a5cc7bdee.png

 

7 hours ago, requinix said:

Too little data and all you can do is guess...

Now, if you want to get really smart, you might be able to pool more data than you think. For example, if daily values tend to be similar then you can include the previous and next days' data in calculations. Expand out to a week or more both ways and you can derive a trendline for your trendline...

Really, I just want the bottom chart to look prettier.  What you are looking at is grouping to 400 samples, but it shows up as 20 horizontal 3 minute steps.  Each of these 3 minute steps has 20 groups (20 x 20 = 400), but they are the same value because I am filling empty values with the previous value.  I "could" do linear instead, but it doesn't look much better.   Highcharts can deal with irregular timestamps, but I would rather output common timestamps with data for all as it would be better UX if downloaded to CSV and is more specific chart software agnostic.

image.thumb.png.cfde83906490d683b188ac9d7a86bb0f.png

image.thumb.png.736710f21723cef8cb445a463d71b3f3.png

 

Edited by requinix
removed the unwanted image
Link to comment
Share on other sites

Oh, dude, that makes so much more sense to work with. Really should have started off with that.

But first: you are combining rows and not columns, right? P49s get math-ed together, P51s, P55s, and P60s too, but you aren't mathing more than one of those together (as far as this thread is concerned), right?

Seems to me you should first try to focus on merging those rows at creation time rather than later after the fact. Given that the sensors seem to poll every 3 minutes, how about the code storing this data first tries to UPDATE an existing row created within the last minute (and only INSERTs if not)?
13:00:22: P51 and P60 happen together, no rows within the last minute, create new row as [null, 4.47, null, 16.43]
13:00:25: P49 and P55 happen together, existing row within the last minute, update row to be [0.23, 4.47, 4.75, 16.43]
13:03:22: P51 and P60 happen, no rows within the last minute, create new row...

That "one minute" can be any window you want as long as (a) it's consistently shorter than the time between every sensor's reports and (b) it's longer than the time between "sets" of sensor reports.
If the window is too long and violates (a) then the data is still good but you're losing some values because newer ones are overwriting. If the window is too short and violates (b) then you get holes in your data, and though too many holes will break the chart, that's not likely to happen and besides it isn't a critical problem if it does.

You do lose the timestamp of the second set of data, but you could fix that by also tracking timestamps of each sensor value reported. So that's five: creation time plus one for each sensor.

Link to comment
Share on other sites

10 minutes ago, requinix said:

Oh, dude, that makes so much more sense to work with. Really should have started off with that.

But first: you are combining rows and not columns, right? P49s get math-ed together, P51s, P55s, and P60s too, but you aren't mathing more than one of those together (as far as this thread is concerned), right?

Seems to me you should first try to focus on merging those rows at creation time rather than later after the fact. Given that the sensors seem to poll every 3 minutes, how about the code storing this data first tries to UPDATE an existing row created within the last minute (and only INSERTs if not)?
13:00:22: P51 and P60 happen together, no rows within the last minute, create new row as [null, 4.47, null, 16.43]
13:00:25: P49 and P55 happen together, existing row within the last minute, update row to be [0.23, 4.47, 4.75, 16.43]
13:03:22: P51 and P60 happen, no rows within the last minute, create new row...

That "one minute" can be any window you want as long as (a) it's consistently shorter than the time between every sensor's reports and (b) it's longer than the time between "sets" of sensor reports.
If the window is too long and violates (a) then the data is still good but you're losing some values because newer ones are overwriting. If the window is too short and violates (b) then you get holes in your data, and though too many holes will break the chart, that's not likely to happen and besides it isn't a critical problem if it does.

You do lose the timestamp of the second set of data, but you could fix that by also tracking timestamps of each sensor value reported. So that's five: creation time plus one for each sensor.

The P51, P55, etc are physical points.  I also have virtual points which combine multiple physical points columns (not rows).  Note that the data isn't stored in SQL and doesn't have the typical structure.  There is also some use case to do math on rows such as what the difference is between two moments in time, but I don't think it is relevant to this topic.

The data I just grabbed just happened to be about 3 minute increments.  Values can be updated two ways.  One is a poll rate where the user configures it to poll the meters ever x minutes (they just happened to be set up for 3 minutes for these four points).  The other way is by change in value where the meter broadcasts (UDP) to everyone when it changes by a specified threshold.  Change of value is often preferred as multiple clients can just silently listen and gain the data without creating extra network traffic.

I have considered buffering the data in order to make more common writes.  I have some cases where I have samples less than a second apart which I believe provides little value and is a bit of a headache.

The following chart is just for a single physical point, but instead of rendering as steps, wanted to be a beautify curve like the one I tried to draw.  Highcharts and others can accept just the timestamp and value for the various data points highlighted in yellow and do this, however, I don't know how to deal with virtual points which combine multiple real points with different timestamps and it also not ideal when downloading as CSV.  I thought there would be some "simple" PHP approach to derive a time based formula from the real data points so I could then combine the value all the series for each sequential time interval.

image.thumb.png.9ea72e479248de1074bfc5d1a1dc9bbd.png

 

Sorry, always in hindsight, I realize I should have provided more detail early on.  I recognize it is impossible to provide good advise without having the details, but I also don't want to others with too many details.

Link to comment
Share on other sites

Let's see...

So obviously you can't go polynomial for this. 21 data points means a 20th-order polynomial, and that's only 1 hour's worth of data. Also can't go sinusoidal because that's essentially a wavy line on top of a polynomial.
You're getting into math that's beyond my capabilities. In essence, I think you need to fit a curve to a handful of points within a moving window (say, a window of 7 points and a best-fit 3rd-order polynomial) in a way that the curves smooth out.

You're going to want a statistics library to do this.

Link to comment
Share on other sites

I am just amazed that I can't find an existing solution.  It is the stupid irregularly spaced time stamps which really limits what is available.  A couple resources I found are below.

  1. https://traces.readthedocs.io/en/latest/, https://datascopeanalytics.com/blog/unevenly-spaced-time-series/, and https://github.com/datascopeanalytics/traces are right on topic.  I resisted going through the code because it is written in Python and I thought I'd find something in PHP, however, this is a strong contender.
  2. https://www.php.net/manual/en/book.trader.php, and particularly the moving averages and similar functions and especially https://www.php.net/manual/en/function.trader-mavp.php are interesting.  Very poorly documented, and I don't think really geared to irregular sample times,
  3. https://php-ml.readthedocs.io/en/latest/ has some good stuff, but doesn't seem to have what I am looking for.
  4. http://195.134.76.37/applets/AppletSmooth/Appl_Smooth2.html and https://en.wikipedia.org/wiki/Savitzky–Golay_filter might be relevant.  It is only for evenly spaced sample points, however, maybe possible to adapt by coming up with a moving derivative and multiplying it by each group's duration and I might give it a try.
  5. https://en.wikipedia.org/wiki/Unevenly_spaced_time_series and maybe https://en.wikipedia.org/wiki/Curve_fitting and https://en.m.wikipedia.org/wiki/Smoothing are relevant.
     
Link to comment
Share on other sites

1 hour ago, Barand said:

So I'm gussing either that's a load of bovine scatology or it simplifies down to the image.png.ff03ef320d816941d064e20b7f711fca.pngthat you used in the table you posted.

Exactly!

By the way, good point that one wouldn't add temperature + humidity + pressure + power, but one might add f(temperature, humidity, pressure) + power.  That being said, I've only been doing so with power and energy, and not for temperature, humidity, or pressure.

Link to comment
Share on other sites

  • 2 weeks later...
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.