Jump to content

Batch processing and immutable objects


NotionCommotion

Recommended Posts

I have some queries which can take over a second to execute.  It is likely that there will be multiple queries per HTTP request and often several can be queried at once with no appreciable additional processing time.

To implement this, I have created a BatchQuery class who's constructor accepts an array of RequestObjects.  The BatchQuery's constructor will create a QueryObject for each RequestObject unless a QueryObject has already been created for a previous RequestObject which meets the latest RequestObject's needs.  These QueryObjects are then grouped as applicable and the database is queried.

My question regards how to return the data to the application.  I have two ideas:

  1. Utilize a SplObjectStorage map within BatchQuery which uses each RequestObject as keys and the applicable QueryObject as values.  The applicable can then retrieve the data using $batchQuery->getData($requestObject), and BatchQuery::getData() will return $this->splObjectStorage[$requestObject]->getData().
  2. Create a public method within RequestObject which will set that object's QueryObject.  The applicable can then retrieve the data using $requestObject->getData(), and RequestObject::getData() will return $this->queryObject->getData().  My only concern with this approach is RequestObject is not immutable, however, I am not certain whether this will lead to any issues.

Any recommendations whether one approach is better than another or whether I should do something different all together?  Thank you

Link to comment
Share on other sites

Think about #1. You're using the requests as keys and the queries as values, right? Well, which of those two is going to be unique and which may be duplicated? You should have the query as the key and the set of requests as the value.

But a generic SplObjectStorage isn't the best data structure. Please try to look at what else SPL offers. There is one particular design that will be more appropriate here. Consider the possibility that you don't actually need key/value pairs.

Link to comment
Share on other sites

1 hour ago, requinix said:

Think about #1. You're using the requests as keys and the queries as values, right? Well, which of those two is going to be unique and which may be duplicated? You should have the query as the key and the set of requests as the value.

Wouldn't $request[0] and $request[2] be unique?  Also, I was first thinking of a set, but what does it gain me?

$request[0]=new Request(1,2,3);
$request[1]=new Request(3,2,1);
$request[2]=new Request(1,2,3);
$requestCollection=new RequestCollection(...$request);

class RequestCollection
{
    private $storage;
    
    public function __constructor(Request ...$requests) {
        $this->storage=new SplObjectStorage();
        foreach($requests as $request) {
            $this->storage[$request]=$this->getQueryObject($request);
        }
    }

    public function getQueryObject(Request $request):QueryObject {
        //if a QueryObject exists which meets the given Request requirements, return it, else return a new QueryObject.
    }
}
1 hour ago, requinix said:

But a generic SplObjectStorage isn't the best data structure. Please try to look at what else SPL offers. There is one particular design that will be more appropriate here. Consider the possibility that you don't actually need key/value pairs.

A little hint please?

 

Lastly, do you consider my second solution bad practice?  If so, is the reason that it is difficult to troubleshoot because an array of $requests are initially at one state, then it gets inserted into some class and afterwards is at another state?  If bad practice but not for this reason, why?

Link to comment
Share on other sites

33 minutes ago, NotionCommotion said:

Wouldn't $request[0] and $request[2] be unique?

...yes. You were saying that one request can have multiple queries and that multiple requests may want the same query, right?

You're going to be executing queries and then passing the result to the request(s) that wanted it. Thus the uniqueness is on the query, not the request.

Quote

Also, I was first thinking of a set, but what does it gain me?

Riddle me this: in what order do you want to execute these queries?

Quote

Lastly, do you consider my second solution bad practice?

I wouldn't say it's a bad practice, just that it's not the best/right way to do it. It's also kinda a different answer to a different question than the first solution.

Why persist the query object? I would think that you should use the object as a vehicle for executing something and passing the results to whatever need them, so once that process is over you no longer need it. The data goes directly to the request.

Quote

If so, is the reason that it is difficult to troubleshoot because an array of $requests are initially at one state, then it gets inserted into some class and afterwards is at another state?

What?

Link to comment
Share on other sites

Maybe my use of the word "request" brings confussion.  It is not meant to be a HTTP request, but a request for data required for part of the HTTP request.

There is a required order to execute these queries, but way outside of the scope of this question.  There are certain types of requests where the results from one query are subtracted from those of another, so I wish to do the ones which are not acted on by another first.

I expect I am going to totally lose you, but below is a higher level implementation.

abstract class BigObject implements BigObjectInterface
{
    protected $requests=[], $prop1, $prop2, $prop3;

    public function getRequests() {
        return $this->requests;
    }

    public function display() {
        $data=[];
        foreach($this->requests as $request) {
            $data=$request->getData();
        }
        return ['prop1'=>$this->prop1,'prop2'=>$this->prop2,'prop3'=>$this->prop3,'data'=>$data,];
    }
}

class SomeBigObject extends BigObject
{
    public function __constructor(array $userData) {
        foreach($this->getStuffFromDB($userData) as $stuff) {
            //Do things unique for SomeBigObject and populate $prop1, $prop2, $prop3
            $this->requests[]=new Request($stuff, $userData);
        }
    }
}
class SomeOtherBigObject extends BigObject
{
    public function __constructor(array $userData) {
        foreach($this->getStuffFromDB($userData) as $stuff) {
            //Do things unique for SomeOtherBigObject and populate $prop1, $prop2, $prop3
            $this->requests[]=new Request($stuff, $userData);
        }
    }
}

$app->get(_VER_.'/bigQuery', function (Request $request, Response $response) {
    $params=$request->getQueryParams();
    $bigObjects=[];
    foreach($params['someBigObjectData'] as $bigObjectData) {
        $bigObjects[$bigObjectData['id']]=new SomeBigObject($bigObjectData);
    }
    foreach($params['someOtherBigObjectData'] as $bigObjectData) {
        $bigObjects[$bigObjectData['id']]=new SomeOtherBigObject($bigObjectData);
    }
    $bigObjectCollection=new BigObjectCollection(...$bigObjects);

    $requests=$bigObjectCollection->getRequests();
    $requestCollection=new RequestCollection(...$requests);
    $requestCollection->crunchData();

    return $this->responder->bigObjects($response, $bigObjects);    //Responder to execute BigObject::display()
});


 

Link to comment
Share on other sites

Don't worry, I understood the request thing.

1 hour ago, NotionCommotion said:

There is a required order to execute these queries, but way outside of the scope of this question.  There are certain types of requests where the results from one query are subtracted from those of another, so I wish to do the ones which are not acted on by another first.

I was thinking along the lines of wanting to execute queries either (a) in the order they arrive in, as in request #1's queries should probably begin being executed before request #2's if that's not done concurrently, or (b) according to an undescribed priority system, like where one client/request may have a higher priority than another based on something, or that certain queries are innately more important than others.

The required order thing turns out to be relevant. I was suspecting you wanted first-in-first-out, but apparently that's not quite the case.

The data structure I first had in mind was a queue (ie, an array). Push query/request pairs into it when they come in, pop them out when you need to execute the next one. You would execute the query and notify the "request" of the result. AFAIK it's not part of SplObjectStorage's spec to maintain item insertion order so that wouldn't be appropriate. However maybe you need to arrange queries not just by FIFO but some weighted combination of FIFO and query complexity. That suggests a priority queue (SplPriorityQueue), where it's mostly a queue except higher-priority items go to the front of the line, with the added wrinkle of needing to support dependencies.

So I'll stop my reply there with more questions:
1. When one of those "certain types of requests" comes in, how are the queries and the subtraction managed? Does the request actually say to do a couple queries and a subtraction, or does it say to do one query and the system recognizes that as meaning multiple queries and some processing?
2. Any particular reason the easy queries need to come first? Get too many and they'll crowd out the complicated ones. Surely the fairest treatment would be to execute them more-or-less as they come in?
3. Say a request for query A minus query B comes in. Are queries A and/or B reasonable queries to run on their own? Could another request come in just for query A, and those results be used for the subtraction?
4. You've mentioned reusing existing queries. Does that imply caching? Or is it really a question of coincidence where multiple requests just happen to arrive in the same short time frame and involving some of the same queries?

 

Off-topic, but

$app->get(_VER_.'/bigQuery',

don't use the API version number like that. If you change the VER then this code will no longer work for the previous version but instead will suddenly work for the new version. API versioning needs to provide backwards compatibility.

If you want to support versioning well, either
(a) Send (HTTP) requests through a versioning layer first, which identifies the version number and endpoint and routes the request to the proper handler. New revisions to the API come with updates to the version layer. This is a bit complicated and probably not good to use here.
(b) An easier variation is to program endpoints into this versioning layer, as in something like $versioning->get(1, 'bigQuery', function...), then have it populate $app. It would know the highest version number and automatically fill in the gaps between versions. I can explain this in more detail - in fact I could just write the whole thing out.
(c) Hardcode the version number for each endpoint. New revisions to the API entail copy/pasting the unchanged endpoints with the incremented version number. Duplicating code here is okay because old versions must be immutable - no maintenance nightmares. Obviously you'd want to move as much code outside the endpoint callback itself as you could to reduce the sheer amount of duplication.

 

  • Like 1
Link to comment
Share on other sites

The queries are acting on environmental data which arrives asynchronously approximately every 15 seconds, and being asynchronous, has no common timestamps.  A typical query might be to retrieve the min, max, and average value of 20 parameters in 15 minute increments over the past year as well as the difference between this year’s values and last year’s values for each of these aggregate functions.  The database experiences little performance impact retrieving multiple parameters for a given WHERE/GROUP BY clause, and thus I want to do as much work as possible for each of these boundaries.

My optimization strategy does the following:

  1. Sort parameter requests by end date.  This really isn’t a big deal and just allows the data to be subtracted without needing a separate loop.
  2. Group the parameter requests based on several variables that determine the WHERE/GROUP BY clause.
  3. Query all applicable parameters for each of these groups in a single query, and if it happens to be one where the difference between this year and last year is required, subtract the two.

In an attempt to keep things sane, my class which does this accepts an array of ParameterRequest objects which specify the parameter, aggregate function, sample size, time duration, time offset from NOW() for the end time, and optional time offset from the end time to subtract the two values.  The class then creates new Group objects which govern the WHERE/GROUP BY boundaries as well as ArchiveQuery objects which take into account both the Group and the ParameterRequest objects.

It then crunches the data, and when all is complete, something that holds one or more ParameterRequest objects must be able to access the results.

Back to my initial post, to do so I can either:

  1. Utilize a SplObjectStorage to map a ParameterRequest to a ArchiveQuery.
  2. Put the applicable ArchiveQuery object in each ParameterRequest object so it may access the results when needed.

Was trying to to get too much in the weeds, but as typical, not really possible.

 

3 hours ago, requinix said:

Off-topic, but

This is actually on my list to think through and come up with a better approach.  I like your #2 approach.  Let me give it a little thought and pose questions later if necessary.

Thanks!

Link to comment
Share on other sites

So does that mean each Request is for a single Query? It sounded like a Request could have one or more Queries.

I think... that I would do the second option. Both solutions need a store for the ArchiveQueries that can be searched, and presumably one specialized for finding them, which means the most significant different between the solutions is that one needs an SplObjectStorage and the other does not. Reducing complexity would be good.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.