Data Access Platform Query API - Resizing Logic

The content in this blog is over six months old, and the comments are closed. For the most recent product updates and discussions, you're encouraged to explore newer posts from Instructure's Product Managers.

BobODell
Instructure
Instructure
10
3345

Canvas.png

Hello Canvas Data consumers! The Data & Insights team is continuing to look at how we can improve the experience when using the Data Access Platform (DAP) Query API directly and we'd like to get your input.

Today when you query the DAP API there is a post-processing step in the querying layer which kicks in if the result files are too big or too small in size. To achieve this feature, the following happens in the process:

  • The result files from the query are analysed as to whether they need repartitioning. This analysis returns true if either of the below conditions are met:

    • any file is greater than 500MB

    • more than 40% of the result files are considered small (< 30MB)

  • If either of these conditions exist then the data is repartitioned aiming for ~128 MB per file size but not guaranteed (larger files above 500MB in size are broken into multiple files and smaller files below 30MB in size are combined together into a larger file).

This processing is "on" for all queries and can result in unnecessary delays in returning the resulting files to you. 

So our questions:

  1. Is this a feature that you are relying on today and if so, what are the use cases where you care if the file size is above 500MB or there are multiple files less than 30MB in size?
  2. Is this resizing logic something you need us to do directly in the DAP query API or can it be handled more efficiently in your code and service calling the API?
  3. Would an acceptable solution be to move this as an optional parameter that you could use for specific API calls you make vs. having the performance hit on all API calls?

We'd love to hear from you so please let us know your thoughts!

The content in this blog is over six months old, and the comments are closed. For the most recent product updates and discussions, you're encouraged to explore newer posts from Instructure's Product Managers.

10 Comments
jwals
Community Participant

Hi @BobODell ,

We have a daily job that refreshes the 90 tables in the canvas namespace, stored in a PostgreSQL database. This job typically takes under ten minutes. We also have a weekly job that downloads the files for the web_logs table in the canvas_logs namespace and stores them in an AWS S3 bucket. This job frequently takes several hours, the vast majority of which is on Instructure's end (moving the files from Instructure's to our S3 typically takes under a minute). It seems likely that this is because of the resizing logic you have outlined here.

In answer to your questions, then:

1. We are not relying on this feature (we did not explicitly know that the resizing step was happening).

2. Since we aren't relying on it, we don't need Instructure to directly do it. On the other hand, we don't have any resizing logic in our own code so we can't do it more efficiently.

3. Yes, this would be an acceptable solution to us.

ColinMurtaugh
Community Champion

We're in a similar boat -- this is the first I'm hearing about this resizing logic, and TBH I don't really care how big the files are as long as we can download them. I would be in favor of a parameter to control whether or not this logic is used.

I'd also be curious to know what the real-world effects of removing the logic would be: for the large files, roughly how big could they be? And for the small files, what's the biggest batch that we would likely see? Not looking for exact numbers, but a ballpark could be helpful. 

--Colin

mclark19
Community Participant

Thanks for the question. @BobODell

We have a daily set of daily jobs that grab data for each of the tables (including web_logs) and push it to S3. From there, we load into Redshift. For the most part, our jobs finish in an acceptable amount of time, though there have been occasions in the past where our token timed out while checking for job status. (As we have separate jobs for each table, this means that the response time for a single table has hit the 60 minute mark.) As others have said, we weren't aware of the resizing logic, so I can't say that we are relying on it per se. It is unclear how much of our wait time (in these extreme instances) is due to the actual processing of our job because of this logic and how much is just the general load on the system from all users.

In terms of answering your questions:

1. We aren't explicitly relying on the resizing feature currently. That said, as we download JSON format, our Redshift ingest is likely indirectly benefiting from the fact that we have multiple files of roughly the same size, which is more performant from what I understand. (And as @ColinMurtaugh said, it would be useful to know what kind of sizes we may be talking about in general terms.)

2. We don't have any resize logic on our end, and it would be a little tricky to add some, given our current process.

3. It's probably an acceptable solution to have a parameter we can set on our end.

- Martyn

BobODell
Instructure
Instructure
Author

Thanks @mclark19 @ColinMurtaugh @jwals  - the feedback is much appreciated and definitely helps. I think adding it as a parameter going forward is the right approach and we'll work to clearly define and document it as we release it. 

As for how large a file set can grow and/or how many files might be produced, I know this can vary widely given the Canvas data but as we looked back through the history we actually aren't seeing files above the 500MB and the upper range of the number of small files seems to be around 100 with a rare case exceeding that. 

- Bob

pgo586
Community Contributor

I hope I'm not replying too late here @BobODell . I have to say that I was aware that file sizing logic existed, although not aware of the details. We've put in place an automated process that uses the API directly (and depends on whatever is getting done right now) to do the retrievals of snapshots and incrementals (saving them locally, and then onto the local Splunk data platform), and would definitely NOT like to have to rewrite this code if incompatible API changes are made. To phrase this differently, assuming that the current behavior stays as default in case a new parameter gets added, then we have no issues with this. On the other hand, I would rather not have to update our code to add this parameter in, just to keep the current behavior. Does this make sense?

jawahar
Community Member

@BobODell 

Can you please explain a bit about it. Files sizes of 500MB or more are beneficial. How would the client do fail/retry behavior when the download of large file fails due to network/other interruptions. Is there any s3 range get parallel optimizations that the client can do on their side with the sent signed url?

pgo586
Community Contributor

While I appreciate this blog post @BobODell in that it gives us an update on what the Canvas Data team is currently working on or considering, I would have expected to see this in a different location easily accessible to institutions that directly use Canvas Data and are dependent on the current APIs in their workflows (like the Data and Analytics form perhaps?). I think it would best to expose this to the larger audience before making any potentially 'breaking' API changes (I myself did not see it for two weeks after it was posted). 

BobODell
Instructure
Instructure
Author

@pgo586 Thanks for the feedback and we'll ensure to add additional tags to future posts for greater visibility. 

@LeventeHunyadi - any feedback you can provide for @jawahar ?

LeventeHunyadi
Instructure
Instructure
How would the client do fail/retry behavior when the download of large file fails due to network/other interruptions. Is there any s3 range get parallel optimizations that the client can do on their side with the sent signed url?

If you use the DAP client library, multiple attempts are made to fetch the files, and download is retried with an exponential back-off logic, i.e. the client library waits for longer and longer intervals in between successive retries.

The resource URL returned by DAP API is an AWS pre-signed URL to an S3 bucket object. You can do HTTP Range queries (specifying a start and end byte range) to split up the file into multiple parts and download concurrently, or use any parallelization that AWS would normally permit with S3 bucket objects exposed via a pre-signed URL.

pgo586
Community Contributor

Thanks @LeventeHunyadi for your reply to @jawahar 's question. I'm definitely very interested in the answer to this question too. Unfortunately, our existing workflow uses the API directly and NOT the client libraries. As such, it would be problematic (time consuming to say the least) to have to write that logic into our existing code, which has been working well with the API in its current state (to add some context, we have custom Canvas (production) LTI tools which currently depend on this code). As such, I would definitely discourage Instructure from making changes unless the default behavior stays as it is (that is, the resize logic stays as default, and the extra parameter gives you the option to turn it off). Worst case scenario, I could see ourselves updating our (nodejs) code to add the additional parameter to the API calls if current behavior cannot be preserved as default, but we'd need enough advance notice from Instructure for this scenario of course.