Help Shaping Data Access Platform CLI & Client Library Next Steps

sgergely · ‎05-07-2024

I have recently joined as a Senior Product Manager for the Instructure Data & Insights team. I’m overseeing the DAP Query API and the DAP CLI client and library. I look forward to meeting many of you to discuss your data and analytics priorities.

As I started to work on shaping the roadmap for these products, I'm eager to gather your feedback to inform our direction. Here are 3 questions for you:

I'd greatly appreciate any feedback you have on both the CLI client and the API itself. This includes your experiences using them, working with them, and consulting their documentation.

What is your current data storage architecture for the data collected using these tools? Are you storing it on-premises, in the cloud, or both? If you're utilizing cloud storage, could you specify which provider you're using?

What are the underlying reasons or objectives behind your or our clients' use of our CLI tool or Query API? While I understand the primary goal is data retrieval, I'm particularly interested in learning about specific use cases and the questions you aim to address with this data.
How do you typically analyze the data retrieved from our CLI tool or Query API? Are you primarily examining it in isolation, or do you integrate it with data from various other sources and what are those?

You can share your responses in the comments section below, or feel free to send me a direct message if you prefer to share your answers privately.

I've taken the time to review past discussions in our community forums and noticed several challenges that users have encountered. I'm committed to exceeding the standards set by my colleagues, predecessors and ensuring a seamless experience for all. Your continued support and feedback are invaluable as I strive to meet and even surpass your expectations.

The recent release of Data Access Platform (DAP) CLI and Client Library version 1.1.0 serves as a prime illustration of how your feedback drives improvements within our system, making us more effective and responsive to your needs.

mclark19 · ‎05-13-2024

Thank you for the questions, @sgergely.

1. Our ETL process is based around Informatica, which is the tool of choice at our institution. We aren't using the DAP Client directly. We have one Informatica taskflow per table. The taskflow has several different components, but basically pulls data from the DAP API to AWS 3, then pushes to AWS Redshift. Due to some initial issues with getting csv files to load correctly, we've opted for downloading data in JSON (parquet was not an option when we began working on the CD2 pipeline). Everything is in the cloud. We've found the DAP documentation very useful. One frustration has been the reported status of the system, which seems to always be "operational" even when we are experiencing data issues. (This has not been an issue recently.)

2. We have primarily used Canvas data to report on general LMS usage trends each semester (e.g., number of published courses, number of unique users per day). We have also used the data to answer more targeted questions pertaining to particular departmental use cases (e.g., usage differences of the same base template across several course spaces) or specific requests for information.

3. We do not integrate other data with Canvas data as part of the ETL pipeline, but we do ocassionaly triangulate Canvas data with information from other systems (e.g., Panopto, survey responses) and/or explicitly combine Canvas data with other data sources (usually the student information system) to respond specific data requests.

SpencerWilliams · ‎05-16-2024

My request is that there is more focus on the developer experience.

At a bare minimum, I would love it if there was a way to be directly notified of new releases (e.g., email newsletter, GitHub release, or something other than having to check back manually every week or so).

I've seen mentions of hosting the DAP CLI codebase on GitHub. This would greatly enhance the developer experience in my opinion as then we could easily see a product roadmap (using GitHub issues with boards) and be able to be notified about new releases (like I normally do with other packages I depend on). I'm not particular to GitHub, but at least something similar would be fantastic.

I appreciate you and your efforts in making working with DAP CLI a more pleasurable experience. Good luck!

reynlds · ‎05-21-2024

The sync process has been broken for me since the beginning for all schemas, and I've got multiple cases open on it now. With our CD1 going away at the end of June it's imperative that this process work reliably. The full drop/init process runs fine but takes nearly 12 hours on the canvas schema and almost a week on the canvas_logs schema (during which time if another DAP process is started it fails out the canvas_logs process). We've also run into issues with tokens failing but still have an expiration date many months out, requiring the creation of a new one. I'd like to see Instructure officially offer a sync to Snowflake or something similar to bypass the DAP processes. Better release documentation (as others have stated) would also be helpful, along with a better roadmap. I'd like to easily be able to share my SQL queries with others, and review theirs in a dedicated space under this Group.

sgergely · ‎05-30-2024

@mclark19, @reynlds, @SpencerWilliams thank you for your effort and writing this up! They helped us shaping our roadmap and I'm going to incorporate these feedbacks into the product development.

Please follow my blog as I'm going to publish more about the upcoming changes related to my products.

robotcars · ‎06-18-2024

@sgergely

Any way to make the initdb process multi-threaded? Trying to do initdb on some of these larger tables is takes considerable time. The benefit of DAP and the schema management kinda makes using other tools that can be multi-threaded cumbersome to setup and maintain, or possibly problematic if they don't align to DAP schema.

sgergely · ‎06-21-2024

@robotcars thank you for sharing your issue and I'm sorry for the struggle!

From 0.3 to 1.0 we have seen significant speeding up in the SQL functionality.

So because of that could you please share your

operating system type and version
and sql server type and version,
python version,
and DAP CLI version?
Also can you please provide more details on what do you mean considerable time and what would you expect as an acceptable time?
Which namespace and tables take considerable time?
How frequently do you need to run initdb and why?
What are the other tools you are referring to?

Thanks!

reynlds · ‎06-25-2024

@robotcars Your comment intrigued me so I set about working in my Linux shell to accomplish this through scripting. Certainly not "multi threading", but runs a lot faster than processing one table at a time.

Basically one script acts as a "wrapper" first creating a current table listing and "splitting" into 5 chunks of files, each containing as many table entries as possible. Following that I have 5 scripts that run concurrently that loop through their table name entries and log to the same location. Hacky and stupid, yes, but it works. My next task is to optimize it by assigning the largest tables evenly across the running processes so that one script doesn't get stuck feeding a lot of multi-GB tables, bogging everything down again.

@sgergely I run a full initdb nightly because the syncdb process never runs consistently or completely for me. I've got at least one support case open on it. Recently the syncdb on web_logs started working out of the blue after failing for months...love it when that happens.

I think most of the issues with CD2 since the beginning have been related a lack of scale and optimization of the service. One thing I like is the Identity portal that was spun up (or was it already there and just "exposed" to us for this). I'm hopeful that this will get more and more love and be used for more processes moving forward.

fullerusa · ‎06-25-2024

API Feedback - I would also reiterate one of the above comments about having a way to understand the release cycle and method of notification other than just going and looking once in a while.

We utilize the API with a small python script that just loops through the available tables and updates the data in a postgresql database which is a temporary solution for us. In the end we want to put the data in a Snowflake database with the rest of our data. Our postgresql is on-prem but in the end this is just a simple way to get the data from Instructure which we then need to push to Snowflake. Having the api support snowflake would be ideal (or if Instructure had a Snowflake instance and could share the data that would be ideal).

Once we have the data, we use the data in various reporting dashboards to track usage, implementation and possible configuration issues, any passback issues (missing assignments in SIS that should exist, etc.)

Currently we evaluate some of the data in isolation only because we have not built out the Snowflake sync but ideally we are not using the data isolation.

-Brian

robotcars · ‎06-25-2024

@sgergely

Thanks for your response.

operating system type and version
1. AmazonLinux 2023.4.20240528
and sql server type and version
1. postgres 15.5
python version
1. 3.11
and DAP CLI version?
1. 1.1.0
Also can you please provide more details on what do you mean considerable time and what would you expect as an acceptable time?
1. last Tuesday I ran initdb for submissions for one of our instances, with 107M rows, it took about 7 hours to import to the RDS Cluster. Looking at the output of rows inserted, it was inserting each file 1 at a time (- 1 line/headers)
Which namespace and tables take considerable time?
1. Canvas namespace, and any of our largest tables, submissions, enrollments, assessment_questions
How frequently do you need to run initdb and why?
1. Assuming syncdb works regularly (I see some concerns) it should be once for dev and once for production. But we have 4 instances, and about 30 tables for each are too large for Lambda 15 minute limit, so running initdb for these manually takes a considerable time if I'm going to have to load that many rows.
What are the other tools you are referring to?
1. I was using Embulk to load tables nightly, it can import 1 file for each CPU/core.

@reynlds

Can you provide a little more detail on that solution? Are you using a snapshot query to get the files, and then which command to read each file to the db?

Or did you import them manually? Then how did you handle the meta-table details and timestamp?

reynlds · ‎06-26-2024

@robotcars Using the vanilla DAP client on a RHEL 9 host with BASH and a Python virtual env. No snapshot, straight drop/init on all tables using a script that spawns multiple sub-processes concurrently, each doing their "thing" on one table at a time. Basically it's 5 unique processes, called from a single point, each processing 20 tables, one at a time. Since it all runs in the background the only way that I can "see" it running is to review the processes (ps -ef) and watch the log file roll. Pretty sure I could optimize this even more through AWS/RDS tweaks and DB optimization, or even build out an actual cloud app for it. But for now it resides on a virtual host in our on-prem VMware environment.

Once I get it written up and added to github I'll send you a link or post one as a blog entry in this group.

robotcars · ‎06-28-2024

@sgergely

Something else I've noticed is that the CLI command for snapshot returns the `at` value used for chaining incrementals, but the SnapshotQuery and IncrementalQuery (expecting `since` and `until`) do not, only a timestamp with a python datetime object. Is it possible to ensure that value is provided? Am I missing something? My current workaround is to store the run-timstamp of an Incremental and reuse it on the next.

RichardCarballo · ‎01-02-2025

Can we add a feature to fully load the entire database with all existing tables in their related schemas for when an organization has to load a new database?