Celebrate Excellence in Education: Nominate Outstanding Educators by April 15!
Found this content helpful? Log in or sign up to leave a like!
We, at California Northstate University, are implementing canvas data v2, but we are finding that unless you use the python client library (which is not an option for us, we have no python developers or anyone to maintain a python environment) the documentation is extremely inadequate. We are attempting to communicate with the API using powershell to download selected data exports for import into Microsoft SQL using other methods. You probably don't want to do this, but as a non-developer, there was no sample code out there and it was a MASSIVE struggle. I want to share what we came up with so other people aren't starting from scratch, because the official documentation is woefully inadequate.
First, you will need to create a Canvas Data v2 API key, which is documented here: https://community.canvaslms.com/t5/Admin-Guide/How-do-I-generate-a-Canvas-Data-2-API-key/ta-p/560955 These will need to be set as system environment variables for security, but for our purposes I'm stashing them as powershell variables because it makes it easier to explain. Assuming you have those, the powershell below will generate an API Access token which you can store and will persist for some time (like 10 minutes maybe? If it expires, you just re-run it and get a new token, the expiration is in there in the results of the $ACCESS_TOKEN invoke-restmethod
$CLIENT_ID="ENVIRONMENT VARIABLE, REDACTED FOR DEMONSTRATION"
$SECRET="ENVIRONMENT VARIABLE, REDACTED FOR DEMONSTRATION"
$Header = @{"Authorization" = "Basic "+[System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes("${CLIENT_ID}:${SECRET}"))}
$body = @{
"grant_type" = "client_credentials"
}
$ACCESS_TOKEN = (Invoke-RestMethod -Method POST -Header $Header -ContentType "application/x-www-form-urlencoded" -Body $body -uri "https://api-gateway.instructure.com/ids/auth/login").access_token
Great! Now we can use that access token to do some querying. I want to focus on 2 broad queries which are useful before we actually get our data, first I want to get a list of tables. This is one of the documented examples in their API page here: https://api-gateway.instructure.com/doc/ (which is infuriating to use, you cannot easily link to specific tags in the document as it doesn't update the URL). The following code will use the existing access token to show a list of tables which the API key has access to:
$ReqHeader = @{"x-instauth" = "${ACCESS_TOKEN}"}
$ReqBodyJson = $ReqBody | convertto-json
$ReqContentType = "application/json"
(Invoke-RestMethod -Header $ReqHeader -Method GET -ContentType "application/json" https://api-gateway.instructure.com/dap/query/canvas/table).tables
Get the output yourself, but for my case, I want to get the "users" table as an example. We can directly request the schema for this table with the following query:
$ReqHeader = @{"x-instauth" = "${ACCESS_TOKEN}"}
$ReqContentType = "application/json"
(Invoke-RestMethod -Header $ReqHeader -Method GET -ContentType "application/json" https://api-gateway.instructure.com/dap/query/canvas/table/users/schema).schema | convertto-json | out-file users.schema.json
This may be useful later to you. Finally, we need to download our data; in our case it's a tsv we desire. You don't actually just get a link; your data request becomes a job, and when the job completes an object is created, and that object has a URL identifier which is only in existence for a short time. For that reason, this becomes a multi-step process, compounded in complexity by the fact that Powershell doesn't handle json -> object conversion gracefully (in my opinion). The code to do all this is below, for the users table, and if you follow along, we first create a request for a tsv file, then we check the job status until it reads completed, then we find the object created in that job and download the file at the indicated URL, and in my case, we use 7-zip to extract it:
$ReqHeader = @{"x-instauth" = "${ACCESS_TOKEN}"}
$ReqContentType = "application/json"
$ReqBody = @{
"format" = "tsv"
"mode" = "expanded"
}
$ReqBodyJson = $ReqBody | convertto-json
$RequestID = (Invoke-RestMethod -Header $ReqHeader -Method POST -ContentType "application/json" -Body $ReqBodyJson https://api-gateway.instructure.com/dap/query/canvas/table/users/data)
$JobStatusURL = "https://api-gateway.instructure.com/dap/job/" + $RequestID.id
$RequestStatus = (Invoke-RestMethod -Header $ReqHeader -Method GET -ContentType "application/json" $JobStatusURL)
$RequestStatus
while ( $RequestStatus.status -ne 'complete' ) {
$RequestStatus = (Invoke-RestMethod -Header $ReqHeader -Method GET -ContentType "application/json" $JobStatusURL)
$RequestStatus.status
sleep 5
}
$ObjReqBody = convertto-json $RequestStatus.objects
$ObjReqBody
$ObjUrlID = ($RequestStatus.objects).id
$URLs = Invoke-RestMethod -Header $ReqHeader -Method POST -ContentType "application/json" -Body $ObjReqBody https://api-gateway.instructure.com/dap/object/url
$URLs
foreach( $url in $URLs ) {
($url.urls.$ObjUrlID).url
Invoke-WebRequest -Header $ReqHeader -Method GET ($url.urls.$ObjUrlID).url -OutFile users.tsv.gz
& "C:\Program Files\7-Zip\7z.exe" x users.tsv.gz
}
Ok, cool, it works! But ultimately, there's a lot of code re-use in there, and we would likely ultimately want to download MULTIPLE tables, not just the users table, and there's no reason to do that all at once; we'd like ultimately to just pass a list of tables to download and have them appear. That, gets us the following:
$CLIENT_ID=$Env:CanvasClientID
$SECRET=$Env:CanvasSecret
$datestamp = get-date -format "yyyy.MM.dd"
$Header = @{"Authorization" = "Basic "+[System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes("${CLIENT_ID}:${SECRET}"))}
$body = @{
"grant_type" = "client_credentials"
}
$ACCESS_TOKEN = (Invoke-RestMethod -Method POST -Header $Header -ContentType "application/x-www-form-urlencoded" -Body $body -uri "https://api-gateway.instructure.com/ids/auth/login").access_token
$TableList = @(
"courses"
, "accounts"
, "users"
, "pseudonyms"
, "enrollment_terms"
, "roles"
, "enrollments"
, "scores"
)
forEach ($Table in $TableList) {
$ReqHeader = @{"x-instauth" = "${ACCESS_TOKEN}"}
$ReqContentType = "application/json"
$ReqBody = @{
"format" = "tsv"
"mode" = "expanded"
}
$ReqBodyJson = $ReqBody | convertto-json
$RequestURL = "https://api-gateway.instructure.com/dap/query/canvas/table/" + $Table + "/data"
$RequestID = (Invoke-RestMethod -Header $ReqHeader -Method POST -ContentType "application/json" -Body $ReqBodyJson $RequestURL)
$JobStatusURL = "https://api-gateway.instructure.com/dap/job/" + $RequestID.id
$RequestStatus = (Invoke-RestMethod -Header $ReqHeader -Method GET -ContentType "application/json" $JobStatusURL)
$RequestStatus
while ( $RequestStatus.status -ne 'complete' ) {
$RequestStatus = (Invoke-RestMethod -Header $ReqHeader -Method GET -ContentType "application/json" $JobStatusURL)
$RequestStatus.status
sleep 5
}
$ObjReqBody = convertto-json $RequestStatus.objects
$ObjReqBody
$ObjUrlID = ($RequestStatus.objects).id
$URLs = Invoke-RestMethod -Header $ReqHeader -Method POST -ContentType "application/json" -Body $ObjReqBody https://api-gateway.instructure.com/dap/object/url
$URLs
foreach( $url in $URLs ) {
($url.urls.$ObjUrlID).url
$DownloadFileName = $dateStamp + "." + $Table + ".tsv.gz"
Invoke-WebRequest -Header $ReqHeader -Method GET ($url.urls.$ObjUrlID).url -OutFile $DownloadFileName
& "C:\Program Files\7-Zip\7z.exe" x $DownloadFileName
rm $DownloadFileName
}
}
This assumes the API key information is stored in environment variables, that 7-zip is installed in program files as indicated, and that the table list I indicated is the ones you want (you can add more from the list of tables I indicated above). In the directory this is run from, it will produce date-stamped tsv files for each table, as desired.
I am not sure anyone out there is actually going to use this as we did, but I think the pieces here are instructive for someone looking to dip their toes into the API when they're not a Python developer.
EDIT: I am attaching (here and in the first 2 comments) the powershell script we have that runs as a scheduled task to download the data, then the 10 sql scripts representing the import of the tables we specifically need, performed as a sql server agent job. These do all the data-type validation and null handling. The paths would of course need to be updated for anyone to use these, and ensure 7-zip is installed.
That looks great! Thanks for sharing! Very helpful for those not using python.
How has performance been for you? Was there anything uniquely challenging you encountered? I'm using Oracle and python and have found the instructure-dap-client useful. My main issue has been converting the json schemas to oracle ddl.
Happy scripting!
Jason
It's not fast because of the async nature, but I just stagger the jobs. This runs at 7 AM, the sql import runs at 7:30 AM, it's ready to go at 8 and I haven't had an issue yet.
Thanks for sharing your work. I used python, but have some additional comments:
- Suggest that you try to retrieve the table data using a non-positional format like JSONL or Parquet instead of .tsv. Canvas occasionally adds or deletes columns.
- You probably also found out by now that 'canvas' in this url https://api-gateway.instructure.com/dap/query/canvas/table).tables needs to be changed to 'canvas_logs' for the web_logs table
- I retrieve incrementals from all tables several times a day so I needed to multi thread these processes. If you're only retrieving a handful of tables, it won't make any difference, but multi-threading dramatically reduced the retrieval.
- I also needed to include additional coding for sub properties. In the Assignments table is a turnitin_settings column with originality_report_visibility, exclude_small_matches_type etc properties. This translated into turnitin_settings_originality_report_visibility etc columns.
- You might also encounter many different errors. The web_logs processing, in particular, has been temperamental.
I'll start by saying I'm only a data guy as required; I wear all the hats here, so I'm far from versed in best practice. For our import, a csv/tsv was the easiest for me to implement based on existing code I had access to, so there's probably better ways to do it for sure, however as we previously used their node tool (which they no longer support, and which gave us tsv files) my focus was on the powershell piece, and not the sql piece. I didn't want to use the python tool because I'm not familiar enough with node OR Python to hang my hat on supporting it, and now that the node tool which I spent a lot of time implementing is not saying it supports v2, I didn't wanna engage with Python vs. using something I can support, aka Powershell, as it's on all my windows servers and it interacts directly with the API. I just have this bad taste in my mouth about depending on an instructure-provided intermediary, and want to be able to own it top to bottom.
That's a good shout, we're not grabbing log tables at this time, but may in the future. Our institutional research team mostly defines the data needs outside of SIS sync stuff, and they could easily start asking that.
The sub properties and mapping and translation is the biggest pain of the individual sql files. Basically there's a huge effort to go field by field for any tables we decide to add, hence providing the work I already did. We had something like this with the node tool as well, but there were different data errors (like reporting Null's as 1900 dates).
All in all I think this will work for us, but I see having to keep an eye on it. Still, we're an all windows and all MS SQL shop, so this as an automation using just stuff already on the system is much easier for me to support. I'll need to monitor the schema files to make sure some field doesn't change over time, and watch for things like job errors causing the downloads to fail and our local data getting stale, but that's mostly trivial.
I understand the reluctance to use Python, but it is possible to use the DAP CLI tool without actually writing or maintaining any Python code. Since the tool is written in Python you would obviously need to have Python installed, but that should be pretty straightforward on most platforms. I would strongly encourage you to at least try it out; once it's installed it should be no different than running CLI tools that are written in powershell (or c++, or whatever).
The logic necessary to correctly work with the CD2 API and data files is sufficiently complex that I would not want to have to implement it all myself. Because CD2 delivers incremental updates rather than full table replacements you need to take care to apply updates in the proper order, and you need to keep track of what updates have been applied. The DAP tool does this for you, in addition to handling token refreshment.
I expect that you could still orchestrate your process using a powershell script that calls the python CLI tool (though admittedly I don't know much about powershell).
--Colin
That's a great point, Colin - the dap client can be called from the command line like any other program and it means you don't have to worry about tokens and failed api calls.
On the other hand, one of the strengths of web APIs is that they are language independent. When your APIs are complex enough that no one recommends working with them directly maybe they should be simplified, though.
Powershell is such a great "glue" language and is perfect for these types of tasks (chaining cli calls, for example). Since the dap client doesn't support SQL server you could use it to download incrementals and snapshots and then handle the merges with your t-sql.
Your works looks great, @JosephGullo. I like where you took it with the code re-use. You could make powershell modules out of it that replicated some of the dap-client tasks, like downloading and querying tables and namespaces. The database specific stuff I would leave where you have it.
The maintenance and bug fixes will be on you, though. With the python code, you can at least count on Instructure to release fixes, upgrades, and keep it in sync with the changes they implement server-side.
Thanks again for sharing! I love seeing how others are tackling the same problems.
Jason
There is a 15 minutes window before the file download links expire and you need to fetch new download links (for some large tables such as assessment_questions with 50+ files) to continue the downloads. Expired file links will receive a 403 response as well as a response message indicating that they've expired, so I updated the last part of this script to use a script block "$scriptBlock" to allow the download process to be executed in the background, and a "Start-job" to initiate a background job for each url download, all these allows us to run the file download process simultaneously (please see the sample codes below. I tested it and worked well with our CD2 instance).
$urlCounter = 0
$jobList = @() # Array to keep track of jobs
foreach ($ObjUrl in $ObjUrlID) {
$ObjUrl # Debug Output
$url.$ObjUrl # Debug Output
$url.$ObjUrl.url # Debug Output
$DownloadFilePath = $DownloadPath + $Table + "." + $urlCounter + ".tsv.gz"
# Create a script block for the job
$scriptBlock = {
param($url, $ReqHeader, $DownloadFilePath)
try {
Invoke-WebRequest -Header $ReqHeader -Method GET $url -OutFile $DownloadFilePath
} catch {
$_.Exception.Message
}
}
# Start the job
$job = Start-Job -ScriptBlock $scriptBlock -ArgumentList $url.$ObjUrl.url, $ReqHeader, $DownloadFilePath
$jobList += $job # Add the job to the job list
$urlCounter++
}
# Wait for all jobs to complete
$jobList | ForEach-Object {
$job = $_
Wait-Job -Job $job
}
}
}
# Optionally clean up jobs
$jobList | ForEach-Object {
$job = $_
Remove-Job -Job $job
}
To participate in the Instructure Community, you need to sign up or log in:
Sign In