Celebrate Excellence in Education: Nominate Outstanding Educators by April 15!
Found this content helpful? Log in or sign up to leave a like!
I am working on a Python script that will download all of the files in the tables available to us in the Canvas namespace. I noticed when I unzip the .gz files, some have 1 file, the others have 2 files. The ones with 2 files in them appear to be a header file and an actual data file, but the data file also includes the headers for the data. It also appears that if there are multiple files, the first one starting with 0000 in the file name is the header, Does anyone know why this is?
Also, when the files download, they download into a folder like job_some-random-text instead of the name of the table. Is there some secret voodoo for getting the folder to be named after the actual table I'm downloading? I can probably make it so once the DAP command is run to rename it after the current item in the list, but I figure the less hacky I need to make my code, the better.
The names of the files returned do not bear any special significance. If you need to associate files returned with information in the original request, you have to maintain association in your Python script. This is exactly what the official Python client does.
A request may produce several files. Incremental query requests typically produce one or a few files, snapshot query requests typically produce more files. Specifics largely depend on the size of the table requested, the distribution of the data, the size of the data retrieved, etc. You should not make hidden assumptions about the number of files returned; your Python script should iterate over all the files returned, and process them the same way.
Each file has a header, and contains zero or more records. If the file contains no records other than the header, your script should simply skip the file. DAP distributes tasks associated with a request to multiple nodes, which work independently. If one of the tasks finds no records matching the query in the data range it has been preliminarily assigned, it will produce an empty file.
While it might be helpful to have more structure to file names, AWS S3 does not support rename on objects. (AWS S3 is the underlying data store used in passing data for query results.) Unfortunately, a nice name is not always known beforehand when tasks spin up, and we must go with some name to let tasks work independently without conflict.
All of the above features are already implemented in instructure-dap-client, the Python client library provided by Instructure. I recommend reading through the classes and functions exposed in the module dap.api. Even if you decide to roll your own integration, you may be inspired by how DAP client library implements data retrieval. (Data processing will change in major ways with the upcoming new version of the client. However, the way the client interacts with DAP API and downloads files is going to remain largely the same.)
Strongly agree with @LeventeHunyadi ! The DAP client library will save you an immense amount of work! You can either use the CLI tool or you can integrate the Python library with your own code. For example, you could write your own script that uses the DAP library to fetch the list of tables, and then use the DAP library to sync or init each one of them.
This is essentially what we do; for each table we try to sync first, and if we get a "table doesn't exist" error (meaning the table isn't in our local database yet) we try to init the table. The library will keep track of the last timestamp of each table, so you can simply re-run the whole process every few hours to keep your database in sync.
--Colin
Thanks @ColinMurtaugh ! I am using the DAP command for the call, although I'm calling it like this:
for table in table_list:
# Build dap command
dap_command = f'dap snapshot --table {table} --format csv --output-directory "{output_directory}"'
# Execute dap command
subprocess.run(dap_command, shell=True)
I didn't include
from dap.api import DAPClient
from dap.dap_types import Credentials
but I can see where this can make life easier.
@ColinMurtaugh did you have success yet in porting your Cloudformation stack to Canvas Data 2?
Thanks, @LeventeHunyadi ! Right now I don't plan to feed these files into a database. I see how downloading the file & dumping it directly into a table works. I'm just experimenting & using PowerBI Desktop to work with the files. Eventually, we want to put these into an MS SQL file and have dashboarding for Canvas for non-admin users, like the Provost's Office.
Is there some sort of function in DAP that if I request a file I can also tell it to name it? My next step in my script is to simply rename the folders at this moment. I guess I can have Python peek at the contents and delete if there's only one line, then rename the files to the table names once the extra files are removed.
You can get a lot of good info from the result of the job download, including a list of downloaded files.
async with DAPClient() as session:
query = SnapshotQuery(format=Format.CSV, filter=None, mode=None)
job = await session.download_table_data(namespace, table, query, output_directory)
for file in job.downloaded_files:
# unzip and move out of the job directory
Reference the Instructure code examples for more information: https://data-access-platform-api.s3.amazonaws.com/client/README.html#code-examples
Jason
To participate in the Instructure Community, you need to sign up or log in:
Sign In