Canvas Data 2: status report

The content in this blog is over six months old, and the comments are closed. For the most recent product updates and discussions, you're encouraged to explore newer posts from Instructure's Product Managers.

Edina_Tipter
Instructure Alumni
Instructure Alumni
21
8799

3.png

We've appreciated all the great feedback on Canvas Data 2 since we released it in March. Today we are going to share some updates on planned features as well as guidance related to timelines and migration.

First up, we’re happy to report that weblogs (requests table in CD1 terminology) work is on schedule and we are planning to release it to production on June 21st with our normal Canvas Release cycle. We will  provide you with guides and suggestions once it’s available to make your transition as smooth as possible as we know this is highly anticipated.

Next, we’ve added a CD2 sandbox environment which is seeded with dummy data and refreshed every 2 hours. The sandbox is a great way to explore and test as you build out the integration prior to pulling your full data sets.  We recommend that customers get started with the sandbox as soon as possible and prior to Instructure onboarding of your data. This will help you understand what to expect prior to your migration. 

The sandbox can be accessed with an API key that your Customer Success Manager can provide upon request so you can begin your testing. The API key will remain active for 6 months while you work on your integration.  There are guides and resources to help you get started, you can find them on Community under Canvas Data 2. To highlight a few: 

As always we value your feedback, suggestions and support as we help build a data powered future together.

Tags (3)

The content in this blog is over six months old, and the comments are closed. For the most recent product updates and discussions, you're encouraged to explore newer posts from Instructure's Product Managers.

21 Comments
IanGoh
Community Contributor

Going to be interesting to see how much space weblogs will take over time. LiveEvents is already like a firehose.  We get 1M LiveEvents a day.

a1222252
Community Participant

To give you some idea of volume, we have been storing CD1 requests data since 2016. We have archived 2016 - 2020 data, about 1.6TB. Our 2021 - 2023 data is currently 1.8TB. That's for around 22,000 EFTSL. The long-term average is over 7 million records per day. The trick is to separate student-generated activity from internal application-generated records.

marco_divittori
Community Participant

For requests data in CD1 we've found that each new set of request files can contain records of activity that took place 2 weeks earlier. @Edina_Tipter will the same be true for weblogs files in CD2? Thanks

Edina_Tipter
Instructure Alumni
Instructure Alumni
Author

@marco_divittori Just to make sure I understand the question: do you mean that for example in the daily logs for June 4, you see activities that took place say on the 25th of May?

marco_divittori
Community Participant

@Edina_Tipter Yes, that is currently the case with request data in CD1 and we are wondering whether that will continue to be true for CD2.

a1222252
Community Participant

@Edina_Tipter we were doing some work around most recent student activity and found that pseudonym_dim.last_request_at reports the last request generated by browser-based activity and does not take account of app-based activity. Will this remain the same in CD2?

Edina_Tipter
Instructure Alumni
Instructure Alumni
Author

@marco_divittori Apologies for the somewhat late reply. We don't expect to have such late data in our service (unless something unexpected happens). In the sample data we have scanned so far that we read from a stream of logs, we have't seen the type of irregular behaviour that you described above. The expected behaviour is to serve the data in the order as it was produced.

Edina_Tipter
Instructure Alumni
Instructure Alumni
Author

@a1222252 This filed maps to the pseudonyms.last_request_at in CD2 (Timestamp of when the user last logged in with this pseudonym.). As CD2 is not enriching the data which comes directly from the Canvas DB, I would not expect a change in behaviour or semantics in this case. Sorry I could not help more.

a1222252
Community Participant

@Edina_Tipter Thanks for the update. Does that mean student activity generated using apps is not recorded in the database, and is only available using requests data?

Edina_Tipter
Instructure Alumni
Instructure Alumni
Author

@a1222252 User activity is captured in a few ways but with different granularity: some of it is captured in the database but you can reach more granular results from the requests table or the live events stream. We will soon deliver the weblogs (aka requests table) via CD2 in the end of June if you would like to leverage this source. What is your use case more specifically?

Joshua_HCSD
Community Member

Did anything change with CD1 recently?  Near the end of the school year our data pulls went from manageable to several terabytes in size crashing our server from lack of space.  We are using the Eastern Michigan State University pulls slightly modified for our instance (from my understanding). 

mcarruth
Community Contributor

We are trying to migrate to CanvsData2 using the MySQL CLI client that was released 2023-08-23. CLI release version 0.3.11

We are running into an issue with the database structure.  The client fails with the error:  pymysql.err.DataError: (1406, "Data too long for column 'question_data' at row 5". The column is in the assessment_questions table. The column type is text. MYSQL supports up to 65,535 bytes in a text column. 

a1222252
Community Participant

@Edina_Tipter Hi Edina, we are in the process of testing CD2 downloads and have found some issues. I've sent details to isaiah.alamani@instructure.com and to the support desk.

1. Performance of the dap tool is slower than the CD1 canvasDataCli tool and is variable.

2. We have seen intermittent HTTP 502 errors. With debug logging enabled, the log file does not capture all of the messages displayed in standard output.

I've also asked Sai for the best way to report these issues, perhaps you may also have a suggestion on this question?

Regards, Stuart.

a1222252
Community Participant

@Edina_Tipter I should add that we are currently using dap 0.3.8.2. We are awaiting an upgrade to 0.3.11. Is there any documentation regarding the changes introduced with newer versions?

Thanks, Stuart.

Gabor_Endrodi
Community Explorer

Hello, we are working on a fix for the DataError you mentioned above and there will be a fix for this soon. Please, bear with us and sorry for the inconvenience.  

a1222679
Community Contributor

X

a1222252
Community Participant

@Edina_Tipter I note that the format of the web_logs table has been changed recently to include the user_agent value, the same as CD1 requests. Previously the web_logs table only contained the user_agent_id which joined to the user_agents table. Is there a reason for this change?

marco_divittori
Community Participant

I also see the new "user_agent" field in the data files but it's not currently reflected in the documentation or even the latest schema file. 

https://data-access-platform-api.s3.amazonaws.com/index.html#tag/web_logs

stimme
Community Coach
Community Coach

@a1222252 & @marco_divittori , changes to the web_logs and user_agents tables are identified in the 2023 API and CLI Change Log. (I hadn't clicked through from the Nov 22 release notes document, which states, "Canvas Data 2 has been updated within the past two weeks," to this obscure change log until this afternoon.) The reason given there is stability. Note that the user_agents table is slated for removal on December 16, 2023.

I'll add that I just queried the dap client for the web_logs schema file, and the resulting JSON includes the user_agent_id field, not the user_agent field.

marco_divittori
Community Participant

Thanks for digging that up @stimme. According to the release notes, the new column was added on 2023-11-22. I would expect the documentation and schema file to be updated on the same day but it looks like neither has happened.

marco_divittori
Community Participant

I checked again today and it looks like the web_logs schema file has been updated to include the "user_agent" column data.