We're Updating our Data Source

Katrina-Hess
Instructure Alumni
Instructure Alumni
12
2949

Hello Canvas Community! 

 

There was a glitch in the matrix. We discovered a problem that affected the accuracy of your data. To ensure that data numbers are what they should be, we are updating the data source for both New Analytics and Canvas Data's request table.

 

What is changing?

In about a week or so, you'll notice an increase in the number of HTTP requests received daily, which will increase activity numbers in New Analytics. Reports will be adjusted to include the previously missing data—including historic activity. Canvas Data's request table is also affected but will only show updated activity numbers moving forward. No schema changes are involved.

 

[UPDATE 2019-11-20] We are also looking into the issue of increased request activity as stated in the comments below.

 

Why did this happen?

Canvas Data's request table and New Analytics expose HTTP request data to customers. These products rely on logs generated by Canvas web application servers which are archived to S3 periodically. A flaw was discovered in the archiving process that caused some logs to be discarded when web application servers were shut down.

 

How are we preventing this from happening again?

The Canvas DevOps team has developed a separate archiving process which does not exhibit the same data loss behaviors. Canvas Data's request table and New Analytics will be moving to this new archive source for HTTP request data.

 

We apologize for any concerns that may have been caused by this data inconsistency. This adjustment is one improvement in a longer journey towards data excellence at Instructure. Today represents an improvement in your data. We will continue to ensure we continuously deliver a high standard of data services for you because of how much data excellence matters to us.

 

Best, Katrina

12 Comments
rpsimon
Community Contributor

So please forgive me for being ignorant, but does this affect the Canvas Data Portal too?

robotcars
Community Champion

Shouldn't affect the Portal, Documentation, or API.

It mostly seems to be an explanation of Large increase in requests data

oxana
Instructure Alumni
Instructure Alumni

Thank you  @r_carroll ‌. I confirm that the schema, documentation and api/cli are not affected. The request table will go up in the size in the upcoming run. 

a1222252
Community Participant

Hi All,

Looking at today's requests data, 75% of the records are of the form:

web_application_controller = 'conversations'

web_application_action = 'unread_count'

course_id = '\N'

conversation_id = '\N'

url = '/api/v1/conversations/unread_count'

Almost all of of these, (99.6%), are associated with browser sessions, (user_agent like 'Mozilla%.), so the fact that the url is an API call would indicated that the traffic is system-generated.

I don't see how these records are of any use since they are clearly not generated by user activity.

While we could remove these up as part of our load, the increased data quantity has a substantial impact on processing time, disk storage of the local data store etc.

Regards,

Stuart.

oxana
Instructure Alumni
Instructure Alumni

Hi  @a1222252 . 

The request data we are sharing is the application transactional data and not an ideal source to calculate user activity or any other meaningful analytics, we recommend using Data Services : Live Events Streaming option to eliminate any system noise . Please check out our new feature in Canvas Beta Environment /Data Services : System vs User filter https://community.canvaslms.com/docs/DOC-18081-system-and-user-event-filtering-capabilities.

We are also looking for customer feedback to introduce  meaningful events to the service, please share what you are using request logs for and whether there is an opportunity for us to emit the events that could replace your need for request logs. 

Thank you,

Oxana 

a1222252
Community Participant

Hi Oxana,

That may be so, but it doesn't explain the dramatic increase in requests data volume since 6th November caused by system-generated counts of unread conversations.

We use the full historical data set to look at user activity across entire courses and this allows us to compare year-on-year etc. We have developed a process which parses urls to derive data not available in the native requests data for records which are not API calls. From this we can obtain reasonably accurate user activity for browser-based interactions.

I'll request a colleague to provide a more detailed description of what the requests data is currently used for.

Regards,

Stuart.

a1222679
Community Contributor

Hi Oxana,

I work alongside Stuart and am primarily looking at who is/isn't logging in to their courses and what resources they're accessing.  Log ins are used as a measure of student engagement (% of days logged in either total or week by week) and also to identify good times to message students.

 

Looking at where they navigate to is useful for things like the report I'm running now which is evaluating uptake of an LTI we built.

Best,

Daniel

jago_brown
Community Member

Dear Instructure,

will these changes mean we are less likely to get duplicate id (guid) values in our Requests table in future? or that it will remove the preexisting duplicate guid?

Smiley Happy

Jago

cronek
Community Participant

We use the GUI Page Views tool (on the User Account page) all the time to track down exactly what a user was doing when a problem occurred.  Now it is like finding a needle in a haystack to parse out the relevant views from the unread_count views.  As folks here are commenting, the Requests table was already large and now it is even more prohibitive to load locally just to look up something for a support ticket.  Maybe there is a way to filter these out of the Page Views tool or vastly increase the number of views that can be downloaded via CSV.  If you could download 3000 or even 10000 views, it would be easier to omit them locally with a sort or grep of some kind.

oxana
Instructure Alumni
Instructure Alumni

Hi Daniel, have you looked into Live Events -> Session Events [Live Events: Event Type by Format ].  @r_carroll ‌ is using those in his application I believe for a similar use case . We also introduced referrer URL to the event to track where the end user came from. If you ever is interested in calculating LTI usage please take a look at the asset_accessed event for asset_type =context external tool , there is currently no reliable way of getting LTI usage data out of the request logs due to the way the launch URL is captured in the logs, we built a specific event with all required LTI launch data to accommodate with LTI usage use case. Anytime end user launches an LTI app we emit the event. 

Thanks,

Oxana

robotcars
Community Champion

 @a1222679 ,

some of my own documentation below.

asset_accesed

canvas live events alpha format schema with descriptions and sample values · GitHub 

ledbelly/canvas.rb at master · ccsd/ledbelly · GitHub 

These events include referrer, url, and asset_name, as well as the fields from metadata including request_id, session_id, and client_ip, and most others from requests.

logged in

ledbelly/canvas.rb at master · ccsd/ledbelly · GitHub 

BSS
Community Explorer

Is there any update on this? Trying to find out Canvas access records when the /api/v1/conversations/unread_count is showing up every minute in page views is making parts of my job nearly impossible to do.