Celebrate Excellence in Education: Nominate Outstanding Educators by April 15!
I am having trouble downloading discussion board posts for the current semester, Summer 2020. I need to calculate the average post length for different discussion topics within different courses. For example, I have a discussion topic with 33 posts/replies, but when I download the data and query by discussion topic (joining discussion_entry_dim and discussion_entry_fact), only 5 posts appear (2 that were deleted and 3 that are active). I downloaded the Canvas data via a batch file developed by our data scientist and uploaded it to an AWS server to query the data. This same process has worked before with all discussion posts in Canvas appearing in the query for another discussion in March 2020. I tried another route of downloading, unpacking, and opening the discussion_entry_dim and discussion_entry_fact tables in Excel using the Canvas LTI Tool and only the same 5 posts appeared again. I also searched a couple random posts using wildcards (both within mySQL and Excel) to see if maybe the ID's were incorrect in the tables and I cannot find those posts within the data files. Do you have any idea where these posts could be stored? I would rather use a more robust method than just copying and pasting all the discussion posts. Thank you.
mySQL script for specific discussion topic ID:
SELECT * FROM DISCUSSION_ENTRY_DIM INNER JOIN
DISCUSSION_ENTRY_FACT ON DISCUSSION_ENTRY_DIM.id=DISCUSSION_ENTRY_FACT.discussion_entry_id
WHERE topic_id LIKE "%XXXXX";
mySQL script for trying to find discussion post message:
SELECT * FROM DISCUSSION_ENTRY_DIM INNER JOIN
DISCUSSION_ENTRY_FACT ON DISCUSSION_ENTRY_DIM.id=DISCUSSION_ENTRY_FACT.discussion_entry_id
WHERE message LIKE "%INSERT QUOTE HERE%";
Hi @mk4261
That definitely seems weird. It sounds like you were getting files from someone else, and I wonder if these were filtered or out of date. You said you also got them directly out of the Canvas Data Portal LTI?
These files should be updated daily, with a 2 day delay in what you see vs what's currently in Canvas.
It's kinda hard to tell what you're missing, but maybe I'll share a way to help you test if you are...
The dim and fact table for each set should have the same row count.
SELECT count(*) FROM discussion_entry_dim;
SELECT count(*) FROM discussion_entry_fact;
If the fact table is missing rows, the INNER JOIN will only show you rows where the JOIN works.
In your query above, if you change it to a LEFT JOIN, do you get a different result count.
-- should return the same row count as the 2 above
SELECT count(*) FROM discussion_entry_dim
INNER JOIN discussion_entry_fact ON discussion_entry_dim.id = discussion_entry_fact.discussion_entry_id
Are you missing rows?
Beyond that, I don't see anything wrong with your query. If you have access to some other tables, I did a little tinkering to try and make reviewing/comparing what you see in Canvas and the DB a little easier. This query will return a discussion topic, using the discussion id you would see in Canvas, or the URL. The results trimmed and sorted to show a threaded view, if the discussion is threaded, you can switch them out if not.
SELECT
cd.canvas_id AS canvas_course_id
, cd.name AS course_name
, dtd.canvas_id AS topic_canvas_id
, dtd.discussion_type
, dtd.title AS topic
, dtd.[message] AS topic_body
, ded.canvas_id AS discussion_entry_canvas_id
, parent_discussion_entry_id
, ded.depth
, ded.[message] AS discussion_entry -- if this is null, could be an attachment
, def.message_length AS discussion_entry_length
, dtd.workflow_state AS discussion_entry_state
, ded.created_at AS discussion_entry_created_at
, ded.updated_at AS discussion_entry_updated_at
, CASE
WHEN parent_discussion_entry_id IS NOT NULL THEN parent_discussion_entry_id % 10000000000000
ELSE ded.canvas_id
END AS thread_view
FROM discussion_entry_dim ded
JOIN discussion_entry_fact def ON def.discussion_entry_id = ded.id
JOIN discussion_topic_dim dtd ON dtd.id = def.topic_id
JOIN course_dim cd ON cd.id = dtd.course_id
WHERE dtd.canvas_id = 123456 -- short topic id seen in canvas urls
-- AND ded.workflow_state = 'active'
ORDER BY thread_view, depth, ded.created_at
-- ORDER BY ded.created_at
I tested this by looking for for discussion entries from the current summer term and grabbing one of the topic ids
SELECT
def.course_id
, def.topic_id
, count(ded.id) entries
FROM discussion_entry_dim ded
JOIN discussion_entry_fact def ON def.discussion_entry_id = ded.id
JOIN discussion_topic_dim dtd ON dtd.id = def.topic_id
WHERE def.enrollment_term_id = 100000000000001
GROUP BY def.course_id, topic_id
I grabbed one with 53 replies and opened it in Canvas. Then in the browser's Developer Tools, I took a count of the current entries. I got 62.
document.querySelectorAll('.discussion_entry')
Lot of Canvas toys... so I went further.
We also collect Live Events, discussion_entry_created in SQL.
So I counted that and found 63, maybe someone deleted one, can't see it in CD yet.
SELECT count(*)
FROM live_discussion_entry_created
WHERE discussion_topic_id = 1233456
-- 63
SELECT CONVERT(date, created_at), COUNT(*)
FROM live_discussion_entry_created
WHERE discussion_topic_id = 1233456
GROUP BY CONVERT(date, created_at)
-- 2020-06-05 2
-- 2020-06-06 7
-- 2020-06-07 1
-- 2020-06-08 21
-- 2020-06-09 13
-- 2020-06-10 9
-- 2020-06-11 3
-- 2020-06-12 7
This lines up with the 2 day delay in CD... we have 10 entries since yesterday. Tonight I should see the 3 from yesterday, and so on.
Not sure if any of that'll help, let me know.
Hi carroll-ccsd,
Thank you for your message. I checked the counts of the two tables and of the join tables and the counts are all the same.
Hi Megan,
Were you able to look and see if the counts (what you're seeing in Canvas vs CD) are just delayed by 2 days?
Hi Robert,
Yes, there is a delay of 2 days. Is that a standard delay as I was not aware of that? Also, I updated the discussion_topic_dim table which I had not done previously. I am not sure if it was the delay or not updating the discussion_topic_dim table along with the discussion_entry_dim and discussion_entry_fact tables that corrected the problem, but using my original query, I can now see all the discussion posts per topic (with the 2 day delay). Thank you again for helping me to troubleshoot this issue. I appreciate it!
The 2 day is standard. Canvas is working on a new CD product to reduce that latency.
Typically depending on your Timezone those files should be ready every evening. Since the batch takes awhile to produce, it may be possible that some tables won't have references to records in other tables until the next batch.
Getting updated files before you start the task is a good idea no matter what. If that's every day, there are ways to get the files. If you can import them nightly, you'll have the most recent files in the morning.
How to Use the Canvas Data CLI Tool
Since you're already using AWS, this might be helpful
To participate in the Instructure Community, you need to sign up or log in:
Sign In
This discussion post is outdated and has been archived. Please use the Community question forums and official documentation for the most current and accurate information.