@themidiman
I understand what you're trying to do, but am missing your definition of when a file is ready to be deleted. Is it when the file is no longer used in any current course? When the file hasn't been used in a course for a fixed period of time? Do you delete course content after a certain period of time?
The next question is what tools you have available. Do you have access to Canvas Data 2, only the REST API, just the web interface???
Some people have said that CIDI labs has a tool with file management capability, but I'm not familiar with it. It would still need some way of knowing what is no longer needed (unless perhaps you delete course content and so the file is not used anywhere).
I would do a little preliminary work in the Web UI.
I would start with the Account > Files > My Files > conversation attachments and make sure those are cleared out as much as possible. If your instructor sent a lot of attachments to students (I've had faculty send their notes via attachments rather than posting it in Canvas), that can definitely fill up quickly and they likely aren't needed anymore. I've also had faculty provide media comments to their students and had that fill up their quota really quickly. Even though that counted against their quota, it didn't stop them from recording more media, but it might have (I didn't check and this was several years ago) might cause them to not be able to upload other things.
Then we move to the Account > Files > My Files, which contains all of the images the user has uploaded. You can sort by date to get the oldest files first. That may be enough to jog the instructor's mind of "I don't need that" or "I still need that".
Determining where a file is used becomes tricky. There's no API for that.
This is where Canvas Data 2 would be beneficial.
The file information is stored in the attachments table. For example, this will give me all of the files that I've uploaded. It takes a while because none of the columns are indexed, but it comes back with a list 295 files that I've uploaded to my personal space. Note that my DAP client prefaces the namespace in front of the table, so the table's name for me is canvas__attachments.
SELECT *
FROM canvas__attachments
WHERE context_type = 'User'
AND context_id = 2175488;
You probably don't need to worry about files that are already deleted so you could check the file_state. This brings me down to 39 files.
SELECT *
FROM canvas__attachments
WHERE context_type = 'User'
AND context_id = 2175488
AND file_state = 'available';
Some of the properties include size and modified_at. This could allow you to search for large files or old files (or some combination of both).
In particular, I found a file called "health_map.PNG" that is used in a discussion. It's file_id (attachment_id in CD2) is 133128306.
Now there is a really tempting table called attachment_associations. It "Links user files to an assignment to allow grader to see the student’s submission." However, it does not link to discussions, only conversation messages, submission, course, or group. The attachement_associations has an attachment_id column, but because my file is only used in discussions, it isn't that table.
That means that we need to look through the discussions. In a discussion entry, the image might be part of an attachment or it might be part of the message. The message has type bounded_str. For MySQL, that comes through as MEDIUMTEXT, which is capped at 16 MB. If the link appears after 16 MB, it may be difficult to find. Hopefully, most professors don't post messages that long.
Just searching for the file (attachment) ID in the discussion is going to take a while. For me, it took 28.6 s (my MySQL interface times out after 30 seconds).
SELECT *
FROM canvas__discussion_entries
WHERE message LIKE '%133128306%';
Since the discussion_entries has a user_id column and it's unlikely (potentially possible, though) that anyone other than me would be posting using my file, I could specify that in the search to speed it up. I also want to make sure the discussion is still active and that I'm not getting a partial match on the file ID (if the ID was 1234, I want to make sure I don't pick up 12345 or 61234). This took my request down to 8.5 s (although caching lowered the original request to 19.6 s, so it's not as much of as in improvement as indicated, but still faster).
SELECT *
FROM canvas__discussion_entries
WHERE user_id = 2175488
AND workflow_state = 'active'
AND message LIKE '%/files/133128306/%';
To check for attachments of the file, you could add the ID there.
SELECT *
FROM canvas__discussion_entries
WHERE user_id = 2175488
AND workflow_state = 'active'
AND (message LIKE '%/files/133128306/%'
OR attachment_id = 133128306);
You probably want to combine the two queries together, rather than searching for a specific attachment ID and repeating the query.
Here is some MySQL using common table expressions to accomplish this. I set the user_id once at the top so I didn't have to specify it multiple places.
SELECT 2175488 INTO @user_id;
WITH cte1 AS (
SELECT
context_id AS user_id,
id AS attachment_id,
filename,
size,
COALESCE(modified_at, created_at) AS modified_at,
CONCAT('%/files/',id,'/%') AS link_text
FROM canvas__attachments
WHERE context_type = 'User'
AND context_id = @user_id
AND file_state = 'available'
),
cte2 AS (
SELECT
a.attachment_id,
COUNT(1) AS n
FROM cte1 a,
canvas__discussion_entries de
WHERE de.user_id = a.user_id
AND de.workflow_state = 'active'
AND (de.message LIKE a.link_text
OR de.attachment_id = a.attachment_id)
GROUP BY a.attachment_id
)
SELECT
a.attachment_id,
a.filename,
a.size,
a.modified_at
FROM cte1 a
LEFT JOIN cte2 b USING (attachment_id)
WHERE b.n IS NULL;
This gives me a list of 28 files that are not used in discussions.
However, some of them are used for profile pictures, conversation attachments, and other places. I would need additional queries to make sure that they weren't found in places I wanted to keep.
But anyway, that might give you a place to start from if you had Canvas Data 2. If you're limited to the API, then it's going to take a lot more work.