Community Help

Pete5484 · ‎04-15-2024

Hi - I'm using DAPClient 1,1 and postgres 16

With an incremental query I've discovered we seem to have started losing pseudonym "id"'s. e.g. users.id 2720 has pseudonym.id 2689, but in the pseudonymn table 2689 has been skipped, so has 2693. Confirmed that this relationship does exist in live API. No errors in DAP output per below. And just did a snapshot query outside postgres of pseudonym's and it's in that table.

There are no error messages in the logs. It was identified in a query today and only seems to have affected today's logs.

Wondering if anyone else is experiencing this? And how do can these errors be detected?

I figure I'm going to have to drop that table... or recreate from scratch, but that also means recreating views, etc.

Fresh Snapshot query for pseudonymns contains:
{"key":{"id":2689},"value":{"user_id":2720,"created_at":"2024-04-12T03:11:16.385Z","updated_at":"2024-04-15T00:21:18.582Z","workflow_state":"active",...}

Logs from the postgres DAP query:
INFO:pysqlsync.postgres:PostgreSQL version 16.0.2 final
2024-04-15 11:27:03,060 - INFO - Query started with job ID: a45e6b97-...
...
2024-04-15 11:28:12,709 - INFO - Query job still in status: running. Checking again in 5 seconds...
2024-04-15 11:28:19,132 - INFO - Data has been successfully retrieved:
{"id": "a45e6b97-...", "status": "complete", ...}
INFO:pysqlsync:267 rows have been inserted or updated into "canvas"."pseudonyms"
WARNING:pysqlsync:no rows to delete

KeithSmith_au · ‎04-15-2024

Welcome to the world of Canvas Data 2 inconsistency.

I am making some assumptions here, as your post is not completely clear to me. If I understand correctly,

You are using the DAP client to maintain a set of tables in Postgres
After a recent run of the client (which would use an incremental update), you have noticed missing entries in Pseudonyms
(My assumption from what you have written) - You have run a snapshot query against CD2 using the CLI or a separate client, not updating your database, and this has shown the entries present.

My initial questions with this would be:

Have you tried another incremental update via the DAP client, and has that fetched the missing entries?
When was the last time you ran the update prior to the one where you found missing items?

During the Beta phase of Canvas Data 2, I had long discussions about the lack of referential integrity in Canvas Data 2 - where things like this could occur, where users existed, but the pseudonyms did not. Instructure have classified the system has providing "eventual consistency" in that eventually the records should be present and consistent, but I have begged to differ and consider it to be "continually inconsistent".

I currently have tickets outstanding for missing data from CD2 - but it is probably not caused by this, but by Canvas errors during updates that cause some records to be left in inconsistent states such that the streaming logs that CD2 relies on do not reflect actual data - that is a completely different world of pain.

What I think is happening here, is a result of the way data is streamed to CD2, and the way the queries work.

In order to save space, and make the various queries more efficient within the constraints of the design choices made, only the most recent version of every record is stored. If a record is updated multiple times in a short space of time, only the most recent state will be retained (or ever present) in CD2. Each record is timestamped with the time that record is committed into CD2 - which is not the time of the actual update. There is no guaranteed timing of these streamed updates, and different tables can be update at different times - so you may have newer values in users, but the pseudonyms have not been update at that point.

Because there is only one version of each record, and the timestamp only reflects when it was persisted into CD2, it is not possible to generate a set of data consistent across tables at any given point in time. The only thing the timestamps can be used for, is to try and incrementally fetch data such that all records eventually get updated.

I don't have any knowledge of how locking on update or read is performed when accessing Canvas Data 2. I am guessing (and this is only a guess), that the updating of data into CD2 is done on a transactional basis per table, and that all updates for a given timestamp are committed at the same time, but this may not be the case, and updates may be committed on a more frequent basis, potentially record by record. I am also guessing that the reads (be it snapshot or incremental) do not lock at all, and are effectively dirty reads of whatever data is available. To provide consistent sets, be it for snapshot or incremental, the queries return the "until" parameter, which gives the starting point for the next incremental update.

What I am guessing, is that to provide the best approximation of consistency, is that at the start of a query, the maximum current timestamp is determined, and used to return the "until". All records for the table are then fetched, and any greater than the "from" (for incremental queries, or no lower bound for snapshots) and less than or equal to the "until" are returned. I believe this is a realistic assumption, as the query supports supply of the "until" parameter, which is not recommend, but can be used to effectively paginate updates, so the underlying processing obviously uses it.

If there is not full locking on updates and reads (which I highly doubt), the possibility exists (and this is what I suspect may have happened here) is that the following scenario is in play:

A record Z has been first created at time t1 in Canvas
At timestamp t2, record Z is created in Canvas Data 2 (which is later than the last incremental query done when the highest timestamp was t0)
At time t3, record is updated in Canvas
At time t4, an incremental query is requested against the table. At this point in time, the maximum timestamp on the table is t2
The extract query commences extracting records with timestamps greater than t0 and less then t2
At timestamp t5 (greater than t4 by a very small amount), record Z is updated in Canvas Data 2, and the timestamp is changed to t5
At time t6 (greater than t5 by a very small amount), record Z is read for processing in the extract, and bypassed because the timestamp t5 on it is greater than the t2 maximum being processed for this run

If this is true, a subsequent incremental query will obtain the updated record - where the prior creation was not obtained when expected due to the subsequent update.

The timestamps on the information you have provided are not conclusive - the updated at timestamp from snapshot shows a Zulu timezone, and the records present in your screenshot from the Database that have an updated at time slightly later than that are without timezone - so I am not sure if the Zulu time from the extract has been shifted to a local timezone in the database.

LeventeHunyadi · ‎04-15-2024

Today, DAP (a.k.a. Canvas Data 2) provides eventual consistency. We are looking into providing referential integrity across tables, and have identified approaches how we can make this possible in a scalable manner. (This is more challenging than it might at first seem.) Unfortunately, we don't have an official timeline yet when referential integrity across tables would be supported.

DAP is implemented as a transactional data lake; dirty reads are not possible. There are no locks (which are not supported by the object storage that acts as the persistence layer) but updates are atomic. Only records from committed source transactions appear in DAP. However, source transactions that impact multiple tables are not necessarily applied together, which contributes to the weaker consistency that DAP provides as of today. Furthermore, consecutive updates to the same record may coalesce into a single update. As such, successive incremental queries cannot be used as an audit trail because not all record state transitions are returned. The purpose of an incremental query is to bring a previous state (identified by the commit time) to the latest state.

KeithSmith_au · ‎04-15-2024

@LeventeHunyadi wrote:

We are looking into providing referential integrity across tables, and have identified approaches how we can make this possible in a scalable manner.

I hope this investigation includes providing referential integrity across Consortia implementations were objects reside in tables in non-local instances.

Pete5484 · ‎04-15-2024

Thanks both - all timezones were Z.

I started the first incremental update at about 11am local time.

I noticed the data gaps when I ran a query with a left join pseudonyms to users where the output was to be used as a sis_import (and discovered the users' sis_user_id/unique_id were empty).

The user was created three calendar days earlier.

I then executed a further incremental update (using the DAP CLI linked to postgress) - about 4 hours had elapsed since the original incremental update. There was additional data that was imported into the pseudonyms table from new users created that morning (plus changes to other records, last_logins etc), but records 2689 and 2693 were still missing.

That's when I ran queried the Live API and ran a local snapshot and pulled jsonl that showed the records existed and were in CD2.

I've dropped the canvas namespace, deleted source_namespace = canvas from the instructure_dap.table_sync table and run a full snapshot refresh on the canvas namespace!

J-J-Jason · ‎04-16-2024

I've had this problem and I had to drop and recreate the table with dap init.

I found users that have no pseudonym with this query:

select
u.*
from canvas.users u
left join canvas.pseudonyms p on p.user_id = u.id
where p.user_id is null

Here's my post from a while back when I had the issue:

https://community.canvaslms.com/t5/Data-and-Analytics-Group/Canvas-Data-2-Pseudonyms-and-Logins/m-p/...

Jason

Pete5484 · ‎04-17-2024

In the dim recesses of my mind, I can actually recall reading that thread! I've pretty much done exactly the same troubleshooting (so something stuck)!

@LeventeHunyadi does it seem unusual that nothing was pulled by the two incremental queries given the user was created three calendar days earlier?

Being unable to join pseudonymn in CD2 to select sis_user_id and/or unique_id is problematic!

LeventeHunyadi · ‎04-17-2024

Canvas Live API and DAP API (a.k.a. CD 2) query different subsystems. DAP replicates the Canvas PostgreSQL database directly; all records that have been inserted/updated/deleted in Canvas should be reflected in DAP within 4 hours at most. If this is not the case, I would recommend that you open a support ticket such that the team can investigate. For example, they can take a look in the database directly to check if the record is there based on the record primary key. Records may occasionally end up in a quarantine if they violate the schema, and our team can check if this is the case if you share specifics. Thanks in advance!

Incremental query with missing pseudonym entries

Canvas Data CLI

cd2

DAP client library

Roll Call Attendance / Grades Alert

CD2 - how to identify which users / sections have ...

Course Audit log Published Dates from Database

Global Announcements

curiosity within Canvas Live Events - can you help...

Roll Call Attendance / Grades Alert

DAP initdb error: expected UUID

Forwarding Canvas Activity Logs Externally

CD2 - how to identify which users / sections have ...

dap_meta - where did it go? Canvas Data 2

You're signed out

Incremental query with missing pseudonym entries

Community Help

View our top guides and resources: