Celebrate Excellence in Education: Nominate Outstanding Educators by April 15!
Found this content helpful? Log in or sign up to leave a like!
After upgrading to the 1.0.0 DAP client yesterday, I'm now getting a very specific error:
ERROR - [Errno 28] No space left on device
It appears that the DAP client is now loading up my /tmp location (which is very limited) for holding and assembly of the various file parts prior to assembly and shipment to Postgres. If I'm wrong about this, please let me know, but it's certainly unexpected.
I can increase this location, but it would be nice to add this as an option in the DAP client so I don't have to modify multiple systems to allow for it. Right now I"m just looking for a "yea" or "nay" on this as to whether it's "by design".
Solved! Go to Solution.
Confirmed. Following a FULL database refresh (complete removal of the DB along with a full "initdb" using DAP), along with "fixing" the data in Canvas (adding a name to an object that was created w/o one), has allowed this process to continue. However, it should still be that DAP logs these errors, and proceeds, rather than simply dumping out on anything that does not fit exactly as defined. Again, it's up to Instructure to modify Canvas to NOT allow blank (null) names on course sections.
Caching temporary data to disk was implemented in one of the more recent 0.3.x versions. Previously, DAP client library was operating in memory-only streaming mode, it was reading records from the network connection, and directly sending them to the database as they were incoming. In more recent versions of the DAP client library, this was changed to a buffered approach, whereby parts are first downloaded to the disk, and then inserted into the database by reading the files on disk.
The buffered approach was introduced in response to user complaints. Many users had a setup with a relatively slow database, and the database engine could not keep up with the speed records were being received over the network connection. This has resulted in a premature disconnect, and ultimately a poor user experience. Others had issues with the AWS pre-signed URL expiring by the time they got to downloading a file in the result-set other than the first two or three. All of these are eliminated with a buffered approach, and disk size seemed not to be a major concern for the majority of our users.
If the development team would want to re-introduce the streaming approach, they would have to carefully assess they do so in a way that doesn't resurface the network connectivity and URL expiry issues.
I blew away my PG schema and restarted the initdb with each table getting this error:
raise ValueError("table already replicated, use `syncdb`")
...which should "perform a snapshot query of a table and persist the result in the database". It appears, however, that it's simply creating the table without populating it.
DAP client library keeps track of replication state (e.g. last time a table was retrieved from DAP) in a special tracking table called table_sync, found in the PostgreSQL schema instructure_dap. (In the case of MySQL, which has no schemas, the table is called instructure_dap__table_sync.) If you run the command dap dropdb, the entry for the dropped table is automatically removed from the tracking table as well. If you deleted the PostgreSQL table (or the entire schema) manually, you will need to remove the row(s) corresponding to the deleted table(s), or DAP client library will continue to believe the table has been replicated because there is a matching row it finds.
@LeventeHunyadi Before version 1.0.0, I was able to handle whether to initialize or sync a table by catching "NonExistingTableError". It seems a general "ValueError" is now thrown in this scenario. This isn't as helpful, since ValueError can map to many scenarios
I use MySQL with a BASH script, so I don't know if this is possible for your setup. Instead of relying on dap to fail, I queried the database through the command line to see if the table existed. Then I performed an initdb or syncdb depending on what results it gave.
I had the same issue and in my Python code I'm checking for the string "table not initialized" in the ValueError exception:
...
except ValueError as e:
if "table not initialized" in str(e):
# initialize the table
I agree that the NonExistingTableError exception was much cleaner.
--Colin
@James @ColinMurtaugh Thank you both for these suggestions, I'm also using a python script to orchestrate the sync. Although more fragile, I'll most likely just check the error message for now
I finally woke up and harkened back to my "unix days" and changed the environment variable for TEMPDIR, which (thankfully) the DAP client is honoring. so no more warnings about running out of space. I'm currently running a full drop/init on each table and so far it's populating the new temp area with TSV files, and uploading them to the DB. Fingers crossed...
I still think a switch to redirect the temp files, that is built directly into the DAP client, would be very helpful/useful...especially for users that may not have access to modify their environment or exports:
dap initdb --namespace canvas --table assessment_questions --temp /var/lib/tmp
Following the complete rebuild of my DB, I'm encountering only 1 error, but it's not allowing me to move past it on the course_sections table. I've tried both deleting the table from my DB and using the "dropdb" method with the same negative results:
$ dap dropdb --namespace canvas --table course_sections
INFO:pysqlsync.postgres:connecting to postgres@canvas-data-2.ccmfhqnzit0p.us-east-2.rds.amazonaws.com:5432/cd2
INFO:pysqlsync.postgres:PostgreSQL version 14.0.10 final
INFO:pysqlsync.postgres:connecting to postgres@canvas-data-2.ccmfhqnzit0p.us-east-2.rds.amazonaws.com:5432/cd2
INFO:pysqlsync.postgres:PostgreSQL version 14.0.10 final
2024-03-04 10:56:27,135 - ERROR - table not initialized, use `initdb`
$ dap initdb --namespace canvas --table course_sections
INFO:pysqlsync.postgres:connecting to postgres@canvas-data-2.ccmfhqnzit0p.us-east-2.rds.amazonaws.com:5432/cd2
INFO:pysqlsync.postgres:PostgreSQL version 14.0.10 final
INFO:pysqlsync.postgres:connecting to postgres@canvas-data-2.ccmfhqnzit0p.us-east-2.rds.amazonaws.com:5432/cd2
INFO:pysqlsync.postgres:PostgreSQL version 14.0.10 final
INFO:pysqlsync:synchronize schema with SQL:
CREATE TABLE "canvas"."course_sections" (
"id" bigint NOT NULL,
"name" varchar(255) NOT NULL,
"course_id" bigint NOT NULL,
"integration_id" varchar(255),
"created_at" timestamp NOT NULL,
"updated_at" timestamp NOT NULL,
"workflow_state" "canvas"."course_sections__workflow_state" NOT NULL,
"sis_batch_id" bigint,
"start_at" timestamp,
"end_at" timestamp,
"sis_source_id" varchar(255),
"default_section" boolean,
"accepting_enrollments" boolean,
"restrict_enrollments_to_section_dates" boolean,
"nonxlist_course_id" bigint,
"enrollment_term_id" bigint,
CONSTRAINT "pk_canvas_course_sections" PRIMARY KEY ("id")
);
COMMENT ON COLUMN "canvas"."course_sections"."id" IS 'The unique identifier for the section.';
COMMENT ON COLUMN "canvas"."course_sections"."name" IS 'The name of the section.';
COMMENT ON COLUMN "canvas"."course_sections"."course_id" IS 'The unique Canvas identifier for the course in which the section belongs.';
COMMENT ON COLUMN "canvas"."course_sections"."integration_id" IS 'The integration ID of the section. This field is only included if there is an integration set up between Canvas and SIS.';
COMMENT ON COLUMN "canvas"."course_sections"."created_at" IS 'Timestamp for when this section was entered into the system.';
COMMENT ON COLUMN "canvas"."course_sections"."updated_at" IS 'Timestamp for when the last time the section was updated.';
COMMENT ON COLUMN "canvas"."course_sections"."workflow_state" IS 'Life-cycle state for the section.';
COMMENT ON COLUMN "canvas"."course_sections"."sis_batch_id" IS 'The unique identifier for the SIS import if created through SIS.';
COMMENT ON COLUMN "canvas"."course_sections"."start_at" IS 'The start date for the section, if applicable. When a user is allowed to participate in a course. enrollment term dates, course dates, and course section dates flow together in all aspects of Canvas. Various dates allow different users to participate in the course. The hierarchy of dates are: course section dates override course dates, course dates override term dates.';
COMMENT ON COLUMN "canvas"."course_sections"."end_at" IS 'The end date for the section, if applicable. When a user is allowed to participate in a course.';
COMMENT ON COLUMN "canvas"."course_sections"."sis_source_id" IS 'Id for the correlated record for the section in the SIS (assuming SIS integration has been properly configured).';
COMMENT ON COLUMN "canvas"."course_sections"."default_section" IS 'True if this is the default section.';
COMMENT ON COLUMN "canvas"."course_sections"."accepting_enrollments" IS 'True if this section is open for enrollment.';
COMMENT ON COLUMN "canvas"."course_sections"."restrict_enrollments_to_section_dates" IS 'Restrict user enrollments to the start and end dates of the section. True when "Users can only participate in the course between these dates" is checked.';
COMMENT ON COLUMN "canvas"."course_sections"."nonxlist_course_id" IS 'The unique identifier of the original course of a cross-listed section.';
COMMENT ON COLUMN "canvas"."course_sections"."enrollment_term_id" IS 'Identifies the associated enrollment term.';
2024-03-04 10:59:00,321 - INFO - Query started with job ID: 8f3143fc-287f-4051-8454-4c3a76e37c6e
2024-03-04 10:59:00,602 - INFO - Data has been successfully retrieved:
{"id": "8f3143fc-287f-4051-8454-4c3a76e37c6e", "status": "complete", "expires_at": "2024-03-05T16:53:10Z", "objects": [{"id": "8f3143fc-287f-4051-8454-4c3a76e37c6e/part-00000-de8f428b-cf7c-40d7-a725-2f70a213916d-c000.tsv.gz"}], "schema_version": 1, "at": "2024-03-04T15:01:05Z"}
2024-03-04 10:59:10,594 - ERROR - null value in column "name" of relation "course_sections" violates not-null constraint
DETAIL: Failing row contains (29818, null, 17349, null, 2022-03-07 16:08:02.961, 2022-03-07 16:10:13.093, active, null, null, null, null, null, null, null, null, null).
In this situation I'd try to fix that record in the source system: I'd try to use the API to update the course_section with ID 29818 to give it a name.
It would be nice if the DAP library could log these data-integrity errors but move on with the rest of the process.
--Colin
The immediate remediation (as suggested by @ColinMurtaugh) is indeed changing the value of the record with the NULL value to a non-NULL value such that the constraint is no longer violated. Specifically, the name column for row 29818 in the table course_sections should be assigned a non-NULL value.
In general, however, DAP API should not output records that are non-conforming to the declared schema. Previous versions of the DAP client library were more lenient but the latest version is strictly following the schema returned by the DAP API schema endpoint.
I will bring this up with the team but I would also encourage opening a support ticket such that Instructure can properly track this issue, and report back to you with progress.
I've had to go through and fix several instances where data that should not be possible was in the Canvas database and causing problems importing it.
Yesterday's issue was in the learning_outcome_groups table, which has a title field that does not allow None (null). I was able to track that down because it was the only table with a title in the schema (I knew the error was somewhere after assignments and before quizzes) that didn't have None. I downloaded the data with a TSV snapshot, found the offending lines (it just had one error line but not with a usable key.id or position but there were about 6 or 7 that needed changed). I was able to fix it in Canvas itself by using GraphQL mutation to change the title on the offending records. I changed the titles to "Blank" and four hours later I was able to get the initdb to work.
I had to do that earlier (0.3.18) with datetimes that were out of whack. The maximum allowed datetime in MySQL is 9999-12-31 23:59:59.4999999 and Canvas was sending 0.999999. I don't know if that was fixed in 1.0.0 because I fixed in through another GraphQL mutation so it's correct in your databases.
The customer shouldn't be required to track down where Canvas has allowed bad data to be stored in its tables. Canvas shouldn't be storing bad data in the first place.
Now it's working except that I'm getting warnings with syncdb on MySQL 8 because you're using a deprecated insert on duplicate record without an alias. And I'm getting that warning to the console despite specifying --loglevel error and --logfile. I need to dig into that a little bit more.
Unfortunately, we cannot get rid of the warning message in MySQL 8.0.19 and later that is related to the absence of a column alias. This is due to a bug in aiomysql, a dependency that DAP client library uses indirectly. The warning itself is emitted by the MySQL driver, which is out of our control.
Agreed that Instructure s/b logging these errors and anomalies and moving forward with everything else. The course sections I've found that have null names should NOT be allowed by Instructure in the first place. If a teacher creates a section, it should not be allowed to save it in the course without a "name', even the name is nothing more than a single blank space.
Thanks @reynlds ! I had to use TMPDIR but that did the trick. The assignments and enrollments tables are beasts.
Found the offending section (with no enrollments) and deleted it. It had been created without a name by the instructor. Will see if it init's in a few hours.
You have covered it in your following post, but for anybody else encountering this - you have to add the name, not just delete the section, as deleting it will still have the CD2 data element present - just with a workflow_state of deleted.
I think you will be disappointed. You need to add a name. Just changing the workflow_state to deleted isn't going to fix the constraint problem - the record is not hard deleted, and will still appear in your Canvas Data 2 feed, just with the changed state.
Confirmed. Following a FULL database refresh (complete removal of the DB along with a full "initdb" using DAP), along with "fixing" the data in Canvas (adding a name to an object that was created w/o one), has allowed this process to continue. However, it should still be that DAP logs these errors, and proceeds, rather than simply dumping out on anything that does not fit exactly as defined. Again, it's up to Instructure to modify Canvas to NOT allow blank (null) names on course sections.
initdb (and I assume syncdb) seems to download .tsv files to your /tmp, which if you have a big installation, can be a LOT of data for the larger tables. It consumed our tmp space.
IMO it should be using the .gz files and as necessary expanding each file temporarily. Alternatively, per reynlds
In Linux
export TMPDIR=/somecustom/dir/withlotsofspace
Also, data errors can be resolved for mysql by making sure you have the correct character set and collation, e.g.
ALTER table
CD2.canvas__courses
CHARACTER SET = utf8mb4
COLLATE = utf8mb4_unicode_ci;
and/or similarly for the database.
Such a coincidence that we have just added the temporary storage location setup to our documentation: https://data-access-platform-api.s3.amazonaws.com/client/README.html#changing-the-temporary-storage-...
To participate in the Instructure Community, you need to sign up or log in:
Sign In