[ARCHIVED] Our Canvas data journey

wre0001
Community Member

In case it's useful for others, below is information on our locally hosted canvas data setup.

1. Setup a local linux server and installed Postgresql 9.5.x.

2. Used the provided Canvas data CLI to download and unpack the data after configuring API keys within the canvas admin portal.  GitHub - instructure/canvas-data-cli: Command line tool to connect and download files from Canvas Da...

3. Since we had to locally install nodejs for the cli, and the DB schema canvas provides is in json format, we wrote code in js to generate the postgres table creates DDL with comments.  Script attached.  Example output snippet:

drop table "course_dim";

create table "course_dim"

(

    "id" bigint,

    "canvas_id" bigint,

    "root_account_id" bigint,

    "account_id" bigint,

    "enrollment_term_id" bigint,

    "name" varchar(256),

    "code" varchar(256),

    "type" varchar(256),

    "created_at" timestamp DEFAULT TIMESTAMP 'epoch',

    "start_at" timestamp DEFAULT TIMESTAMP 'epoch',

    "conclude_at" timestamp DEFAULT TIMESTAMP 'epoch',

    "publicly_visible" boolean,

    "sis_source_id" varchar(256),

    "workflow_state" varchar(256),

    "wiki_id" bigint

) TABLESPACE canvasdata_tablespc;

comment on table "course_dim" is 'A course in the canvas system';

comment on column "course_dim"."wiki_id" is 'Foreign key to the wiki_dim table.';

comment on column "course_dim"."workflow_state" is 'Workflow status indicating the current state of the course, valid values are: completed, created, deleted, available, claimed';

comment on column "course_dim"."sis_source_id" is 'Correlated id for the record for this course in the SIS system (assuming SIS integration is configured)';

comment on column "course_dim"."publicly_visible" is 'True if the course is publicly visible';

comment on column "course_dim"."conclude_at" is 'Timestamp for when the course finishes';

comment on column "course_dim"."start_at" is 'Timestamp for when the course starts.';

comment on column "course_dim"."created_at" is 'Timestamp when the course object was created in Canvas';

comment on column "course_dim"."type" is 'TBD';

comment on column "course_dim"."code" is 'The code for the course (e.g. FA12 MATH 2000)';

comment on column "course_dim"."name" is 'The friendly name of the course.';

comment on column "course_dim"."enrollment_term_id" is 'Foreign key to enrollment term table';

comment on column "course_dim"."account_id" is 'The parent account for this course.';

comment on column "course_dim"."root_account_id" is 'The root account associated with this course.';

comment on column "course_dim"."canvas_id" is 'Primary key for this course in the canvas courses table.';

comment on column "course_dim"."id" is 'Unique surrogate id for a course';

.................................

4. With a bash shell, we invoked the canvasdatacli to unpack each table we were interested in and load it into postgres.  The unpacked files are just what postgres wants, but it seems the unpack by the CLI sometimes left blank lines within the data.  Also, a few times data inthe file was too large for the column as defined by the schema.  Since Postgres's "copy into" tool is an all or nothing, any error made the entire loading of that file fail.  So with some simplistic "sed" regex, the below script was used to skip a line that failed and retry.  Truncates and postgresql's "vacuum"(free space from deleted rows) was used to reduce disk space usage.

DATADIR=......../CanvasData/unpackedFiles

export DATADIR

psql -d canvasdata -c "vacuum;" 2>/dev/null

for t in `cat ~/Config/table_names.txt`

do

  sedfile=/tmp/${t}_load.sed

  errfile=/tmp/${t}_load.err

  fname=$DATADIR/${t}.txt

   # sed regex to delete first line of file

  echo "1d" > $sedfile

  echo "" > $errfile

  echo "Loading table $t from file $fname ..."

  until sed '/^$/d' < $fname | sed -f $sedfile | psql -d canvasdata -c "truncate $t; copy $t from STDIN" 2>$errfile

  do

    cat $errfile

    echo "Trying to load $t again, skipping bad input line..."

    grep "^CONTEXT: .*, line " < $errfile | sed 's/^.* line //' | sed 's/[:, ].*$/d/' >> $sedfile

  done

done

psql -d canvasdata -c "vacuum;" 2>/dev/null

Labels (2)