Managing Canvas Data with Embulk

robotcars
Community Champion
12
6576

Embulk is an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. embulk.org github contributors

 

Simply put, Embulk makes importing gzipped CSV files into any RDBMS* and managing the data and workflow necessary for Canvas Data using command line tools easy, really easy, specifically solving issues we experience working with Canvas Data without fancier tools.

 

with support for

Linux, OSX, Windows https://github.com/embulk/embulk#quick-start

MySQL, MS SQL Server, Oracle, PostgreSQL, RedShift https://github.com/embulk/embulk-output-jdbc

 

* Embulk goes beyond SQL, List of Embulk Plugins by Category

 

and features useful for Canvas Data

  • Decode gzipped files
  • The ability to intelligently guess the format and data types of CSV files
  • Parallel execution of tasks, multi-threading per CPU core, and a task for each batch file
  • Input CSV Plugin as default Input for Embulk
  • Filter data with Filter Plugins, https://plugins.embulk.org/#filter
  • Output Data to SQL
    • Insert, Insert Direct, Replace, Merge, Truncate and Truncate Insert
    • Timestamp formatting
    • TimeZone conversion from UTC for date time columns
    • before_load and after_load, config options to run queries before (truncate) and after import (indexes)
    • and more

 

Embulk uses YAML config files for each task, for Canvas Data this means each input source (table files) and it's output destination (db table) is 1 file. This includes differences between staging, test and production destinations. I imagine your workflow and setup will be different than mine and many others. You may only need a few tables, or only have one database, or you could simply use Embulk to manage, manipulate, filter and possibly join CSV files to examine with Tableau if that's your thing. For this reason, I have only shared each set of config files for MySQL, MSSQL, Oracle, and PostgreSQL. I have not worked with RedShift.

 

Our old workflow, requires that we attempt to maintain the newest data from Canvas Data for reporting, attendance, API services and automation, and LTIs. One of our biggest issues is the size of the daily batch without deltas and the growing use of Canvas within our schools and how long importing everything can take, how slow and unnecessary it is to hold 6 years worth of data for this semester, tried different things in SQL and bash to limit the data quickly for the current school year in production, never implement. LTI queries for attendance and submissions are really slow. Then some days the downloaded files are 0 bytes, we must have lost internet, or there was duplicates and the table didn't load, and it takes until 2pm to get everything loaded. Sometimes there's new columns in the table and I forgot to read the release notes and we've truncated the table before importing, and it takes hours to import. And so on.

 

Some of these are human, some of these are manageable.

 

Our new workflow uses Embulk

  1. Download with Canvas Data CLI, some of that documented here
  2. Import all CD tables using CSV in SQL out to staging environment with Replace mode, this creates temporary tables for the import, if it fails, the previous version is still intact. After successful import, Embulk will drop the old table and run the after_load queries, I use this for enumerable constraints and indexes. I left a lot of examples in the configs.

    The Requests table config uses Insert mode to append the new rows.
  3. I use staging for Tableau reporting. For production, I only need to load the tables necessary for our LTIs and API services. Some of these configs are straight copies of the staging imports, except they point to production. Some of the configs create new tables using SQL in SQL out and importing filtered or composite tables from query results using https://github.com/embulk/embulk-input-jdbc

    heres' an example https://github.com/ccsd/canvas-data-embulk-configs/wiki/SQL-in-SQL-out

 

Using RHEL7, 6 CPUs with 12 cores, and 16GB Ram, Embulk imports 7.9GB of CSVs into >1TB of SQL (no requests) in less than 4.5 hours, depending on which indexes you keep in the configs.

 

GitHub - ccsd/canvas-data-embulk-configs: YAML configs for importing Canvas Data with Embulk

 

12 Comments
James
Community Champion

carroll-ccsdGreat Job!

Thanks for getting this written up; I know it was a major undertaking. I don't think I realized until I saw your video that Embulk would let you drop columns and filter rows to help get just the data you want.

Can you clarify your comment about benchmarking?

Embulk imports 7.9GB of CSVs into >1TB of SQL (no requests) in less than 4.5 hours

Do you mean that 7.9GB of compressed CSV and 1TB of disk storage? Or does it actually create 1TB of SQL and then you run that separately, if so, is the 4.5 hours just to create the SQL?

robotcars
Community Champion
Author

Absolutely, I didn't make that very clear.

Our compressed daily import (~7.9GB/gz) is just over 1TB* on SQL Server 14.

I also delete requests from the previous sequences after downloading, keeping the one I haven't imported.

47GB unpacked on disk storage.

Last nights batch was 7.7GB, guessing the difference was requests for July 4*.

The import into SQL staging started at 20:30:02.846 and ended 7/6 01:25:54.919

Total time: 4 hours, 55 minutes, 52 seconds

Using the config with all indexes and constraints provided.

Appears there are some wide swings in those times. YMMV

Also note that I have this issue, so I had 6 trips and 3 retries last night. Hoping spend some time on that one after next week.

*Live Events agrees.

316763_Screen Shot 2019-07-06 at 6.55.03 AM.png

robotcars
Community Champion
Author

In comparison, I just finished running our production import with time bcp*.

As I shared in the OP, our production import only installs 23 of 100 tables.

Those files are 3.2GB compressed, and completed in 174m43s or 2.9 hours.

*bulk copy program, native SQL Server import tool which Embulk uses with multi-threading

robotcars
Community Champion
Author

Ever miss an incremental import of the Requests Table?

This often happens when our systems department does production patching over the weekend or corrupts my ODBC config. Which leaves me recovering the requests for missing dates the week after. This isn't an issue for any other table because they get a full import.

Here's how I fix it with our multi-core system using GNU Parallel and Embulk.

First, I use a custom unpack script because it's faster than CD:CLI, and it doesn't add headers.

#!/usr/bin/env bash
# unpack.sh

path="/canvas/data/"

_unpack () {
  echo "[UNPACKING] $1"
  parallel zcat ::: "$path/files-fetched/$1/"*.gz > "$path/files-unpacked/$1.txt"
}

_unpack "$1"‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Then, I use most not all (14/16) cores to handle a sort command on the requests file and save it back to the same file. Today, my unpacked requests file was 56G (about 3 months worth), so saving a second file isn't an option. I do a reverse key sort (-rk) on column 6 which is the timestamp_day, ensuring the most recent dates are at the top of the file. 

(export LC_ALL=C; sort -S1G --parallel=14 -rk 6 -o requests.txt requests.txt)‍‍‍‍‍‍‍‍

The next script splits the file by timestamp_day. Again, noting that I wouldn't have the disk space to save the requests.txt file and all the split dates to the same disk. I starting running the script and watch the results land in the same directory, killing the script after I get the files I need to import, noting that my I use Embulk to convert timestamps to America/Los_Angeles on import and the files timestamps are in UTC... so I keep a day before and a day after the dates I need to import.

#!/usr/bin/env bash
# requests-split-unpacked.sh

path="/canvas/data/files-unpacked/"
parallel 'awk '\''{print $0 >>"/canvas/data/files-unpacked/requests-"$6".txt"}'\' ::: "$path/requests.txt"‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Now, I have the following files in the files-unpacked directory, because I want to load 10/30 and 10/31.

809M requests-2019-10-29.txt
837M requests-2019-10-30.txt
791M requests-2019-10-31.txt
818M requests-2019-11-01.txt
56G requests.txt‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

A copy of the requests.yml.liquid file makes the rest easy.

I name this copy requests-recovery.yml.liquid

  1. Set the path prefix to the unpacked directory, and use the hyphen to restrict the import to those with dates in the file name.
  2. Set the table to requests_recovery, this temporary table helps us merge in the next step.

The rest of the file is the same as your normal requests.yml.liquid file.

in:
  type: file
  path_prefix: /canvas/data/files-unpacked/requests-‍‍‍‍‍‍‍‍‍
out:
  table: requests_recovery‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Finally, a second copy of your requests.yml.liquid file needs to be made.

I name this one requests-merge-recovered.yml.liquid.

This one uses SQL IN as the Embulk input source, see SQL in SQL out · ccsd/canvas-data-embulk-configs Wiki · GitHub 

The query uses a JOIN to determine all rows in requests_recovery that do not exist in requests, and selects them for import. This prevents duplicate PK constraint errors, because some of the rows from the day before and day after your missing dates are likely already there depending on your UTC/TimeZone or Default TimeZone Conversion settings.

in:
  type: sqlserver
  ...
  query: |
    SELECT rr.*
    FROM requests_recovery rr
    LEFT JOIN requests r ON r.id = rr.id
    WHERE r.id IS NULL‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

And one last query to make sure the gap is filled.

SELECT
     timestamp_day
     , MIN(timestamp) AS [first]
     , MAX(timestamp) AS [last]
     , COUNT(*) total
FROM requests
GROUP BY timestamp_day
ORDER BY last DESC‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

330505_Screen Shot 2019-11-08 at 10.58.54 AM.png

robotcars
Community Champion
Author
jsavage2
Community Contributor

Hi carroll-ccsd, everyone,

I've been really intrigued by this project for a while, and I hope this is the right place for this question. My apologies if it's not...

I'm curious how everyone feels about running this in production environments, and where you see the long-term prognosis heading.

I've played around with embulk in testing, and really love the concept, but the dependency on Java 7/8 with no apparent plans to support anything more modern is disconcerting. 

robotcars
Community Champion
Author

 @jsavage2 ,

This is the place.

Personally, I'm OK with running this in production, but those reasons come down to a very hard constraints and determination.

  1. I am the single person in charge of our enterprise integrations. We briefly had a second programmer that took me 18 months to show we needed/could use to do more. We built our first LTI which was a nice case study for the online school here. I suggested a raise for that developer after, and it was impossible, he quickly exited to the first department that offered a higher position in February '18. We flew the position, had an unsuccessful hire and flew it again, then in this years budget crunch got told 'it's been vacant so long you must not need it'. :smileycry:
  2. We continue to work in a low tech environment and budget, in that, our departments budget is essentially the staff and the per user license of the products we support. Getting funding for something like AWS, which I would love to use more of, is difficult even when our costs for on-premise infrastructure or quotes for Azure would be similar. We are just not allowed those fungible dollars. This leaves me exploring and utilizing whatever tech I can, doing more with a few linux VMs than I'd ever thought. Also Linux and data files are a pretty fantastic pair, checkout harelba/q and VisiData. The AWS solutions, I would imagine would save us a lot of time to rehash some of our FE/LTI projects we'd like to do.

    It's the code you don't write that has the most value.
  3. I really just need the data. Embulk isn't fancy like AWS... or is it? I mean at the end of the day, every day, I ETL a TB+ of SQL and consume real time data from Canvas with LEDbelly. I can answer questions and look up stats on just about anything to help our department, schools and teachers; conceptualize new ideas and or create Tableau reports that would be better as an LTI but work well distributed as workbooks or PDFs.
  4. I found Embulk, and shared it here after tackling the community use aspect first, so it could be easily used by anyone with similar constraints instead of having to use a different solution, see all the other CD Loaders. The number of contributors a project has has quickly become a huge metric for me. More specifically, things get hard evaluating how to maintain old projects, third party projects, and tackling new projects. Burnout is always around the corner with such a small team. So where I can I've settled on trying to find the best tool with the most contributors that may survive the following:
    1. Is the project capable of supporting more than a single purpose, use case or environment? The Embulk configs offer multiple data source/destinations and it's not developed by or with the limited scope of Canvas. I can use Embulk to load other files too, including Canvas Reports, the gradebook, etc.
    2. Is the project likely to break during a Canvas deployment, or mid school year... and how quickly can I shift if necessary?
    3. How many users are going to be affected? I rarely deploy any code on the FE that will affect students and teachers, because if they depend on it, and I can't reproduce or repair it within a few days it's cost of support is too high.
    4. Is the project maintained by 1 person or institution, when was their last ping/activity/response to their thread or overall support within the community? What's the likeness they are going to continue supporting their code.

      ex:  @James ‌ is for lack of a better term, a one man coding machine, and this is a huge problem because if James retires all his code should be deprecated unless someone picks them up as the maintainer, which would be a huge loss for a lot of Canvas users. I haven't seen many developers on the Canvas Community become a maintainer for a abandoned projects. James' dedication and the amount of time he spends reading, listening, and thinking about the issues and solutions he can provide with great detail has me willing to use those projects. In the past, I've raised the dead on 2 JavaScript hacks that broke after a Canvas release. However, my own User Script for Admin Tray - Sub Account Menu has been broken for a few weeks since the latest jQuery changes, because I just don't have time to update it.
    These configs and even LEDbelly are currently maintained by me... but I wrote them and continually work to make them easier for others to use and maintain. The technologies are borrowed, not mine, updated regularly via their repos and maintainers have open support channels via IRC/Github/Twitter, and can easily be maintained by the community. Embulk is a core with plugins, while LEDbelly is essentially 2 primary gems (Shoryuken and Sequel) a Ruby Hash and a really long CASE statement.

    My projects, like many others are used by some, I can even see that in Embulk Config and LEDbelly Github traffic, but few contribute even with simple feedback. Some ask questions, and then as we commonly see around here, get their answers and disappear, some of those conversations are left hanging... with unanswered questions and never knowing how issues were solved, whether they were solved or if they abandoned the task entirely. It's sad to think that some have been solved and never documented for others. And yet, others who aren't developers, have spent days or weeks collaborating with me and we've made changes that make it easier to just configure, install, and update.

    The biggest concern I have supporting and choosing any community software in this environment is the lack of holistic collaboration, that if I use this tool I'm depending on my own time and that of the author, I'm dependent on their ability to shift focus when Canvas breaks the project with a weekend or mid-week deployment even when they/I have our own deadlines or issues to deal with.

    So I think long-term is possible, but some of the developer culture in the community has to level-up a bit.

  5. Embulk is a Big Data tool, it has 44 contributors and dozens more via plugins, and the latest commit was last week. The developers went on to found Arm Treasure Data and they still use the product and offer the OS for free, Open Source Customer Data Platform - Arm Treasure Data. There is enough community on the repo that I think it will continue to be supported, and that support for something like Java 9 is a matter of necessity and maybe time.
    [ those were just my thoughts, then I started trying to answer the question ]
    Java [version history - Wikipedia] is a weird thing to support apparently... Java 8 was released in March 2014, 9 was released in July 2017, while 10 and 11 were both released in 2018, leaving 12 released in March 2019, but wait... 13 was released in September 2019. If I was using a core technology that released major versions multiple times a year I don't think I'd want to implement it quickly either, especially while it still works. 8 is still also the supported version on Oracle Java Downloads for All Operating Systems, and when I run `sudo yum search java`, 8, 11, and 13 are the only JDK options available. Which led me to asking why... java - Why Oracle JDK 9 download ends so early? - Stack Overflow 
  6. Finally, if Canvas Data ever became more than TSV files Embulk seems to offer a lot of potential for already having the existing plugins to shift to whatever Canvas does with CD2.0. RedShift and S3 are already supported, Parquet could be an option, and other AWS options could be supported via Embulk ODBC/JDBC connectors.

Alternatively, if you find something better, or can share something better for a slightly different use case, please do! Build a Canvas Data Warehouse on AWS in 30 minutes!

We don't all have to use the same solution, but we can share and document the ones we use for others. As a community, we should probably focus our efforts on a few that work for most, instead of developing more single tech single author solutions. This would greatly reduce the barriers to entry for those needing low-tech/low-cost solutions. I shared these projects because in my first pass to collaborate with other K12 Canvas schools I found their biggest issue was getting access to and managing the data. I'm hoping as the data becomes more accessible we can collaborate on data use.

I'm here, I'm not retiring for a bit, I need this tech, that's why I share. :smileygrin:

jsavage2
Community Contributor

Hi carroll-ccsd,

Thank you very much for the detailed, thoughtful reply!

I really only meant to ask what the feelings were about the future of the Embulk project itself. I'm eyeing it for all the same reasons you cite.

But not being a Java guy (except occasionally by necessity) I wasn't really sure how to interpret the dependencies in light of Oracle's supposed roadmap for Java (which the Wikipedia page highlights--and color codes!--in all its absurdity) that says Java 8 is in already LTS and officially EOL in a few months, and the new (to me, at least) terms around the new TOS/"Oracle Technology Network License Agreement for Oracle Java SE" and production use.

It sounds like it's an active project and the consensus is "don't worry."

That works for me!

Thanks for all your work on this!

And if I do dive in, I can promise feedback, and possibly pull requests.

--j

robotcars
Community Champion
Author
robotcars
Community Champion
Author

Five years have passed since we released the initial version of Embulk. It may be a good time to release v1, but we still have had some concerns to step up to v1. It turned out that we have to give up some plugin compatibilities to resolve those concerns. For these one or two years, we have somehow managed with minimum updates, but the limit is approaching.

This year, we steel ourselves to resolve those concerns with incompatible updates, and aim to release v1 eventually. Let us explain the milestones to v1, and expected responses against the incompatibility.

...

Embulk v0.10 series, which is a milestone to v1.0

robotcars
Community Champion
Author

The master branch for this repo has been renamed to main.

kimhuang
Community Participant

@robotcars Your post is a lifesaver. Thank you so much! I have taken the "embulk" approach to upload the gz files on my Mac(downloaded by the cli tool) into a remote oracle database by utilizing your config files ( ccsd/canvas-data-embulk-configs ) in github. Most of tables are loaded beautifully. I have noticed that embulk oralce plugins added the quotes around the oracle table names. (I wish they had not done that.) Therefore, when I need to query the content in the table, I need to quote the table like this: select * from "user_dim".

The loading of the request table from 453 request gz files failed.  I guess that the table size might need to be increased. I will try to load one gz file first to identify the problem that I experienced.

I will run some queries and create some DB views to get myself familiar with the data in tables. I hope in the near future we will be able to visualize the data in Tableau.

I really appreciate your contribution to the community. And please let me know if I can help with anything. Will keep you updated about my progress.

Thanks again,

-Kim