@schang11 ,
I'm going to provide three things to try. Hopefully one of them helps.
All files test okay with gunzip
If the files are all fine with gunzip, then you might check to make sure that you have the latest version of the zlib package for node that will work with your version of node. Sometimes the one that comes with the package managers aren't the latest. I don't know that this is the issue, but there were some "unknown compression method" posts in 2014 for the zlib.js file.
By the way, you can use this gunzip command with a BASH shell to test them. Run it in the dataFiles/requests folder.
for f in *.gz ; do gunzip -t $f ; done
The problem with gunzip -t *.gz is that it stops after the first error and so it doesn't check every file once it finds one. There is gunzip -l *.gz (that's a lowercase L), but it will return a value sometimes when the file is corrupt.
Corrupted files / Re-downloading
The rest of this might not help you if your files are all valid, but maybe someone else will find it of use, so I'm going to leave it here.
I finally broke down and installed the CLI tool so I could try to diagnose what you're saying. When I ran it, there were some request files that were incomplete and much smaller than the others. Some of the ones that were smaller were valid giles.
But my error wasn't the same error as yours, it reached the maximum number of attempts trying to download a file with no url.
The error you reported isn't complete as the files it references aren't even part of the CLI tool, it's in the included files, but you didn't include the line that called it. Normally, a dump has a stack trace. Mine had this:
/usr/lib/node_modules/canvas-data-cli/lib/FileDownloader.js:32
if (attempt > MAX_ATTEMPTS) return cb(new Error('max number of retries reached for ' + fileUrl + ', aborting'));
^
ReferenceError: fileUrl is not defined
at FileDownloader._downloadRetry (/usr/lib/node_modules/canvas-data-cli/lib/FileDownloader.js:32:94)
at null._onTimeout (/usr/lib/node_modules/canvas-data-cli/lib/FileDownloader.js:46:26)
at Timer.listOnTimeout (timers.js:92:15)
From that, I can tell to go look at line 32 of the FileDownload.js file.
That said, I figured out enough to kind of guess what I think might be happening and make a suggestion for your problem.
When you look at the readme from the CLI tool site, it says this about the sync process.
canvasDataCli sync -c path/to/config.js
will start the sync process.
On the first sync, it will look through all the data exports and download only the latest version of any tables that are not marked as partial
and will download any files from older exports to complete a partial table.
On subsequent executions, it will check for newest data exports after the last recorded export, delete any old tables if the table is NOT a partial
table and will append new files for partial tables.
That makes it sound like it only goes back for old partial files, of which requests is, on the first sync.
So, that makes it sound like you can manually download the files and, place them into the dataFiles folder, and name them so they start with the sequence number. You'll need to do that for all of the missing files between the corrupt ones and the last sequence number, which is found in the state.json file.
Unfortunately for me, the program crashed after downloading 187 out of 297 request files on the initial run and never wrote a state.json file. Four of those were incomplete. That's 3.5 GB of information it got before it crashed. Unfortunately, when you run the program again, it wipes out all existing files, rather than checking to see if they are the same that it would be downloading again, and so the 3.5 GB is gone. The next time, it claims that it ran successfully, downloaded 8.5 GB of data, but it claimed the sequence was 52. When I download the dump file manually, it says its sequence number is 81.
Even worse, it saved a schema.json file that had the 1.2.0 schema in it, even though the current one is 1.3.0. Sequence 52 isn't even the last sequence to use schema 1.2.0, that was sequence 53 for us, but they may be starting with 0 instead of 1 so the sequence numbering may not t match what's in the actual dump.
When I run the sync again, it now starts over with version 1.3.0 of the schema and deleted all existing files, including the requests -- presumably because they were schema 1.2.0. So, I had to download 12 GB of data (between the crash and the worthless 1.2.0 schema) before I could ever get access to the correct and most current dump. We're not a huge institution, either.
I would say the program needs a logic reworking, but it definitely is easier to use than downloading the files by hand. It would be nice if the sync command supported the --filter like the unpack did. You may also want to go to GitHub and file an issue. I don't know if the developer monitors the community.
However, if you're lucky and you had that initial download and you've got the state.json file, you may be able to just supplement the bad files as I described with the manual downloads. Just make sure that the naming scheme uses the sequence in the dump, but that the sequence you look at (if you need to modify it), is one less than the actual sequence number.
Summary of this section:
Here's how you should be able to solve the issue of corrupt files. Let's say that the first bad file starts with 76_. That means that it was sequence 76 was corrupt. Edit the state.json file and change the sequence number to be 75 (it needs to be one less than the bad one). Then re-run the sync command. It will redownload everything from 76 on, but that's still easier than doing it by hand.
Luckily for us, the complete dumps are only about 500 MB while the requests are 7.3 GB, so this step isn't too bad.
If you ever need to determine which *.gz files are corrupt and you are running a BASH shell, you can run for f in *.gz ; do gunzip -t $f ; done from the dataFiles/requests folder. The problem with gunzip -t *.gz is that it stops after the first error and so it doesn't check every file once it finds one. There is gunzip -l *.gz (that's a lowercase L), but it will return a value sometimes when the file is corrupt.
Check hard drive space
Finally, it shouldn't be the cause of an unknown compression method, but be aware of how much hard drive space you have available. There is a copy of the data in compressed form and then there is a copy in uncompressed form. My request files were 7.3 GB compressed and 45 GB uncompressed. I would need a drive with at least 53 GB free just for the requests table.
I ran the unpack command. 45 minutes later, it finished without any errors, so that didn't help in diagnosing the problem. Sorry.