CD2 Invalid Parquet GZip Format
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying out Canvas Data 2 in preparation for writing an ELT pipeline, and I noticed an issue with version 1.1.0 of the instructure-dap-client library. Whether I try to get data using the command line tool or the python library, I get errors about an invalid gzip format. This only happens with parquet files. I am able to successfully pull down CSV and JSONL files, but not Parquet files. Here is the command I used (I made sure the environment variable credentials are set):
$ dap snapshot --namespace canvas --table accounts --format parquet
And this is the code I used:
import os
import time
import asyncio
from pathlib import Path
from dap.api import DAPClient
from dap.dap_types import Format, SnapshotQuery
from dotenv import load_dotenv
async def main():
load_dotenv()
path = Path("logs")
start = time.time()
async with DAPClient() as session:
query = SnapshotQuery(format=Format.Parquet, mode=None)
await session.download_table_data("canvas", "accounts", query, path, decompress=True)
print(f"Downloaded canvas accounts in {time.time() - start:2f} seconds")
if __name__ == "__main__":
asyncio.run(main()
In both of these cases, data is successfully downloaded, so I don't think it's an authentication issue. When I try to gunzip the parquet files (after renaming them from "part-000*.gz.parquet" to "part-000*.parquet.gz"), I get this error:
gzip: part-0000*.parquet.gz: not in gzip format
When I run the code, I get this error:
Traceback (most recent call last):
File "/app/elt/CanvasSrc/CanvasData2Interaction.py", line 29, in <module>
asyncio.run(main())
File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/app/elt/CanvasSrc/CanvasData2Interaction.py", line 19, in main
await session.download_table_data(
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 676, in download_table_data
downloaded_files = await self.download_objects(objects, directory, decompress)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 566, in download_objects
local_files = await gather_n(downloads, concurrency=DOWNLOAD_CONCURRENCY)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/concurrency.py", line 186, in gather_n
results = await _gather_n(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/concurrency.py", line 111, in _gather_n
raise exc
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 584, in download_object
return await self.download_resource(resource, output_directory, decompress)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/dap/api.py", line 541, in download_resource
await file.write(decompressor.decompress(await stream.read()))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
zlib.error: Error -3 while decompressing data: incorrect header check
Has anyone else come across this issue?
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you're correct. I've just tried...
tables = await dap_client.get_tables("canvas")
# tables = ['accounts']
try:
for table in tables:
start2=time.time()
await dap_client.download_table_schema(namespace, table, output_directory)
logger.info(f"Table schema downloaded for: {table}")
snapshot_dir = os.path.join(output_directory, "snapshot")
os.makedirs(snapshot_dir, exist_ok=True)
snapshot_query = SnapshotQuery(format=Format.Parquet, mode=None)
download_result = await dap_client.download_table_data(namespace, table, snapshot_query, snapshot_dir, decompress=False)
...
Which works for me.
So if Format.JSON | CSV | TXT then decompress can = True | False, or if Format.Parquet then use decompress = False. (I then move/rename each file to its table name in the snapshot output_directory and rmdir the jobxxx directory.)