Celebrate Excellence in Education: Nominate Outstanding Educators by April 15!
Found this content helpful? Log in or sign up to leave a like!
After upgrading to version 0.3.2 of the instructure-dap-client python library, I'm trying to use the dap tool to fetch some tables. Most of the time I'm getting errors like this:
dap syncdb --table accounts
2023-03-29 16:08:39,614 - INFO - Query started with job ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXXXX
2023-03-29 16:08:39,614 - INFO - Query job still in status: waiting. Checking again in 5 seconds...
2023-03-29 16:08:45,040 - INFO - Query job still in status: running. Checking again in 5 seconds...
2023-03-29 16:08:50,403 - INFO - Query job still in status: running. Checking again in 5 seconds...
2023-03-29 16:08:56,918 - WARNING - Received error in response:
Traceback (most recent call last):
File "/Users/cmurtaugh/.pyenv/versions/canvas-data-2/bin/dap", line 8, in <module>
sys.exit(console_entry())
^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 121, in console_entry
main()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/commands.py", line 33, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 55, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 51, in execute
await self._execute_impl(args, session)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/syncdb_command.py", line 67, in _execute_impl
await self._save_resources(
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/syncdb_command.py", line 123, in _save_resources
await self._download_and_save(context_aware_object, table_def, session, db)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/syncdb_command.py", line 140, in _download_and_save
resource_array = await session.get_resources([object])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/api.py", line 421, in get_resources
response = await self._post("/dap/object/url", objects, ResourceResult)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/api.py", line 247, in _post
return await self._process(response, response_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/api.py", line 292, in _process
raise error_object
dap.dap_error.ProcessingError
Anyone else seeing this? I've tried both initdb and syncdb with a few different tables. Occasionally it works, but most of the time I get the error above.
--Colin
Solved! Go to Solution.
Thanks for calling attention to these issues (lack of synchronization/locking on database client objects, time-out when replicating large tables, and improper serialization of JSON type), they have been addressed in version 0.3.4. Feel free to reach out if any of these issues persist with the latest version of the DAP client library and CLI.
@ColinMurtaugh We recently found this bug as well. It's fixed and we'll release it EOD. You'll have to upgrade your CLI
pip3 install instructure-dap-client --upgrade --upgrade-strategy=eager
Edina
Thanks - I'll keep my eye out! (I currently am using the latest available on PyPI - 0.3.2).
--Colin
We have identified a similar issue, which has been addressed in version 0.3.3. Feel free to reach out if the issue persists with the latest version of the DAP client library and CLI.
Hi @LeventeHunyadi --
It seems to be working a little better (I was able to sync a couple of tables successfully), but trying to init a new table gives this error now:
2023-03-30 11:33:24,655 - INFO - Downloading [object 4/4 - job b521df24-5222-4a14-95ff-7b2c76aa60cd] (downloader.py:97)
2023-03-30 11:33:32,002 - DEBUG - Inserting 224235 records from [object 2/4 - job b521df24-5222-4a14-95ff-7b2c76aa60cd] (db_operations.py:76)
Traceback (most recent call last):
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 770, in rollback
self.await_(self._transaction.rollback())
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 102, in await_only
return current.driver.switch(awaitable) # type: ignore[no-any-return]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 160, in greenlet_spawn
value = await result
^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/asyncpg/transaction.py", line 219, in rollback
await self.__rollback()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/asyncpg/transaction.py", line 198, in __rollback
await self._connection.execute(query)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/asyncpg/connection.py", line 317, in execute
return await self._protocol.query(query, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "asyncpg/protocol/protocol.pyx", line 323, in query
File "asyncpg/protocol/protocol.pyx", line 707, in asyncpg.protocol.protocol.BaseProtocol._check_state
asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1116, in _rollback_impl
self.engine.dialect.do_rollback(self.connection)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 657, in do_rollback
dbapi_connection.rollback()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 772, in rollback
self._handle_exception(error)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 717, in _handle_exception
raise translated_error from error
sqlalchemy.dialects.postgresql.asyncpg.AsyncAdapt_asyncpg_dbapi.InterfaceError: <class 'asyncpg.exceptions._base.InterfaceError'>: cannot perform operation: another operation is in progress
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/cmurtaugh/.pyenv/versions/canvas-data-2/bin/dap", line 8, in <module>
sys.exit(console_entry())
^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 121, in console_entry
main()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/commands.py", line 33, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 49, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 45, in execute
await self._execute_impl(args)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/initdb_command.py", line 31, in _execute_impl
await init_db(
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/actions/init_db.py", line 14, in init_db
async with DatabaseConnection(connection_string) as db_connection:
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/ext/asyncio/engine.py", line 431, in close
await greenlet_spawn(self._proxied.close)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 165, in greenlet_spawn
result = context.throw(*sys.exc_info())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1232, in close
self._transaction.close()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2559, in close
self._do_close()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2697, in _do_close
self._close_impl()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2683, in _close_impl
self._connection_rollback_impl()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2675, in _connection_rollback_impl
self.connection._rollback_impl()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1118, in _rollback_impl
self._handle_dbapi_exception(e, None, None, None, None)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2326, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1116, in _rollback_impl
self.engine.dialect.do_rollback(self.connection)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 657, in do_rollback
dbapi_connection.rollback()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 772, in rollback
self._handle_exception(error)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 717, in _handle_exception
raise translated_error from error
sqlalchemy.exc.InterfaceError: (sqlalchemy.dialects.postgresql.asyncpg.InterfaceError) <class 'asyncpg.exceptions._base.InterfaceError'>: cannot perform operation: another operation is in progress
(Background on this error at: https://sqlalche.me/e/20/rvf5)
2023-03-30 11:33:32,301 - ERROR - Exception terminating connection <AdaptedConnection <asyncpg.connection.Connection object at 0x10769c4a0>> (base.py:384)
Traceback (most recent call last):
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 380, in _close_connection
self._dialect.do_terminate(connection)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 984, in do_terminate
dbapi_connection.terminate()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 793, in terminate
self._connection.terminate()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/asyncpg/connection.py", line 1344, in terminate
self._abort()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/asyncpg/connection.py", line 1371, in _abort
self._protocol.abort()
File "asyncpg/protocol/protocol.pyx", line 569, in asyncpg.protocol.protocol.BaseProtocol.abort
File "asyncpg/protocol/protocol.pyx", line 668, in asyncpg.protocol.protocol.BaseProtocol._handle_waiter_on_connection_lost
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/base_events.py", line 761, in call_soon
self._check_closed()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/base_events.py", line 519, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
2023-03-30 11:33:32,307 - ERROR - The garbage collector is trying to clean up non-checked-in connection <AdaptedConnection <asyncpg.connection.Connection object at 0x10769c4a0>>, which will be terminated. Please ensure that SQLAlchemy pooled connections are returned to the pool explicitly, either by calling ``close()`` or by using appropriate context managers to manage their lifecycle. (base.py:1036)
sys:1: SAWarning: The garbage collector is trying to clean up non-checked-in connection <AdaptedConnection <asyncpg.connection.Connection object at 0x10769c4a0>>, which will be terminated. Please ensure that SQLAlchemy pooled connections are returned to the pool explicitly, either by calling ``close()`` or by using appropriate context managers to manage their lifecycle.
--Colin
Also, when I try to initdb a large table like pseudonyms or users, I'm getting timeout errors like this:
Traceback (most recent call last):
File "/Users/cmurtaugh/.pyenv/versions/canvas-data-2/bin/dap", line 8, in <module>
sys.exit(console_entry())
^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 121, in console_entry
main()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/commands.py", line 33, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 49, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 45, in execute
await self._execute_impl(args)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/initdb_command.py", line 31, in _execute_impl
await init_db(
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/actions/init_db.py", line 16, in init_db
await SQLReplicator(session, db_connection).initialize(
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/replicator/sql.py", line 49, in initialize
await client.download(processor)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/downloader.py", line 74, in download
await wait_n(
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/concurrency.py", line 39, in wait_n
raise exc
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/downloader.py", line 72, in logged_download_and_save
await self._download(context_aware_object, processor=processor)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/downloader.py", line 101, in _download
async for record in get_json_lines_from_gzip_stream(stream):
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/payload.py", line 47, in get_json_lines_from_gzip_stream
async for chunk in stream.iter_chunked(1024 * 1024):
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/aiohttp/streams.py", line 35, in __anext__
rv = await self.read_func()
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/aiohttp/streams.py", line 385, in read
await self._wait("read")
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/aiohttp/streams.py", line 303, in _wait
with self._timer:
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/aiohttp/helpers.py", line 706, in __enter__
raise asyncio.TimeoutError from None
TimeoutError
I was able to work around it by editing line 82 of dap/api.py to have a timeout of 300 instead of 30:
timeout=aiohttp.ClientTimeout(total=300, connect=30),
so it seems like that total timeout is including the database updates.
Sorry for all the noise, but I came across another error when I tried to initdb the context_external_tools table. I assume this is due to the settings field which is type "object" in the schema, presumably a JSON field in the database:
Traceback (most recent call last):
File "/Users/cmurtaugh/.pyenv/versions/canvas-data-2/bin/dap", line 8, in <module>
sys.exit(console_entry())
^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 121, in console_entry
main()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/commands.py", line 33, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 49, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 45, in execute
await self._execute_impl(args)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/initdb_command.py", line 31, in _execute_impl
await init_db(
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/actions/init_db.py", line 16, in init_db
await SQLReplicator(session, db_connection).initialize(
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/replicator/sql.py", line 49, in initialize
await client.download(processor)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/downloader.py", line 82, in download
await processor.close()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/database/init_processor.py", line 63, in close
await self._db_operations.flush()
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/database/db_operations.py", line 71, in flush
await self._execute_flush(recordsets)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/database/db_operations.py", line 79, in _execute_flush
await self._copy_func(recordset.records_to_copy)
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/dap/database/db_operations.py", line 116, in _copy_func
await driver_conn.copy_records_to_table(
File "/Users/cmurtaugh/.pyenv/versions/3.11.2/envs/canvas-data-2/lib/python3.11/site-packages/asyncpg/connection.py", line 983, in copy_records_to_table
return await self._protocol.copy_in(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "asyncpg/protocol/protocol.pyx", line 525, in copy_in
File "asyncpg/protocol/protocol.pyx", line 472, in asyncpg.protocol.protocol.BaseProtocol.copy_in
File "asyncpg/protocol/codecs/base.pyx", line 206, in asyncpg.protocol.protocol.Codec.encode
File "asyncpg/protocol/codecs/base.pyx", line 189, in asyncpg.protocol.protocol.Codec.encode_in_python
TypeError: descriptor 'encode' for 'str' objects doesn't apply to a 'dict' object
Thanks for calling attention to these issues (lack of synchronization/locking on database client objects, time-out when replicating large tables, and improper serialization of JSON type), they have been addressed in version 0.3.4. Feel free to reach out if any of these issues persist with the latest version of the DAP client library and CLI.
Thank you! It's working great so far -- really appreciate the quick response!
--Colin
As I continue to initdb our tables, I did run into another timeout (using 0.3.4) when loading our attachments table. It had downloaded 4 of 22 files when it timed out.
I'm pretty new to using async in python, but it seems to me like the steps (query, wait, download, import) would ideally be more independent of each other and be able to run to completion however long they take.
I'm also hoping that this library makes it to GitHub with some documentation; ultimately I'm hoping to put together a pipeline in AWS using Lambda (maybe with Step Functions), and I'm guessing that all of the async stuff is going to be a problem there.
--Colin
I totally agree with the separation of steps into concurrent execution streams, as this has solid support with async/await. Resume functionality would also be helpful to recover from network errors. I am working with engineers to get these into the implementation.
I also agree with publishing to an open repository (e.g. GitHub) so that the project can more easily incorporate contributions. This is work underway but we need to get answers to a few questions internally (e.g. related to product, support, legal, etc.) before this can happen.
In the past, I used aiobotocore to interact with AWS services using the async/await asynchronous interface.
We are still getting Timeout error when trying to initdb for large tables (E.g. assignments)
2023-04-03 10:14:44,533 - DEBUG - Traceback (most recent call last):
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\__main__.py", line 126, in console_entry
main()
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "C:\Program Files\Python311\Lib\asyncio\runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\commands\commands.py", line 31, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\commands\base.py", line 49, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\commands\base.py", line 45, in execute
await self._execute_impl(args)
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\commands\initdb_command.py", line 31, in _execute_impl
await init_db(
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\actions\init_db.py", line 16, in init_db
await SQLReplicator(session, db_connection).initialize(
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\replicator\sql.py", line 49, in initialize
await client.download(processor)
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\downloader.py", line 78, in download
await wait_n(
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\concurrency.py", line 39, in wait_n
raise exc
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\downloader.py", line 76, in logged_download_and_save
await self._download(context_aware_object, processor=processor)
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\downloader.py", line 105, in _download
async for record in get_json_lines_from_gzip_stream(stream):
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\payload.py", line 23, in get_json_lines_from_gzip_stream
compressed_data = await stream.read(64 * 1024)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\aiohttp\streams.py", line 385, in read
await self._wait("read")
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\aiohttp\streams.py", line 303, in _wait
with self._timer:
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\aiohttp\helpers.py", line 721, in __exit__
raise asyncio.TimeoutError from None
TimeoutError
(__main__.py:140)
2023-04-03 10:14:44,543 - ERROR - (__main__.py:141)
And when I tried to increase the timeout as mentioned by Colin, I get the following error
2023-04-03 11:03:15,066 - DEBUG - Traceback (most recent call last):
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\__main__.py", line 126, in console_entry
main()
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "C:\Program Files\Python311\Lib\asyncio\runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\Python311\Lib\asyncio\base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\commands\commands.py", line 31, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\commands\base.py", line 49, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\commands\base.py", line 45, in execute
await self._execute_impl(args)
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\commands\initdb_command.py", line 31, in _execute_impl
await init_db(
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\actions\init_db.py", line 16, in init_db
await SQLReplicator(session, db_connection).initialize(
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\replicator\sql.py", line 49, in initialize
await client.download(processor)
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\downloader.py", line 78, in download
await wait_n(
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\concurrency.py", line 39, in wait_n
raise exc
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\downloader.py", line 76, in logged_download_and_save
await self._download(context_aware_object, processor=processor)
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\downloader.py", line 105, in _download
async for record in get_json_lines_from_gzip_stream(stream):
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\dap\payload.py", line 23, in get_json_lines_from_gzip_stream
compressed_data = await stream.read(64 * 1024)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\aiohttp\streams.py", line 385, in read
await self._wait("read")
File "C:\Users\96122\AppData\Roaming\Python\Python311\site-packages\aiohttp\streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
(__main__.py:140)
2023-04-03 11:03:15,076 - ERROR - Response payload is not completed (__main__.py:141)
I got another error when initializing the attachments table:
2023-04-03 09:21:28,939 - ERROR - Exception terminating connection <AdaptedConnection <asyncpg.connection.Connection object at 0x10914c040>>
Traceback (most recent call last):
File "< ... >/sqlalchemy/pool/base.py", line 380, in _close_connection
self._dialect.do_terminate(connection)
File "< ... >/sqlalchemy/dialects/postgresql/asyncpg.py", line 984, in do_terminate
dbapi_connection.terminate()
File "< ... >/sqlalchemy/dialects/postgresql/asyncpg.py", line 793, in terminate
self._connection.terminate()
File "< ... >/asyncpg/connection.py", line 1344, in terminate
self._abort()
File "< ... >/asyncpg/connection.py", line 1371, in _abort
self._protocol.abort()
File "asyncpg/protocol/protocol.pyx", line 569, in asyncpg.protocol.protocol.BaseProtocol.abort
File "asyncpg/protocol/protocol.pyx", line 668, in asyncpg.protocol.protocol.BaseProtocol._handle_waiter_on_connection_lost
File "< ... >/asyncio/base_events.py", line 761, in call_soon
self._check_closed()
File "< ... >/asyncio/base_events.py", line 519, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
2023-04-03 09:21:28,946 - ERROR - The garbage collector is trying to clean up non-checked-in connection <AdaptedConnection <asyncpg.connection.Connection object at 0x10914c040>>, which will be terminated. Please ensure that SQLAlchemy pooled connections are returned to the pool explicitly, either by calling ``close()`` or by using appropriate context managers to manage their lifecycle.
sys:1: SAWarning: The garbage collector is trying to clean up non-checked-in connection <AdaptedConnection <asyncpg.connection.Connection object at 0x10914c040>>, which will be terminated. Please ensure that SQLAlchemy pooled connections are returned to the pool explicitly, either by calling ``close()`` or by using appropriate context managers to manage their lifecycle.
sys:1: RuntimeWarning: coroutine 'SnapshotClientFactory.get_client.<locals>.download.<locals>.logged_download_and_save' was never awaited
I tried changing line 80 of dap/downloader.py from this:
logged_download_and_save(obj, obj_index)
to this:
await logged_download_and_save(obj, obj_index)
which seems to have cleared up the first error. Now I'm getting an error which seems to have to do with the actual data being downloaded/inserted:
File "asyncpg/protocol/protocol.pyx", line 529, in copy_in
asyncpg.exceptions.InvalidTextRepresentationError: invalid input value for enum canvas.attachments__file_state: "__dap_unspecified__"
--Colin
I tried "dap initdb --table discussion_entries" which ran for ~33 minutes before failing with:
Traceback (most recent call last):
File "< ... >/bin/dap", line 8, in <module>
sys.exit(console_entry())
^^^^^^^^^^^^^^^
File "< ... >/dap/__main__.py", line 121, in console_entry
main()
File "< ... >/dap/__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "< ... >/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "< ... >/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "< ... >/dap/commands/commands.py", line 31, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/dap/commands/base.py", line 49, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/dap/commands/base.py", line 45, in execute
await self._execute_impl(args)
File "< ... >/dap/commands/initdb_command.py", line 31, in _execute_impl
await init_db(
File "< ... >/dap/actions/init_db.py", line 16, in init_db
await SQLReplicator(session, db_connection).initialize(
File "< ... >/dap/replicator/sql.py", line 49, in initialize
await client.download(processor)
File "< ... >/dap/downloader.py", line 78, in download
await wait_n(
File "< ... >/dap/concurrency.py", line 28, in wait_n
pending = set(
^^^^
File "< ... >/dap/concurrency.py", line 29, in <genexpr>
_make_task(awaitable) for _, awaitable in zip(range(concurrency), iterator)
^^^^^^^^^^^^^^^^^^^^^
File "< ... >/dap/concurrency.py", line 15, in _make_task
return asyncio.create_task(aw) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/asyncio/tasks.py", line 374, in create_task
task = loop.create_task(coro)
^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/asyncio/base_events.py", line 436, in create_task
task = tasks.Task(coro, loop=self, name=name, context=context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: a coroutine was expected, got None
We have identified the root cause of the following exception:
asyncpg.exceptions.InvalidTextRepresentationError: invalid input value for enum canvas.attachments__file_state: "__dap_unspecified__"
This seems to have been an oversight in how enumeration names are handled. (Dunder (a.k.a. double underscore) names have not been given special treatment, and __dap_unspecified__ is such a name.) Our engineering team is already working on addressing this issue, which will be incorporated in the next patch version 0.3.5.
Unfortunately, I don't have an immediate answer (couldn't reproduce the issue yet) but we will be looking into uncovering the root cause such that we can incorporate a fix in the upcoming version.
@LeventeHunyadi -- let me know if I can provide any more details or test anything -- happy to help!
--Colin
Fortunately, we have managed to track down most of the issues mentioned in this thread, thanks for raising our attention to them, and providing context and detailed logs. Version 0.3.5 of the DAP client library is now available in PyPI.
We have a few remaining leads how we can optimize performance and improve robustness. However, we would also appreciate your feedback on the new version. If you run into any issues, don't hesitate to let us know.
I was able to successfully init our wiki_pages and discussion_entry_participants tables with 0.3.5, but I'm still getting errors with submissions, quiz_submissions, discussion_entries and attachments tables. The errors are mostly similar to this:
Traceback (most recent call last):
File "< ... >/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py", line 770, in rollback
self.await_(self._transaction.rollback())
File "< ... >/lib/python3.11/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 102, in await_only
return current.driver.switch(awaitable) # type: ignore[no-any-return]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 160, in greenlet_spawn
value = await result
^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/asyncpg/transaction.py", line 219, in rollback
await self.__rollback()
File "< ... >/lib/python3.11/site-packages/asyncpg/transaction.py", line 198, in __rollback
await self._connection.execute(query)
File "< ... >/lib/python3.11/site-packages/asyncpg/connection.py", line 317, in execute
return await self._protocol.query(query, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "asyncpg/protocol/protocol.pyx", line 323, in query
File "asyncpg/protocol/protocol.pyx", line 707, in asyncpg.protocol.protocol.BaseProtocol._check_state
asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress
but the error from the submissions table is different:
Traceback (most recent call last):
File "< ... >/bin/dap", line 8, in <module>
sys.exit(console_entry())
^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/dap/__main__.py", line 121, in console_entry
main()
File "< ... >/lib/python3.11/site-packages/dap/__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "< ... >/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/dap/commands/commands.py", line 31, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/dap/commands/base.py", line 49, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/dap/commands/base.py", line 45, in execute
await self._execute_impl(args)
File "< ... >/lib/python3.11/site-packages/dap/commands/initdb_command.py", line 31, in _execute_impl
await init_db(
File "< ... >/lib/python3.11/site-packages/dap/actions/init_db.py", line 16, in init_db
await SQLReplicator(session, db_connection).initialize(
File "< ... >/lib/python3.11/site-packages/dap/replicator/sql.py", line 49, in initialize
await client.download(processor)
File "< ... >/lib/python3.11/site-packages/dap/downloader.py", line 78, in download
await wait_n(
File "< ... >/lib/python3.11/site-packages/dap/concurrency.py", line 46, in wait_n
raise exc
File "< ... >/lib/python3.11/site-packages/dap/downloader.py", line 76, in logged_download_and_save
await self._download(context_aware_object, processor=processor)
File "< ... >/lib/python3.11/site-packages/dap/downloader.py", line 106, in _download
await processor.process(record, obj)
File "< ... >/lib/python3.11/site-packages/dap/database/init_processor.py", line 57, in process
tuple(converter(record) for converter in self._converters), obj
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/dap/database/init_processor.py", line 57, in <genexpr>
tuple(converter(record) for converter in self._converters), obj
^^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/dap/type_conversion.py", line 61, in <lambda>
return lambda record_json: _get_optional_value(
^^^^^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/dap/type_conversion.py", line 25, in _get_optional_value
return type_cast(value) if value is not None else None
^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/site-packages/dap/timestamp.py", line 32, in valid_naive_datetime
return datetime.fromisoformat(s[:-1])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: year 0 is out of range
sys:1: RuntimeWarning: coroutine 'SnapshotClientFactory.get_client.<locals>.download.<locals>.logged_download_and_save' was never awaited
Also -- this thread is getting a little unweildy; should I start a new one? FWIW, I'd love to be able to submit these details as GitHub issues 😉
Thanks,
Colin
@ColinMurtaugh Thank you for your observations. We're investigating what the root cause might be.
To the GitHub question, we are planning to open source the Client Library so that developers can open issues as well as contribute to the code. This is planned to roll out till the end of April.
Interesting.
asyncpg errors appear to be synchronization errors.
asyncpg.exceptions._base.InterfaceError: cannot perform operation: another operation is in progress
We are looking into whether these errors are originating from how we interact with SQLAlchemy (a Python package dependency for DAP client library), how database driver operations are scheduled, or if the database driver is getting too much load. Can you change CONCURRENCY to 1 in downloader.py, and see if you can reproduce the issue?
CONCURRENCY: int = 1
As for the following ValueError, can you share what job ID produced this output?
ValueError: year 0 is out of range
Based on the job ID, our team can look into the data, and locate the incorrect field value. In general, I would advise always including a job ID if you have a specific failure tied to malformed data.
As for the following RuntimeWarning, it seems like noise.
sys:1: RuntimeWarning: coroutine 'SnapshotClientFactory.get_client.<locals>.download.<locals>.logged_download_and_save' was never awaited
DAP client library runs several operations concurrently (e.g. downloads multiple files), and if there is an error, some operations never make it to the scheduler. We will make adjustments to the code to cancel these operations explicitly such that no warning is emitted.
Re: the "year 0" error, the line immediately preceding the error was:
2023-04-05 15:37:47,917 - INFO - Downloading [object 1/16 - job 1f62eea6-98a3-482f-aacc-29b5f1302d72]
I changed CONCURRENCY to 1 and tried to init our discussion_entries table; it gave a similar error to what I saw before:
2023-04-06 08:42:10,676 - INFO - Downloading [object 1/13 - job a8527112-8a6c-4815-a0e3-a6d86fb1c415]
2023-04-06 08:42:22,963 - INFO - Downloading [object 2/13 - job a8527112-8a6c-4815-a0e3-a6d86fb1c415]
2023-04-06 08:42:37,596 - INFO - Downloading [object 3/13 - job a8527112-8a6c-4815-a0e3-a6d86fb1c415]
2023-04-06 08:42:50,233 - INFO - Downloading [object 4/13 - job a8527112-8a6c-4815-a0e3-a6d86fb1c415]
2023-04-06 08:45:40,408 - INFO - Inserted 256606 records from [object 1/13 - job a8527112-8a6c-4815-a0e3-a6d86fb1c415]
2023-04-06 08:48:18,052 - INFO - Inserted 254436 records from [object 2/13 - job a8527112-8a6c-4815-a0e3-a6d86fb1c415]
2023-04-06 08:50:55,695 - INFO - Inserted 254387 records from [object 3/13 - job a8527112-8a6c-4815-a0e3-a6d86fb1c415]
2023-04-06 08:53:20,815 - INFO - Inserted 234571 records from [object 4/13 - job a8527112-8a6c-4815-a0e3-a6d86fb1c415]
Traceback (most recent call last):
File "< ... >/dap", line 8, in <module>
sys.exit(console_entry())
^^^^^^^^^^^^^^^
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 121, in console_entry
main()
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/__main__.py", line 115, in main
asyncio.run(dapCommand.execute(args))
File "< ... >/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/commands.py", line 31, in execute
executed = await super().execute(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 49, in execute
if await subcommand.execute(args):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/base.py", line 45, in execute
await self._execute_impl(args)
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/commands/initdb_command.py", line 31, in _execute_impl
await init_db(
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/actions/init_db.py", line 16, in init_db
await SQLReplicator(session, db_connection).initialize(
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/replicator/sql.py", line 49, in initialize
await client.download(processor)
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/downloader.py", line 78, in download
await wait_n(
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/concurrency.py", line 46, in wait_n
raise exc
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/downloader.py", line 76, in logged_download_and_save
await self._download(context_aware_object, processor=processor)
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/downloader.py", line 105, in _download
async for record in get_json_lines_from_gzip_stream(stream):
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/dap/payload.py", line 23, in get_json_lines_from_gzip_stream
compressed_data = await stream.read(64 * 1024)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/aiohttp/streams.py", line 385, in read
await self._wait("read")
File "< ... >/envs/canvas-data-2/lib/python3.11/site-packages/aiohttp/streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
sys:1: RuntimeWarning: coroutine 'SnapshotClientFactory.get_client.<locals>.download.<locals>.logged_download_and_save' was never awaited
Looks like a completely different error message (ignore the identical last line, it's noise; it's the last but one that matters).
We have a 0.3.6 version under review that improves execution speed and stability but most of our team are out on holiday. People will likely pick up Tuesday next week where they left off today. A release would likely be published on Wednesday or Thursday.
@LeventeHunyadi did you mean 0.3.7? I should have clarified that I am still not able to initdb several of our tables using 0.3.6. I can gather error output if that is helpful.
Please do if you can; any error traces are incredibly helpful, especially if you include a job ID. All of the performance-specific errors should now be gone in client library version 0.3.6 but there may be data-related errors, e.g. DAP API returns some unexpected data that the tool is not yet able to handle. I can relay the issues to the right team who can look into the data and see if anything is amiss. Also, we can continue improving the tool based on your feedback to be more prepared for these unexpected cases.
@LeventeHunyadi -- I've attached a file containing an error trace from running "dap initdb --table quiz_submissions". I'll note that I ran the command on my laptop, and the postgresql database was running in AWS RDS. It's a db.t3.small instance running postgres 13.7.
Interestingly, I was able to successfully run "dap initdb --table quiz_submissions" pointing to a postgresql database running on my laptop (pgsql 14).
If I have a chance I will experiment with a larger AWS RDS instance, and with pgsql 14. But it seems like the fact that the RDS instance is more resource constrained than my local pgsql database has something to do with the failures.
--Colin
I was able to successfully run "initdb --table quiz_submissions" pointing to an AWS RDS db.m5.large instance, but it was painfully slow: 45 minutes (vs. 3 minutes pointing to the local pgsql db on my laptop). And the db.m5.large instance type is quite a bit more expensive than the db.t3.small that we're using already.
It appears that almost all of the db waits were "Client:ClientRead"; I'll have to see what options are available to speed that up.
Where do you run the instructure-dap-client instance?
python3 -m dap initdb --table quiz_submissions
Note that there is a fair bit of network communication between instructure-dap-client and the database server into which you are inserting records, which is especially true for an initial snapshot consisting of millions of records. If the EC2 instance running instructure-dap-client and the database server are not in the same AWS region, there may be increased network latency, which will significantly reduce performance.
Client:ClientRead may be partly due to CPU load on the instance running instructure-dap-client. We are working on further reducing this load such that data is sent more promptly to the database server, which would in turn spend less time waiting.
I've been mostly running the dap client on my laptop, connected to our university network which has very good connectivity to AWS us-east-1. The RDS instance is also located in us-east-1. The load on my laptop when I run the dap command seems low but I will have to monitor it more closely. I guess I'd expect a little slower performance when connecting to the RDS instance, but I'm surprised it's more than an order of magnitude slower.
At the end of the day on Friday I started to try running the dap command on an EC2 instance (in the same region and VPC as the RDS instance), but for some reason the whole instance ground to a halt shortly after I ran the command. I couldn't even SSH into the box at that point; I had to reboot the instance. I tried both python 3.11 and 3.9 and saw the same behavior; same thing when I tried a different EC2 instance types. I haven't spent much time debugging it yet though.
@LeventeHunyadi I've made some more progress: I've been able to successfully initdb all but one of our tables both from EC2 and from Lambda; I needed to provide at least 8GB of memory in order for the operation to complete successfully.
Now the only table I'm having trouble with is the submissions table; still getting "ValueError: year 0 is out of range". Job ID is ea108826-87fa-4c2d-a689-dc9d16375b4d and the full trace is attached.
--Colin
Note that you can decrease the amount of memory required if you decrease concurrency, which has the value 4 by default. When snapshot pieces are processed concurrently, they have to be kept in memory simultaneously, which will increase memory consumption. Another option is to decrease database update batch size, i.e. the number of records gathered before a database command is triggered. By default,
RECORDS_PER_FLUSH = 1000000
which is a relatively large value.
Looking at the data in your root account, it seems that the Canvas table submissions has some funky dates for the column submitted_at, e.g.
0001-01-01 00:00:00
These look invalid, and we are reaching out to Canvas teams internally to look into why these values are in the database. As an interim workaround, you can catch ValueError exceptions raised for dates, and set a default value.
I added a try/catch in timestamp.py to deal with the ValueError and I was able to successfully initdb our submissions table (running dap on my laptop, connecting to a db.t3.small AWS RDS instance). That process took just over 15 minutes.
So - now my challenge is to try to get these operations to work better in AWS Lambda. The main issue there is that we're limited to 15 minutes of running time (and we don't really want the Lambda running while we're waiting for a query to be complete anyway). Right now I have this in my Lambda handler:
asyncio.get_event_loop().run_until_complete(
sync_db(
base_url=api_base_url,
namespace=namespace,
table_name=table_name,
credentials=credentials,
connection_string=conn_str,
)
)
This basically works as long as the whole operation completes within the Lambda timeout. Since the sync_db call includes all of:
the whole process can take a long time and sometimes times out. I'd like to be able to move at least some of these major steps into separate Lambda functions, i.e.:
Function 3 still might take quite a long time, but at least we would not be billed in Lambda for all of the time that we spend waiting for the query to complete, and it would be less likely to time out.
I'm picking through the dap library code, but so far I'm struggling a bit with how to separate the sync_db() call into the three functions I outlined above. The async nature of the code is unfamiliar to me and it's taking me some time to try to figure out how it works. Each step seems tightly tied to the next. @LeventeHunyadi if you or anyone else have any pointers/ideas, I'd love to hear them!
--Colin
Thanks for that info -- I think it'll make most sense for me to wait for the client library API - glad to hear that that's in the works.
I also noticed that a new 0.3.7 version is out on pypi -- I'll be trying that out momentarily.
@ColinMurtaugh wrote:I added a try/catch in timestamp.py to deal with the ValueError and I was able to successfully initdb our submissions table ....
--Colin
Thanks for this suggestion. I was having the same issue with our Courses and EnrollmentStates tables.. We're fully initdb now!
this did not answer my question
@ColinMurtaugh I'm getting the same "RuntimeError: Event loop is closed" and was racking my brain all yesterday. Glad to know it's not just me.
@LeventeHunyadi What is the best way to receive a notification when 0.3.5 is available?
@ptart There's a RSS feed for packages on PyPI https://pypi.org/project/instructure-dap-client/#history and when Instructure open source it then watching the project on somewhere like GitHub should alert you.
@ptart We will be publishing the availability of new versions and what's changed in regards of the CLI and API in the Canvas deploy notes https://community.canvaslms.com/t5/Canvas-Releases/Canvas-Release-Notes-2023-04-15/ta-p/560530 respectively API change log https://community.canvaslms.com/t5/Canvas-Change-Log/2023-API-and-CLI-Change-Log/ta-p/549097.
These are getting published on a 2 weeks cadence.
Nevertheless, if we do bugfixes or hotfixes it can happen that we release a new patch version. Another way to get notified if a new CLI version was released is to subcribe to the RSS feed from Pypi: https://pypi.org/rss/project/instructure-dap-client/releases.xml
Hope this helps.
Edina
To participate in the Instructure Community, you need to sign up or log in:
Sign In