Celebrate Excellence in Education: Nominate Outstanding Educators by April 15!
Found this content helpful? Log in or sign up to leave a like!
Hi all,
I refer the section 'Getting latest changes with an incremental query' from the link
https://data-access-platform-api.s3.amazonaws.com/client/README.html#command-line-usage
and program the code to retrieve the web_logs with datetime range.
==========================================================================================
import os
import asyncio
from datetime import datetime, timezone
from urllib.parse import ParseResult, urlparse
import aiofiles
from dap.api import DAPClient
from dap.dap_types import Credentials
from dap.dap_types import Format, IncrementalQuery
base_url: str = os.environ["DAP_API_URL"]
client_id: str = os.environ["DAP_CLIENT_ID"]
client_secret: str = os.environ["DAP_CLIENT_SECRET"]
credentials = Credentials.create(client_id=client_id, client_secret=client_secret)
# timestamp returned by last snapshot or incremental query
last_seen = datetime(2024, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
until_seen = datetime(2024, 1, 2, 0, 0, 0, tzinfo=timezone.utc)
print("Last seen: ", last_seen)
print("Until: ", until_seen)
async def main():
async with DAPClient(base_url, credentials) as session:
query = IncrementalQuery(
format=Format.JSONL,
mode=None,
filter=None,
since=last_seen,
until=until_seen,
)
result = await session.get_table_data("canvas_logs", "web_logs", query)
resources = await session.get_resources(result.objects)
for resource in resources:
components: ParseResult = urlparse(str(resource.url))
file_path = os.path.join(
os.getcwd(), "data", os.path.basename(components.path)
)
print("File path: ", file_path)
async with session.stream_resource(resource) as stream:
async with aiofiles.open(file_path, "wb") as file:
# save gzip data to file without decompressing
async for chunk in stream.iter_chunked(64 * 1024):
await file.write(chunk)
starttoday = datetime.now()
print("Start datetime:", starttoday)
asyncio.run(main());
endtoday = datetime.now()
print("End datetime:", endtoday)
========================================================================================
After ran the code with python, it showed 'components: ParseResult = urlparse(str(resource.url))' has problem, so I modify it to
components: ParseResult = urlparse(str(resource))
Then, ran again, and this time showed:
Last seen: 2024-01-01 00:00:00+00:00
Until: 2024-01-02 00:00:00+00:00
Start datetime: 2024-02-19 18:42:52.493787
File path: /home/adm1/cd2/script/data/part-00000-0a9d64fc-013f-4315-9225-260949ce4fdf-c000.json.gz
Traceback (most recent call last):
File "/home/adm1/cd2/script/get_weblogs_date_range.py", line 51, in <module>
asyncio.run(main());
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/adm1/cd2/script/get_weblogs_date_range.py", line 42, in main
async with session.stream_resource(resource) as stream:
AttributeError: __aenter__
Does anyone has idea about it? Sorry that I am new beginner of python. Many Thanks
@canvastech --
This isn't really how CD2 is meant to be used; you're meant to fetch an initial snapshot of the table once, and after that fetch incremental updates on a regular basis to maintain a local replica of the original table.
If you're trying to get log events for a particular day, you should fetch everything locally first and then query your local table. The since/until parameters are not meant to be used as a way to filter the source data -- they're only meant to properly sequence the stream of updates that you apply to your local replica table. The DAP CLI will take care of all of this for you, including keeping track of the correct value to use for the "since" parameter. I know it seems like a lot of data to fetch, but this is the only way to do it correctly.
--Colin
Whilst true in general, the web logs are a bit of a special case. Ignoring the quirks of the DAP client, if you are actually calling the API endpoints manually, the web-logs can actually use the since and until parameters to do what is suggested here - getting logs for a single day.
In general, you should always fetch a full snapshot, and then fetch incremental updates because the data store only contains the most recent version of the element for any given key (with the latest update timestamp). Web logs are never updated - they are just individual items that are logged, so it would actually be quite acceptable in this particular case, to filter on the dates - but you can't do that for any table that permits updates to records.
While web_logs looks like an immutable table, it does (infrequently) receive updates to earlier records (data patches), and there are examples when this happened in the past. If you try using since and update to window on the data, you might miss updated records. For example, let's assume you have a web log record with the timestamp 2024-02-10, and that record is updated on 2024-02-28. In this particular case, a since of 2024-02-01 and an until of 2024-02-15 would completely miss this record.
I agree that usually you can window on the data because event time of web_logs is typically closely related to commit time, or in other words, records are typically ingested in DAP soon after they were generated. However, this is not always the case, and when it is not, you might lose data.
I would recommend always chaining incremental queries, populating the since timestamp of the next incremental query request based on the until timestamp returned in the previous incremental query response, as described in the documentation.
Whilst I understand the concepts of since (and until, which is not really practical in most cases), it is of concern to me that updates are being made to web_logs that people are treating as point-in-time logging of what actually happened.
Even if there are data fixes and patches applied (let's say for the sake of argument that there was a corruption and users had the wrong UUID for a period of time). Patching and updating the actual users data, enrolments, submissions etc. to reflect the correct values would be expected, but I am sure most people would not expect a log of what was the state when something went through the http call to be being changed after the fact. That sort of activity really invalidates any use of those logs as an audit trail.
Completely agree with this. I have always understood the requests and web_logs tables to contain processed web server access logs. We plan to ingest this data on a daily basis using the since and until values as outlined in the documentation. It's problematic to find out now that data patches can be applied to update earlier records. @LeventeHunyadi what is the recommended approach to deal with this if performing daily incremental updates? Thanks
If you follow the practice of chaining incremental queries with one another, populating the since timestamp of the next incremental query request based on the until timestamp returned in the previous incremental query response (that you executed the day before), you are good. This will fetch all daily record updates, and any retroactive data patches that might have been applied. (Just to emphasize, data patches are rare but do occasionally occur.)
In general, I would caution against use of web_logs as an audit trail. Canvas Data, the predecessor to DAP/CD 2 had a long disclaimer about this:
Disclaimer: The data in the requests table [equivalent of web_logs in CD 1] is a 'best effort' attempt, and is not guaranteed to be complete or wholly accurate. This data is meant to be used for rollups and analysis in the aggregate, _not_ in isolation for auditing or other high-stakes analysis involving examining single users or small samples. As this data is generated from the Canvas request log files, not a transactional database, there are many places along the way data can be lost and/or duplicated (though uncommon).
Since web_logs data for CD 1 and CD 2 comes from the same source, the same limitations would apply.
web_logs is a derived table, with data combined from multiple sources. Because commit timestamps are row-level, the commit timestamp may change even if only a single value is updated. Some columns are truly immutable (e.g. event time) but others may get updates, even if infrequently. This is why it's safest to rely on chaining since/until as described in the documentation.
DAP, in general, does not support point-in-time queries (i.e. fetch state of DAP as seen at some earlier time). DAP API only lets you update your current local state to the latest state in DAP.
To participate in the Instructure Community, you need to sign up or log in:
Sign In