Community Help

canvastech · ‎02-19-2024

Hi all,

I refer the section 'Getting latest changes with an incremental query' from the link

https://data-access-platform-api.s3.amazonaws.com/client/README.html#command-line-usage

and program the code to retrieve the web_logs with datetime range.

==========================================================================================
import os
import asyncio
from datetime import datetime, timezone
from urllib.parse import ParseResult, urlparse

import aiofiles

from dap.api import DAPClient
from dap.dap_types import Credentials
from dap.dap_types import Format, IncrementalQuery

base_url: str = os.environ["DAP_API_URL"]
client_id: str = os.environ["DAP_CLIENT_ID"]
client_secret: str = os.environ["DAP_CLIENT_SECRET"]

credentials = Credentials.create(client_id=client_id, client_secret=client_secret)

# timestamp returned by last snapshot or incremental query
last_seen = datetime(2024, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
until_seen = datetime(2024, 1, 2, 0, 0, 0, tzinfo=timezone.utc)

print("Last seen: ", last_seen)
print("Until: ", until_seen)

async def main():
async with DAPClient(base_url, credentials) as session:
query = IncrementalQuery(
format=Format.JSONL,
mode=None,
filter=None,
since=last_seen,
until=until_seen,
)
result = await session.get_table_data("canvas_logs", "web_logs", query)
resources = await session.get_resources(result.objects)
for resource in resources:
components: ParseResult = urlparse(str(resource.url))
file_path = os.path.join(
os.getcwd(), "data", os.path.basename(components.path)
)
print("File path: ", file_path)
async with session.stream_resource(resource) as stream:
async with aiofiles.open(file_path, "wb") as file:
# save gzip data to file without decompressing
async for chunk in stream.iter_chunked(64 * 1024):
await file.write(chunk)

starttoday = datetime.now()
print("Start datetime:", starttoday)
asyncio.run(main());
endtoday = datetime.now()
print("End datetime:", endtoday)
========================================================================================

After ran the code with python, it showed 'components: ParseResult = urlparse(str(resource.url))' has problem, so I modify it to
components: ParseResult = urlparse(str(resource))

Then, ran again, and this time showed:

Last seen: 2024-01-01 00:00:00+00:00
Until: 2024-01-02 00:00:00+00:00
Start datetime: 2024-02-19 18:42:52.493787
File path: /home/adm1/cd2/script/data/part-00000-0a9d64fc-013f-4315-9225-260949ce4fdf-c000.json.gz
Traceback (most recent call last):
File "/home/adm1/cd2/script/get_weblogs_date_range.py", line 51, in <module>
asyncio.run(main());
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/adm1/cd2/script/get_weblogs_date_range.py", line 42, in main
async with session.stream_resource(resource) as stream:
AttributeError: __aenter__

Does anyone has idea about it? Sorry that I am new beginner of python. Many Thanks

ColinMurtaugh · ‎02-19-2024

@canvastech --

This isn't really how CD2 is meant to be used; you're meant to fetch an initial snapshot of the table once, and after that fetch incremental updates on a regular basis to maintain a local replica of the original table.

If you're trying to get log events for a particular day, you should fetch everything locally first and then query your local table. The since/until parameters are not meant to be used as a way to filter the source data -- they're only meant to properly sequence the stream of updates that you apply to your local replica table. The DAP CLI will take care of all of this for you, including keeping track of the correct value to use for the "since" parameter. I know it seems like a lot of data to fetch, but this is the only way to do it correctly.

--Colin

KeithSmith_au · ‎02-27-2024

Whilst true in general, the web logs are a bit of a special case. Ignoring the quirks of the DAP client, if you are actually calling the API endpoints manually, the web-logs can actually use the since and until parameters to do what is suggested here - getting logs for a single day.

In general, you should always fetch a full snapshot, and then fetch incremental updates because the data store only contains the most recent version of the element for any given key (with the latest update timestamp). Web logs are never updated - they are just individual items that are logged, so it would actually be quite acceptable in this particular case, to filter on the dates - but you can't do that for any table that permits updates to records.

LeventeHunyadi · ‎02-28-2024

While web_logs looks like an immutable table, it does (infrequently) receive updates to earlier records (data patches), and there are examples when this happened in the past. If you try using since and update to window on the data, you might miss updated records. For example, let's assume you have a web log record with the timestamp 2024-02-10, and that record is updated on 2024-02-28. In this particular case, a since of 2024-02-01 and an until of 2024-02-15 would completely miss this record.

I agree that usually you can window on the data because event time of web_logs is typically closely related to commit time, or in other words, records are typically ingested in DAP soon after they were generated. However, this is not always the case, and when it is not, you might lose data.

I would recommend always chaining incremental queries, populating the since timestamp of the next incremental query request based on the until timestamp returned in the previous incremental query response, as described in the documentation.

KeithSmith_au · ‎02-28-2024

Whilst I understand the concepts of since (and until, which is not really practical in most cases), it is of concern to me that updates are being made to web_logs that people are treating as point-in-time logging of what actually happened.

Even if there are data fixes and patches applied (let's say for the sake of argument that there was a corruption and users had the wrong UUID for a period of time). Patching and updating the actual users data, enrolments, submissions etc. to reflect the correct values would be expected, but I am sure most people would not expect a log of what was the state when something went through the http call to be being changed after the fact. That sort of activity really invalidates any use of those logs as an audit trail.

marco_divittori · ‎02-28-2024

Completely agree with this. I have always understood the requests and web_logs tables to contain processed web server access logs. We plan to ingest this data on a daily basis using the since and until values as outlined in the documentation. It's problematic to find out now that data patches can be applied to update earlier records. @LeventeHunyadi what is the recommended approach to deal with this if performing daily incremental updates? Thanks

LeventeHunyadi · ‎02-29-2024

If you follow the practice of chaining incremental queries with one another, populating the since timestamp of the next incremental query request based on the until timestamp returned in the previous incremental query response (that you executed the day before), you are good. This will fetch all daily record updates, and any retroactive data patches that might have been applied. (Just to emphasize, data patches are rare but do occasionally occur.)

LeventeHunyadi · ‎02-29-2024

In general, I would caution against use of web_logs as an audit trail. Canvas Data, the predecessor to DAP/CD 2 had a long disclaimer about this:

Disclaimer: The data in the requests table [equivalent of web_logs in CD 1] is a 'best effort' attempt, and is not guaranteed to be complete or wholly accurate. This data is meant to be used for rollups and analysis in the aggregate, _not_ in isolation for auditing or other high-stakes analysis involving examining single users or small samples. As this data is generated from the Canvas request log files, not a transactional database, there are many places along the way data can be lost and/or duplicated (though uncommon).

Since web_logs data for CD 1 and CD 2 comes from the same source, the same limitations would apply.

web_logs is a derived table, with data combined from multiple sources. Because commit timestamps are row-level, the commit timestamp may change even if only a single value is updated. Some columns are truly immutable (e.g. event time) but others may get updates, even if infrequently. This is why it's safest to rely on chaining since/until as described in the documentation.

DAP, in general, does not support point-in-time queries (i.e. fetch state of DAP as seen at some earlier time). DAP API only lets you update your current local state to the latest state in DAP.

Code error for DAP library

canvas_logs

cd2

web_logs

Data Services sends incorrect Content-Type when re...

duplicates in Catalog enrollment table snapshot

Historical data of assessment decision when using ...

Roll Call Attendance / Grades Alert

CD2 - how to identify which users / sections have ...

Canvas Data 2 Documentation Down?

CD2: Microsoft Sync State (for MS Teams)

Integrating Canvas with PowerBI?

New Quizzes

Data Services sends incorrect Content-Type when re...

You're signed out

Code error for DAP library

Community Help

View our top guides and resources: