Celebrate Excellence in Education: Nominate Outstanding Educators by April 15!
Found this content helpful? Log in or sign up to leave a like!
I have the following code to retrieve page views of the user with id=123456. The very same code works when I retrieve the entries of a user for a discussion forum. However, when I change the uri to https://learn.canvas.net/api/v1/users/123456/page_views to retrieve the page views, no matter what the page number is, it continues to return the same records over and over again. So, there is no further data maybe, but it continues to retrieve data. That is, it creates an endless while loop. I wonder if you see any problems with the code:
npagina=1
control=0
while control==0:
uri = 'https://learn.canvas.net/api/v1/users/123456/page_views?per_page=100&page=' + str(npagina)
r = requests.get(uri, headers=headers)
raw = r.json() #if no new data, then continues to retrieve duplicate data
if raw != "":
views = pd.DataFrame(raw)
if 'id' in views.columns:
npagina=npagina+1
page_views = page_views.append(views)
else:
control = 1
Solved! Go to Solution.
I found the Pagination section of the API documentation very helpful. Here's how I retrieve paginated data in Python:
r = get('{}?per_page=100'.format(url))
paginated = r.json()
while 'next' in r.links:
r = get(r.links['next']['url'])
paginated.extend(r.json())
I hope this helps.
EDITS: fixed typo; refactored
Page views are treated differently than some of the other requests as they can change pretty quickly and you can get new page views added before you make the request to get the next page. That means that information that was in the first page of results might get shifted down by incoming requests and reappear in the second page of results as well.
To compensate, Canvas adds a bookmark: value to the page= parameter, rather than a specific number. Here are the results of the Link response header (I've reformatted and removed portions so more is visible).
CURRENT:
page=first&per_page=10
NEXT:
page=bookmark:WyIyMDE3LTA5LTE0VDA4OjU3OjQ5LjM2MC0wNTowMCIsIjNhOTVmMzIxLTc5Y2UtNDcyOC04N2U0LTczNDYxN2Y0ZWJhNiJd&per_page=10
FIRST
page=first&per_page=10>
Also notice that there is no rel="last" supplied here.
It is important that the second fetch contain that page=bookmark:token or it's considered a different request.
The way you're making them doesn't contain it, but using the next Link like dgrobani recommended will grab it. That's especially true because that bookmark link changes every time you fetch more pages and so there is no way to predict where it will be the next time. There is no idea of page number for the page_views, it's all based off that bookmark.
CURRENT:
page=bookmark:WyIyMDE3LTA5LTE0VDA4OjU3OjQ5LjM2MC0wNTowMCIsIjNhOTVmMzIxLTc5Y2UtNDcyOC04N2U0LTczNDYxN2Y0ZWJhNiJd&per_page=10
NEXT:
page=bookmark:WyIyMDE3LTA5LTEzVDE4OjQ0OjQ3Ljk3MC0wNTowMCIsIjU4OWQ4YTAyLTk3OGEtNGE5ZS1hM2EwLTVkYTZhY2ZjOTQyMyJd&per_page=10
FIRST:
page=first&per_page=10
That reminds me that I need to add this to my list of way that pagination is handled in Canvas. I'm working on revising how I fetch data and grabbed the rel="last" link and then iterate over the ones between 2 and "last". That works in some cases, but it won't work here where the only way to fetch it is in series rather than in parallel.
Also, I'd watch out for the page_views API as they can go on for a really long time. Depending on what you're looking for, you might want to specify the dates in the original query or break your fetching once you reach the desired point.
Another possibility is to use Canvas Data and the requests table for most of the information and then fetch the current information that hasn't made it into Canvas Data yet from the API.
I found the Pagination section of the API documentation very helpful. Here's how I retrieve paginated data in Python:
r = get('{}?per_page=100'.format(url))
paginated = r.json()
while 'next' in r.links:
r = get(r.links['next']['url'])
paginated.extend(r.json())
I hope this helps.
EDITS: fixed typo; refactored
Very clean and concise. Can you please go through your code?
Thank you!
I've revised my code to be slightly more compact. I think you'll understand it more easily if you start by reading both the Pagination section of the Canvas API documentation and the Link Headers section of the Requests library documentation. They're each fairly short.
My code makes an initial API call for 100 items [line 1] and stores the returned JSON in a list called "paginated" [line 2]. If the API has more than 100 items to return, the link header of the response will contain a "next" element that specifies the URL to call for the next page of results. We check for that element [line 3] and if it's there, we make a new request to the URL specified in the "next" element [line 4]. We add the JSON returned in the response to the "paginated" list [line 5]. When the API has returned the last page of the results, the link header won't contain a "next" element, and we're done.
Thank you very much for the explanation!
Page views are treated differently than some of the other requests as they can change pretty quickly and you can get new page views added before you make the request to get the next page. That means that information that was in the first page of results might get shifted down by incoming requests and reappear in the second page of results as well.
To compensate, Canvas adds a bookmark: value to the page= parameter, rather than a specific number. Here are the results of the Link response header (I've reformatted and removed portions so more is visible).
CURRENT:
page=first&per_page=10
NEXT:
page=bookmark:WyIyMDE3LTA5LTE0VDA4OjU3OjQ5LjM2MC0wNTowMCIsIjNhOTVmMzIxLTc5Y2UtNDcyOC04N2U0LTczNDYxN2Y0ZWJhNiJd&per_page=10
FIRST
page=first&per_page=10>
Also notice that there is no rel="last" supplied here.
It is important that the second fetch contain that page=bookmark:token or it's considered a different request.
The way you're making them doesn't contain it, but using the next Link like dgrobani recommended will grab it. That's especially true because that bookmark link changes every time you fetch more pages and so there is no way to predict where it will be the next time. There is no idea of page number for the page_views, it's all based off that bookmark.
CURRENT:
page=bookmark:WyIyMDE3LTA5LTE0VDA4OjU3OjQ5LjM2MC0wNTowMCIsIjNhOTVmMzIxLTc5Y2UtNDcyOC04N2U0LTczNDYxN2Y0ZWJhNiJd&per_page=10
NEXT:
page=bookmark:WyIyMDE3LTA5LTEzVDE4OjQ0OjQ3Ljk3MC0wNTowMCIsIjU4OWQ4YTAyLTk3OGEtNGE5ZS1hM2EwLTVkYTZhY2ZjOTQyMyJd&per_page=10
FIRST:
page=first&per_page=10
That reminds me that I need to add this to my list of way that pagination is handled in Canvas. I'm working on revising how I fetch data and grabbed the rel="last" link and then iterate over the ones between 2 and "last". That works in some cases, but it won't work here where the only way to fetch it is in series rather than in parallel.
Also, I'd watch out for the page_views API as they can go on for a really long time. Depending on what you're looking for, you might want to specify the dates in the original query or break your fetching once you reach the desired point.
Another possibility is to use Canvas Data and the requests table for most of the information and then fetch the current information that hasn't made it into Canvas Data yet from the API.
I just noticed that Daniel's response disappeared between the time I started writing and the time I finished writing my response. I hope he brings it back. I just explained the why since he had already explained the how.
Thank you for your detailed answer! I wonder if your examples can be adapted in Python. Your code should be run in JavaScript?
I am guessing that what Daniel provided is how to run it in Python. He's the expert there, I don't know Python so I'll take his word for it. I see his code is back now (yay!) so you should refer to it.
My code isn't really code, definitely not JavaScript. It's just a portion of the response link header that's returned. So it won't run anywhere. It was just intended to explain why you needed to use the next link and your code wouldn't work -- even if it works in other places.
Thanks for that explanation, @James . I didn't know about page views bookmarks--very interesting. And very cool that you don't have to do anything different because the "next" URL handles everything behind the scenes.
The bookmark thing forces the requests to be in series. With some of my earlier code that wouldn't matter because that's how I handled it -- waiting for one request to finish before fetching the next one based off the response header link.next url. That's the way it's recommended to handle things and if you're doing it that way, like you are, then you don't need to change anything.
I don't know if Python allows multiple requests. PHP has a library called Guzzle that handles it. The user scripts I've been writing use JavaScript within the browser and they normally allow 5 or 6 parallel requests so doing it sequentially slows it down, which may be fine for a back-end process, but people using a web browser want speed and they don't like to wait. I have been working on something to make my scripts take advantage of that ability within the browser to speed up the process, but now I've got to make sure I handle the case like page views that can't be made in parallel.
Requests, the go-to Python HTTP library, supports concurrent requests, but I haven't had a need for that yet, as all my Python code so far is batch processes. Have you encountered any throttling or other issues making multiple requests to the Canvas API?
Someone from Canvas addressed this, maybe in a late-Thursday presentation at InstructureCon. They said as long as you're hitting it sequentially, you'll never hit the threshold, but if you have a lot of processes running making multiple requests you might.
I've ran three parallel PHP processes, each making sequential requests and didn't hit it.
I did finally do some testing fetching a list of courses using a browser and fetching them as fast as it would handle and I believe 6 concurrent requests. I tracked the X-Rate-* response headers and it starts at 700 and I never got lower than 631. The threshold doesn't kick in until you reach 0. The details are in this post which part of a larger discussion in https://community.canvaslms.com/groups/canvas-developers/blog/2017/06/09/full-course-listing-with-so...
How would you know that you need to go to page 30? You would have no idea that it was page 30 until you got there since you have no idea how many page views there are. Once you are there and you know that it's page 30, you can save the bookmark from the current link header or even the URL of the page and get back there immediately, even when page 30 is no longer page 30 because more page requests have come in since then and it's now page 34. I do not know how long bookmarks remain valid, but I don't suspect that they're generated until you actually need it. In other words there is no bookmark for page 30 until you visit page 29 and it generates it for the link headers.
If you are needing page 30 in a list of page views, it's much more likely that you're after a set of page views from a specific time period. The start_time and end_time query parameters are already part of the API call. The way to get to a specific page is to limit the start and end times to a specific enough period that it takes you right to that spot.
I don't know for sure if/when page views used to use the page parameter and then switched to bookmarks. However, looking at the source code commits over time, it looks like it probably was page-based at one time.
I have two guesses for when that changed.
Looking at most of the code, it suggests that changed on May 12, 2012. On April 17, 2012, code that provided the total count of the number of page views was removed. At that same time, a note was added to the pagination documentation about the last link header and it not being sent if there are no pages or it would be expensive to calculate the number of pages. The page view tables within the web UI was modified to dynamically load the page views rather than automatically loading all of them. However, the spec file still showed page=2 as a query parameter on an API call to load the page views and that is still there today. Even though the page count was removed from the users controller at that time, the page parameter was still being sent until November 2012. Those changes removing the total page count were released on May 12, 2012.
Before that, commits made on October 15, 2011, and released on March 3, 2012, set the page views per_page to 50, but were still passing the page parameter to the pagination.
Later, commits made on September 20, 2012, and released on November 3, 2012, switched the page views to use the Backbone/Paginated Collection, but it said that there were no user-visible changes. That is not an API call, but it is when the page_views was taken out of the users controller. That was the code where the page parameter was still being passed, so this is when it stopped being used in the user page views within the web interface.
By the way, those changes to allow the start_time and end_time were committed on July 18, 2013, and released on August 24, 2013.
My second guess is based on a single comment in a commit made on September 24, 2012, and released on November 24, 2012. That's when the page views begin being stored on a Cassandra cluster. That changed the way things are handled and the note was "note that the format of the pagination headers in the /api/v1/users/X/page_views endpoint has changed."
Whether in May 2012 or November 2012, it looks like the page parameter for page_views through the API hasn't been used in over 5 years.
The first I used page views for anything useful (seeing whether my students were viewing pages) was in June 2015. At that time, I was blindly following the next link header as I was making serial calls. It wasn't until later that I discovered that I could hack the system and fetch multiple pages at the same time by using the page query parameter. At that point, I discovered it doesn't work with page views.
The documentation for API pagination doesn't define the availability page= versus bookmark=, it just uses an opaque identifier.
To retrieve additional pages, the returned `Link` headers should be used. These links should be treated as opaque. They will be absolute urls that include all parameters necessary to retrieve the desired current, next, previous, first, or last page. The one exception is that if an access_token parameter is sent for authentication, it will not be included in the returned links, and must be re-appended.
In other words, Canvas didn't intend for us to modify those links. What we're trying to do is leverage knowledge about the way that Canvas works in some places as an undocumented, unsupported work-around.
While you can't hack the page= parameter with page views, you can limit the time period that it returns page views for. That's actually more powerful functionality than being able to jump to a specific page in a list of unknown length in the first place.
Hi! We have two CanvasLIVE sessions coming up on Canvas Data you might want to RSVP to or follow for more information. If you have any questions about the events, feel free to post them in the comments section at the links below so the host has the opportunity to respond. Hope to see you there!
Oct. 5 9:30am MDT / 11:30am EDT
Learning Analytics and Canvas Data
Oct. 10 12pm MDT / 2pm EDT
To participate in the Instructure Community, you need to sign up or log in:
Sign In