Celebrate Excellence in Education: Nominate Outstanding Educators by April 15!
Found this content helpful? Log in or sign up to leave a like!
Since there are so many of us who use the APIs with a variety of languages and or libraries I am curious how everyone handles pagination with large GET requests? It seems that this has been a hurdle for many of us to conquer before we can fully utilize the Canvas API. This could be a very interesting discussion as we all handle the same problem in very different ways.
So I'll start...
I am partial to Python so I use the Requests library: http://docs.python-requests.org/en/latest/ to make working with the APIs extremely simple. Requests is a big part of what makes it easy to handle the pagination.
I start of by declaring a data_set list where each of the JSON objects will reside.
data_set = []
I then perform my initial API request and set the per_page limit to the max of 50.
I then save the response body to a variable called raw and use the built-in json() function to make it easier to work with.
uri = 'https://abc.instructure.com/api/v1/courses/12345/quizzes/67890/questions?per_page=50'
r = request.get(uri, headers=headers)
raw = r.json()
I then loop through the responses and pull out each individual response and append them to the data_set list.
for question in raw:
data_set.append(question)
For the next pages I use a while loop to repeat the above process using the links provided from the link headers of the response. As long as the current url does not equal the last url it will perform another request but using the next url as the uri to bet sent.
while r.links['current']['url'] != r.links['last']['url']:
r = requests.get(r.links['next']['url'], headers=headers)
raw = r.json()
for question in raw:
data_set.append(question)
The loop stops when they do equal each other as that denotes that all requests have been completed and there are none left, which means you have the entire data set.
You then can work with the data_set list to pull out the needed information. With some APIs this method may have to be modified slightly to accommodate how response data is returned depending on the API. This also may not be the best method as it stores the data in memory and there may be a possibility that the system could run out of memory or preform slowly, but I have not ran into a memory limit.
Solved! Go to Solution.
I'm not getting errors testing for a different condition in my Python loop. Perhaps you could try that and see what happens?
paginated = r.json()
while 'next' in r.links:
r = get(r.links['next']['url'])
paginated.extend(r.json())
return paginated
Has anyone else using this while loop condition in Python recently found it throwing new errors?
while r.links['current']['url'] != r.links['last']['url']:
.........
It looks like this is because r.links['last']['url'] is not always included in our header links? however our scripts using this condition having been running well over a year without error, so i'm curious to know what has changed (our environment hasn't as far as I know)
Hi Jago,
Our scripts use the same while loop condition in Python when handling pagination and we haven't noticed any new errors.
That being said, the Link header may exclude the "last" url if the total count is too expensive to compute on each request as noted here - Pagination - Canvas LMS REST API Documentation which may be what is causing the error in your case.
@James mentioned a possible solution for cases where the Link header does not include a "last" url could be to "cycle through using the next link if there is no last provided".
I hope this helps!
Levi
I'm not getting errors testing for a different condition in my Python loop. Perhaps you could try that and see what happens?
paginated = r.json()
while 'next' in r.links:
r = get(r.links['next']['url'])
paginated.extend(r.json())
return paginated
Yes, in our Canvas client we also check for the existence of the 'next' link. That seems to be more stable than checking whether the 'current' link equals the 'last' link.
Thanks dgrobani, @levi_magnus , @ag3811 and @jperkins ,
I sampled the link headers of a number of endpoints in addition to submissions before I posted, and couldn't work out why Its worked without error for so long, as I tried but failed to see r.links['last'] in any of these headers despite altering the number of objects in the request and the pagination value.
Anyway our differences to Daniel's script above - is testing for existence of the actual url
while r.links['next']['url']:
....
And I've wrapped the while in a Try/Except block so errors can be appended to a log file...
except Exception as obj:
print(obj, file=sys.stderr)
Thanks for sharing. Here's what I ended up with when trying to get a list of all blueprints in a sub-account.
import requests
blueprints = []
num_blueprints = 0
headers = {'Authorization': 'Bearer ' + CANVAS_TOKEN}
url = BASE_URL + 'accounts/sis_account_id:SIS_ID_HERE/courses'
# Per page max value appears to be 100
params = {
'blueprint': True,
'per_page': 100
}
response = requests.get(url=url, headers=headers, params=params)
courses = response.json()
num_blueprints = num_blueprints + len(courses)
blueprints = blueprints + courses
while 'next' in response.links:
# We don't need to include params anymore - the given 'next' URL includes previously included params
response = requests.get(url=response.links['next']['url'], headers=headers)
courses = response.json()
num_blueprints = num_blueprints + len(courses)
blueprints = blueprints + courses
print('Total blueprints:', num_blueprints)
print('Blueprints:', blueprints)
This is how I handle pagination in my c# project:
I make use of the following LinkHeader class in my code: C# Parse Link Header · GitHub
and process paginated queries like this:
private static readonly HttpClientHandler HttpClientHandler = new HttpClientHandler { AllowAutoRedirect = false };
private static readonly HttpClient HttpClient = new HttpClient(HttpClientHandler, true);
public async Task<string> Get(string queryString)
{
using (var response = await HttpClient.GetAsync(queryString))
{
string content = await response.Content.ReadAsStringAsync();
HttpHeaders headers = response.Headers;
if (headers.TryGetValues("Link", out IEnumerable<string> values))
{
LinkHeader linkHeader = new LinkHeader(values.First());
if (linkHeader.NextLink != null)
{
JArray objects = JArray.Parse(content);
string nextContent = Get(linkHeader.NextLink).Result;
objects.Merge(JArray.Parse(nextContent));
content = objects.ToString();
}
}
return content;
}
}
I've broken down all API calls to generic Get/Post/Put/Delete calls in my Canvas API class. The Get method checks for the existence of the Link header and whether the Link header contains a "next" link, if so it will do a recursive Get call for that next URL and concatenate the results to the previous GET call.
How do you build the query string that gets passed to the above function? I am using RestSharp to do most of the API calls, but I hit a roadblock when it comes to pagination. I want to use your method, but I do not know what the full query string looks like. Thanks for any help.
You can view our whole c# application including pagination if you like GitHub - uvadev/UVACanvasAccess: A .NET wrapper for the Canvas LMS API.
I'm looking to use Pagination on my Canvas database in HTML and PHP, but I'm not sure where to start.
I have read the core documentation: Pagination - Canvas LMS REST API Documentation
I know I can reduce the results by page with the "?per_page=n" argument. But what I want to do next is create page buttons that link to the next set of results fetched from the page. And that I am unsure how to do. Yes, they show <link> options, but I don't know how to create buttons from that.
If I'm working with data from $data, how do I make pagination pages for it with PHP and HTML?
Regarding PHP, I'm not using Curl but using RollingCurlX because many of my requests are too long (list all pdf files in a sub-account or list all students in a sub-account).
Canvas will not output more than 100 results per page (even though the API fails to mention that) so a list with 10K items which would be common for a university with 10.000 students, would take 100 individual Curl requests.
With RollingCurlX I can request the first page, check in the header how many pages are there to request, build a list of requests, one for each page and let RollingCurlX go throught them in parallel instead of sequentially.
Here's the link to RollingCurlX: GitHub - marcushat/RollingCurlX: Easy to use cURL Multi wrapper for PHP
Here's the code:
//Match the header response using RegEx.
preg_match('/(page=)(\d*)(&per_page=\d*>; rel="last")/', $page_one_header['response_header'], $last_page);
//If the last page is 1, that means there are no more pages.
//If last page is 2 or more, then the loop runs.
if ($last_page[2] > 1) {
$current_page = 2; //Start building a request list from the second page as the first was already requested
$request_list = array();
while ($current_page <= $last_page[2]) {
$request_list[] = $url.$options_build.'&page='.$current_page; //adds a new URL to the list
$current_page++; //Increase the page number
}
rollingcurl_request($token, $request_list); //Then call a function which will runthrough all the requests in parallel.
I'm still working on making this even faster. At the moment I request one list which means requesting one single page (the first) and then requesting all others in parallel.
I'm at the moment rewriting the functions so I can request all first pages at once (example: list all files in all courses in a sub-account) in parallel which will then build a list with all the following pages for all the courses at once.
If you're using PHP I strongly advise you to try RollingCurlX or some other way of making parallel requests.
This cannot be done with every request; only some support the numeric page IDs. In particular, fetching page views, submissions for multiple assignments/students, and soon the enrollments do not support them. There are others, this is just off the top of my head a minute before class starts.
You may want to check to make sure it matches before you rely on $last_page[2].
There are other places where the headers get duplicated (or at least they used to). Again, more error checking rather than blanket reliance on having a set pattern is needed.
Thanks for the reply James
I do have other function to request pages without pagination. I'll choose one or another depending on the task at hand.
I've never seen a duplicated header yet, hopefully they fixed that...
Sorry I was rushed and had to keep my previous message short. RollingCurlX sounds like it has a lot of potential and I have taken advantage of the multiple requests in other languages besides PHP -- multiCURL was always such a pain that I left my PHP code as single threaded and it is slower than it could be. Canvas doesn't appreciate people tweaking the system, but since they usually don't allow you to filter the data you want, you have to download it all to get the parts you want and people always want it to go faster. GraphQL may allow us to fetch a subset of the data faster, but it doesn't allow filtering yet either.
Along the way, I realized that you should not assume that Canvas is going to behave the same way in all cases. What I found is that it works until it doesn't work and most of the time it didn't work was when I tried to take shortcuts. Here are some of the things to be aware of for when it doesn't work.
As for pagination, you should always assume that there is going to be pagination.
Pagination won't always look like page=3. For example, I just got page=bookmark:WzMyNDY4Nzc5MF0 in a request I just made.
By header being duplicated, I meant that if you made a request with per_page=100&page=2 with stuff after it then the next link from the link headers would come back with the per_page and page portions in there twice -- once where it was and again at the end. It appears that there is no spec on what to do with duplicated query parameters. I think the page number was the same in both places, but I remember writing code to clean up the page query parameter because I didn't like the duplication. I just don't remember where I wrote it to check. I did not mean that the next link was duplicated, but the parameters within the links.
Other times it may look like page 3 without having a last link. According to the documentation on pagination, this comment is made about the type of links: 'rel="last" may also be excluded if the total count is too expensive to compute on each request.' Your code was checking $last_page[2] without making sure that it existed.
Be sure you don't set the number of concurrent requests too high.There is a limiting factor of 700. New requests have a penalty of 50, which quickly is given back, but it keeps you from asking for too much at one time. The library I use for Node JS staggers the requests so I don't hit it with all of them at the exact same time, otherwise you would be limited to about 14 concurrent requests.
Sorry I was rushed and had to keep my previous message short.
Short good
Thanks for the insights in pagination. I'll make some test to see if I can replicate the errors and will update my code so I don't run into any problems.
Be sure you don't set the number of concurrent requests too high.There is a limiting factor of 700.
I haven't implemented anything to deal with this yet, but even going through thousands of URLs using 30 concurrent request I haven't got X-Rate-Limit-Remaining below 400.
Using RollingCurlX it would be easy to implement a delay to allow these points to refill.
The way RollingCurlX works is you provide all the data for the requests (with multiple URLs), a user data variable and a callback function. This callback function gets called for (after) each request, so for instance you could have the function check the X-Rate-Limit-Remaining and once it's below a certain threshold wait(2) so it would block the following requests for 2 seconds.
I'm glad you haven't hit the limiting factor yet.
There were several things I tried to get it to work relatively smoothly. This is what I eventually settled on:
My run last night at 23:00 localtime (Saturday night so low usage) took 10.6 minutes to make 3036 requests. That's about 209 ms per call or 4.8 calls per second. The lowest my x-rate-limit-remaining got was 533.928. 589 (19.4%) of the requests came back with x-rate-limit-remaining = 700.
I have found that if I try to push it too much, then I run into trouble. It may work one night, but have heavy congestion another and then it doesn't work. I'm not in that big of a hurry, so I leave it with what works for me instead of pushing it too hard.
I use plain old JS and I have a modified version of @James pagination mention earlier in this blog
const nextURL = linkTxt => {
if (linkTxt) {
let links = linkTxt.split(",");
let nextRegEx = new RegExp('^<(.*)>; rel="next"$');
for (let i = 0; i < links.length; i++) {
let matches = nextRegEx.exec(links[i]);
if (matches && matches[1]) {
//return the matches right away
return matches[1];
}
}
} else {
//(condition === false) cause if(!condition) actually tests true for false as well as undefined and null.
return false;
}
};
I'm using PowerShell, and this is what I use to get the Canvas users list in paginated form:
$Token = '<YOUR-TOKEN>'
$headers = @{"Authorization"="Bearer "+$token}
$outputFile = '.\api_Users.csv'
$allCanvasUsers = New-Object System.Collections.ArrayList @()
$pageNumber = 1
do {
# Get a single page of data from the Canvas API.
$resultsPage = Invoke-WebRequest `
-Method GET `
-Headers $headers `
-Uri "https://<SCHOOL-URL>:443/api/v1/accounts/self/users?per_page=100&page=$pageNumber"
$usersPage = ConvertFrom-Json $resultsPage.Content
# Append each user object in the results page to an ArrayList.
foreach ($user in $usersPage) {
$allCanvasUsers.Add($user)
}
$pageNumber++
# Do until the RelationLink list does not contain an element with KEY='Next'.
} while ($resultsPage.RelationLink.Next -NE $NULL)
# Write the complete ArrayList to a CSV.
$allCanvasUsers | Export-CSV -UseQuotes Never -Path "$outputFile" -Delimiter "`t"
This is the approach I take, just in php. I know there is a more elegant way, working with the headers, but I will add that I have never had any issues presuming that the page numbers will increment like this.
Dear Friends,
<?php
$curl = curl_init();
$token = "<YOUR TOKEN>";
$pageNo = 0;
function curlCall($curl, $endpoint, $token,$pageNo){
curl_setopt_array($curl, array(
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_URL => 'https://<YOUR INSTANCE>.instructure.com/api/v1' . $endpoint . '?access_token=' . $token . '&per_page=100&page=' . $pageNo
));
}
$endpoint = "/accounts/1/users";
do{
$pageNo++;
curlCall($curl, $endpoint, $token, $pageNo);
$resp = curl_exec($curl);
$json = json_decode($resp);
//Process $json here
}while(sizeof($json) > 0);
$pageNo = 0;
?>
-------//Process $json here ???? --->
if I want to be show to a table such as:
<table>
<tr>
<td>id</td>
<td>name</td>
</tr>
<tr>
<td>value of id from json</td>
<td>value of name from json</td>
</tr>
</table>
how to show $json element to this table???
To participate in the Instructure Community, you need to sign up or log in:
Sign In