Community Help

maguire · ‎12-18-2023

As a former professor, the following article inspired me to look at the vocabulary used in some Canvas course rooms in terms of the Common European Framework of Reference for Languages (CEFR) levels:

Studie: Vissa universitetslärare inte bättre än gymnasieelever på engelska - Universitetsläraren (un...

Teachers’ receptive and productive vocabulary sizes in English-medium instruction (tandfonline.com)

A program to extract the vocabulary used in a course room and a program to prune the list of "words" and add information about CEFR levels. Note that this is primarily directed at courses in American English (with some words in Swedish - this support is very limited at present). The vocabulary is based on a third-year course in Internetwork, a 4th-year course in Voice over IP, a 4th-year course in research methodologies and scientific writing, a course in accelerated computing, and a course in data science. [The last two courses are based on material provided by Nvidia's Deep Learning Institute under a CC-BY license that has been converted into Canvas course rooms]. The courses have accompanying videos and these have been captioned and wikipages created for each PowerPoint slide - with the corresponding transcript from the video added to the wikipage. As a result, the input material includes a very large portion of the course content. It does not include content in files, quizzes, etc.

Hopefully, with some automated feedback to the teachers, the accessibility of the course material can be increased. It might even be useful for students to know the distribution of CEFR levels for the vocabulary in a course they are considering taking.

The first program is compute_unique_words_for_pages_in_course.py it takes a course_id as the only argument. The second program is prune_unique_words.py and it also takes a course_id as the only argument.

The code can be found at https://github.com/gqmaguirejr/Canvas-tools

In addition to outputting a number of files, the prune program gives a summary in the form shown below (for a course that has 187,072 words in it):

Loading some directories

2999 entries in American3000

2003 entries in American5000

7459 words in common_English_words

376 words in common Swedish_words

Pruning the input

10540 unique words - initially

10383 words left, 157 place names removed

10326 words left, 57 misc_words_to_ignore removed

10229 words left, 97 company_and_product_names removed

10208 words left, 21 abbreviations_ending_in_period removed

10206 words left, 2 common_programming_languages removed

10079 words left, 127 domainnames removed

9738 words left, 341 improbable words removed

1688 likely acronyms

7937 unique words after filtering acronyms and single letters

7936 unique words after filtering if there is a capitalized and lower case version of the word or title case turn to lower case

7838 words left, 98 top_100_English_words removed

7197 words left, 641 thousand_most_common_word_in_English removed

5744 words left, 1453 Oxford American 3000 words removed

5061 words left, 683 Oxford American 5000 words removed

2435 words left, 2626 common English words removed

2205 words left, 230 common_swedish_words removed

2220 words left, 15 words added after processing words that appear in title case

1565 starting with a capital letter (70.50%)

638 starting with a lower case letter (28.74%)

17 starting with other letter (0.77%)

Some statistics about the CEFR levels of the words as determined by the four main data sources

The totals are the total numbers of the input words in this source.

The percentage shown following the totals indicates what portion of the words from this source used in the course pages.

The American 3000 and 5000 sources have an explicit column of plurals; the rest are considered "singular".

The level xx indicates that the word does not have a known CEFR level.

American 3000: total: 2012 (67.09%), singular: 1554, plural: 460

singular: {'A1': 543, 'A2': 443, 'B1': 305, 'B2': 263}

plural: {'A1': 520, 'A2': 442, 'B1': 303, 'B2': 261}

American 5000: total: 600 (29.96%), singular: 482, plural: 119

singular: {'B2': 188, 'C1': 294}

plural: {'B2': 188, 'C1': 294}

common English words: total: 2784 (37.32%)

{'A1': 163, 'A2': 166, 'B1': 295, 'B2': 311, 'B2x': 1, 'C1': 167, 'C2': 98, 'xx': 1582}

common Swedish words: total: common_swedish_words_count=188 (50.00%)

Assessing vocabulary used in course wikipages in terms of Common European Framework of Reference for Languages (CEFR) level

Instructional Designer

Instructor

Seriously, why is Canvas LMS the worst among ALL t...

Rubric for assignment with optional parts?

How do you manage assessment people in your Canvas...

GRADEBOOK

"Invalid signature" error trying to use google aut...

Seriously, why is Canvas LMS the worst among ALL t...

How Canvas works with WileyPLUS

Rubric for assignment with optional parts?

Student Enrollment

Assignment Creation Notification Setting

You're signed out

Assessing vocabulary used in course wikipages in terms of Common European Framework of Reference for Languages (CEFR) level

Community Help

View our top guides and resources: