Participation Opportunity: Comparison of the Quality of AI vs Human Generated Rubrics

Ai · ‎11-06-2024

Summary

unnamed (1).jpg The Advance Development team is looking for educators interested in judging the effectiveness of an instructional tool. This research seeks to gather experimental evidence for generative AI’s ability to produce “intelligent” rubrics and determine their potential to be similar in quality to human-generated rubrics.

We are inviting you to participate in this study (via survey participation), which will take around 15 minutes to complete. Your input would help shape the development of an innovative tool intended to empower educators with a streamlined process that promotes fairness in education.

Join us in helping validate a solution that could afford a more efficient and equitable learning experience. Click below to participate.

The survey closed on 2024.12.04. Findings will be shared soon!

Research Details

This is phase one of a three-phase research effort. If our AI-generated rubric solution successfully passes this test, it could proceed to additional rounds of vetting. If all goes well, this solution could provide educators with a well-validated and trusted time-saver. Generative AI has the potential to take an important but time-consuming task and turn it into an automated “human in the loop” quality control check.

Introduction

In any workplace, the use of generative AI for the more tedious tasks is increasing, and in an education setting, building rubrics is a prime example (Ouyang, et al., 2023; U.S. Department of Education, n.d.). Rubrics are highly beneficial for both students and instructors but also highly time-consuming for the instructor to create. Gone unsupported, busy educators may find themselves going with a freemium, good enough for now, solution. We recognize that it is in our best interest to at least test a large language model’s (LLM) ability to handle this.

Thus, the Advance Development team at Instructure is conducting research that evaluates the quality of AI-generated rubrics compared to human-generated rubrics, where we seek to test our prototype, which leverages generative AI to automate parts of the rubric creation process and increase the viability of using rubrics in Canvas LMS.

Background

A rubric is a valuable tool. It communicates expectations to students, helps align content to assignments and assessments, provides learners with an opportunity for critical thinking and self-evaluation, and has been shown to alleviate bias in grading (Quinn, 2020). Given these benefits, one would expect that the majority of courses in Canvas LMS would leverage rubrics. As it stands, roughly a third of courses make use of this feature.

Despite a streamlined rubric creation process, it remains time-consuming and tedious to build. When an instructor has taken time to outline basic goals and expectations in an assignment description, the rubric creation process can feel that much more cumbersome. Instructure’s Advance Development team recently completed a proof of concept (POC) to determine if generative AI could build a rubric via a few-shot approach.

POC

For this POC, the capabilities of the AI models were examined to determine their understanding of assignment descriptions, generation of relevant criteria and ratings, assignment of logical point values, and production of output compatible with the Canvas API.

Prototype

Success Criteria

1.) Retrieve an existing assignment record that does not have an associated rubric

2.) Retrieve a list of rubrics from the account to give to the models as examples

3.) Generate a prompt asking the model to generate a novel rubric given the assignment type and description and using the existing rubrics as a reference (few-shot)

4.) Feed the prompt to a chosen AWS Bedrock or OpenAI model and validate that the output matches what would be expected

5.) Transform the output into a format that the Canvas REST API can work with

6.) Utilize the Canvas REST API to associate the generated rubric with the assignment so that it can be reviewed within the context of the assignment edit page.

Q. Can the models understand the intent of the assignment from the assignment description?

Q. Can the models generate novel criteria, ratings, and descriptions based on that understanding?

YES, Both models appear to understand the stated expectations of the assignment from the assignment description and generated criteria that verified these expectations.

Q. Can the models generate criteria point values and ranges that make sense?

YES, Both models consistently created “n” number of criteria, with reasonable point allocation and a correct number of total points

Q. Can the models generate an output that is usable by the Canvas API / ecosystem?

YES, Both models were able to generate an output that matched example rubrics, and only basic structural manipulation was required to transform the output into a format that was expected by the canvas API (numbered hashes instead of arrays for criteria and ratings)

This initial success suggests further exploration of an AI-generated rubric where Claude 3 Haiku produced acceptable results, is cost-effective, and widely available, making it a good candidate from a business perspective. However, from an education utility perspective, further validation of the models’ output quality needs to be performed. This is where you come in—we seek to compare the quality of the rubrics with that of human raters in a controlled experiment. If successful, we would proceed with building a solution available within Canvas LMS for continued validation.

Methods

This study is guided by one hypothesis focused on educator evaluation. We aim to evaluate the quality and practicality of AI-generated rubrics via experimental design.

Hypothesis

AI-generated rubrics will be rated equally well as human-generated rubrics. Rubrics generated by AI models will be perceived at least as well as human-generated rubrics in terms of educator ratings for quality, coverage, usefulness, and fairness in a controlled environment.

We posit that one or both models are capable of “exhibiting intelligent behavior equivalent to or indistinguishable from that of a human” (ie. passing a Turing Test; Turing, 1950). The results would inform the future development and integration of this solution within Canvas LMS, ensuring that AI-generated resources provide both pedagogical value and operational efficiency.

Experimental Design

This study includes rubrics created by two different LLMs against one human-generated control across three topic areas. By employing a 3 x 3 mixed model design, we are conducting this initial assessment of our AI-generated rubrics to understand if they would achieve similar levels of quality, coverage, usefulness, and fairness as those created by humans.

Where the rubric generator is fully randomized between subjects to ensure unbiased evaluation, the three subject areas—Science, Civics, and Biology—are being evaluated within subjects as a first (albeit small) step in testing the extent of their generalizability. Additionally, the assignment description word count will be evaluated to better understand if AI-generated rubrics are better when an assignment description is more complete. If an AI-generated rubric achieves ratings that are not significantly different from a human-generated control, it stands to reason that further testing should proceed.

Participation

We are looking for at least 250 educators across multiple disciplines, with differing levels of expertise and a wide variety of usage, to help assess these rubrics. By gathering responses from a broad range of educators, we aim to ensure that we have a sufficient sample size to determine significant differences if they exist.

If one of the AI-generated rubrics demonstrates non-significant differences from the human-generated control on the dependent variables, our evidence would indicate that the AI-generated rubrics are performing at a comparable level to human-created rubrics. This result would justify proceeding with further validation in a real-world setting, where we would proceed to Study 2.

Importance and Impact

This research seeks to address the pressing need for ed-tech tools to improve efficiency, reduce workload, avoid quality compromises, and retain educators' control. The study applies a Turing Test-inspired approach to assess whether AI can produce rubrics indistinguishable from those created by humans, which speaks directly to the broader AI research question of achieving "human-like" intelligent behavior in specialized tasks. This study is a step in the direction of better understanding the degree to which an AI-generated tool could be similar to a human-generated one in the nuanced context of a rubric.

We value your acceptance of AI solutions. By getting involved in the evaluation process, this study will test the technical feasibility of AI-generated rubrics and gain a better understanding of educators’ perceptions of AI’s utility and fairness. The results will inform future research on AI in instructional and administrative tasks while preserving pedagogical quality.

Conclusion

At Instructure, we champion a responsible, ethical approach to AI that prioritizes transparency, accountability, and user privacy, minimizes environmental impact, and fosters secure, human-centered partnerships. We aim to innovate with intention, and we know that the most impactful innovations are going to come from direct input from the educators and learners who use our platform.

This study represents an important step in evaluating the quality of generative AI content. With the active participation of diverse educators across disciplines and levels of expertise, we aim to gather insights that will guide the responsible integration of AI tools within Canvas LMS. Once we reach a sufficient sample size, we will share the results and the implications for future development. Your involvement helps pave the way toward more efficient, equitable, and teacher-empowered educational tools.

Thank you for joining us in this endeavor to shape the future of instructional technology.

The survey closed on 2024.12.04.

Findings will be shared soon!

References

Cover image designed by Freepic

Ouyang, S., Wang, S., Liu, Y., Zhong, M., Jiao, Y., Iter, D., Pryzant, R., Zhu, C., Ji, H., & Han, J. (2023). The shifted and the overlooked: A task-oriented investigation of user-GPT interactions. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. https://aclanthology.org/2023.emnlp-main.146.pdf

Quinn, D. M. (2020). Racial bias in grading: Evidence from teacher-student relationships in higher education. Educational Evaluation and Policy Analysis, 42(3), 444-467. https://doi.org/10.3102/0162373720942453

Turing, A. M. (1950). Computing machinery and intelligence. Mind, LIX(236), 433–460. https://doi.org/10.1093/mind/LIX.236.433

U.S. Department of Education. (n.d.). AI Inventory. Office of the Chief Information Officer. Retrieved September 12, 2024, from https://www2.ed.gov/about/offices/list/ocio/technology/ai-inventory/index.html

Participation Opportunity: Comparison of the Quality of AI vs Human Generated Rubrics

Summary

Research Details

Introduction

Background

POC