Results of a Comparison of the Effectiveness of AI vs Human Generated Rubrics

Ai
Instructure
Instructure
0
356

RoboGrader _ Title.jpg

 

A comparison of AI-generated rubrics to those created by humans, specifically focusing on Coverage, Fairness, Quality, and Usefulness, found that AI performed well in producing practical and impartial rubrics. However, the results also indicate that instructor oversight is necessary. While AI can effectively create a fair rubric, it may struggle to align with an instructor's intended complexity, or lack thereof, in an assignment. We will start looking for partnering institutions to further validate and test a solution in early 2025.

 

Introduction

Rubrics are beneficial for both students and instructors but also time-consuming to create. This study's findings suggest that a generative AI solution powered by an LLM does well at producing an initial draft of a rubric, even with very simple prompting. Exploring the potential of AI to speed up the rubric creation process is proving promising, but keeping instructors in the loop remains important. 

Background

Rubrics are a valuable feature in Canvas LMS, they communicate expectations to students, help align content to assignments and assessments, provide learners an opportunity for critical thinking and self-evaluation, and have been shown to reduce bias in grading (Quinn, 2020). Given the benefits, one would expect that the majority of assignments in Canvas LMS would have a rubric attached. However, roughly 34% of active courses in 2024 contained a rubric. Software can make it easier to grade with a rubric, but the rubric creation process can be arduous. Instructors may find that after spending time outlining goals and expectations in an assignment description, creating the rubric seems that much more trivial. 

So, what are the options? We can start from scratch and reverse engineer a rubric from the assignment description, adapt a preexisting rubric for our purposes, collaborate with colleagues, and share and adapt our rubrics together, or rely on AI to get us started. All of these are good options; however, one is not like the others. 

Are you good at writing rubrics.png

While ChatGPT assures us it is quite good at writing rubrics, the Advance Development team wanted to lean on the expertise of real educators to test AI-generated rubrics against human-generated rubrics and evaluate whether they could perform equally well.  

Hypothesis

This initial study was guided by a hypothesis focused on evaluating the effectiveness of AI-generated rubrics. Specifically, the hypothesis was defined as follows:

Hypothesis: AI-generated rubrics will be rated similar in quality to human-generated rubrics. Rubrics generated by LLMs will be perceived at least as well as human-generated rubrics in terms of educator ratings for coverage, fairness, quality, and usefulness in a controlled environment. 

If an AI-generated rubric is found to be as effective as a human-generated rubric, it would suggest that a generative AI tool powered by a large language model (LLM) can perform this task at a comparable level to that of a human. In other words, we hypothesized that an LLM can pass a simple Turing Test with regard to creating a rubric. 

Let’s get into the details.

Method & Design

Participants were primarily recruited through a separate post in the Community, which ran from Nov. 6th to Dec. 4th, 2024. The post provided information about the study and a link to the survey. 

The study employed an experimental design that incorporated both between-subjects and within-subjects methods. There were three assignments representing three subjects: Ethics in Science, Biology, and Sociology. Each assignment had a description, and participants were randomly assigned to evaluate one of three possible versions of the rubric (two generated by AI and the original human-generated rubric that initially came with the assignment). In all, there were nine distinct rubrics, and educators were randomly assigned to evaluate three of them. Additionally, the study utilized a blinded design, neither the participants nor the researcher were aware of the condition assignments during the trials.

Participants were presented with a rubric, which included a link to an assignment description and asked to evaluate it based on four criteria: Coverage, Fairness, Quality, and Usefulness (the aggregate of which is referred to as “Effectiveness”). Following their ratings, participants provided positive and/or negative feedback tailored to their ratings.

Participants

Screenshot 2024-12-10 at 1.04.19 PM.png

 

In total, 65 educators participated (Thank you again to everyone who contributed their valuable insights!). These participants came from a diversity of institutions and held various roles in education settings. With a range of familiarity and usage patterns for both Canvas LMS features and AI tools this group provided both ratings of the rubrics and a rich cross-section of qualitative insights.

 

Screenshot 2024-12-10 at 1.35.55 PM.png

Most participants were associated with higher education, but many were also from K-12. The average number of years spent using Canvas LMS was nearly 6 years. A majority of participants were in Faculty/Teaching and Instructional Design roles, with several other roles also represented to a lesser extent.

 

Screenshot 2024-12-10 at 1.30.27 PM.png

On average, participants were highly familiar with the Rubric feature in Canvas LMS (mean = 3.95). They sometimes — often use rubrics to grade students’ work (mean = 3.53). They also sometimes — often use AI tools in their day-to-day work (mean = 3.52). 

 

Findings

Analyses revealed the strengths of both Claude Haiku and ChatGPT-4o, particularly in Biology and Sociology. Claude Haiku was noted for its strong rubric structure, though some aspects lacked clarity and objectivity. ChatGPT-4o received high ratings on coverage, fairness, and usefulness, and feedback suggested it did well in outlining objectives effectively, suggesting it might be a valuable tool for supporting grading, especially in applied and social sciences.

Claude Haiku vs Human Generated Rubrics

Claude Haiku’s rubrics consistently scored higher than our human-generated rubrics across all subjects. For the Ethics in Science rubric, the lack of significant difference suggests that both rubrics were equally effective. However, significant differences were observed in two subjects, Biology and Sociology, where Claude Haiku’s rubrics were rated higher. This pattern highlights Claude Haiku's potential strength in delivering effective materials in science-related subjects. 


Screenshot 2024-12-10 at 1.32.52 PM.png

* differences were statistically significant

Participants noted that Claude Haiku generated acceptable rubric categories and criteria, and they appreciated its use of specific language. However, raters observed that Claude Haiku produced a lack of objectivity in its mid-range items. Also, it included "additional considerations," which was regarded as confusing and problematic as there was no clear explanation for how students could turn those considerations into actions. Moreover, these were not found to be useful in the context of grading.

ChatGPT-4o vs Human Generated Rubrics

ChatGPT-4o’s rubrics outperformed human-generated rubrics in all subjects, with significant differences observed in Biology and Sociology. In these areas, ChatGPT-4o’s rubrics were rated significantly higher, suggesting it is capable of delivering effective materials. While no significant difference was found for Ethics in Science, results still support ChatGPT-4o’s ability to generate an effective rubric. These results emphasize ChatGPT-4o’s capabilities to support teachers in their grading, particularly in applied and social sciences.

Screenshot 2024-12-10 at 1.32.43 PM.png

* differences were statistically significant

Participants found ChatGPT-4o's rubrics to be comprehensive and clear, especially in addressing both general writing standards and the specific requirements of an essay. They appreciated the well-defined categories and the option for partial credit, which contributed to a fair and nuanced evaluation. The rubrics were regarded as effectively outlining the objectives and expectations, breaking down the evaluation into relevant components such as source incorporation, critique demonstration, and the display of analysis and judgment.  

Additional Observations

The Subjects for the assignments were held relatively constant; all were science-focused. While the word count of the assignment descriptions varied, the average rating for overall effectiveness was not materially impacted. Suggesting more inputs by the user is not necessarily critical for the LLM to succeed.

While not a focus of this study, a usability observation was noted where the format of the rubric produced by ChatGPT-4o was easier to work with. In the absence of any prompting, it produced a tabled rubric, whereas Claude Haiku produced the information in a list format. Translating that information over to a Canvas Rubric was slightly more tedious as it required more attention to detail.

During the trials, many but not all reviewed the assignment descriptions, comparing ratings among those who did vs. didn’t review the descriptions revealed that ChatGPT-4o further outperformed Claude Haiku. However, the rubric for the Sociology assignment was a weekly activity and neither ChatGPT-4o nor Claude Haiku adjusted for this, and participant ratings were notably lower among those who reviewed the assignment description. The rubrics generated were disproportionately detailed compared to the simplicity of the task, which involved reading a popular news article and preparing for class discussion. Both ChatGPT-4o and Claude Haiku, without additional prompting, defaulted to creating a standard 0-100 point rubric format—highlighting the importance of teachers keeping their hand on this wheel. 

Summary

Our analysis sought to compare the performance of two LLMs' ability to generate a “good enough” rubric compared to human-generated rubrics. We looked at ChatGPT-4o and Claude Haiku across four criteria: Coverage, Fairness, Quality, and Usefulness. The results reveal that ChatGPT-4o was rated highest in all but Quality, where it was nearly equal to Claude Haiku. 

Screenshot 2024-12-10 at 1.29.01 PM.png

The chart above shows the overall performance of ChatGPT-4o, Claude Haiku, and Human rubric generation across four key criteria: Coverage, Fairness, Quality, and Usefulness.

  • Coverage: ChatGPT-4o leads with an average rating of 3.946, indicating better comprehensiveness. Claude Haiku follows with an average rating of 3.473, while Human-generated content lagged behind at 3.145.
  • Fairness: ChatGPT-4o again achieved the highest average rating at 3.636, suggesting that its criteria might be viewed as more equitable. Claude Haiku’s average rating was slightly lower at 3.434, and Human-generated criteria score the lowest at 3.000.
  • Quality: Claude Haiku marginally outperformed ChatGPT-4o in quality, with average ratings of 3.577 and 3.537, respectively. Human-generated criteria had the lowest quality rating at 2.658.
  • Usefulness: ChatGPT-4o’s average rating was highest at 3.464, surpassing Claude Haiku’s average of 3.093. In contrast, Human-generated rubrics were rated lower at 2.539, indicating that AI-generated rubrics might be more practical for end-users.

 

Overall, ChatGPT-4o achieved the highest average rating (3.647), suggesting that it did not perform as well as the human-generated rubrics, it performed better than the somewhat randomly selected rubrics from the web. Both AI tools demonstrated strengths in creating comprehensive and practical rubrics, though the importance of teacher input remains clear. These findings suggest that while AI tools are effective for generating rubrics, instructors would be best off ensuring thoughtful alignment with the simplicity and intent of the assignment.

Is ChatGPT-4o good at writing rubrics? According to our test, signs point to yes. 

GPT _aw shucks_.png

 

RoboAssistant _ Recommendations.jpg

Conclusion

ChatGPT-4o demonstrated the strongest performance across most criteria, particularly in Coverage, Fairness, and Usefulness. While Claude-3 is competitive in Quality, the three rubrics ~randomly grabbed from the web consistently scored the lowest. This highlights the potential of AI tools like ChatGPT-4o to produce rubrics that are capable of being comprehensive, fair, and practical.

Recommendations

  • Rely on ChatGPT-4o to help create a new rubric, as it is capable of producing a good first draft. Then, refine the rubric to align with your assignment’s complexity.  
  • Get more familiar with prompting your AI generator. For this research, the only prompt given was “Can you make a rubric based on the following assignment description? " But better prompts amount to better results, and tech companies are coming to the rescue. Structured Prompt, for example, offers free resources that provide guidance on creating better prompts.  

Future Directions

This study evaluated whether AI-generated rubrics could maintain high-quality standards and prove useful to educators. These results are informing the future development and integration of AI-generated rubrics and will help ensure this resource provides both pedagogical value and operational efficiency. Follow-up studies would entail validating a tool across more subject areas in a real-world setting.

 

Appendix