AI In Education

LearnWise Updates

Why AI Grading Has Been Getting It Wrong

May 7, 2026

•

10 min

•

Written by

Tosh Polak

Head of Product

Grading is one of the most time-consuming parts of teaching, and one of the most inconsistent. Faculty at large institutions can spend dozens of hours each week reviewing submissions, writing feedback from scratch, and trying to apply rubrics consistently across hundreds of students. Meanwhile, students frequently receive generic comments that don't reflect the specific strengths or gaps in their work. The result is a system where feedback arrives too late to be useful, varies too much between instructors, and leaves both sides without a clear path forward.

Most AI grading tools were built to address the time problem. Fewer were built to address the quality problem. And almost none were built to address both without trading one off against the other.

That tension is worth examining, because it shapes every meaningful design decision in this space. These are the tensions that led us to rebuild our AI Feedback & Grader solution: not to add capabilities on top of an existing architecture, but to rethink the approach from the foundation up. What follows is an account of the problems we kept running into and the principles we arrived at, which we think apply well beyond any single product.

Feedback quality degrades when speed is the only goal

The traditional AI grading workflow reinforces a bad habit: open a submission, wait for the AI, skim the output, move on. The wait, sometimes over a minute per student, adds up across a cohort of thirty or a hundred. So the natural instinct is to optimize for speed.

But speed-first design produces speed-first output. When an AI has thirty seconds to process a submission, it produces a surface reading. When it has more time, working in the background while an instructor reviews the previous student, it can reason more carefully, catch more nuance, and produce drafts that require fewer corrections.

The better design question is not "how do we make the AI faster?". It is "how do we give the AI more time to think without making the instructor wait?". Those are different problems with different solutions. Background generation, where drafts are prepared for upcoming submissions while the instructor is still reviewing the current one, shifts the question entirely. The wait doesn't disappear; it moves somewhere the instructor can't feel it. Perceived wait time drops from over a minute to roughly ten to thirty seconds. Output quality improves because the AI had more time to work. The instructor's rhythm stays intact.

Why AI feedback sounds like it was written for no one in particular

Tone is where most AI grading tools quietly fail. Instructors notice it immediately: the feedback is technically accurate but reads like it could have been written for any student, in any course, on any topic. There is no voice. No relationship. No sense that this specific person's work was actually read.

This happens because most tools generate feedback without knowing anything about the instructor's preferences. The AI writes in a neutral register because no one told it to do otherwise. The instructor then spends time softening it, or making it more direct, or rewriting the opening entirely, and does that again the next session, and the one after that.

The fix is straightforward in concept and harder in execution: ask instructors to define their preferences before they grade a single submission, not after. Tone, scoring format, style by course type or department. Set it once. Let the AI work within those parameters from the first click. New instructors can start with pre-built profiles; experienced ones can refine over time.

The result is feedback that reflects the instructor's standards and voice, not a generic average of all possible instructors.

The hard truth about how AI assigns grades

Here is something few AI grading vendors will say plainly: for a long time, scores were not actually derived from rubrics. They were derived from sentiment.

If the feedback generated sounded positive, the grade was high. If the language was critical, the score dropped. The rubric was present in the interface, but it was functioning more as a label than as an actual evaluation framework. The AI was, in effect, reverse-engineering a score from its own tone.

The correct approach requires treating grading and feedback as separate processes that inform each other, rather than conflating them. A well-designed grading pipeline starts by building a scoring blueprint for each rubric criterion: a step-by-step algorithm describing what quality of work earns each score, with grade-level descriptors spelled out explicitly. That blueprint is generated before any submission is evaluated.

The grader then walks through each criterion independently, evaluating the submission against each grade level and selecting the best fit. It uses the feedback as supporting evidence, not as its source of truth. When a submission falls between two levels, the conservative score is the correct default. No inflation. No over-grading. Just a defensible, rubric-aligned result.

This distinction matters enormously for institutional trust. Instructors need to be able to defend a grade if a student challenges it. A score derived from AI sentiment cannot be defended. A score derived from a documented, criterion-by-criterion evaluation can.

Why single-pass AI analysis misses too much

The earliest AI grading systems operated in a single pass: read the whole submission, generate feedback, done. For short, simple assignments, this produces adequate results. For anything complex, it is structurally insufficient.

A single pass cannot simultaneously evaluate domain accuracy, logical reasoning, structural coherence, and writing style with equal attention to each. It tends to weight whatever is most salient in the text, which means a well-written but logically flawed argument can receive better feedback than a technically accurate but poorly structured one.

A staged pipeline addresses this by separating concerns. First, a shared assignment blueprint is built once per assignment, establishing common standards, typical mistakes, and feedback rules that apply consistently across every student. This alone solves one of the most persistent problems in large-course grading: drift, where the feedback given to the first student in a batch is subtly but meaningfully different from the feedback given to the last one, because the internal reference point has shifted.

Submissions are then broken into focused chunks by type: sections, code blocks, mathematical proofs, charts, spreadsheets. Each chunk is analyzed by specialized controllers running in parallel, each focused on its own dimension of quality. The results are then refined and structured against the rubric, with the instructor's preferred tone applied at the end.

The output is not faster analysis. It is deeper analysis, the kind that catches what a single pass would miss, while still remaining reviewable and adjustable by the instructor before anything reaches a student.

Refinement should be a conversation, not a form

When AI-generated feedback isn't quite right, most tools require an instructor to navigate to a separate adjustment screen, fill out structured fields, and regenerate from scratch. This interrupts the grading session and frames refinement as an exception rather than an expected part of the workflow.

A better model treats refinement as conversation. The instructor types what they want to change, in plain language, directly within the feedback view: make the tone more encouraging; add a reference to the student's introduction; shorten this section and focus on the methodology gaps. The adjustment is applied inline and immediately. The session continues without interruption.

This shift matters beyond convenience. It signals a different relationship between the instructor and the tool. The AI is not producing a finished product for passive review. It is producing a working draft in a collaborative process where the instructor's judgment is the final authority. That framing changes how instructors engage with the output.

Trust requires visibility into reasoning

Instructors do not trust output they cannot interrogate. This is a reasonable professional instinct, not a failure of faith in technology. A grade a teacher cannot explain is a grade they cannot defend, and a grade they cannot defend is a liability.

AI systems that surface their reasoning change this dynamic. When an instructor can see which parts of a submission the AI focused on, which rubric criteria it weighted most heavily, and precisely why it chose a particular score, review becomes faster and trust increases. Not because the AI is always right, but because its reasoning is legible. An instructor can validate it, correct it, or override it with full understanding of what they are changing.

Transparency in AI grading is not a nice-to-have. It is a prerequisite for institutional adoption. The tools that get deployed at scale will be the ones where faculty understand what the AI is doing and why.

The more important shift is still ahead

The most significant near-term development in AI-assisted grading is not on the instructor side. It is on the student side.

Giving students access to AI feedback on their own work before final submission, on pre-approved assignments, changes the fundamental feedback loop in education. Right now, most students receive feedback after the grade is already set. The feedback is retrospective: it explains what went wrong, which is useful for future assignments but useless for the one just submitted.

Feedback that arrives before submission is something different. It is formative in the truest sense, giving students the information they need to improve the work they are still working on. Instructors retain control over which assignments are eligible. The AI does not replace the instructor's final assessment. But the student's experience of the feedback process shifts from receiving a verdict to participating in a conversation.

The other meaningful development is personalization through usage. An AI that learns from every edit, comment, and adjustment an instructor makes becomes, over time, a reflection of how that instructor actually gives feedback. Not a generic approximation, but a model of how this specific teacher writes, evaluates, and communicates. The reduction in manual editing is a secondary benefit. The primary one is that the tool stops feeling like an external system the instructor is managing, and starts feeling like a genuine extension of their practice.

That shift, from tool to collaborator, is where the real potential of AI in education sits.

Featured use case

AI driving cost efficiencies and better outcomes

Read use case

Create an AI assistant to support your students

Empowering your students with immediate access to AI assistance at any hour.

Book a demo

Want more stories like this?

Sign up for our monthly newsletter for the latest on AI in education, inspirational AI applications, and practical insights. You can unsubscribe at any time.

Thanks for subscribing! Check your inbox to get started

Oops! Something went wrong while submitting the form.