How Do You Actually Grade Story-Listening? Beniko Mason on Cloze Testing

May 27, 2026

By Beniko Mason

One of the most common questions I receive from teachers is about grading. They understand the method. They see students engaged. But they face a real institutional requirement: marks must be given. How can Story-Listening and Guided Self-Selected Reading — approaches built on comprehensible input, not memorized exercises — produce grades that satisfy a school or university?

Learners can absolutely be graded, and in many ways it is a fairer and more accurate evaluation than what most institutions use.

Teachers often think evaluation means memorized vocabulary, practiced grammar points, rehearsed dialogues, or discrete-point test items. But Story-Listening and Guided Self-Selected Reading evaluate different kinds of visible progress: better summaries, greater comprehension, smoother reading, increased fluency, more natural writing, stronger listening ability, and improved standardized test scores.

Language acquisition itself is subconscious, but the results are clearly visible.

What Is a Cloze Test?

A cloze test is a type of language assessment in which words are removed from a connected passage of text and learners are asked to supply the missing words using contextual clues from the surrounding discourse. The technique was first described by Wilson L. Taylor in 1953, and the term "cloze" was derived from the concept of "closure" in Gestalt psychology.

Most cloze tests have been used primarily as placement instruments — to estimate which graded-reader level a learner can handle comfortably. The cloze tests I developed served a different purpose. Rather than focusing on placement into reading levels, I repeatedly administered the same long narrative-based cloze test over extended periods of time in order to observe gradual changes in learners' developing internal language system, reading behavior, and readiness for autonomous reading.

The key difference is this: a placement test tells you where a learner is. A longitudinal cloze test tells you how a learner is changing.

Why It Fits an Input-Based Classroom

The task of filling in blanks with one's own words cannot be completed without understanding how language functions within connected discourse and how meaning flows through the text. For this reason, an appropriate-word cloze test may be more suitable for measuring overall language proficiency than a multiple-choice test, which often allows students to rely on recognition, guessing, or elimination strategies rather than drawing from their own developing internal language system.

From the perspective of Pure Optimal Unified Input, optimal input develops real language competence — including what John Oller (1983) referred to as "grammar expectancy." This competence gradually develops through Story-Listening and Guided Self-Selected Reading and eventually becomes visible even on traditional standardized tests.

In the past, some researchers criticized acquisition-based approaches by claiming they were too slow and that students taught through acquisition-based methods would not perform well on standardized tests. However, my classroom experience and research suggest otherwise. When learners receive abundant optimal input, acquisition-based approaches can produce substantial development efficiently while also being easier and more enjoyable for learners than traditional skill-based approaches.

How I Built the Test

I selected a text about the atomic bombing of Hiroshima written by a Japanese elementary school girl who experienced the event herself. The topic was emotionally and culturally familiar to many of my students. The translated text was estimated to fall at approximately the sixth-grade reading level using the Flesch–Kincaid Grade Level formula.

The text itself was approximately 1,640 words long. Some researchers criticized the test for being too long. However, I came to believe that the length was actually one of the strengths of the test. Many short cloze tests measure only local sentence processing or guessing ability. Longer passages allow readers to build context, sustain comprehension, form predictions, and process meaning across extended discourse. Real reading competence involves maintaining understanding over long stretches of connected text.

I deleted approximately every tenth word and created a total of 100 items. The first paragraph remained intact so that students could establish the context before encountering blanks. A test–retest reliability analysis yielded a reliability coefficient of .87, supporting the reliability of the instrument.

Scoring: What to Accept

I used acceptable-word scoring rather than strict exact-word scoring. If a student supplied a semantically and grammatically acceptable alternative, I counted it as correct. Rigid exact-word scoring may underestimate language development because real reading involves prediction and construction of meaning rather than reproduction of identical language forms.

I never returned the cloze tests to the students. Instead, I kept them in individual files so that I could compare their responses over time and observe long-term development.

The changes were not merely numerical. The actual quality of the responses gradually changed over time. Blanks that were originally left unanswered were eventually filled appropriately. Responses such as "tree" became more contextually appropriate forms such as "trees." Incorrect forms such as "a" were replaced by more appropriate forms such as "an." Appropriate adverbs such as "carefully" were supplied instead of unrelated content words. The correct use of prepositions is difficult, but over time students gradually began supplying more appropriate prepositions in the cloze passages.

Test scores themselves also tended to increase steadily every year. On average, scores often increased by approximately 10 points per year. Students who initially scored around 30 often reached approximately 40 after one year and around 50 by the end of the second year.

What the Scores Mean

In my own longitudinal use of this particular cloze test with my students, consistent relationships emerged between cloze test scores and students' actual reading behavior.

Students who scored below approximately 10 points were generally not ready to begin reading even the easiest starter-level books comfortably. These students usually required large amounts of comprehensible auditory input through Story-Listening before reading became manageable and emotionally comfortable.

Students who scored approximately 20 to 25 points often showed partial readiness for beginning-level reading, but their reading fluency remained fragile. Sustained reading was still difficult and sometimes emotionally tiring.

When students scored approximately 30 to 35 points at the beginning of the first year, they were generally able to begin reading starter-level readers relatively comfortably while continuing Story-Listening in class. The time required to finish a 600-headword-level reader was often approximately 30 minutes, suggesting a reading speed of roughly 80 words per minute.

Students approaching 60 points were often reaching a level at which they could begin functioning as relatively autonomous readers of connected English narrative texts. When learners reached this level, they may be said to have entered what I call the "pleasure reader" stage. At this point, students were no longer dependent primarily on instruction. They could continue developing through reading itself and begin functioning more independently as language acquirers.

Three Assessment Types in Rotation

In my own classes, I used several different assessment methods alternately:

Cloze Tests — Administered at the beginning and end of the first semester and at the end of the second semester to assess progress. For second-year students, the tests were also administered before graduation.
Summarization in the Native Language — After Story-Listening and reading, students created short summaries in their native language. I did not provide a template for what an optimal summary should look like. I graded the summaries based primarily on whether the students understood the story. Some summaries were long and others short; I considered this to reflect the students' personalities rather than their level of comprehension. This was a holistic assessment of how well students understood the content.
Summarization in the Target Language — After reading a given text, students wrote summaries in English without referring to the original text. I assessed how well they expressed the content in their own words while also observing overall language development, including grammatical accuracy.

In addition to these assessments, I also administered two vocabulary tests. At the beginning of the first semester, I administered a vocabulary levels test adapted from Paul Nation's Vocabulary Levels Test. I followed the approach used by Beglar and Hunt in revising the 2000-word-level test and in constructing the 3000-word-level test. Both tests contained more than 50 items each, rather than the original 18 items used in Nation's version.

Student progress depends primarily on the quality of the input rather than on memorizing isolated language forms disconnected from meaningful communication. When students listen to many meaningful stories they understand and enjoy, language competence gradually develops as a natural result of comprehension and continued exposure to input.

A Practical Note for Teachers

Constructing a 100-item cloze test with an appropriate level of readability, contextual support, and reliability is not easy. For this reason, once a valid and reliable test has been developed, teachers should preserve copies of it carefully for future longitudinal use rather than returning the tests to students.

Over time, I came to see that reading ability develops gradually through large amounts of Pure Optimal Unified Input. When learners eventually become capable of reading independently for pleasure, they begin functioning as increasingly autonomous language acquirers. At that point, reading itself increasingly becomes the engine of further language development.

The cloze test does not create that development. It reveals it — steadily, honestly, semester after semester.

Beniko Mason is the creator of Story-Listening and the author of What If Input Is Enough? Her full guide on cloze testing, Beyond Placement: Cloze Testing as a Longitudinal Measure of Reading Development, is available now.

Her full guide on cloze testing, Beyond Placement: Cloze Testing as a Longitudinal Measure of Reading Development, is available now.

Ready to try Story-Listening in your classroom?

Start here - no purchase needed:

Get a free Story-Listening kit — Beniko's classroom video, story text, and prompter
Join the free community — connect with other teachers using the method
Free minicourse — a short introduction to Story-Listening and GSSR

Go deeper:

Story-Listening Kits — complete kit library
Semester Implementation Package — 15 kits + 60 Optimal Readers
Membership — everything, $19/month