The Multidimensional Progress Hypothesis: Why Single-Score Systems Fail Learners and What Honest Measurement Looks Like

1. The Problem: What Single Scores Actually Measure

Consider what happens every time a learner completes a Duolingo lesson. A number moves. A streak counter increments. A progress bar fills. The product communicates, unambiguously, that learning has occurred. But what, precisely, has been measured? The answer is completion. Nothing more.

The dominant progress metric in consumer educational technology — the single XP score, streak, or points total — measures engagement with the product, not competence in the skill. Duolingo's scoring system runs from 0 to 160 and tracks course-level completion. It does not distinguish between a user who answered every question correctly and one who guessed through the same lesson. It does not measure whether any of the vocabulary learned on Monday survived to Friday. It cannot tell you whether a learner who scored 80 today is more capable than the same learner who scored 80 six weeks ago.

This is not a design flaw. It is a design choice. Barasch and Silverman (2023) demonstrated empirically that gamified habit systems — streaks, points, badges — decouple engagement from the behaviours they are meant to reinforce. Once a metric becomes a target, it ceases to be a good measure. In EdTech, the target is daily active users. The streak is the mechanism. Learning is the declared purpose and the first casualty.

Ebbinghaus's forgetting curve research, foundational to memory science since 1885, established that retention decays predictably without spaced reinforcement. A completion-based score cannot capture this decay. A user who completed a vocabulary module thirty days ago and has never revisited it receives the same historical XP credit as the day they completed it — regardless of whether any of that vocabulary remains retrievable. The score has not moved. The learning has.

A score that cannot go down is not a measure of learning. It is a measure of attendance.

2. XP vs Fluency Points™: An Architectural Comparison

The difference between an XP system and FP Rings™ is not a matter of presentation. It is a fundamental architectural difference.

How an XP System Works

In most language learning applications, XP are awarded primarily for completion. Finishing a lesson = X points. Maintaining a streak = bonus. Answering quickly = multiplier. The quality of the response is largely irrelevant. Whether the learner guessed is irrelevant. Whether the content from last week's lesson has already been forgotten is irrelevant.

The logic is that of a video game: continuous upward progression, never downward, frequent rewards to sustain engagement. An XP score cannot go down. An XP level never reflects forgetting. And a level-42 XP learner has no information about what they can actually do.

How FP Rings™ Works

Fluency Points™ are calculated from four independent dimensions, each weighted according to its contribution to genuine language acquisition, and each scaled by the Depth Index™ — a difficulty scalar that ensures harder, more advanced work generates more precise measurement and greater reward.

Dimension	Weight	What Is Measured	What Causes a Loss
Pronunciation	30%	Azure phoneme-level scoring, prosody, accent fidelity. Depth Index™: ×1.0 (beginner) → ×1.6 (advanced). Bonus: +2 FP per newly mastered phoneme.	Phoneme score below session threshold. No pronunciation exercise = 0 FP Pronunciation.
Comprehension	25%	Listening and reading task accuracy. Context bonus: +3 FP for vocabulary used in sentences. Long-term recall bonus: +5 FP for words recalled after 7+ days.	Incorrect responses on comprehension tasks. No comprehension exercise = 0 FP Comprehension.
Production	25%	Speaking pace + Depth Index™ for syntactic complexity: ×1.0 (simple) → ×1.8 (complex). Conversation completion: +10 FP. First attempt success: +5 FP.	Low fluency or grammatical accuracy scores. No production exercise = 0 FP Production.
Retention	20%	SM-2 spaced repetition quality score. Perfect recall (5/5): ×2.0. Good (4/5): ×1.5. Poor (<3): ×0 — no FP for forgetting.	Items not reviewed on SRS schedule: ring degrades between sessions.

Zero is a legitimate score. It is accuracy, not punishment.

The Direct Comparison

Criterion	XP System (Duolingo, etc.)	FP Rings™ (Voicely Language)
Direction possible	Upward only	Up and down based on real performance
Zero score possible	No — any completion earns points	Yes — zero if the dimension is not assessed
Dimensions measured	1 (completion / engagement)	4 independent: Pronunciation, Comprehension, Production, Retention
Difficulty weighting	Flat — easy and hard tasks earn the same XP	Depth Index™ — advanced work scales the score
Is forgetting reflected?	No — historical score unchanged	Yes — Retention ring degrades if items not reviewed
Does quality matter?	Marginally (minor accuracy bonuses)	Yes — poor performance loses points
CEFR alignment	No — proprietary 0–160 scale	Yes — each ring anchored to CEFR thresholds (A1→C2)
Can it be gamed?	Yes — clicking quickly generates XP	No — Azure, Deepgram, and SM-2 are not fooled
Actionable information	None	High — each ring signals where to focus next

A Duolingo level-42 learner and a Duolingo level-12 learner can have exactly the same actual competence. The XP score measures time spent in the app, not language acquired.

3. The Research Case for Multidimensional Competence

The claim that complex skills are multidimensional is not a product opinion. It is the consensus position of every serious framework in cognitive science, linguistics, and educational research.

The CEFR Framework

The Common European Framework of Reference for Languages (Council of Europe, 2001; updated 2020) defines language competence across distinct receptive skills (listening, reading), productive skills (speaking, writing), and interactive skills (conversation, correspondence). It has never proposed a single number as an adequate representation of a learner's capability. Any product claiming CEFR alignment while reporting a single score is misrepresenting the framework.

Cognitive Load Theory

Sweller's (1988) cognitive load theory establishes that different cognitive systems govern different skill dimensions. The neural pathways recruited for phonemic discrimination are distinct from those governing semantic retrieval or syntactic production. Training one dimension does not automatically transfer to others. No single score can represent multiple dimensions simultaneously without falsifying at least one.

Motivation and Identity

Dörnyei's (2009) L2 Motivational Self System identifies the learner's vision of their future fluent self — the Ideal L2 Self — as the primary driver of sustained language acquisition. A single composite score obscures dimensional realities and provides no actionable motivational signal. Multidimensional progress data makes the ideal self feel reachable dimension by dimension.

Retention as a Separate Cognitive Track

Spaced repetition research — from Ebbinghaus (1885) through Leitner (1972) to modern Anki efficacy studies — has consistently established that retention is a distinct cognitive track from production or comprehension. Collapsing them into a single score is not simplification. It is falsification.

4. The Depth Index™: Why Difficulty Must Scale the Score

One of the most persistent failures of single-score systems is their treatment of difficulty. In a flat XP system, a beginner completing a simple vocabulary exercise earns the same points as an advanced learner producing a grammatically complex sentence under time pressure. The tasks are not equivalent. The cognitive effort is not equivalent. The acquisitional significance is not equivalent.

The Depth Index™ is SCORA™'s solution. It is a difficulty scalar — a coefficient applied to each dimension's score that increases as the learner's demonstrated competence increases. Harder, more advanced work is worth more, because it measures more.

The Principle

The Depth Index™ is not a reward for showing up at a harder level. It is a calibration mechanism. A beginner correctly identifying a phoneme is evidence of early acquisition. An advanced learner producing that same phoneme accurately under spontaneous speech pressure, with correct prosody and accent fidelity, is evidence of deep acquisition. These are not the same thing and should not produce the same score.

This matters across the full learning arc. A learner at A2 scoring 80% on a pronunciation task and a C1 learner scoring 80% on the same task are not equally competent. The Depth Index™ ensures the C1 score reflects the greater precision and complexity of what was actually assessed — so the progress curve over months and years is an accurate map of the journey, not a flat line sustained by repeating easy exercises.

The Depth Index™ in Practice — Language Learning

Pronunciation Depth Index™: scales from ×1.0 at beginner level to ×1.6 at advanced. An advanced learner nailing a difficult phoneme cluster earns 60% more FP than a beginner producing the same sound correctly. The difficulty is real — phonemic complexity, prosodic demand, and accent fidelity all increase with level.

Production Depth Index™: scales from ×1.0 for simple subject-verb-object sentences to ×1.8 for complex constructions with subordinate clauses, relative clauses, and conditional forms. Syntactic complexity is one of the most reliable markers of L2 acquisition depth in the SLA literature (DeKeyser, 2007).

The Depth Index™ Across Verticals

The Depth Index™ is domain-agnostic. In speech therapy (Voicely Flow), spontaneous production under cognitive load scores higher than a controlled drill — because spontaneous production is the clinical goal. In strength training, a complex compound movement under load scores higher than an isolated machine exercise. In financial literacy, correct judgement on a multi-variable scenario scores higher than a single-rule decision.

The Depth Index™ means SCORA™ does not just track whether a learner is progressing. It tracks how deeply they are progressing — and in which direction the depth is growing. That is what a journey map looks like. That is what an XP counter cannot be.

5. The Sub-Ring Architecture: The Dimension Within the Dimension

Each of the four FP Rings™ is itself a multidimensional competence. SCORA™ does not simply replace one score with four. It creates a measurement hierarchy in which each primary ring rests on independently-assessed sub-dimensions — and the Depth Index™ applies within each sub-dimension.

The Pronunciation Ring — three sub-dimensions

Phonemic accuracy: the phoneme-by-phoneme score from Azure Cognitive Services.

Prosody: rhythm, intonation, and stress patterns.

Accent fidelity: the degree to which pronunciation matches the selected target accent.

The Comprehension Ring — two sub-dimensions

Listening comprehension: accuracy on listening tasks — comprehension questions, dictation, word identification in the spoken stream.

Reading comprehension: accuracy on reading tasks — text comprehension, meaning inference, contextual understanding.

The Production Ring — two sub-dimensions

Fluency: production rate, hesitation frequency and duration, speech smoothness as measured by Deepgram.

Grammatical complexity: syntactic sophistication of the learner's output, evaluated by GPT and scaled by the Production Depth Index™.

The Retention Ring — memory across time

Retention is the only dimension that moves between sessions. It measures the persistence of vocabulary and grammatical structures in long-term memory via the SM-2 algorithm.

A learner can see that their global Pronunciation ring stands at 71 — but that phonemic accuracy is at 84 while prosody is at 58. That is the information that changes learning behaviour.

6. The Landscape: What Products Currently Measure

Product	What It Measures	What It Misses
Duolingo	Course completion (0–160 XP scale)	Retention decay, pronunciation accuracy, real output quality, difficulty weighting
ELSA Speak	Pronunciation only — English only	Comprehension, retention, production fluency, all non-English languages, Depth Index™
Babbel	Lesson completion + vocabulary recognition	Pronunciation accuracy, retention decay, difficulty calibration
Rosetta Stone	Immersion exposure and lesson progress	All dimensions independently; no integrity constraint; no difficulty scalar
Duolingo English Test	4 subscores: Literacy, Comprehension, Conversation, Production	Assessment product only — not a daily in-app progress architecture

No consumer language app tracks all dimensions simultaneously, in real time, governed by an integrity constraint and a difficulty scalar, as part of the daily practice experience. That is the architectural gap SCORA™ fills.

7. The Framework: SCORA™

SCORA™ — Scored Competence and Observable Ring Architecture — is an open progress measurement framework that tracks multidimensional competence across independently-evidenced dimensions of any given discipline. Each dimension is represented as a scored ring. Each ring updates only when that dimension is directly assessed in a session. The Depth Index™ scales each ring's score according to the difficulty of what was assessed. The overall score is a composite of all rings — but the rings are the primary display, because the breakdown is the truth and the composite is the summary.

The Integrity Constraint

A ring does not move unless that dimension was assessed in the current session. A user who only completed a reading exercise cannot gain Pronunciation points. A user who performed poorly on a vocabulary recall task will see their Retention ring decline. Zero on any dimension is a legitimate score. No existing consumer EdTech product enforces this constraint.

The Universality Principle

Domain	Implementation	Score	Dim. 1	Dim. 2	Dim. 3	Dim. 4
Language learning	FP Rings™ (Voicely Language)	Fluency Points™	Pronunciation	Comprehension	Production	Retention
Mindfulness	Calm Rings (Voicely Calm, forthcoming)	Flow Points	Presence	Regulation	Consistency	Embodiment
Speech therapy	Flow Rings (Voicely Flow, forthcoming)	Speech Points	Articulation	Fluency	Confidence	Carry-Over
Strength training	(hypothetical)	Training Points	Strength	Endurance	Mobility	Recovery
Financial literacy	(hypothetical)	Wealth Points	Awareness	Discipline	Growth	Resilience

FP Rings™ is to SCORA™ what iOS is to a mobile operating system — a specific, named implementation of a universal architecture.

8. Why SCORA™ Becomes the Standard

The measurement honesty argument: AI is making granular assessment cheaper by an order of magnitude. Products that report a single engagement-based score will face a credibility problem — their score will contradict what learners can verify independently. SCORA™ is the infrastructure for telling the truth.

The regulatory alignment argument: CEFR, ACTFL, DELF, DELE, JLPT, IELTS, TOEFL — all use multidimensional assessment frameworks. No certification body anywhere uses a single composite score as a complete representation of language ability. SCORA™ is alignment with where the standards already are.

The identity argument: A learner who sees their Pronunciation ring at 61 and Comprehension ring at 84 — and understands that prosody is holding the Pronunciation ring back — has a specific, actionable picture of where their future self is being built. That specificity is motivationally powerful in a way that a number moving from 142 to 143 XP never can be.

Conclusion

The single-score progress system is not a neutral design choice. It is an architectural decision that systematically misrepresents how learning works, treats beginner and advanced effort as equivalent, optimises for engagement over growth, and leaves learners without the information they need to improve.

SCORA™ corrects all of this simultaneously. Its integrity constraint separates measurement from attendance-tracking. Its sub-ring architecture reveals the dimension within the dimension. Its Depth Index™ ensures that the further a learner travels, the more precisely their journey is mapped. And its universality principle means it is not a language learning solution — it is an EdTech infrastructure standard applicable to any discipline where competence can be independently measured.

FP Rings™, Voicely Language's implementation of SCORA™, is the first consumer product to apply all of these principles in real time, at scale, across four independently-evidenced dimensions of language competence. It is live. It is measurable. And it is the first evidence that honest progress tracking is not only theoretically sound but commercially viable.

SCORA™ is an open framework. Organisations interested in implementing it in their products are invited to contact Voicely Language at voicelylanguage.app/research.

A Note on Terminology

This paper uses "fluency" and "fluent" in the applied linguistics sense: the ability to communicate in a target language with ease, accuracy, and appropriate speed. These terms do not imply bilingualism.

References

Barasch, A., & Silverman, J. (2023). Streak pathology and habit gamification. Journal of Consumer Research.
Council of Europe. (2001). Common European Framework of Reference for Languages. Cambridge University Press.
Council of Europe. (2020). CEFR Companion Volume. Council of Europe Publishing.
DeKeyser, R. (Ed.). (2007). Practice in a second language. Cambridge University Press. https://doi.org/10.1017/CBO9780511667275
Dörnyei, Z. (2009). The L2 Motivational Self System. In Motivation, language identity and the L2 self (pp. 9–42). Multilingual Matters.
Dörnyei, Z., & Ushioda, E. (2011). Teaching and researching motivation (2nd ed.). Pearson Longman.
Ebbinghaus, H. (1885). Über das Gedächtnis. Duncker & Humblot.
Leitner, S. (1972). So lernt man lernen. Herder.
Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge University Press. https://doi.org/10.1017/CBO9781139524759
Sweller, J. (1988). Cognitive load during problem solving. Cognitive Science, 12(2), 257–285. https://doi.org/10.1207/s15516709cog1202_4