1. The Problem: What Single Scores Actually Measure
Consider what happens every time a learner completes a Duolingo lesson. A number moves. A streak counter increments. A progress bar fills. The product communicates, unambiguously, that learning has occurred. But what, precisely, has been measured? The answer is completion. Nothing more.
The dominant progress metric in consumer educational technology — the single XP score, streak, or points total — measures engagement with the product, not competence in the skill. Duolingo's scoring system runs from 0 to 160 and tracks course-level completion. It does not distinguish between a user who answered every question correctly and one who guessed through the same lesson. It does not measure whether any of the vocabulary learned on Monday survived to Friday. It cannot tell you whether a learner who scored 80 today is more capable than the same learner who scored 80 six weeks ago.
This is not a design flaw. It is a design choice. Barasch and Silverman (2023) demonstrated empirically that gamified habit systems — streaks, points, badges — decouple engagement from the behaviours they are meant to reinforce. Once a metric becomes a target, it ceases to be a good measure. In EdTech, the target is daily active users. The streak is the mechanism. Learning is the declared purpose and the first casualty.
Ebbinghaus's forgetting curve research, foundational to memory science since 1885, established that retention decays predictably without spaced reinforcement. A completion-based score cannot capture this decay. A user who completed a vocabulary module thirty days ago and has never revisited it receives the same historical XP credit as the day they completed it — regardless of whether any of that vocabulary remains retrievable. The score has not moved. The learning has.
A score that cannot go down is not a measure of learning. It is a measure of attendance.
2. XP vs Fluency Points™: An Architectural Comparison
The difference between an XP system and FP Rings™ is not a matter of presentation. It is a fundamental architectural difference.
How an XP System Works
In most language learning applications, XP are awarded primarily for completion. Finishing a lesson = X points. Maintaining a streak = bonus. Answering quickly = multiplier. The quality of the response is largely irrelevant. Whether the learner guessed is irrelevant. Whether the content from last week's lesson has already been forgotten is irrelevant.
The logic is that of a video game: continuous upward progression, never downward, frequent rewards to sustain engagement. An XP score cannot go down. An XP level never reflects forgetting. And a level-42 XP learner has no information about what they can actually do.
How FP Rings™ Works
Fluency Points™ are calculated from four independent dimensions, each weighted according to its contribution to genuine language acquisition, and each scaled by the Depth Index™ — a difficulty scalar that ensures harder, more advanced work generates more precise measurement and greater reward.
| Dimension | Weight | What Is Measured | What Causes a Loss |
|---|---|---|---|
| Pronunciation | 30% | Azure phoneme-level scoring, prosody, accent fidelity. Depth Index™: ×1.0 (beginner) → ×1.6 (advanced). Bonus: +2 FP per newly mastered phoneme. | Phoneme score below session threshold. No pronunciation exercise = 0 FP Pronunciation. |
| Comprehension | 25% | Listening and reading task accuracy. Context bonus: +3 FP for vocabulary used in sentences. Long-term recall bonus: +5 FP for words recalled after 7+ days. | Incorrect responses on comprehension tasks. No comprehension exercise = 0 FP Comprehension. |
| Production | 25% | Speaking pace + Depth Index™ for syntactic complexity: ×1.0 (simple) → ×1.8 (complex). Conversation completion: +10 FP. First attempt success: +5 FP. | Low fluency or grammatical accuracy scores. No production exercise = 0 FP Production. |
| Retention | 20% | SM-2 spaced repetition quality score. Perfect recall (5/5): ×2.0. Good (4/5): ×1.5. Poor (<3): ×0 — no FP for forgetting. | Items not reviewed on SRS schedule: ring degrades between sessions. |
Zero is a legitimate score. It is accuracy, not punishment.
The Direct Comparison
| Criterion | XP System (Duolingo, etc.) | FP Rings™ (Voicely Language) |
|---|---|---|
| Direction possible | Upward only | Up and down based on real performance |
| Zero score possible | No — any completion earns points | Yes — zero if the dimension is not assessed |
| Dimensions measured | 1 (completion / engagement) | 4 independent: Pronunciation, Comprehension, Production, Retention |
| Difficulty weighting | Flat — easy and hard tasks earn the same XP | Depth Index™ — advanced work scales the score |
| Is forgetting reflected? | No — historical score unchanged | Yes — Retention ring degrades if items not reviewed |
| Does quality matter? | Marginally (minor accuracy bonuses) | Yes — poor performance loses points |
| CEFR alignment | No — proprietary 0–160 scale | Yes — each ring anchored to CEFR thresholds (A1→C2) |
| Can it be gamed? | Yes — clicking quickly generates XP | No — Azure, Deepgram, and SM-2 are not fooled |
| Actionable information | None | High — each ring signals where to focus next |
A Duolingo level-42 learner and a Duolingo level-12 learner can have exactly the same actual competence. The XP score measures time spent in the app, not language acquired.
3. The Research Case for Multidimensional Competence
The claim that complex skills are multidimensional is not a product opinion. It is the consensus position of every serious framework in cognitive science, linguistics, and educational research.
The CEFR Framework
The Common European Framework of Reference for Languages (Council of Europe, 2001; updated 2020) defines language competence across distinct receptive skills (listening, reading), productive skills (speaking, writing), and interactive skills (conversation, correspondence). It has never proposed a single number as an adequate representation of a learner's capability. Any product claiming CEFR alignment while reporting a single score is misrepresenting the framework.
Cognitive Load Theory
Sweller's (1988) cognitive load theory establishes that different cognitive systems govern different skill dimensions. The neural pathways recruited for phonemic discrimination are distinct from those governing semantic retrieval or syntactic production. Training one dimension does not automatically transfer to others. No single score can represent multiple dimensions simultaneously without falsifying at least one.
Motivation and Identity
Dörnyei's (2009) L2 Motivational Self System identifies the learner's vision of their future fluent self — the Ideal L2 Self — as the primary driver of sustained language acquisition. A single composite score obscures dimensional realities and provides no actionable motivational signal. Multidimensional progress data makes the ideal self feel reachable dimension by dimension.
Retention as a Separate Cognitive Track
Spaced repetition research — from Ebbinghaus (1885) through Leitner (1972) to modern Anki efficacy studies — has consistently established that retention is a distinct cognitive track from production or comprehension. Collapsing them into a single score is not simplification. It is falsification.
4. The Depth Index™: Why Difficulty Must Scale the Score
One of the most persistent failures of single-score systems is their treatment of difficulty. In a flat XP system, a beginner completing a simple vocabulary exercise earns the same points as an advanced learner producing a grammatically complex sentence under time pressure. The tasks are not equivalent. The cognitive effort is not equivalent. The acquisitional significance is not equivalent.
The Depth Index™ is SCORA™'s solution. It is a difficulty scalar — a coefficient applied to each dimension's score that increases as the learner's demonstrated competence increases. Harder, more advanced work is worth more, because it measures more.
The Principle
The Depth Index™ is not a reward for showing up at a harder level. It is a calibration mechanism. A beginner correctly identifying a phoneme is evidence of early acquisition. An advanced learner producing that same phoneme accurately under spontaneous speech pressure, with correct prosody and accent fidelity, is evidence of deep acquisition. These are not the same thing and should not produce the same score.
This matters across the full learning arc. A learner at A2 scoring 80% on a pronunciation task and a C1 learner scoring 80% on the same task are not equally competent. The Depth Index™ ensures the C1 score reflects the greater precision and complexity of what was actually assessed — so the progress curve over months and years is an accurate map of the journey, not a flat line sustained by repeating easy exercises.
The Depth Index™ in Practice — Language Learning
Pronunciation Depth Index™: scales from ×1.0 at beginner level to ×1.6 at advanced. An advanced learner nailing a difficult phoneme cluster earns 60% more FP than a beginner producing the same sound correctly. The difficulty is real — phonemic complexity, prosodic demand, and accent fidelity all increase with level.
Production Depth Index™: scales from ×1.0 for simple subject-verb-object sentences to ×1.8 for complex constructions with subordinate clauses, relative clauses, and conditional forms. Syntactic complexity is one of the most reliable markers of L2 acquisition depth in the SLA literature (DeKeyser, 2007).
The Depth Index™ Across Verticals
The Depth Index™ is domain-agnostic. In speech therapy (Voicely Flow), spontaneous production under cognitive load scores higher than a controlled drill — because spontaneous production is the clinical goal. In strength training, a complex compound movement under load scores higher than an isolated machine exercise. In financial literacy, correct judgement on a multi-variable scenario scores higher than a single-rule decision.
The Depth Index™ means SCORA™ does not just track whether a learner is progressing. It tracks how deeply they are progressing — and in which direction the depth is growing. That is what a journey map looks like. That is what an XP counter cannot be.
5. The Sub-Ring Architecture: The Dimension Within the Dimension
Each of the four FP Rings™ is itself a multidimensional competence. SCORA™ does not simply replace one score with four. It creates a measurement hierarchy in which each primary ring rests on independently-assessed sub-dimensions — and the Depth Index™ applies within each sub-dimension.
The Pronunciation Ring — three sub-dimensions
Phonemic accuracy: the phoneme-by-phoneme score from Azure Cognitive Services.
Prosody: rhythm, intonation, and stress patterns.
Accent fidelity: the degree to which pronunciation matches the selected target accent.
The Comprehension Ring — two sub-dimensions
Listening comprehension: accuracy on listening tasks — comprehension questions, dictation, word identification in the spoken stream.
Reading comprehension: accuracy on reading tasks — text comprehension, meaning inference, contextual understanding.
The Production Ring — two sub-dimensions
Fluency: production rate, hesitation frequency and duration, speech smoothness as measured by Deepgram.
Grammatical complexity: syntactic sophistication of the learner's output, evaluated by GPT and scaled by the Production Depth Index™.
The Retention Ring — memory across time
Retention is the only dimension that moves between sessions. It measures the persistence of vocabulary and grammatical structures in long-term memory via the SM-2 algorithm.
A learner can see that their global Pronunciation ring stands at 71 — but that phonemic accuracy is at 84 while prosody is at 58. That is the information that changes learning behaviour.
6. The Landscape: What Products Currently Measure
| Product | What It Measures | What It Misses |
|---|---|---|
| Duolingo | Course completion (0–160 XP scale) | Retention decay, pronunciation accuracy, real output quality, difficulty weighting |
| ELSA Speak | Pronunciation only — English only | Comprehension, retention, production fluency, all non-English languages, Depth Index™ |
| Babbel | Lesson completion + vocabulary recognition | Pronunciation accuracy, retention decay, difficulty calibration |
| Rosetta Stone | Immersion exposure and lesson progress | All dimensions independently; no integrity constraint; no difficulty scalar |
| Duolingo English Test | 4 subscores: Literacy, Comprehension, Conversation, Production | Assessment product only — not a daily in-app progress architecture |
No consumer language app tracks all dimensions simultaneously, in real time, governed by an integrity constraint and a difficulty scalar, as part of the daily practice experience. That is the architectural gap SCORA™ fills.
7. The Framework: SCORA™
SCORA™ — Scored Competence and Observable Ring Architecture — is an open progress measurement framework that tracks multidimensional competence across independently-evidenced dimensions of any given discipline. Each dimension is represented as a scored ring. Each ring updates only when that dimension is directly assessed in a session. The Depth Index™ scales each ring's score according to the difficulty of what was assessed. The overall score is a composite of all rings — but the rings are the primary display, because the breakdown is the truth and the composite is the summary.
The Integrity Constraint
A ring does not move unless that dimension was assessed in the current session. A user who only completed a reading exercise cannot gain Pronunciation points. A user who performed poorly on a vocabulary recall task will see their Retention ring decline. Zero on any dimension is a legitimate score. No existing consumer EdTech product enforces this constraint.
The Universality Principle
| Domain | Implementation | Score | Dim. 1 | Dim. 2 | Dim. 3 | Dim. 4 |
|---|---|---|---|---|---|---|
| Language learning | FP Rings™ (Voicely Language) | Fluency Points™ | Pronunciation | Comprehension | Production | Retention |
| Mindfulness | Calm Rings (Voicely Calm, forthcoming) | Flow Points | Presence | Regulation | Consistency | Embodiment |
| Speech therapy | Flow Rings (Voicely Flow, forthcoming) | Speech Points | Articulation | Fluency | Confidence | Carry-Over |
| Strength training | (hypothetical) | Training Points | Strength | Endurance | Mobility | Recovery |
| Financial literacy | (hypothetical) | Wealth Points | Awareness | Discipline | Growth | Resilience |
FP Rings™ is to SCORA™ what iOS is to a mobile operating system — a specific, named implementation of a universal architecture.
8. Why SCORA™ Becomes the Standard
The measurement honesty argument: AI is making granular assessment cheaper by an order of magnitude. Products that report a single engagement-based score will face a credibility problem — their score will contradict what learners can verify independently. SCORA™ is the infrastructure for telling the truth.
The regulatory alignment argument: CEFR, ACTFL, DELF, DELE, JLPT, IELTS, TOEFL — all use multidimensional assessment frameworks. No certification body anywhere uses a single composite score as a complete representation of language ability. SCORA™ is alignment with where the standards already are.
The identity argument: A learner who sees their Pronunciation ring at 61 and Comprehension ring at 84 — and understands that prosody is holding the Pronunciation ring back — has a specific, actionable picture of where their future self is being built. That specificity is motivationally powerful in a way that a number moving from 142 to 143 XP never can be.
Conclusion
The single-score progress system is not a neutral design choice. It is an architectural decision that systematically misrepresents how learning works, treats beginner and advanced effort as equivalent, optimises for engagement over growth, and leaves learners without the information they need to improve.
SCORA™ corrects all of this simultaneously. Its integrity constraint separates measurement from attendance-tracking. Its sub-ring architecture reveals the dimension within the dimension. Its Depth Index™ ensures that the further a learner travels, the more precisely their journey is mapped. And its universality principle means it is not a language learning solution — it is an EdTech infrastructure standard applicable to any discipline where competence can be independently measured.
FP Rings™, Voicely Language's implementation of SCORA™, is the first consumer product to apply all of these principles in real time, at scale, across four independently-evidenced dimensions of language competence. It is live. It is measurable. And it is the first evidence that honest progress tracking is not only theoretically sound but commercially viable.
SCORA™ is an open framework. Organisations interested in implementing it in their products are invited to contact Voicely Language at voicelylanguage.app/research.
A Note on Terminology
This paper uses "fluency" and "fluent" in the applied linguistics sense: the ability to communicate in a target language with ease, accuracy, and appropriate speed. These terms do not imply bilingualism.
References
- Barasch, A., & Silverman, J. (2023). Streak pathology and habit gamification. Journal of Consumer Research.
- Council of Europe. (2001). Common European Framework of Reference for Languages. Cambridge University Press.
- Council of Europe. (2020). CEFR Companion Volume. Council of Europe Publishing.
- DeKeyser, R. (Ed.). (2007). Practice in a second language. Cambridge University Press. https://doi.org/10.1017/CBO9780511667275
- Dörnyei, Z. (2009). The L2 Motivational Self System. In Motivation, language identity and the L2 self (pp. 9–42). Multilingual Matters.
- Dörnyei, Z., & Ushioda, E. (2011). Teaching and researching motivation (2nd ed.). Pearson Longman.
- Ebbinghaus, H. (1885). Über das Gedächtnis. Duncker & Humblot.
- Leitner, S. (1972). So lernt man lernen. Herder.
- Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge University Press. https://doi.org/10.1017/CBO9781139524759
- Sweller, J. (1988). Cognitive load during problem solving. Cognitive Science, 12(2), 257–285. https://doi.org/10.1207/s15516709cog1202_4
© 2026 Voicely Language / Mervine Gowry. SCORA™ and Depth Index™ coined May 18, 2026. All rights reserved.