Diagnostic Methodology

Published by the Society. Updated with each recalibration cycle.

1. What the Diagnostic measures

The Diagnostic is a structured assessment of five prefrontal cortex faculties — the cognitive capacities most directly engaged in sustained intellectual work and most measurably affected by habitual short-form media consumption. Each faculty was selected because it meets three criteria: it is trainable through deliberate practice, it is measurable with established psychometric methods, and its decline is observable in daily life.

Reasoning — Conditional Logic

The capacity to hold multiple conditional rules in working awareness, track entailment across variables, and identify which facts in a scenario are load-bearing and which are noise. This faculty underlies legal reasoning, debugging, negotiation, and any situation requiring structured inference. It is measured through short scenarios with named entities and conditional rules, where the Fellow must select the single valid inference from four options. Difficulty scales not by adding more rules, but by embedding rules in naturalistic prose and introducing plausible distractors — mirroring the way real-world reasoning demands signal extraction from noise.

Working Memory — Dual-Modality Span

The capacity to hold and manipulate information across two sensory channels simultaneously. This faculty is engaged whenever attention must be divided between concurrent streams — following a presentation while reading slides, processing driving directions while monitoring road signs. It is measured through a dual-modality task: visual sequences on a grid paired with auditory digit sequences, where the Fellow does not know in advance which modality will be tested. Difficulty scales by increasing sequence length, presentation speed, and cross-modal interference.

Reading Recall and Comprehension

The capacity for deliberate semantic retention — reading a passage with the explicit intention of retaining it, and doing so under cognitive load and across a delay. This is not surprise memory; the Fellow is told in advance that they will be tested both immediately and again approximately twenty minutes later. The faculty measured is the one most directly eroded by habitual skimming: the ability to read something once, hold it, and retrieve it later. Difficulty scales by passage complexity, inference depth, and density of confusable details.

Sustained Attention — Continuous Reading

The capacity to maintain vigilance and semantic tracking under sustained cognitive load. Placed deliberately in the middle of the Diagnostic, this subtest measures attention not when the Fellow is fresh, but when fatigue has already begun — because that is when the faculty matters. Serious prose scrolls at a fixed reading pace; the Fellow’s task is simply to read, with unexpected comprehension probes appearing at intervals. Drift equals failure, naturally and without any possibility of gaming the response. Difficulty scales by reading speed, prose density, and probe depth.

Processing Speed and Fatigue Slope

Processing speed is measured twice within a single Diagnostic sitting — once near the beginning, when the Fellow is fresh, and once at the end, under fatigue. The raw score on each sitting is recorded, but the primary signal is the delta between them: the endurance score. A Fellow who scores 45 fresh and 43 tired demonstrates strong cognitive endurance. A Fellow who scores 45 fresh and 28 tired demonstrates the exact deficit the Society exists to train. This is the most diagnostic signal in the battery for the thesis that sustained cognitive performance — not peak performance — is what habitual distraction erodes.

2. How items are calibrated

Every item in the Diagnostic has three statistical parameters, estimated from real response data:

Difficulty: the ability level at which the item is answered correctly 50% of the time. An item with high difficulty requires high ability to have an even chance of getting it right.
Discrimination: how sharply the item separates Fellows who are above its difficulty level from those below. A high-discrimination item is a precise measurement tool; a low-discrimination item adds noise.
Guessability: the floor probability of a correct response. On a four-option multiple-choice item, random guessing produces a 25% hit rate. The guessability parameter captures this floor so that the scoring model does not confuse lucky guesses with genuine ability.

These three parameters define what is known in psychometrics as a three-parameter logistic model, or 3PL — the same mathematical framework used by the GRE, LSAT, SAT, and major clinical cognitive batteries. The model produces an S-shaped curve for each item: given a Fellow’s ability level, the curve predicts the probability that the Fellow answers correctly. When a Fellow takes the Diagnostic, the scoring engine finds the ability estimate that best explains their entire pattern of correct and incorrect responses across all items.

How parameters are learned

Item parameters are not set by the item author. They are estimated from data — specifically, from the pattern of responses across many Fellows of varying ability levels. During the initial calibration period, every Fellow sees the same fixed set of items within each subtest. Every response contributes to the statistical estimation of that item’s difficulty, discrimination, and guessability. Parameters stabilize as the response pool grows; confidence intervals on each parameter narrow with sample size.

The estimation method is Expected A Posteriori (EAP), which integrates over a prior distribution of ability rather than relying on point estimates. EAP is more stable than maximum likelihood estimation for the relatively short adaptive tests used here, and it produces well-behaved estimates even when a Fellow answers every item correctly or incorrectly.

3. How the test adapts

During the initial calibration period (approximately the first three months of operation), the Diagnostic uses a fixed set of items. Every Fellow sees the same items in the same order within each subtest. This phase exists to build the response data needed to calibrate item parameters accurately.

From approximately month four onward, the Diagnostic becomes fully adaptive using Computerized Adaptive Testing (CAT). Under CAT, each item is selected in real time based on the Fellow’s performance so far. The algorithm chooses the next item that will provide the most statistical information about the Fellow’s ability, given everything the test has already learned.

In practice, this means a Fellow who answers three items correctly at moderate difficulty will see a harder fourth item. A Fellow who misses two items at moderate difficulty will see an easier next item. The test continuously narrows in on the Fellow’s true ability level, spending most of its items at the boundary of what the Fellow can and cannot do.

Constraints on item selection

The adaptive algorithm does not simply serve the statistically optimal next item without regard for the Fellow’s experience. Three constraints govern item selection:

Content balance. The algorithm avoids serving three items in a row from the same content domain within a subtest. Measurement breadth matters.
Exposure control. No single item is allowed to dominate the item pool. Per-item exposure caps prevent the algorithm from over-relying on a few high-discrimination items, which protects both measurement quality and test security.
Band availability. The algorithm only selects from items whose difficulty parameters fall within a reasonable range of the current ability estimate. This prevents wild jumps in perceived difficulty.

Stopping rules

Each subtest terminates when the standard error of the ability estimate falls below a precision threshold, or when the item cap for that subtest is reached — whichever comes first. The standard error threshold ensures that measurement precision, not item count, governs test length. Some Fellows require fewer items to measure precisely; others require more.

4. How Standing Δ is computed

Standing Δ is a weighted composite of five ability estimates — one per subtest — each produced by the IRT scoring engine described above. The weights reflect the Society’s assessment of each faculty’s relative importance to the construct being measured:

Subtest	Weight
Reasoning	25%
Sustained Attention	25%
Reading Recall	20%
Working Memory	20%
Processing Speed and Fatigue Slope	10%

The raw composite is rescaled to a range designed for legibility and communication. The transformation is analogous to the SAT’s rescaling of raw scores to its 200–800 range, or the GRE’s rescaling to its 130–170 range. The specific offset and multiplier are set during the initial calibration period based on the observed distribution of Fellow ability.

Standing Δ is uncapped. There is no theoretical ceiling. As the Fellow population grows and the top of the distribution becomes denser, further improvement becomes harder to achieve — but it is never impossible. Tier names provide legibility at a glance; the number itself is the credential.

What the weights mean

Reasoning and Sustained Attention together account for half of Standing Δ. This reflects the Society’s thesis: the capacity for structured inference and the capacity for sustained cognitive engagement are the two faculties most consequential to intellectual life and most measurably affected by habitual distraction. Reading Recall and Working Memory together account for another 40%. Processing Speed, while included for its diagnostic value — particularly the fatigue slope — is de-emphasized in the composite because raw speed is less trainable and less central to the construct than the other four faculties.

Weights are locked at these values for the initial operating period. Any adjustment requires a formal architectural decision record, published notice to Fellows, and a recalibration cycle.

5. What Δ is not

Δ is an institution-internal score. It measures a Fellow’s performance on this institution’s assessments relative to other Fellows. Several things follow from this:

Δ is not an IQ test. IQ tests are standardized instruments normed against general population samples, administered under controlled clinical conditions, and interpreted by licensed professionals. Δ is none of these things. It is a training benchmark produced by a specific set of assessments within a specific institution.

Δ is not a medical assessment. It does not diagnose any condition. It does not screen for any condition. It does not claim clinical validity for any purpose. A Fellow’s Δ should not be presented to a medical professional as evidence of cognitive function or dysfunction.

Δ is not a diagnosis. No Δ value, subscale profile, or trend line should be interpreted as indicating the presence or absence of any neurological or psychiatric condition.

Δ is not portable across institutions. A Δ of +74 at this Society means +74 on this Society’s assessments, relative to this Society’s Fellow population. It does not correspond to any score on any other assessment, at any other institution, using any other methodology. Comparisons to external benchmarks are not valid.

Δ is not final. Every Diagnostic sitting is a measurement, not a verdict. Standing Δ is expected to change over time with practice, and it is designed to. A single sitting captures ability at a single point in time, under the conditions of that sitting.

6. What the confidence interval means

Every Standing Δ reported on a Fellow’s case report includes a confidence interval — a range within which the Fellow’s true ability is estimated to lie with a stated probability (typically 95%).

A confidence interval of +74 (±4) means: given the Fellow’s pattern of responses and the precision of the items they encountered, we estimate with 95% confidence that the Fellow’s true composite ability falls between +70 and +78. The reported value of +74 is the point estimate — the single number that best explains the response pattern. The interval communicates how much uncertainty remains.

Why confidence intervals matter

Two Fellows with Standing Δ values of +74 and +72 may or may not differ in true ability. If both have confidence intervals of ±4, the intervals overlap substantially, and the difference is not statistically meaningful. If both have confidence intervals of ±1, the difference is more likely to reflect a genuine gap.

Confidence intervals narrow with more data. A Fellow who encounters more items (because the adaptive algorithm needed more information to reach the precision threshold) will tend to have a wider interval. A Fellow whose responses are highly consistent will tend to have a narrower one. Over successive monthly Diagnostics, confidence intervals on a Fellow’s trajectory provide a clearer picture than point estimates alone.

Why they are shown

Most consumer assessments hide uncertainty. They report a single number and let the user assume it is exact. The Society reports confidence intervals because honest measurement requires it, and because Fellows who understand their interval make better decisions about their practice. A Fellow whose interval is wide knows that their next Diagnostic sitting carries more information than usual. A Fellow whose interval is narrow knows that small changes in Standing Δ are more likely to be real.

7. How items are reviewed

The integrity of Δ depends on the integrity of the item bank. Items are subject to continuous review across four dimensions.

Bias screening

Every item is evaluated for differential item functioning (DIF) — a statistical test that identifies items whose difficulty varies across demographic groups at the same ability level. An item that is harder for one group than another, after controlling for measured ability, is flagged for review. Flagged items are revised to remove the source of bias or retired from the bank.

DIF screening is applied across age cohorts and, as the Fellow population grows, across other relevant dimensions. The Society does not adjust Δ for any demographic variable — the score is computed identically for every Fellow. But the items themselves must measure the construct fairly, and DIF screening is how that fairness is enforced.

Content review

Items are reviewed for factual accuracy, clarity of language, and alignment with the construct being measured. An item that is difficult because it is poorly worded is not measuring the intended faculty — it is measuring reading tolerance for ambiguity, which is a different thing. Content review catches these cases.

For the Reasoning subtest, every item — whether hand-authored, generated, or edited — is verified against a formal logic solver before it can be served. The solver confirms that exactly one answer option is entailed by the premises, that the distractors are plausible but not provable, and that no unintended inference path leads to a distractor. This verification is automated and runs as part of the item pipeline, not as a manual review step.

Item retirement

Items are retired from active service when any of the following conditions are met:

The item’s discrimination parameter falls below a minimum threshold, indicating that it no longer separates Fellows effectively.
The item’s difficulty parameter has drifted significantly from its calibrated value, indicating that the Fellow population’s relationship to the item has changed — often because the item’s content has become more widely known.
DIF screening flags the item and review determines that revision is not feasible.
The item’s exposure count exceeds the cap, indicating that it has been served too many times and may be compromised by memorization or sharing.

Retired items are removed from the active pool but retained in the database for historical analysis and parameter research.

Annual recalibration

Once per year, the full item bank undergoes a recalibration cycle. All item parameters are re-estimated from the accumulated response data. Items whose parameters have shifted significantly are flagged for review. The composite weighting is evaluated against subscale correlation data. The confidence interval methodology is audited. The results are published in an updated version of this document.

Recalibration does not retroactively change any Fellow’s historical Standing Δ. Past scores are locked at the values computed at the time of the sitting. Recalibration ensures that future measurements remain accurate as the item bank and Fellow population evolve.

Glossary

3PL (Three-Parameter Logistic model): The statistical model used to describe the relationship between a Fellow’s ability and the probability of answering an item correctly. The three parameters are difficulty, discrimination, and guessability.
CAT (Computerized Adaptive Testing): A testing methodology in which the next item is selected in real time based on the Fellow’s performance so far, maximizing the statistical information gained from each item.
Confidence interval: A range around a point estimate within which the true value is estimated to lie with a stated probability.
DIF (Differential Item Functioning): A statistical test that identifies items whose difficulty varies across demographic groups after controlling for ability.
EAP (Expected A Posteriori): The estimation method used to compute ability from response patterns. Integrates over a prior distribution rather than relying on point estimates.
IRT (Item Response Theory): The branch of psychometrics concerned with modelling the relationship between latent ability and item responses. The mathematical foundation of the Diagnostic’s scoring.
Standing Δ: The Fellow’s official composite score, updated only at monthly Diagnostic sittings.
Working Δ: A projection of the Fellow’s next expected Standing Δ, updated with every scored practice Examination. Not used in official rankings or profiles.