How GeoReputation calculates visibility scores
The 0-1000 visibility score combines weighted coverage (70%) and weighted quality (30%) across the prompts and models in your scoring set. This doc walks through the formula, the tiering, and what the number does and does not capture.
A GeoReputation visibility score is a number between 0 and 1000 that summarizes how well an AI assistant represents a brand across a set of prompts. This doc describes exactly how that number is produced from the underlying answer data, including the formula, the weights, and the decisions we made about edge cases.
If the math here ever drifts from the production code, the doc is wrong. The Last Reviewed date at the top of the page is the freshness signal.
The shape of the underlying answer
A scoring run executes a configured set of prompts against a configured set of AI models. Each (prompt, model) pair produces one MonitoringAnswer with three signals:
- mentioned — was the brand named in the response?
- rank — if mentioned, what position (1st, 2nd, 3rd, etc.) did the brand appear at in the recommended list?
- quality — a 0-1 score reflecting how prominently and favorably the brand was discussed, computed from mention rank, recommendation status, and surrounding context.
Each answer is actually a roll-up of three samples (we run every prompt against every model three times in a row) so transient model variability does not dominate the score. The sample-level normalization produces a mention rate between 0 and 1 instead of a single bit.
Run-level formula
A single run score combines two components: coverage and quality.
Both components are weighted by a per-prompt reliability factor that accounts for how stable each prompt is across model invocations. Volatile prompts (different answers on different runs) get a smaller weight than stable ones; the default range is 0.5 to 1.5.
Why 70/30
Coverage is the larger weight because being mentioned at all is the threshold question: if an AI assistant does not name you, no amount of quality on the mentions that do happen will fix that. Quality is the secondary signal that distinguishes a brand mentioned in passing from one given a confident recommendation.
The exact split is tunable via settings (SCORING_V2_COVERAGE_WEIGHT, SCORING_V2_QUALITY_WEIGHT). We landed on 70/30 after testing 80/20 and 60/40 on the brands we had data for; 70/30 produced the rank order that matched our internal taste for "this brand has stronger visibility than that one" in the cases we hand-checked.
Stability weighting
Each answer carries a stability score from 0 to 1 based on whether the three samples agreed with each other. The weight applied to that answer is:
weight = 0.75 + (0.25 × stability_score)A perfectly stable answer (same outcome across all three samples) gets weight 1.0. A perfectly volatile answer (every sample different) gets weight 0.75. The discount is intentionally gentle: volatile prompts still count, just less.
Brand score (across runs)
A brand has its own score, which is the per-prompt mean of the most recent run. If you ran 12 prompts and got per-prompt scores of [800, 750, 600, 900, 700, 850, 500, 600, 700, 800, 750, 900], the brand score is round(8200 / 12) = 683.
A few important properties:
- Only prompts marked
include_in_geo_score = truecontribute. Prompts you have archived or marked exploratory are excluded. - Only models in the scoring set (your configured AI models) count. Adding a new model does not retroactively change historical scores; the per-answer snapshot of "was this model in the scoring set at the time?" is what gets filtered on.
- Era-guarded: when the scoring methodology changes meaningfully (which has happened twice since launch), historical runs are tagged with the era they ran under. The current brand score only averages runs from the current era so the number means the same thing across the chart.
From score to tier
The 0-1000 score maps to one of six tier labels. These labels are the canonical vocabulary used across the UI, exports, and AI-generated narratives.
| Score range | Tier label | What it means |
|---|---|---|
| 0 | Not Visible | The AI assistant never named the brand across the scored prompts. |
| 1-199 | Low | The brand surfaces occasionally but is not a default answer. |
| 200-399 | Limited | The brand appears on some prompts but with low frequency or weak quality. |
| 400-599 | Moderate | The brand is a reliable mention on roughly half of relevant prompts. |
| 600-799 | Good | The brand is a regular answer with reasonable rank and quality across most prompts. |
| 800-1000 | Strong | The brand is the default or near-default answer to its target prompts. |
The bands are intentionally wide. We chose round 200-point intervals over finer-grained bands so a 5-point swing on a noisy run does not flip a brand between tiers. The cost is that two brands inside the same band can have meaningfully different scores; tier is for narrative, not for fine comparison.
What the score does NOT capture
Worth naming the limits in the same place as the formula:
- Sentiment. We do not currently weight a mention more or less based on whether the AI described the brand favorably. A neutral mention and a glowing mention contribute equally to coverage; quality measures prominence, not sentiment.
- Reasoning correctness. If an AI mentions the brand but says something factually wrong about it, that does not lower the score. Detecting hallucinated claims about a brand is a separate pipeline.
- Brand-mention context outside the answer. The score only reflects what the AI said in response to the prompt. Off-platform brand health (review sites, social, news) is tracked separately.
Edge cases worth knowing
Zero coverage
If no prompt in a run produced a mention, coverage is 0 and quality is undefined (we cannot average quality across zero mentions). The score is 0, which maps to the "Not Visible" tier.
Single-model runs
A run with only one model produces a valid score, but the score is much noisier because there is no cross-model averaging. We surface this in the UI as a confidence indicator on the run-detail page.
Fallback answers
When the live API call to an AI model fails, we sometimes fall back to a cached or retried answer. Runs where more than 10% of answers came from fallback get a "low confidence" badge in the dashboard; the score is still produced but should be interpreted with the disclosure.
Where to dig deeper
Source code references for anyone who wants the line-level math:
services/scoring_v2.py— the run-level formula, coverage and quality computation, stability weighting.services/scoring_display.py— the brand score helper that averages per-prompt scores into a single number.services/score_rubric.pyandfrontend/lib/score-rubric.ts— the score-to-tier mapping (kept in lockstep across backend and frontend).