How GeoReputation calculates visibility scores

The 0-1000 visibility score combines weighted coverage (70%) and weighted quality (30%) across the prompts and models in your scoring set. This doc walks through the formula, the tiering, and what the number does and does not capture.

Last reviewed: June 8, 20265 min readPublished Jun 8, 2026

A GeoReputation visibility score is a number between 0 and 1000 that summarizes how well an AI assistant represents a brand across a set of prompts. This doc describes exactly how that number is produced from the underlying answer data, including the formula, the weights, and the decisions we made about edge cases.

If the math here ever drifts from the production code, the doc is wrong. The Last Reviewed date at the top of the page is the freshness signal.

The shape of the underlying answer

A scoring run executes a configured set of prompts against a configured set of AI models. Each (prompt, model) pair produces one MonitoringAnswer with three signals:

  • mentioned — was the brand named in the response?
  • rank — if mentioned, what position (1st, 2nd, 3rd, etc.) did the brand appear at in the recommended list?
  • quality — a 0-1 score reflecting how prominently and favorably the brand was discussed, computed from mention rank, recommendation status, and surrounding context.

Each answer is actually a roll-up of three samples (we run every prompt against every model three times in a row) so transient model variability does not dominate the score. The sample-level normalization produces a mention rate between 0 and 1 instead of a single bit.

Run-level formula

A single run score combines two components: coverage and quality.

Both components are weighted by a per-prompt reliability factor that accounts for how stable each prompt is across model invocations. Volatile prompts (different answers on different runs) get a smaller weight than stable ones; the default range is 0.5 to 1.5.

Why 70/30

Coverage is the larger weight because being mentioned at all is the threshold question: if an AI assistant does not name you, no amount of quality on the mentions that do happen will fix that. Quality is the secondary signal that distinguishes a brand mentioned in passing from one given a confident recommendation.

The exact split is tunable via settings (SCORING_V2_COVERAGE_WEIGHT, SCORING_V2_QUALITY_WEIGHT). We landed on 70/30 after testing 80/20 and 60/40 on the brands we had data for; 70/30 produced the rank order that matched our internal taste for "this brand has stronger visibility than that one" in the cases we hand-checked.

Stability weighting

Each answer carries a stability score from 0 to 1 based on whether the three samples agreed with each other. The weight applied to that answer is:

stability weight formula
weight = 0.75 + (0.25 × stability_score)
text

A perfectly stable answer (same outcome across all three samples) gets weight 1.0. A perfectly volatile answer (every sample different) gets weight 0.75. The discount is intentionally gentle: volatile prompts still count, just less.

Brand score (across runs)

A brand has its own score, which is the per-prompt mean of the most recent run. If you ran 12 prompts and got per-prompt scores of [800, 750, 600, 900, 700, 850, 500, 600, 700, 800, 750, 900], the brand score is round(8200 / 12) = 683.

A few important properties:

  • Only prompts marked include_in_geo_score = true contribute. Prompts you have archived or marked exploratory are excluded.
  • Only models in the scoring set (your configured AI models) count. Adding a new model does not retroactively change historical scores; the per-answer snapshot of "was this model in the scoring set at the time?" is what gets filtered on.
  • Era-guarded: when the scoring methodology changes meaningfully (which has happened twice since launch), historical runs are tagged with the era they ran under. The current brand score only averages runs from the current era so the number means the same thing across the chart.

From score to tier

The 0-1000 score maps to one of six tier labels. These labels are the canonical vocabulary used across the UI, exports, and AI-generated narratives.

GeoReputation visibility tier rubric
Score rangeTier labelWhat it means
0Not VisibleThe AI assistant never named the brand across the scored prompts.
1-199LowThe brand surfaces occasionally but is not a default answer.
200-399LimitedThe brand appears on some prompts but with low frequency or weak quality.
400-599ModerateThe brand is a reliable mention on roughly half of relevant prompts.
600-799GoodThe brand is a regular answer with reasonable rank and quality across most prompts.
800-1000StrongThe brand is the default or near-default answer to its target prompts.

The bands are intentionally wide. We chose round 200-point intervals over finer-grained bands so a 5-point swing on a noisy run does not flip a brand between tiers. The cost is that two brands inside the same band can have meaningfully different scores; tier is for narrative, not for fine comparison.

What the score does NOT capture

Worth naming the limits in the same place as the formula:

  • Sentiment. We do not currently weight a mention more or less based on whether the AI described the brand favorably. A neutral mention and a glowing mention contribute equally to coverage; quality measures prominence, not sentiment.
  • Reasoning correctness. If an AI mentions the brand but says something factually wrong about it, that does not lower the score. Detecting hallucinated claims about a brand is a separate pipeline.
  • Brand-mention context outside the answer. The score only reflects what the AI said in response to the prompt. Off-platform brand health (review sites, social, news) is tracked separately.

Edge cases worth knowing

Zero coverage

If no prompt in a run produced a mention, coverage is 0 and quality is undefined (we cannot average quality across zero mentions). The score is 0, which maps to the "Not Visible" tier.

Single-model runs

A run with only one model produces a valid score, but the score is much noisier because there is no cross-model averaging. We surface this in the UI as a confidence indicator on the run-detail page.

Fallback answers

When the live API call to an AI model fails, we sometimes fall back to a cached or retried answer. Runs where more than 10% of answers came from fallback get a "low confidence" badge in the dashboard; the score is still produced but should be interpreted with the disclosure.

Where to dig deeper

Source code references for anyone who wants the line-level math:

  • services/scoring_v2.py — the run-level formula, coverage and quality computation, stability weighting.
  • services/scoring_display.py — the brand score helper that averages per-prompt scores into a single number.
  • services/score_rubric.py and frontend/lib/score-rubric.ts — the score-to-tier mapping (kept in lockstep across backend and frontend).