Benchmarks

Numbers we publish are reproducible or they don't ship. Raw data lives on this page.

1. How AI-ready are real semantic models?

We scanned 40 public semantic models from GitHub with the Semanticus AI-readiness analyzer: Microsoft's own samples, well-known community tools, real production repos and training material. Scans are offline and read metadata only. Every repo is pinned by commit, and the raw per-model scores are downloadable below, so anyone can re-run the scan and check us.

Median: 54.1 out of 100. That's an F.

28grade F

9grade D

2grade C

1grade B

Every official Microsoft sample in the corpus graded F. These are not broken models; they render their reports fine. They fail the things an AI consumer needs: descriptions, unambiguous names, synonyms, linguistic schema, Prep-for-AI settings, documented relationships.

Disclosed bias: public, source-controlled models are the polished end of the ecosystem. Authors published them for an audience, and teams using git are the engineering-mature minority. Typical private client models are rougher than this sample, so the median here is plausibly an upper bound.

Download the raw scores (JSON) · scanned 4 July 2026 · scanner and corpus are in the Semanticus repository under tools/readiness-corpus.

Model (repo)	Kind	Grade	Score
DaniBunny/Fabric-DE-CICD	training	F	39
CSCfi/antero	community	F	39.1
microsoft/fabric-samples	microsoft sample	F	41.4
tomatminceddata/PBIR_XRAY	community	F	43.1
microsoft/Analysis-Services	microsoft sample	F	43.3
Cyberlorians/nistframework	community	F	43.3
MeteoWatch/MeteoWatch	community	F	45.5
microsoft/fabric-racing-sim	microsoft sample	F	46.2
Azure/tech-debt-analytics	microsoft sample	F	46.6
kevchant/GitHub-FUAM-Deploymenator	community	F	46.9
miguelASL/Eurocopa_Espana	community	F	47.2
Mike-Honey/covid-19-au-vaccinations	community	F	47.6
ecotte/Fabric-Monitoring-RTI	community	F	48.3
bcgov/moh-APO-Reporting	community	F	49.5
ayodejiayodele/github-developer-metrics	community	F	51.1
kerski/fabric-dataops-patterns	training	F	51.2
Open-Education-AI/OEAI	community	F	51.3
sonbaoharryson/Data_Engineer_JobPulse_Project	community	F	52.5
djouallah/aemo_fabric	community	F	53.4
microsoft/PowerBI-LogAnalytics-Template-Reports	microsoft sample	F	53.6
FHaurum/FHSQLMonitor	community	F	54.1
PacktPublishing/Microsoft-Power-BI-Cookbook	training	F	54.9
alisonpezzott/reactor-pbi-maio-25	training	F	55.3
aditiv101/Youtube_analytics_dashboard	community	F	56.2
ProdataSQL/FinancialModelling	community	F	56.4
CareTogether/CareTogether-PowerBI	community	F	57.6
PBI-DataVizzle/pbi_content	community	F	58.6
DataChant/Trello-Power-BI	community	F	59.9
alisonpezzott/pbi-docs	community	D	60
jurgenfolz/WorldDataReport	community	D	61.4
NelsonNeba/Workforce-Hiring-Optimization-Dashboard-	community	D	61.8
stephbruno/Power-BI-Accessibility-Checker	community	D	63.1
vlpatkosdani/powerbi-cicd-with-githubactions-demos	community	D	63.5
jurgenfolz/Stock-Intelligence	community	D	67
RuiRomano/fabric-cli-powerbi-cicd-sample	community	D	68.2
pbi-tools/sales-sample	community	D	69
jeremypj/budget-intelligence-ynab	community	D	69
Rede-DSBR/DocPBI2	community	C	70.3
jeremypj/Power-BI-for-BigTime	community	C	72.6
data-goblin/power-bi-visual-templates	community	B	81.5

2. Does the Pro referee actually help? (in progress)

The claim behind Semanticus Pro is that an AI assistant wrapped in the engine's verify/probe/benchmark loop writes more-correct DAX than the same assistant retrying on raw error text. We are measuring exactly that, and we pre-registered the method before running it:

Three arms, same assistant, byte-identical prompts: FREE (one shot, no tools), FREE+ (retries on raw errors: the honest control), PRO (the enforced loop: multi-context probe, equivalence proof, benchmark-pick).
The headline is PRO minus FREE+, never PRO minus FREE, so mere retrying can't be credited to the loop.
Correctness is executed, not judged: 24 frozen tasks, each scored across a filter-context matrix against independently computed reference vectors, where blank, zero and error are three different answers. A task passes only at 100% of cells.
Where Pro does NOT help gets published as a headline, not a footnote. Expect the basic tier to show little or no delta; the loop earns its keep on context transition, time intelligence at mixed grains, and non-additive measures.

Results, raw run logs and full session transcripts will be published on this page.