SHOT Clubhouse · Pulse SportHub · data topology (staging · 2026-06-03)

How data flows through the SportHub

The SportHub is fed by three independent data planes that converge on one serving schema, hub.* — and it is deliberately kept apart from the legacy content.* Pulse feed. (1) A live plane: Supabase edge functions on pg_cron pull fixtures, scores and standings from ESPN's free public API. (2) A scraped plane: a standalone Python scraper running on Stevie's homelab (not Supabase, not the app) lands raw multi-source history into the isolated scrape.* schema, which a daily edge-function ETL folds into hub.*. (3) An entity-resolution plane, pulse_curated.*, reconciles the same club/player across every source. The React frontend reads only hub.* via PostgREST. Short answer to "where does scrape.* fit vs hub.*": scrape.* is the raw, messy, source-shaped landing zone (written only by the external scraper, never read by the app); hub.* is the clean, canonical, public-read layer the frontend queries.

The topology

Plane 1 · live (ESPN) Plane 2 · scraped (homelab) + Plane 3 · resolution Serving layer → frontend Separate · legacy main-pulse (NOT the hub) ESPN APIsite.api.espn.com edge fns · pg_cronfixtures-sync (hourly)match-detail-sync (1 min)+ standings ingest standalonescraperPython · homelab scrape.*55 raw tablesStatsBomb · UnderstatTransfermarkt · TSDB · … scrape-etledge fn · daily⚠ resource-limited pulse_curated.*canonical_clubs/playersresolution_queue hub.* · serving schemaleagues · teams · matches · standingsmatch_events/stats/lineups/shots · playersfollows · notification_subscriptions PostgREST .schema('hub') → React Querypublic-read RLS → football/data/hooks.ts → SportHub UI pulse-sports-sync · pulse-espn-newscrons (*/5, 30s, hourly) content.*sports_events · pulse_articles legacy Pulse feedNOT the SportHub

The four stages, walked through

1
Plane 1 — live data from ESPN · Supabase edge functions on pg_cron

The "what's happening now" data — fixtures, scores, league tables — comes from ESPN's free public API (no key) via three edge functions on pg_cron. Nothing is hand-maintained.

Edge fnCronWrites
pulse-fixtures-sync0 * * * * hourlyhub.matches (fixtures, scores, dates, status)
pulse-match-detail-sync* * * * * every minhub.match_events / match_stats / lineupsin-progress matches only
standings ingest(operational now)hub.standings — the real ESPN table, ingested, not computed

ESPN adapter capabilities = ['fixtures','liveState','standings']. The provider layer is abstracted (ESPN is the free primary; the dispatcher can fail over to other adapters).

2
Plane 2 — the scraper · standalone, external, NOT edge functions

Where the scraping happens: a self-contained Python scraper (a pulse-scraper Docker container) on Stevie's local machine / homelab VPS — explicitly not inside Supabase and not a pg_cron job. It is scheduled by local cron / apscheduler.

Is it a one-time scrape? No — it does a one-time historical backfill (everything each source offers, going back as far as available) and then runs daily off-peak delta scrapes (01:00–05:00 UTC). It writes raw data into the isolated scrape.* schema (55 tables), one shape per source:

Sourcescrape.* tablesWhat it holds
StatsBombstatsbomb_events/_360_frames/_lineups/_matchesevent-level data, freeze frames, lineups
Understatunderstat_shots/_matches/_players/_teamsxG + shot coordinates
Transfermarkttm_clubs/_players/_transfers/_market_values/_news…transfers, valuations, news
TheSportsDBtsdb_teams/_playersbadges + metadata
ESPN / fd.co.uk / OpenLigaDBespn_matches · fdcouk_matches · openligadb_matchesmatch results across feeds
World Cup historywc_* (~30 tables)full WC history (squads, goals, refs…)
Wikidata / Wikipediawikidata_* · wikipedia_articlescross-reference identifiers + prose

A daily edge function — pulse-scrape-etl (0 6 * * *) — is the bridge: it reads scrape.*, resolves canonical entities into pulse_curated.*, then projects clean rows into hub.teams / hub.players / hub.match_shots. It currently fails with WORKER_RESOURCE_LIMIT (needs batching) — which is why hub.players and hub.match_shots are still empty.

3
Plane 3 — entity resolution · pulse_curated.*

The same club/player appears under different names+ids in every source. pulse_curated.canonical_clubs / canonical_players hold one canonical id per real entity, carrying every source's id (tm_club_id, statsbomb_team_id, understat_team_id, tsdb_team_id, wikidata_qid…). Ambiguous matches wait in resolution_queue. This is how the messy scrape.* rows get reconciled into one hub.teams row before serving.

4
Serving — hub.* → the frontend · the only schema the app reads

hub.* is the clean, canonical, public-read layer, exposed via PostgREST. The React frontend (pulse/sporthub) queries it directly with supabase.schema('hub') in football/data/hooks.ts. Current staging contents:

hub tableHoldsSourceRows
leagues7 competitions + ESPN ids + logosESPN / seed7
teamsclubs + real ESPN badgesESPN + scrape-etl135
matchesfull-season fixtures + scoresESPN (fixtures-sync)2130
standingsreal league tables (ingested)ESPN standings168
playerssquads + nationalityscrape-etl0 ⚠
match_events / stats / lineupslive enrichmentmatch-detail-sync (live-only)0 ⚠
match_shotsshot maps + xGscrape-etl (Understat)0 ⚠
follows / notification_subscriptionsuser follow graphapp0

hub.matches is the hub's own match header — no foreign key into content.*. That isolation is the whole reason the schema was split out.

5
Separate — legacy main-pulse · content.* (NOT the SportHub)

Don't confuse these. The original Pulse feed lives in content.*: content.sports_events (fed by pulse-sports-sync-fixtures */5 + pulse-sports-sync-live 30s) and content.pulse_articles (fed by the many pulse-espn-news-* crons). The SportHub was deliberately built as a separate hub.* schema so it owns its data and doesn't entangle with this legacy news/scores feed.