OpenBrief
Log in Sign up
METHODOLOGY // How The Open Brief processes data

Methodology

This is a longer-than-usual methodology page on purpose. We think the only ethical way to publish about politicians is to be transparent about how the data is gathered and where we made calls. Read the first paragraph of any section if you want the gist; read the rest if you want to verify or contest.

The short version: we pull public records (parliament rosters, pecuniary interests, ministerial diaries, donations >$20k, Hansard, charities register) from their primary sources, plus news and commentary articles. AI helps us extract topics and assemble biographies; every AI-produced claim is checked back to a quoted source. Thresholds and model versions on this page are read from the live system; display caps are documented to match the briefing code and reviewed when the caps change.

P Principles & scope

The Open Brief aggregates public-record data about NZ politics: who’s in parliament, what they’ve declared, who they meet, what they say in chambers, and how the press / commentariat / lobby groups talk about them. The intent is observational transparency — a researcher’s scratchpad with provenance, not a verdict generator.

Display semantics are defamation-safe: every fact is shown with an as observed on $date stamp, every claim traces back to a quoted source, and conflicting data is rendered side-by-side (not collapsed to a single “best” reading.)

! What this platform won’t do

The negative space of the methodology — what we’ve committed to not doing. These are public commitments so we have something to point to when (not if) someone tries to push the platform off-mission.

  1. No prediction. We don’t forecast election results, corruption likelihood, conflicts of interest, or who will / should win anything. Every figure on the site is observational.
  2. No defamatory inference from aggregated data. An MP meeting a sector representative is information; it isn’t a claim of wrongdoing. We label, we don’t accuse.
  3. No scoring of MPs on ideology or trustworthiness. We’ll tell you what was said, by whom, when. We won’t roll that into a number that ranks them.
  4. No republishing of article bodies. Commentary and news bodies are captured for AI processing only and never rendered verbatim. Every excerpt links back to the source.
  5. No inference of private opinions or unrecorded behaviour. Hansard, pecuniary register, ministerial diaries, public commentary — that’s the universe.
  6. No paid placement, sponsored topics, or featured pins. The home-page pin list is editorial; nothing on the site is for sale.
  7. No external direction on what to surface. No party, candidate, sitting MP, active campaign, lobby group, media outlet, or funder gets to decide what appears on the site or where it appears.
  8. No money from political parties, candidates, sitting MPs, or active campaigns. If that ever changes, this line will say so before any money moves.

If we ever break one of these: the breach belongs in this section, dated, with the reason. A commitment that quietly changes is worse than one that’s never made.

S Data sources & cadence

Every dataset is pulled from a public, primary-source feed. Re-runs are safe by design: each ingest pass deduplicates against what already exists rather than re-importing.

SourceWhere it comes fromWhat we keepNotes
Parliament rostersdata.govt.nz open-data CSVOne row per MP per termRetired MPs are never deleted — only new-term entries progress history.
Pecuniary registerAnnual Parliament PDF, parsed and chunked per MP by an LLMOne row per declared itemCategories and item text are copied verbatim from the register.
Donations >$20kElectoral Commission registerOne row per disclosed donationThe page is bot-protected; we use a managed browser to fetch it. Smaller donations are not in the public register and are not on this site.
CharitiesCharities Services register (open data)One row per registered org, plus registration / deregistration eventsStatus flips emit a dated event visible on the org page.
Beehive diariesPer-minister HTML scrapeOne row per ministerial meetingOnly ministers we’ve explicitly added to the watch list are tracked.
Hansardparliament.nz speech XMLOne row per recorded contributionFloor debates only — ministerial speeches and press releases come via Beehive.
NewsRSS feeds for most outlets; managed browser for Stuff & NZ Herald (which block plain HTTP)One row per articleSeven outlets: RNZ, Stuff, NZ Herald, ODT, 1News, Newsroom, The Spinoff. Where outlets publish a separate Māori-news desk (RNZ’s Te Manu Kōrihi, NZ Herald’s Kāhu, Stuff’s Pou Tiaki), we pull that section feed alongside the main one so Māori-perspective stories aren’t filtered out by the general-news desk. The Spinoff only exposes a whole-site feed; we filter for politics downstream.
Commentary31 RSS sources plus 18 sources whose front pages we crawl with a managed browserOne row per article49 sources total — classifications listed under Source orientations. Includes Māori-perspective sources (E-Tangata, Waatea News, Tina Ngata, MATA with Mihingarangi Forbes, Te Hiku Media’s Haukāinga desk) so the corpus reflects discourse that mainstream outlets under-cover.
Social: Twitter / XX API v2 Basic tier — one timeline pull per registered handleOne row per tweet (excluding pure retweets); quote tweets get a cross-link rowRegistry combines a small curated list (party leaders, journalists, commentators) with every sitting MP whose handle the mp_handles ingester resolves from Wikidata, party-website scrape, or the operator-curated seed file. Cap: 10k tweets/month under Basic-tier quota.
Social: RedditPublic r/<sub>.json endpoints — no auth needed (Bright Data’s pre-cleared Reddit dataset where datacenter IPs are blocked)One row per submission; replies are not ingestedCurated subreddit registry: NZ-wide (newzealand, aotearoa, NewZealandPolitics, nzpolitics) plus city subs (auckland, wellington, chch, dunedin, tauranga). Author is the redditor handle; same person-resolution path as Twitter.
Social: YouTubeYouTube Data API v3 — uploads playlist + top-N comment threads per channel; transcripts via yt-dlpOne row per video (including Shorts) and one per comment; comments chain to their video via parent_external_idRegistry of NZ political channels: the six parliamentary parties, selected MP channels, the coordinator lobby orgs, alt-media commentators, and mainstream outlets. Comments are pulled for videos inside a 14-day window.
Social: Facebook (organic posts)Bright Data Facebook Pages-Posts dataset (asynchronous trigger→poll→snapshot) — distinct from the Meta Ad Library row below: these are organic page posts, not paid adsOne row per post on social_posts (sponsored posts dropped — those belong to the ads pipeline); engagement carries likes, comments, shares and video viewsRegistry of public Pages: the six parliamentary parties, party leaders & senior ministers, the article-named coordinator lobby orgs (Taxpayers’ Union, NZ Initiative, Family First, Hobson’s Pledge, Free Speech Union, Sensible Sentencing Trust, plus left-side balance), and a media baseline. Page crawls are slow, so the poll runs once daily and bounds each crawl to the last 7 days. Author is the Page slug; page→org attribution is not yet auto-resolved.
Social: Meta Ad Library (Facebook + Instagram)Meta Graph API /ads_archive — identity-verified user token (gated to political / social-issue ads only)One row per ad on its own table (meta_ads, not social_posts) — ads carry spend and impressions as ranges rather than single numbers, plus demographic and regional breakdowns; no engagement counters because ads aren’t postsRegistry covers the six parliamentary parties plus the article-named coordinator lobby orgs (Taxpayers’ Union, NZ Initiative, Family First, Hobson’s Pledge, Free Speech Union, Sensible Sentencing Trust). Filter is the regulated “Paid for by…” byline disclaimer, with keyword fallback. Long-lived token auto-refreshes against the app secret before expiry; the cron is self-sustaining.
MP twitter handlesWikidata SPARQL (P2002) — party-website caucus pages via managed browser — operator-curated JSON seedOne handle per sitting MP, with provenance (which source resolved it)Cheap sources land first; manual seed only fills remaining gaps. Coverage on the 54th parliament: ~71% of sitting MPs, biased toward the front bench.
OIA releasesFYI.org.nz per-authority feeds & request pages (Alaveteli — the platform behind WhatDoTheyKnow.com)One row per strict-mode request that anchors to a known person or org; cover-letter text + section-of-act citations + downloaded attachment filesSee OIA releases below for the four-gate filter (status, allow-listed authority, subject-period, tenure-aware anchor) and the defamation-safe defaults.

Honest note: ingest cadence is operator-driven, not auto-scheduled. The ingest passes are idempotent — each run only picks up what’s new or has changed.

PR Identifying who's who

A name in a news article, ministerial diary, or Hansard speech is tied to a single canonical MP record using a normalised key: lowercased, with honorifics stripped (Rt Hon, The Hon, Hon, Rev Dr, Dr, Sir, Dame … matched longest-first), apostrophes removed, other punctuation collapsed to spaces. Macrons are preserved.

Each MP carries up to two aliases — the public display form ("Jo Luxton") and the original feed form ("LUXTON, Jo") — so a mention written either way resolves to the same person. When a name is ambiguous (overlap between MPs, partial match), a Jaro-Winkler fuzzy comparison breaks the tie. Borderline cases land in a human-review queue rather than being silently merged.

Honest note: first-name-only mentions ("Chris said…") fall back to the fuzzy match and may be missed or mis-attributed where two MPs share a first name. Macron-insensitive collation isn't applied yet, so "Māori" and "Maori" can produce different keys in some places.

OR Identifying organisations

Org names normalise via the same pipeline plus a legal-form suffix fold table (Ltd → limited, Inc → incorporated, LLP, PLC, Co). A "the " prefix is stripped. Apostrophes are collapsed (not just stripped) so "St Bernadette's" and "St Bernadettes" canonicalise identically — that's a deliberate choice to fix Charities Register duplicates.

When duplicate organisation records are discovered, they’re merged into a survivor record. Directorships and meetings written before the merge still resolve to the survivor at display time — the older record stays in place for audit, but the page you read shows the canonical view.

Honest note: apostrophe collapse is lossy and irreversible. We've not seen a case where it's wrong, but it's a precedent for silent canonicalisation that aliases don't capture.

C Source orientations

Every commentary source carries two orthogonal classifications: kind (government / party / media-news / media-opinion / iwi-maori / lobby) and lean (left / centre-left / centre / centre-right / right / n/a). Lobby sources additionally carry a sector.

How we labelled each source

Labels are assigned by us — not by an algorithm, an LLM, or an automated coverage analysis — using the following rubric, applied in order:

  1. Self-identification when it’s authoritative. Political parties are labelled by the party they are; lobby groups carry the sector their charter declares. We don’t second-guess "the National Party is a centre-right party".
  2. Established academic / third-party assessments where they exist for that source class:
    • The Manifesto Project’s RILE scale (party manifesto coding) for the lean of NZ political parties.
    • The AllSides five-point media-bias methodology as a reference for distinguishing left / centre-left / centre / centre-right / right.
    • The Bryce Edwards / Democracy Project taxonomy for NZ-specific commentary classifications where the AllSides frame doesn’t map cleanly.
  3. Editorial judgement for everything else — substacks, niche blogs, advocacy outlets without a published policy stance. We name this as editorial judgement rather than dressing it up as measurement.

Labels apply at the source level, not the article level. A centre-right outlet can publish a critical article about a centre-right party; the lean tag is about the outlet’s editorial centre of gravity, not any one piece of coverage.

If you’re a source and disagree with your label

Editorial judgement is contestable, and the labels here will attract scrutiny. The point of this section is to make the challenge process visible:

  1. Tell us what you think the label should be and why — ideally referencing the rubric above (self-identification, academic frame, or your own published positioning statement).
  2. We review, weighing your argument against the rubric and any published evidence. We may consult the Democracy Project or an equivalent NZ political-science source.
  3. The decision — whether or not we change the label — is logged on this page with the date and a one-line summary of the reasoning, regardless of outcome. A label that was quietly changed is harder to trust than a label that was publicly debated.

The intake process is being designed. Until a public form ships, requests can be sent to [email protected] with subject line “Source orientation review”. We’ll acknowledge within five working days. The contact will be replaced by an in-app form — this section will say so when that lands.

SlugDisplay nameKindLeanSectorNote
beehive Beehive (Govt) government n/a Official government press releases via beehive.govt.nz.
e-tangata E-Tangata iwi-maori n/a Sunday political/cultural essays from a Māori/Pasifika lens.
waatea Waatea News iwi-maori n/a Daily Māori news + politics.
tina-ngata Tina Ngata iwi-maori n/a Decolonisation / Treaty commentary; activist voice.
rnz-mata MATA (Mihingarangi Forbes) iwi-maori n/a RNZ podcast — newsmakers and Māori commentators on weekly NZ politics through a Māori lens.
tehiku-haukainga Te Hiku Haukāinga iwi-maori n/a Te Hiku Media news desk for the Far North iwi (Ngāti Kuri, Te Aupōuri, Ngāi Takoto, Te Rarawa, Ngāti Kahu).
greenpeace-nz Greenpeace Aotearoa lobby centre-left environment Environmental advocacy; centre-left on policy positions.
family-first Family First NZ lobby right social-conservative Socially conservative campaigning org.
taxpayers-union Taxpayers Union lobby right tax-economic Right-libertarian fiscal advocacy; runs the Curia poll.
ctu CTU lobby left labour Council of Trade Unions; labour-aligned.
salvation-army Salvation Army (NZ) lobby centre social-progressive Social-policy & parliamentary unit; broadly centre-left on poverty.
nz-initiative NZ Initiative lobby right business Business-aligned think tank; right-libertarian.
hobsons-pledge Hobson's Pledge lobby right social-conservative Anti-co-governance / one-law-for-all advocacy.
free-speech-union Free Speech Union lobby right civil-liberties Free-speech advocacy; right-libertarian.
forest-and-bird Forest & Bird lobby centre-left environment Conservation / environmental advocacy.
helen-clark-fdn Helen Clark Foundation lobby centre-left social-progressive Centre-left policy think tank.
maxim-institute Maxim Institute lobby centre-right social-conservative Faith-aligned policy think tank; centre-right on social.
groundswell Groundswell NZ lobby right rural Rural / farmer protest movement; right-aligned on environmental regulation.
business-nz BusinessNZ lobby centre-right business Peak business advocacy body.
cpag CPAG lobby left social-progressive Child Poverty Action Group.
ema EMA lobby centre-right business Employers and Manufacturers Association.
rnz RNZ media-news centre Public broadcaster; centre by NZ convention.
stuff Stuff media-news centre-left Stuff Group; centre-left on social issues per analyst consensus.
nzherald NZ Herald media-news centre-right NZME flagship; centre-right editorial line.
odt Otago Daily Times media-news centre Allied Press regional daily; centre.
tvnz 1News (TVNZ) media-news centre TVNZ public broadcaster.
newsroom Newsroom media-news centre Long-form independent; centre.
rnz-comment RNZ Comment media-opinion centre RNZ "On The Inside" comment & analysis.
newsroom-opinion Newsroom Opinion media-opinion centre Newsroom site feed; mostly centre with some centre-left.
spinoff The Spinoff media-opinion centre-left Auckland-based digital outlet; reliably progressive on social issues.
pundit Pundit media-opinion centre Watkin / Geddis / Easton et al — academic/journalist commentary.
conversation-nz The Conversation NZ media-opinion centre-left Academic commentary; broadly centre-left by author pool.
kiwiblog Kiwiblog media-opinion right David Farrar — ex-National pollster; reliable centre-right voice.
daily-blog The Daily Blog media-opinion left Martyn Bradbury — left-progressive blog roster.
the-standard The Standard media-opinion left Collective left-wing blog.
no-right-turn No Right Turn media-opinion left Idiot/Savant — civil-liberties / OIA-driven; left-aligned.
karl-du-fresne Karl du Fresne media-opinion centre-right Veteran journalist; centre-right cultural commentary.
homepaddock Homepaddock media-opinion centre-right Ele Ludemann — rural-conservative blogger.
nzcpr NZCPR media-opinion right Muriel Newman — NZ Centre for Political Research; right-libertarian.
point-of-order Point of Order media-opinion centre-right Centre-right policy commentary.
werewolf Werewolf media-opinion left Gordon Campbell — long-form left-wing political analysis.
democracy-project Democracy Project media-opinion centre Bryce Edwards — annotated meta-aggregator.
the-kaka The Kākā media-opinion centre-left Bernard Hickey — economics / political economy.
bowalley-road Bowalley Road media-opinion left Chris Trotter — long-established left commentary.
blue-review The Blue Review media-opinion centre-right Liam Hehir — centre-right newsletter.
verity-johnson Verity Johnson media-opinion centre-left Op-ed columnist; broadly centre-left social.
working-group The Working Group media-podcast centre Bradbury (left) + Grant (right) cross-partisan panel; weekly. Heavy politician guest density.
cross-party-lines-audio Cross Party Lines (audio) media-podcast centre Goff (Lab) + Finlayson (Nat) + Collins panel; weekly. Companion text feed lives in commentary.cross-party-lines.
big-hairy-news Big Hairy News media-podcast centre-left Brittenden + Chewie progressive news roundup; daily. DOC Studios.
labour-party Labour Party party left Centre-left major party.
national-party National Party party centre-right Centre-right major party.
act-party ACT Party party right Right-libertarian / classical-liberal party.
nz-first-party NZ First party centre-right Populist / nationalist party; positioned centre-right in policy.
twitter Twitter / X social mixed X API v2 Basic tier feed; per-post stance/sentiment classified.
reddit Reddit social mixed Public JSON via reddit-mcp; r/newzealand and sister subs.
facebook Facebook (Meta) social mixed Meta Content Library posts; per-post stance/sentiment classified.
meta-ads Meta Ads (Political) social-ads n/a Meta Ad Library political/issue ads; spend + reach.

Honest note: these are editorial calls, not measurements. New commentary sources we onboard show no orientation until they’ve been classified — until then they default to “n/a” lean in the discourse views.

Rh Source reach scoring

In plain English: a quote in Stuff (~2.3M monthly NZ readers) and a quote in a specialist political blog have very different audiences. Treating every source equally would silently weight a 500-reader Substack the same as a major masthead. We attach a 0–100 reach score to each source so visualisations can weight what they show by how many people are likely to encounter it — without crushing tail voices to nothing.

The technical version

Each source carries a single reach score in source_reach_score, recomputed by the reach ingester (ingester reach refresh && ingester reach score). The score is composed from three signals, in priority order:

  1. Anchored — a hand-curated monthly_uniques figure from a published source (Nielsen Online Ratings via NZ media trade press, NZ On Air "Where Are The Audiences?" report, or Roy Morgan readership). Honest measured data; load- bearing for the top NZ outlets where these are periodically published. Each anchor row stores the publishing source, the period, and a citation URL so every number on the table below traces back to its origin. Score formula: 100 × log10(uniques) / log10(5000000).
  2. Estimated — the Tranco global daily rank for the source's domain (free, no auth). Tranco itself is a research-grade composite of Cisco Umbrella + Majestic + Cloudflare Radar + Chrome UX Report. Used when no anchor exists. Ordering is reliable; the absolute number is a model output, not a measurement. Score formula normalises log-rank between the top cap (1000) and the list size (1000000): 100 × (1 − (log10(rank) − log10(1000)) / (log10(1000000) − log10(1000))). Rank is clamped to 1000 at the top (score = 100) and 1000000 at the bottom (score = 0).
  3. Inferred — an explicit editorial floor for known-credible-but-tail commentators that Tranco doesn't reach (e.g. Werewolf, Bowalley Road, a substack newsletter). Without a floor these would silently score zero and be excluded from any reach- weighted view. Floor values (typically 8–18) are editorial judgement, named as such.
  4. Unknown — no anchor, no Tranco rank, no floor. Score is 0 with confidence unknown rather than collapsing to estimated; the difference matters when reading the table.

Quote weighting

Visualisations that aggregate by reach use weight = log1p(score) rather than the raw score. The dampening keeps the spread between Stuff (~95) and a small but credible blog (~15) within ~1.4× rather than ~6× — defensible without crushing tail credibility. This is a per-quote weighting for aggregation, not a per-source ranking shown to readers. The site does not present a "this source is more important than that one" leaderboard.

What the score is — and is not

  • Is: a per-source weight used inside aggregation maths so a quote in a 1.6M-reader site doesn't get arithmetically buried by a thousand quotes in 500-reader sites.
  • Is not: an editorial judgement of quality, accuracy, or credibility. A small blog can score low on reach and high on signal; the score addresses one axis only.
  • Is not: a complete substitute for reading the source. Two sources with the same reach score may have very different audiences (rural vs urban, generalist vs specialist). The score collapses "how many people see this" into one number; it can't capture WHO sees it.

Known limitations

  1. Global-domain over-credit. Some sources we ingest sit on global domains (theconversation.com, greenpeace.org). Tranco's rank reflects worldwide traffic, not the NZ- specific subset, and the estimated score over-credits their real NZ reach. Rows affected carry an upper-bound warning in the table below.
  2. Blogspot / Substack subdomains. Tranco ranks registered domains, not subdomains, so bowalleyroad.blogspot.com can't get a Tranco rank distinct from blogspot.com itself. Affected sources are scored via floor instead.
  3. Anchor figures are point-in-time leaks. Nielsen NZ Online Ratings are subscription data; we capture the most recent figure that's been published in trade press (Stoppress, Spinoff Media, NZH Media Insider). Refresh cadence is operator-driven, not real-time.
  4. Cloudflare Radar evaluated and rejected. We tried the Cloudflare Radar API as an additional signal; the bucketing was coarser than Tranco for our source mix and the NZ-filtered top 200 contained zero news/media domains. The token is in the env file for future cross-checks but unused in scoring.

Live source reach table

Every source we ingest, with its current score and confidence. Sorted by score descending. Read the confidence column before the number: anchored means a published audience figure; estimated means a Tranco-derived estimate; inferred means an editorial floor; unknown means we have no signal at all.

Source Score Confidence Breakdown
conversation-nz 100.0 estimated tranco_rank=977 tranco_score=100 
stuff 95.0 anchored anchor_period=2025-10 anchor_score=94.97 anchor_source=nielsen-via-trade-press anchor_uniques=2.3e+06 tranco_rank=3462 tranco_score=82.02 
nzherald 93.6 anchored anchor_period=2025-Q3 anchor_score=93.59 anchor_source=nielsen-via-trade-press anchor_uniques=1.86e+06 tranco_rank=3315 tranco_score=82.65 
rnz 92.9 anchored anchor_period=2025-10 anchor_score=92.85 anchor_source=nielsen-via-trade-press anchor_uniques=1.66e+06 tranco_rank=8692 tranco_score=68.7 
rnz-comment 92.9 anchored anchor_period=2025-10 anchor_score=92.85 anchor_source=nielsen-via-trade-press anchor_uniques=1.66e+06 tranco_rank=8692 tranco_score=68.7 
tvnz 91.3 anchored anchor_period=2025-Q2 anchor_score=91.27 anchor_source=roy-morgan anchor_uniques=1.3e+06 tranco_rank=38118 tranco_score=47.3 
spinoff 85.1 anchored anchor_period=2025-Q3 anchor_score=85.07 anchor_source=nielsen-via-trade-press anchor_uniques=500000 tranco_rank=53870 tranco_score=42.29 
greenpeace-nz 83.0 estimated tranco_rank=3239 tranco_score=82.99 
newsroom 82.8 anchored anchor_period=2025-Q3 anchor_score=82.76 anchor_source=nielsen-via-trade-press anchor_uniques=350000 tranco_rank=68409 tranco_score=38.83 
newsroom-opinion 82.8 anchored anchor_period=2025-Q3 anchor_score=82.76 anchor_source=nielsen-via-trade-press anchor_uniques=350000 tranco_rank=68409 tranco_score=38.83 
odt 80.6 anchored anchor_period=2025-Q3 anchor_score=80.58 anchor_source=nielsen-via-trade-press anchor_uniques=250000 tranco_rank=27193 tranco_score=52.18 
daily-blog 20.1 estimated tranco_rank=249150 tranco_score=20.12 
democracy-project 18.0 inferred floor_score=18 
the-kaka 18.0 inferred floor_score=18 
kiwiblog 15.7 estimated tranco_rank=338616 tranco_score=15.68 
working-group 15.0 inferred floor_score=15 
bowalley-road 12.0 inferred floor_score=12 
cross-party-lines-audio 12.0 inferred floor_score=12 
no-right-turn 12.0 inferred floor_score=12 
point-of-order 12.0 inferred floor_score=12 
tina-ngata 12.0 inferred floor_score=12 
waatea 10.8 estimated tranco_rank=472980 tranco_score=10.84 
big-hairy-news 10.0 inferred floor_score=10 
blue-review 10.0 inferred floor_score=10 
karl-du-fresne 10.0 inferred floor_score=10 
salvation-army 9.8 estimated tranco_rank=508646 tranco_score=9.79 
homepaddock 8.0 inferred floor_score=8 
verity-johnson 8.0 inferred floor_score=8 
national-party 7.7 estimated tranco_rank=587943 tranco_score=7.69 
e-tangata 4.9 estimated tranco_rank=713000 tranco_score=4.9 
the-standard 1.7 estimated tranco_rank=889186 tranco_score=1.7 
forest-and-bird 1.0 estimated tranco_rank=932376 tranco_score=1.01 
labour-party 0.8 estimated tranco_rank=945314 tranco_score=0.81 
act-party 0.0 estimated tranco_rank=1.219967e+06 tranco_score=0 
beehive 0.0 estimated tranco_rank=1.166039e+06 tranco_score=0 
business-nz 0.0 estimated tranco_rank=1.180417e+06 tranco_score=0 
cpag 0.0 estimated tranco_rank=3.164105e+06 tranco_score=0 
ctu 0.0 estimated tranco_rank=1.815686e+06 tranco_score=0 
ema 0.0 estimated tranco_rank=2.211362e+06 tranco_score=0 
family-first 0.0 estimated tranco_rank=2.208173e+06 tranco_score=0 
free-speech-union 0.0 unknown
groundswell 0.0 unknown
helen-clark-fdn 0.0 unknown
hobsons-pledge 0.0 unknown
maxim-institute 0.0 unknown
nz-first-party 0.0 estimated tranco_rank=2.71397e+06 tranco_score=0 
nz-initiative 0.0 estimated tranco_rank=1.289971e+06 tranco_score=0 
nzcpr 0.0 estimated tranco_rank=2.03065e+06 tranco_score=0 
pundit 0.0 estimated tranco_rank=4.377399e+06 tranco_score=0 
taxpayers-union 0.0 estimated tranco_rank=1.814108e+06 tranco_score=0 
werewolf 0.0 estimated tranco_rank=3.376909e+06 tranco_score=0 

Honest note: floor scores for tail commentators are editorial judgement, not measurement. The table makes that visible (confidence=inferred); we don't dress up an editorial call as a number with implied measurement precision.

O OIA releases

Official Information Act disclosures are ingested from FYI.org.nz, a NZ-localised fork of the WhatDoTheyKnow Alaveteli platform. Every release that survives the four filter gates below is written as one event (or several, for multi-anchor requests) against the entities it names — subject to a tenure-overlap check that ensures a 2018 OIA can never surface on a current MP's briefing unless that MP held the relevant role in 2018.

Filter gates (cheap-first)

  1. Status filter. Only requests whose FYI-level status is one of successful, partially_successful are considered. Refused / withdrawn / awaiting-classification requests are dropped, even if they cite section-of-act grounds we could analyse — the per-cycle reviewer queue (Phase 6) will handle those.
  2. Authority allow-list. Of the ~3,200 public bodies FYI catalogues, the ingester only polls the operator-curated subset that maps to existing alitheia organisations. Today that's 92 authorities (central agencies + crown entities + major councils + universities). Single-canonical-name collisions with charity rows are suppressed via the /admin/oia-authorities page rather than auto-merged.
  3. Subject-period resolution. Each request page goes through the fyi-oia-subject-period-2026-05a extractor to recover the time window the request was about (e.g. “briefings between 1 April 2024 and 31 March 2025”). If the LLM can't infer one, we fall back to the request date and stamp subject_period_source = request_date_fallback so downstream views can tell.
  4. Tenure-aware anchor. The fyi-oia-attribution-2026-05a pipeline scans title + response text for known people (surname match against the role-tenure registry) and known publicly-anchored organisations. For each person hit, positions.TenureOverlaps checks that the person held a role during the resolved subject period. If no entity survives the tenure check, the request is deleted (cascade kills its attachments and blobs). Tenure-overlap-required: true.

What we store

  • The request title and final response cover-letter text (PII-redacted by FYI before publication).
  • Section-of-act citations (e.g. 9(2)(a), 18(c)(i)) extracted by regex.
  • Every released attachment, downloaded to blob storage and SHA-deduped. PDFs are optionally fed through the extract pipeline; the produced markdown lives at a sibling blob key.
  • One oia_response_published event per anchored mention, with the authority as the event's object pointer.

Defamation-safe defaults

We never store the requester's name (citizen PII), and we never publish a mention that doesn't pass the tenure check. The request rows themselves are dropped — not stored unanchored — if no entity survives attribution. That trades volume for soundness: better to ingest one well-anchored release than to surface ten loosely-anchored ones that turn out to name the wrong person.

Honest note: the surname-match path can't disambiguate two MPs who share a surname. Phase 6 will route the ambiguous cases through the existing match- review queue. For now the pipeline skips surnames shorter than five characters to keep false positives off the briefing pages.

T Topic extraction

In plain English: AI reads each article and tags it with a few short topic phrases. Each new tag is compared against the existing list; if it’s very close to an existing topic we treat it as the same; if it’s clearly different we add it as new; if it’s ambiguous a second AI pass decides “same topic or not?”. No fixed master list — the topic taxonomy grows from what the corpus is actually talking about.

The technical version

Topics are emergent: extracted by an LLM from each commentary item then canonicalised against existing labels via embedding similarity. No fixed taxonomy.

  1. Extract. A local large-language model (Qwen 3.6 35B-A3B, run on our own hardware — no third-party API and no data sent off-site) reads each article and emits 3–7 short topic phrases plus a one-line stance / framing for each, and a one-sentence summary. The article body is truncated to 6000 characters before the prompt to keep responses snappy.
  2. Embed. Each topic phrase is converted into a vector representation using a 768-dimensional embedding model (BGE-base-en-v1.5).
  3. Canonicalise. The new vector is compared against existing topic vectors using cosine similarity. Three zones:
    • ≥ 0.85 → treated as another way of saying the existing canonical (automatic).
    • < 0.75 → treated as a brand-new canonical topic (automatic).
    • in between → a small verifier model is asked “is this the same political topic?”; its decision is cached so the same question is never re-asked.

Honest note: the 0.85/0.75 thresholds were chosen from a BGE-base experiment baseline, not validated against the live NZ corpus. They've been stable enough not to refactor; that's not the same as having been measured.

Pr Press topic extraction

In plain English: the same AI that tags commentary articles also tags news articles. Both flow into the SAME canonical topic list, so when an opinion blog and an RNZ report talk about “cost of living”, they both land on the same canonical topic and we can compare them side-by-side.

The technical version

News topic extraction (M28) mirrors the commentary pipeline described above and writes to a parallel news_article_topics table. Crucially, the canonical topics table and the topic_aliases + topic_alias_verifications tables are shared between corpora — an alias phrasing seen first in a Stuff article and later in a Greens press release lands on a single canonical topic via embedding similarity. This is what makes the cross-corpus comparison views (the /press/comparison quadrant, the in-press panel on each topic detail page, the press × discourse rows on MP briefings) work without a manual mapping table.

What we don't do on news (yet): the stance classifier only runs on commentary, not news. News framing is implicit (lede choice, source selection, headline emphasis) rather than explicit opinion stance, and the existing 5-class classifier would mostly produce “neutral-explainer” on news content, diluting the signal it captures from commentary. A news-specific framing classifier is a future addition.

The seven news outlets currently in the corpus — RNZ, Stuff, NZ Herald, ODT, 1News, Newsroom, and The Spinoff — have lean assignments in source_orientations (see the table on the source orientations section). New outlets need both a registry entry in internal/sources/news/registry.go and a row in source_orientations.

So Social topic extraction

In plain English: the same canonical topic list also covers Twitter, Reddit, YouTube and Facebook posts. A tweet about “cost of living” and an opinion piece on the same subject land on one topic, so the social lens shows you which voices on which platforms are amplifying or pushing back on what the press is reporting.

The technical version

Social topic extraction (M68) follows the same pipeline as press and commentary: an LLM tags each post with up to four canonical topics, the labels resolve through the same topic_aliases table, and rows land on social_post_topics — a parallel table to news_article_topics and commentary_item_topics. Cross-corpus comparison (the social column on each topic detail page, the social anomaly cards on /social) reads from this shared canonical spine without a manual mapping.

Two classifiers run on top of the topic edges:

  • Per-post sentiment — whole-post emotional valence (positive / neutral / negative). Stamped on social_posts directly because sentiment is a property of the post, not of any one topic it touches.
  • Per-edge stance — whether a post that touches a topic is supportive, critical, or neutral toward that topic. Stamped on social_post_topics so the same post can be supportive of one topic and critical of another inside one tweet thread.

Honest note: the registry is built from a curated slice (party leaders + key journalists / commentators) plus every sitting MP whose handle the mp_handles ingester has resolved. Coverage is biased toward MPs who post on X — many backbenchers post primarily on Facebook / Instagram, and the unpaid post-style activity on those platforms is not currently in the corpus. Political and social-issue ads on Facebook / Instagram are ingested via the Meta Ad Library and live on a separate meta_ads table — they show up on spending / disclosure surfaces, not in the /social topic counts. The /social numbers should be read as “the X-and-Reddit slice of the political conversation”, not the totality of online political speech.

Px Press × discourse balance filter

In plain English: the comparison page only shows topics that are talked about in both corpora. A topic that's only ever in news, or only ever in opinion blogs, isn't a fair comparison — you'd be comparing something to nothing. We require a minimum number of items on each side over the trailing 12 weeks before a topic counts as “balanced”.

The technical version

The /press/comparison view applies a balance filter: a topic must have at least 5 items in EACH corpus over the trailing 12 calendar weeks to appear in the quadrant or the lead-lag table. This is enforced in db.CoverageGapTopics(); per-topic deep-dives (/press/topics/{slug} and /discourse/topics/{slug}) do not apply the filter and will render however much data exists.

Why filter. An asymmetric topic — one with consistent coverage in news but never in commentary, or vice versa — produces a degenerate comparison: the cross-correlation is undefined when one series is constant zero, and the quadrant chart collapses to one axis. The 5-item threshold is a heuristic floor: roughly “at least once every other week on each side” over the 12-week window. Below that, lead-lag estimates are dominated by single spikes rather than a real signal.

Trade-off. The filter excludes new topics (only just emerging in one corpus), niche topics (e.g. highly technical policy areas only the policy blogs cover), and corpus-asymmetric topics (e.g. lobby campaign launches that the press doesn't pick up). These are visible individually on /press and /discourse, just not in the comparison view.

Z-scores, not raw counts. The quadrant axes are each corpus's deviation from its own 12-week mean — (current_week_n - mean) / max(mean, 1). That puts a niche topic with consistent low coverage and a hot topic with consistent high coverage in the same frame: both sit near zero when running at their typical level. The quadrant rewards change, not volume.

Lead-lag. For each balanced topic we compute the Pearson correlation of the news-corpus weekly volume against the discourse-corpus weekly volume at lag offsets from -4 to +4 weeks, and report the lag with maximum |r|. With only 12 weekly samples this is a fragile estimate; we hide rows with |r| < 0.25 because below that threshold the “best” lag is essentially noise.

Mt Media transparency findings

In plain English: the /media page turns the same detections that power the rest of the site into one-line, plain-English observations a non-analyst reader can pick up at a glance. Every claim links back to the underlying surface so a reader can verify or contest it.

The technical version

Section 1 findings are rule-based, not LLM-generated. We pick from four rule families, score each candidate by signal strength, and surface the top five.

  • Coverage gap — fires when the discourse z-score exceeds the press z-score by > 1.5 on the same balanced topic. Wording: “X is being discussed publicly but N of M outlets have been quiet”.
  • Lead-lag — fires when |Pearson r| > 0.7 and |best lag| ≥ 2 weeks on the cross-corpus weekly-volume series. Direction follows the sign of the lag.
  • Framing shift — fires when an alias-drift detection on a topic has been adopted by ≥ 3 outlets in the last 14 days.
  • Outlet spike — fires when a press-side anomaly card shows ratio ≥ 3.0 (current items vs prior 4-week mean).

Minimum denominators. Outlets with fewer than 20 articles in the 12-week window are hidden from the scorecard; topics with fewer than 5 items in either corpus are excluded from the gap and lead-lag analyses. These floors are deliberately conservative — a small denominator produces statistical noise that reads as certainty if rendered without context.

No scoring. The scorecard reports volume, lean (editor-assigned), lead-lag tendency, top topics, and coverage gaps. It does not rank outlets on bias, accuracy, or quality. Observations only.

Corrections. Every finding carries a per-line “Report this finding” link that opens the standard corrections form with the finding pre-filled. Submissions land in the same workflow as every other correction on the site; the public log is at /corrections/log.

St Stance classification

For every article-topic pairing, the topic extractor also emits a short framing phrase capturing the stance / angle taken. A separate small classifier (Qwen 3-4B, also local) reads each framing and assigns one of these labels:

Framings are paraphrased, not quoted. The framing phrase is generated by the topic extractor as its own short characterisation of how the piece treats the topic — it doesn’t need to appear verbatim in the source, and usually won’t. We render framings in italics rather than quotation marks so the difference is visible: italic = our paraphrase, any actual quotations would be in quote marks. The source link beside each framing is the canonical record; click through to read the original piece.
  • supportive — endorses or argues for the topic / position.
  • critical — argues against; identifies a problem; calls for change.
  • dismissive — dismisses without substantive engagement.
  • neutral-explainer — informational; lays out the situation without taking sides.
  • mocking — satirical, sardonic, or sneering register.

The five-way scheme draws on Boudana (2016) on supportive/critical asymmetry and Tandoc et al. (2018) on snark-as-distinct-register. It's narrower than full sentiment intensity, wider than binary supportive/critical (which collapses dismissal into criticism, masking a real rhetorical difference).

The model is constrained to return one of those five labels and a confidence score; we run it deterministically (temperature 0) with five concrete worked examples spanning the categories. The breakdown bar on a topic page only renders when at least 3 framings have been classified for that topic; below the threshold the section is hidden so readers don’t infer a ratio from too-few datapoints.

What stance does not mean

Stance is recorded per article, not per author or per source. A “critical” framing on one piece doesn’t make the source critical of the topic in general — many outlets publish a range of stances on the same topic over time. Aggregations on the topic page are read as “across all articles classified so far”, full stop. We don’t roll stances up to the source level or use them to characterise an author.

Validation

Stance is the highest-risk classification on the platform — it’s an interpretive call about authorial intent. Current validation is honest but limited:

  • The five-shot prompt was iterated against ~50 hand-graded framings during development to land on the example set currently in use.
  • We don’t yet have a regular validation pass against a held-out gold set; that’s a known gap and is on the roadmap.
  • Per-call confidence is recorded but not currently surfaced in the UI.

Known failure modes

The classifier is small (a few-billion-parameter local model) and can be expected to miss in roughly these cases:

  • Sarcasm and irony — particularly dry NZ political satire that depends on shared context. A piece that reads as supportive on the surface but is actually mocking can be miscategorised in either direction.
  • In-group humour and shibboleths — if a framing relies on knowing a particular figure or running joke, the model may not.
  • Te reo Māori phrases and Māori cultural context. The model has limited te reo coverage; framings that hinge on a Māori-language phrase or a tikanga reference may land as “neutral-explainer” when the actual stance is sharper.
  • Multi-stance articles. A long piece that criticises one aspect of a topic while supporting another is compressed into a single stance label per (article, topic) pairing. We don’t do paragraph-level segmentation.
  • NZ-specific cultural and political context the model wasn’t exposed to during training — regional issues, smaller-party positions, electorate-specific dynamics.

If you think a stance label is wrong

Misclassifications happen. Until a flag-from-page form ships, send the topic-page URL plus the framing you think is mislabelled to [email protected] with subject line “Stance review”. We’ll review, correct the label if warranted, and log the change.

An Anomaly detection

The detector runs three passes — topic, MP, source — comparing each subject's current calendar week against its prior 4-week mean. A row is emitted when:

  • current_n ≥ 4 (drops low-volume noise);
  • current_n / prior_mean ≥ 3.0;
  • prior_mean > 0 (no cold-start cards — "X never happened before, now it has" isn't a step-change in the same sense).

Before any pass runs, the detector checks that the corpus has at least 3 distinct prior calendar weeks of data. Without that history we can’t tell signal from cold-start noise, so the detector exits silently. The detector is also safe to re-run at any cadence: the same spike on the same day will only ever produce one card.

Honest note: the prior window is calendar-aligned. A Monday cron run sees full prior-week counts; a Friday cron sees a still-accumulating bucket and may flag late.

D How each Discourse view is built

Weekly digest cards

Top 5 topics by current calendar-week volume. Trend arrow compares against the prior 4-week mean. Each card carries the top 3 contributing sources and one sample framing (the longest non-empty framing this week).

12-week heatmap

Top 25 topics ranked by 4-week recency, against the last 12 calendar weeks. Cells with zero items are not emitted; the UI zero-fills.

Source × topic sparkline matrix

Top 10 topics × top 8 sources × 12 weekly counts. Source order is by total volume across the matrix; topic order by total volume desc. Each cell is an inline SVG sparkline coloured by source lean. Empty cells render as a dashed midline so “quiet” reads differently from “missing”.

Topic co-occurrence graph

Top 40 topics by 8-week volume; edges where two topics co-occurred in ≥ 2 articles. Node size scales with item count; quartile-bucketed colour categories. Layout is ECharts default force layout (no UMAP yet — positions vary across page loads).

Pe Pecuniary register

The annual Register of Pecuniary Interests is published as a single PDF. We parse it to text and then run a structured extraction prompt against each MP’s section, producing one row per declared item with the category heading, the original description, the named organisation (when one is clearly stated), and the role (Director / Trustee / Shareholder / Beneficiary, when stated).

Roles are read from the register text exactly as written; we don’t standardise variants. The MP page currently shows up to 200 directorship rows per MP — no pagination beyond that.

Honest note: role values are kept as written. A row that says "Director; Shareholder" is shown that way, not split into two normalised entries.

M Within-portfolio meetings

A meeting is flagged "within portfolio" when the counterparty organisation’s primary market sector intersects with the minister’s portfolio sectors. The portfolio → sector lookup is a hand-maintained map. Cross-cutting portfolios — ones that legitimately touch every market — are deliberately unmapped, because tagging every meeting under "Economic Growth" or "Trade and Investment" as in-portfolio would make the highlight meaningless.

PortfolioSectors
acc healthcare
agriculture agriculture_food
arts, culture and heritage advocacy_ngoretail_consumer
auckland — deliberately unmapped —
biosecurity agriculture_food
building and construction construction_infrastructure
child poverty reduction advocacy_ngo
children advocacy_ngo
climate change energy_resources
commerce and consumer affairs — deliberately unmapped —
community and voluntary sector advocacy_ngo
conservation energy_resources
corrections government_crown
courts professional_services
customs transport_logistics
defence government_crown
deputy leader of the house — deliberately unmapped —
disability issues healthcareadvocacy_ngo
economic growth — deliberately unmapped —
education education
emergency management and recovery government_crown
energy energy_resources
environment energy_resources
ethnic communities advocacy_ngo
finance financial_services
food safety agriculture_food
foreign affairs government_crown
forestry agriculture_food
gcsb and nzsis government_crown
government’s response to the royal commission’s report into historical abuse in state care and in the care of faith-based institutions — deliberately unmapped —
health healthcare
housing construction_infrastructure
hunting and fishing agriculture_food
immigration government_crown
infrastructure construction_infrastructure
internal affairs government_crown
justice professional_services
land information construction_infrastructure
local government government_crown
media and communications technology_media
mental health healthcare
ministerial services — deliberately unmapped —
māori crown relations government_crown
māori development advocacy_ngo
national security and intelligence government_crown
oceans and fisheries agriculture_food
pacific peoples advocacy_ngo
police government_crown
prevention of family and sexual violence advocacy_ngohealthcare
public service and digitising government technology_mediagovernment_crown
racing retail_consumer
rail transport_logistics
regional development — deliberately unmapped —
regulation government_crown
resources energy_resources
revenue financial_services
rma reform construction_infrastructure
rural communities agriculture_food
science, innovation and technology technology_media
seniors healthcareadvocacy_ngo
small business and manufacturing retail_consumerprofessional_services
social development and employment advocacy_ngo
social investment financial_services
south island — deliberately unmapped —
space technology_media
sport and recreation retail_consumer
state owned enterprises government_crownenergy_resourcestransport_logistics
statistics government_crown
tertiary education education
tourism and hospitality retail_consumer
trade and investment — deliberately unmapped —
transport transport_logistics
treaty of waitangi negotiations government_crown
veterans healthcareadvocacy_ngo
whānau ora advocacy_ngohealthcare
women advocacy_ngo
workplace relations and safety advocacy_ngoprofessional_services
youth advocacy_ngoeducation

Briefings show up to 50 most-recent meetings per minister.

$ Donations

Donations come from the NZ Electoral Commission register of donations exceeding $20,000. Smaller donations are not on this site — they're not in the public register.

Donor names are recorded as disclosed. Addresses are stored in full for audit but reduced to the locality (city / town) on display. The heuristic: split on commas, take the last non-empty segment, strip a trailing 4-digit NZ postcode. "23 Armstrong Avenue, Carterton 5713" becomes "Carterton".

Ch Charities sidecar

For organisations registered as charities, we ingest the Charities Services register as a sidecar: NZBN, registration number, charitable purpose, main sector / activity / beneficiary, and registration / deregistration dates. Status flips (Registered → Deregistered) emit events visible on the org page.

B AI-assembled biographies

MP biographies are produced by an AI research agent that is given a name plus a small toolkit: it can search the web, fetch a page and store a snapshot, query the platform database for what we already know, and finalise structured claims. Each claim must be backed by at least one verbatim quote from a stored snapshot. Snapshots are content-addressed, so re-extracting the same page never double-counts as multiple sources.

Each claim is corroborated according to source tier:

StateRule
confirmedOne Tier-1 (authoritative, e.g. Hansard) source or ≥ 2 distinct Tier-2 sources.
unverifiedSingle Tier-2 source, or Tier-3 only.
disputedSources disagree on dates, values, or other key fields.

Honest note: “confirmed” is a corroboration claim, not a truth claim. We’re saying multiple independent sources agree on this fact; we are not saying the fact itself is true.

What “admin review” actually means

We want to be precise here, because “reviewed” is the kind of word that quietly does a lot of work. Bios sit in an internal queue until a platform operator (us, not an independent editorial board) clears them. When we describe a bio as reviewed, the named reviewer has:

  1. Read the generated prose end to end.
  2. Clicked through each citation and verified that the quoted text actually appears in the source it points to (no hallucinated quotes, no off-by-one paragraph drift).
  3. Checked that the corroboration tags — confirmed, unverified, disputed — reflect the evidence on file.
  4. Removed or tagged any claim the cited evidence doesn’t actually support.

That’s a sourcing check, not editorial polish. The reviewer’s name and the date are recorded internally with the run; the public MP page deliberately doesn’t carry a “reviewed by” byline, because the reviewer is a platform operator and we don’t want to imply an independent editorial board that doesn’t exist. The internal record is auditable on request, and every page has a one-click flag a problem button so a real reader can challenge anything that slipped through.

Regeneration triggers a fresh review. If a bio is regenerated — because new sources have surfaced, the agent prompt has changed, or a previous review found problems — the new run lands back in the queue and the prior approval doesn’t carry over. The version you read is either “awaiting review” or “approved against this exact set of claims”, never partway between.

Treat every bio as a starting reference, not a final source: the corroboration tags are designed to make the level of confidence legible, not to replace your own cross-checking.

@R At-risk institutions early warning

The /at-risk surface flags NZ public-sector institutions exhibiting patterns consistent with coordinated institutional targeting. The four-stage detection model is informed by published research:

  • Levitsky & Ziblatt (2018) on democratic erosion through the gradual capture of constraining institutions rather than through dramatic constitutional rupture.
  • Benkler, Faris & Roberts (2018)Network Propaganda — on asymmetrically polarised media ecosystems and how grievance content propagates from partisan blogs through partisan outlets into mainstream press.
  • Phillips (2018)The Oxygen of Amplification — on the role of mainstream media repetition in laundering fringe narratives into general-audience cycles.
  • Hannah, Hattotuwa & Taylor (2022)The murmuration of information disorders — on Aotearoa-specific disinformation ecologies, including the targeting of regulatory and oversight bodies as a norm-setting vector for far-right policy positions.

Politically agnostic by design

The detector recognises a pattern, not an ideology. Coordinator-node-ness is a behavioural classification — an org or person whose activity pattern shows synchronous targeting of multiple regulators with predominantly grievance-framed content. Anyone exhibiting that pattern surfaces as a candidate, regardless of political alignment. We expect (and welcome) right-wing and left-wing coordinator networks to be evaluated under the same criteria. To demonstrate the score is not "everyone gets criticised", we run the same model on negative-control institutions (Reserve Bank, IRD, ACC) and expect them to fail the threshold.

Admit gate

Cascade detections are not auto-published. They land in an admin queue at /admin/at-risk-queue and require a human admit + a written narrative note before appearing on /at-risk. Rejections are logged with a reason and an optional suppression window; rejected cascades don't re-spawn until the window expires or the underlying detector version changes. This gate is the load-bearing design choice — the public claim that an institution is being targeted by a coordinated network is politically loaded, and a reviewable false-positive policy is essential. To be precise about what this gate isn’t: it’s a sourcing-and-sanity admit by a platform operator, not editorial polish by an independent editorial board.

The four stages (early warning only)

  1. Volume signal (Stage 1). Mention rate across discourse, press, audio, and social rises sharply against the institution’s own trailing baseline. Fires when current-week count is ≥ 4 items and at least 3× the trailing 4-week mean (or ≥ 8 items when the prior baseline is sub-1). Indicates increased attention; does not in itself imply coordination.
  2. Delegitimising framing (Stage 2). Items frame the institution as illegitimate or as overreaching its mandate, rather than engaging substantively. Fires when grievance-framed share ≥ 30% over the trailing 30 days and grievance items ≥ 4. The framing classifier is a 4-way LLM call distinguishing grievance from defence, neutral, and substantive criticism — the latter category protects legitimate accountability journalism from getting bundled with the playbook signal.
  3. Infrastructural amplification (Stage 3). Two or more confirmed coordinator nodes (orgs or people with coordinator_role set) authoring or posting in the trailing 14 days about the institution’s topics. Coordinator-node status is behavioural and admit-gated (see “Auto-discovery” below).
  4. Mainstreaming (Stage 4). Two or more news articles carrying the delegitimising frame in the trailing 14 days, and Stage 3 also fired. This captures the “oxygen of amplification” effect — the moment a partisan frame crosses into general-audience press cycles. Without Stage 3, Stage 4 cannot fire: press picking up a story is not by itself a coordination signal.

Stage 5 (policy capture — defunding / restructuring / abolition) is not part of the early-warning detector. By the time policy capture happens, the warning is no longer early. We track the outcome manually as an outcome_status field on the institution row so cascades can be reviewed retrospectively and the historical record stays intact.

Risk score

The headline number on each /at-risk row is computed as:

risk = (1 if Stage 1 fired else 0)
     + (Stage 2 grievance_share * 2)
     + (log10(coordinator_count + 1) * 3   when Stage 3 fired)
     + (2 if Stage 4 fired else 0)
risk *= speed_multiplier

speed_multiplier rewards a tight cadence between Stage 1 first-detected and the latest fired stage — floor 1.0 (single stage / wide gap), capped at 3.0 (all four stages firing within a week). The article's central thesis is "the speed is the point": an 8-month cascade with stages back-to-back reads very differently from the same end-state arrived at over years.

Auto-discovery

Tracked institutions are not hand-curated. A discovery pipeline runs over canonical topic labels, applies a heuristic name-pattern match (* Authority, * Commission, * Council, * Board, Office of the *, etc.), and an LLM classifier confirms each candidate as an NZ public-sector regulator / oversight body / commission / tribunal / research body. Confirmed candidates flow through the admit gate before they're tracked. Coordinator-node candidates use the same auto-discovery + admit pattern.

Stage 5 caveats and what's missing

The current detector does not ingest Parliament bills, select committee submissions, or continuous Hansard polling. Stage 5 evidence (members' bills, ministerial announcements) therefore relies on whatever surfaces in commentary scrapes of beehive.govt.nz and on-demand Hansard searches. This is a known gap; closing it is in the Phase 2 roadmap so the detector can move from early-warning to outcome confirmation.

In Insights overview

The Insights page surfaces cross-corpus analytical views that cut across the news, commentary, social, and audio corpora. Each section answers a different question about NZ’s political information ecosystem. All analysis is derived from the same underlying data the rest of the site presents — no additional data sources or models are used. The insights are period-aware (tied to the site-wide period picker) and refresh with each page load.

The seven sections below document the analytical method behind each view. All use SQL aggregations over the existing tables; no machine learning is applied beyond the topic extraction and stance classification already documented above.

AT Actor trajectories

Question: Who is gaining or losing attention across channels, and is the sentiment around them shifting?

The actor trajectory view aggregates per-person mention counts across four channels (news articles via news_article_mentions, commentary via commentary_item_mentions, audio via audio_item_mentions, and social posts via social_posts.author_person_id) into weekly bins over a trailing 6-week window.

The top 15 actors by total mention volume are selected. For each, the sentiment arc is derived from stance-classified commentary edges: the ratio of supportive vs critical/mocking stances on commentary items that mention the actor, grouped by week.

Limitations

  • Social post mentions pipeline is not yet wired (0 rows), so the social channel only counts posts authored by MPs, not posts about them.
  • Sentiment is derived from commentary stance only, not from news or audio tone.
  • Mention counts are not normalised by corpus volume — a week with more articles will show higher counts even if the actor’s share-of-voice is unchanged.

AS Agenda setting

Question: Which channel surfaces topics first, and where does each topic’s coverage actually live?

The first-mover donut counts how often each channel (commentary, news, social, audio) published the earliest item for a given topic across the trailing window. Only topics appearing in 2+ channels are counted.

The channel share-of-voice table shows where each topic’s coverage actually lives, normalised for channel volume. The method is a two-step normalisation:

  1. Channel normalisation: each channel’s raw item count for a topic is divided by the channel’s total item count in the period. This converts raw counts into “% of this channel’s attention.” A topic with 50 news articles out of 7,000 total = 0.7% of news attention; 500 social posts out of 60,000 total = 0.8% of social attention.
  2. Rebasing: the four normalised channel values are then rescaled to sum to 100% for each topic. This answers “of the attention this topic gets across all channels (adjusted for volume), what share comes from each channel?”

Topics are sorted by total volume (most discussed first). Only topics with ≥10 total items across ≥2 channels are included. The raw item count is shown underneath each percentage so readers can judge the absolute scale.

Limitations

  • Publication timestamps are taken as-is from RSS feeds and scrapes; ingestion delay (the time between publication and our first poll) is not subtracted, so the first-mover signal is approximate.
  • A topic appearing first in commentary may reflect a press release (which we ingest as commentary) rather than independent commentary.
  • Normalisation compensates for volume differences between channels but does not account for reach. A single RNZ article may reach more people than 500 Reddit posts, but both count as one item in their respective channels.
  • The channel totals used for normalisation include all topic-tagged items, not just political ones. A channel with a large non-political corpus (e.g. YouTube) will have its political topics appear as a smaller share of total output.

TL Topic lifecycle

Question: How long do topics persist, and what shape does their activity arc take?

Topics are classified into four lifecycle stages based on their weekly volume series over the trailing 6 weeks:

  • Persistent — current volume ≥ 70% of peak volume. The topic is still near its maximum and shows no sign of declining.
  • Decaying — current volume has dropped below 70% of peak and is not recovering. The default state for most topics past their initial spike.
  • Flash — the topic appeared in only one calendar week. A single-week spike that didn’t persist.
  • Resurgent — the topic’s most recent week showed higher volume than the week before (a bounce), but it has not re-reached its peak.

Half-life is estimated as (active_weeks − 1) × ln(0.5) / ln(current / peak) — the exponential decay constant implied by the ratio of current to peak volume over the observed active window. It is undefined for topics still at or above peak and for single-week flashes.

Limitations

  • Half-life assumes exponential decay, which is a simplification. Real topic decay curves are often step-shaped (a few days of coverage, then silence).
  • Topics with fewer than 5 total items across all channels are excluded.

M$ Money & narrative

Question: Do organisations that spend more money (via donations or political advertising) receive more media coverage?

The view joins three data sources: donations (from the Electoral Commission’s >$20k disclosure register), Meta ad spend (from the Facebook Ad Library API, using the upper spend estimate), and media mentions (news + commentary articles that mention people affiliated with each organisation via the party_current join).

Social sentiment is drawn from posts authored by the organisation’s own social accounts (social_posts.author_org_id) rather than from posts about the organisation, so it reflects the org’s output tone, not public reception.

What this is not

This is a transparency table, not a causation claim. Larger parties naturally receive more donations, spend more on advertising, and attract more news coverage. The table lets readers see all three dimensions side-by-side; drawing causal inferences requires controlling for party size, incumbency, and news cycle dynamics, which we do not attempt.

Limitations

  • Donation data covers only the >$20k register, which is election-cycle-dependent. Gaps between cycles are by design, not data loss.
  • Ad spend uses the upper estimate from the Meta API, which can overstate actual expenditure.
  • Coverage is attributed to the party via people.party_current, which does not account for MPs who have changed party.

“Follow the money”: third-party & union spend

A companion view on the Insights page widens the lens beyond parties to third parties — unions, lobby groups and issue campaigns. It draws on four disclosed money signals, deliberately split into two separate views because their disclosure windows are not comparable: a live current-cycle chart (donations + Meta + Google) and a clearly-dated historical block (the 2023 regulated promoter returns, shown on their own and never summed with live data). The four signals:

  1. Donations received — Electoral Commission >$20,000 register (parties only).
  2. Meta ad spend — Facebook/Instagram Ad Library, upper estimate, linked to the advertiser’s org via the resolved meta_ads.page_org_id.
  3. Google ad spend — Search/YouTube, from the public BigQuery dataset bigquery-public-data.google_political_ads.advertiser_stats filtered to NZ advertisers (whole-NZD figures, linked via google_ad_spend.advertiser_org_id).
  4. Regulated promoter expenses — the only source with a per-medium breakdown, including newspaper and other print. Taken from the Electoral Commission’s registered third-party promoter expense returns; the headline total is scraped from the summary page and the itemised lines are read from each return PDF (these are filled AcroForm PDFs that ordinary text extraction misses, so we decode them with a document-capable LLM and only keep a return’s line items when they reconcile to its stated total within 5%).

Different disclosure windows — the totals are not a single comparable period. This is the most important caveat for reading the chart honestly:

  • Donations publish in real time but only in election years; sparse between cycles by design (see the Donations section). We are currently in the 2026 election year.
  • Promoter returns cover the 2023 General Election only, regulated period 14 July–13 October 2023, and only promoters who spent over $100,000. The 2026 returns are filed after the election, so 2026 promoter spend is not yet disclosed.
  • Meta is rolling (every political/issue ad in our corpus, ~2020 onward).
  • Google is a cumulative advertiser lifetime total over the programme’s history.

Read each bar as “total disclosed money we can currently see for this actor,” not as spend within one timeframe.

Organisation identity, lean, and unions

All four signals are rolled up onto one canonical organisation row, so an actor advertising under several names (“ACT” on Meta, “ACT New Zealand” in the donations register) counts once. Trade unions are tagged with a dedicated sector and surfaced distinctly, in response to questions about union political spending. The left/right filter uses a hand-curated lean attached to each tracked party, union and lobby; organisations we have not classified carry no lean and appear only in the unfiltered view. Lean is an editorial judgement, not a measurement — treat it as a navigation aid, not a verdict.

Independent (third-party) advertising is, by law, separate from a party’s own campaign. Where a union both donates to a party and runs its own ads, that is alignment, not evidence of coordination or that donation money funded the ads — NZ disclosure does not expose any such link, and we do not assert one.

EC Echo chambers & framing divergence

Question: Are left-leaning and right-leaning commentary sources talking about the same topics but framing them very differently?

For each canonical topic covered in commentary during the trailing 4 weeks, the view groups items by source orientation (from source_orientations.lean): left / centre-left vs right / centre-right. It then computes the dominant stance (MODE()) from each orientation group and a divergence score: the absolute difference in the ratio of supportive-to-total stances between left and right.

A divergence score of 0% means both sides take the same supportive/critical mix. A score of 100% means one side is entirely supportive while the other is entirely critical. Topics with fewer than 3 items from either orientation are excluded to avoid noise.

Limitations

  • Source orientation labels are hand-curated. New sources are unclassified until an editor assigns a lean.
  • The divergence score compares supportive vs critical/mocking/dismissive ratios; it does not capture framing divergence (same conclusion, different language).
  • Centre sources are counted but not used in the divergence calculation.

AC Cross-channel anomaly clusters

Question: Which topics triggered volume spikes in multiple channels simultaneously?

The existing anomaly detector (documented above under Anomalies) fires independently for social, discourse, and news. This view joins the three anomaly tables on subject_kind = 'topic' and topic_id, looking for topics that appear in two or more of the three tables within the selected period.

Synchronised anomalies are stronger signals than isolated spikes. When social media, commentary, and news all spike on the same topic in the same period, the probability of a genuine developing story (rather than random noise) is much higher. The cluster cards display the per-channel ratio so readers can see which channel spiked hardest.

Limitations

  • The join is on topic_id, not on time window — two anomalies firing three days apart in the same period still count as a cluster.
  • Dismissed anomalies (reviewed and marked as noise by an editor) are excluded.
  • Audio anomalies are not yet included in the cluster (the audio anomaly detector is a separate pipeline).

RX Talk radio exclusives

Question: What topics are discussed on talk radio but absent from written media?

The view compares the topic set from audio items (talk radio transcripts, filtered to politically_relevant = true) against the combined topic set from commentary items and news articles over the trailing 4 weeks. A topic is flagged as a radio exclusive when it has 3+ audio items and the written-media count is ≤ 30% of the audio count.

These topics represent the “parallel conversation” happening on air that journalists haven’t picked up or can’t cover. They may signal emerging public concerns before they surface in print, or they may reflect talk-radio editorial preferences.

Limitations

  • Audio coverage is limited to 7 talk shows. Topics exclusive to shows we don’t ingest are invisible.
  • The 30% threshold is a heuristic; it has not been calibrated against a ground-truth dataset.
  • Topic canonicalisation may merge an audio-specific label with a broader written-media label, causing false negatives.

L Known limitations & biases

  1. Commentary lean classifications are hand-curated; new sources don't get a label until manually added.
  2. Portfolio → sector map is intentionally sparse; cross-cutting portfolios go untagged.
  3. Embedding canonicalisation thresholds (0.85/0.75) are baseline values, not validated against live data.
  4. Anomaly detector is calendar-aligned; mid-week reads are partial.
  5. Apostrophe collapse in org normalisation is lossy by design.
  6. Body truncation (6000 chars before topic extraction) is uncalibrated.
  7. Address-region heuristic is fragile on non-standard NZ formats.
  8. First-name-only mention resolution depends on fuzzy match.
  9. Topic and stance models are local Qwen variants. The choice was throughput-driven (Claude latency was prohibitive at corpus volume) rather than rigorously benchmarked against alternatives.
  10. Briefing display caps (50 meetings, 50 news, 200 directorships) are hard limits with no pagination.
  11. Stance bar is hidden below 3 classified edges; absence is "not enough signal yet", not "no stance".
  12. Anomaly detector skips entirely if the corpus has fewer than 3 distinct prior calendar weeks of data.

R Reproducibility & audit

Every ingest run records its outcome in an audit log, and every data point on the site carries an "as observed on" stamp so a reader can see exactly when we last verified it.

Thresholds and model versions quoted on this page are read from the live system at render time via the internal/methodology package; if one of those values changes in the running service it shows up here on the next page load. Display caps (briefing meeting / news / directorship limits, topic body truncation) are mirrored constants — the methodology package and the calling code hold the same number in two places, and any drift is caught by review when the cap moves.

Spotted something wrong on this page? Report a correction.