Access‑IQ — Case Study

Executive summary

Access-IQ is a data engineering and analytics platform built for Northshire NHS Trust — a simulated engagement, scoped and delivered as a real consultancy project. Northshire Trust generates thousands of operational data points a day across electronic health records, appointments, urgent care and diagnostics — yet leadership can’t answer the questions that matter most:

Where are our A&E pressures concentrated, and which populations are most affected?

Are we meeting our referral-to-treatment targets equitably across all patient groups?

Which communities are experiencing the worst access outcomes — and are we closing the gap?

Is our performance on health inequality improving or deteriorating over time?

Access-IQ is the platform that answers them: a production-grade pipeline spanning two AWS accounts that ingests three sources into a governed Bronze / Silver / Gold lakehouse, models inequality metrics to public-health standard, and surfaces them in a dashboard a Trust director can use — defined entirely as code that tears itself down to $0 when idle.

Business outcomes

Single source of truth

One governed, auditable dataset behind every board report — down from 4+ conflicting extracts.

Inequality made visible

Disparities measurable across IMD, ethnicity, age and sex — for the first time, simultaneously.

Operational action

Managers can see where inequity emerges, trace the cause, and track interventions over time.

Sustainable by design

$0 idle, ~90 min cold deploy to live dashboard, auto-protected against budget overruns.

Discovery

The engagement began with a 2.5-hour discovery workshop with six stakeholders, spanning operational, clinical, strategic, and EDI leadership. The goal was to understand the current reporting landscape and surface the questions the Trust needed to answer.

“

We report RTT performance to the board monthly — but it's a single Trust-wide number. When NHSE ask whether the 18-week target is equitable across population groups, we can't answer.

Dr Sarah Mitchell Chief Operating Officer

“

Are DNA rates higher in certain ethnic groups? Is that language, transport, digital access? I can't even get the first number reliably — let alone the "why".

Rachel Dunmore Equality, Diversity & Inclusion Lead

“

My A&E breach rates look fine Trust-wide, but certain postcodes present later, with higher acuity, and wait longer in department.

Dr Amir Patel Clinical Lead, Urgent & Emergency Care

“

We know the data exists — three systems, five extracts, none of them talking to each other.

James Okafor Director of Strategy & Transformation

The workshop included a dedicated data landscape session, mapping every source system the Trust relied on for operational reporting. Five data sources across three systems — an EHR (Postgres), a daily SFTP appointment feed, and a diagnostics S3 export — each with its own identifiers, refresh cadence, and quality characteristics. Ethnicity coding was present but incomplete (~85% coverage). IMD was derivable from postcode but not stored. Community care records existed in a separate system with no patient identifier linkage, and were descoped to Phase 2. Across all conversations, five themes emerged consistently:

FragmentationInconsistent definitionsRetrospective-only reportingEquity blind spotsManual toil

The group converged on one definition of success: tell a clear story of where inequities exist, and how interventions are moving them — backed by governed data.

Requirements

The discovery workshop produced a signed-off Business Requirements Document covering 39 requirements across three tiers: functional, non-functional, and compliance. In NHS-adjacent work, compliance requirements carry equal weight to functional ones — they shape architecture from day one, not as an afterthought. MoSCoW prioritisation was used to make explicit scope decisions for Phase 1.

23 Must

10 Should

5 Won't

Must - the core platform Should - enhanced capability Won't (Phase 1)

Deliberately out of Phase 1: community-care linkage, predictive models, real-time streaming, multi-Trust federation, and clinical write-back — descoped to keep the timeline credible.

Functional

Idempotent ingestion of three heterogeneous sources
RTT (>18wk) & DM01 (>6wk) breach detection
Wait times by IMD, ethnicity, age & sex
Slope & Relative Index of Inequality
Small-cell suppression for counts < 5
Self-serve dashboard for executive and clinical users

Non-functional

Under $500/month, alerting at 80% of ceiling
Fully defined as code — no console drift
CI quality gates on every PR
Operable by a 3-person analytics team

Compliance

Non-reversible, keyed pseudonymisation (DSPT / Caldicott)
Encryption at rest & in transit
Least-privilege, role-based access
Full source-to-dashboard audit trail
Network isolation — no public path to source

Solution design

The architecture spans two AWS accounts separated by a VPC peering connection. The Trust account holds source systems — RDS Postgres, Transfer Family SFTP, and S3 exports. The Platform account runs the analytics stack with controlled, auditable cross-account IAM access. Every data path is explicit — no direct connections between environments.

Trust account · source

RDS PostgresEHR + urgent care

Transfer FamilySFTP appointment drops

S3diagnostics & provider exports

VPC peering

Platform account · analytics

ECS Fargateparallel ingestion + Prefect

S3 data lakeBronze · KMS-encrypted

Redshift ServerlessSilver + Gold · Spectrum

LambdaHMAC UDF · budget teardown

Bronzeraw append-only parquet

→

Silverconformed · pseudonymised

→

Golddimensional marts · SII/RII

Access-IQ models a consultancy engagement across two AWS accounts: a Trust account standing in for the NHS client (operational source systems — Postgres EHR, SFTP appointment drops, S3 diagnostic exports) and a Platform account owned by the ‘vendor’ (ingestion, warehouse, analytics). The two are VPC-peered, and the boundary is deliberate — it forces every cross-org data movement to be explicit.

Data moves through a Bronze → Silver → Gold medallion architecture. Bronze lands in an S3 data lake as Parquet; Silver and Gold are modelled in dbt on Redshift. Bronze is exposed to Redshift through a Glue Data Catalog and Redshift Spectrum external tables, so dbt can query raw data in place without first loading it into the warehouse. Environments are split dev/prod as separate accounts, with secrets namespaced per environment.

Two operational facts shape the runtime design:

The infrastructure is ephemeral. Compute, warehouse, and networking are deployed at the start of a working session and destroyed at the end to avoid idle cost. Between sessions, the only durable state is what lives in S3 — there is no always-on database or scheduler holding pipeline state. The dataset is a 12-month backfill that then runs live. History is generated to look as though the platform has been ingesting daily for a year; a Trust-side Lambda then releases one new business day of data every 30 minutes, which the Platform pipeline ingests on a schedule. Consequently the pipeline processes one business day at a time, and ingest_date throughout the system denotes the business date of the data, not the wall-clock time of ingestion — a deliberate choice so backfilled history partitions by the day each record actually belongs to.

daily-ingest flow · ECS Fargate

PostgresEHR · PyArrow

SFTPappts · SHA-256

Trust S3copy_object

↓ all three complete

Spectrum refresh

→

dbt Silver

→

GE gate

→

dbt Gold

→

Gold export

The Great Expectations gate is blocking — Gold is never built on unvalidated person-level data.

By concern:

Idempotent Ingestion

Every ingestion run is governed by a manifest: a small JSON record written to S3 at the end of the run that serves as both the audit record and the control signal. Four properties fall out of that single primitive.

Idempotency. Before opening a database connection, an SFTP transport, or listing a single S3 object, each ingestion path checks whether the latest manifest for that source and date already succeeded, and returns early if so (Snippet 1, Snippet 2). The check fails open: a corrupt or unreadable manifest re-ingests rather than skipping, so malformed state is never mistaken for success. Because the skip is keyed on business date, a scheduler that fires repeatedly only does work when a date hasn't yet been ingested successfully.

Append-only auditability. Append-only is structural, not enforced by a policy (Snippet 3). Every run mints a fresh uuid4 run_id and writes its manifest to a key namespaced by that id, so two runs can never collide, and nothing in the codebase deletes or overwrites a manifest. A failed run and its successful re-run coexist as a complete audit trail; "latest wins" is simply max(LastModified) over the prefix — exactly what the idempotency check reads back.

Source-matched integrity. Each manifest records source, environment, run ID, business date, status, start/end timing, an error list, and a per-entity outputs block. Integrity controls are matched to the source rather than applied uniformly (Snippet 4): SFTP is an opaque third-party file drop with no native integrity metadata, so every file is SHA-256'd on read to fingerprint exactly what landed; S3 objects already carry an ETag content hash; and the Postgres path reads a live query result, where a byte digest of the derived Parquet would be meaningless. An integrity control for the Postgres path would capture row count, though that had not yet been implemented and is in the backlog.

Incremental resume. Because the pipeline processes one business day at a time and there is no state between sessions, every run must answer "which day do I process next?" with no database and no cursor file. The manifests are the cursor (Snippet 5). discover_latest_successful_date scans the durable manifest history for a source, newest-first, and returns the most recent date with a successful run; discover_next_business_date then advances all sources together as min(latest_successful_date) + 1 day. The min is deliberate — if one source fails or lags, taking the minimum guarantees none races ahead and leaves a gap, so dbt and the Gold layer always see complete days. The mechanism is stateless and crash-safe: an interrupted run leaves no successful manifest, so the next run rediscovers the same target and continues, letting a freshly redeployed pipeline pick up exactly where the previous session stopped.

Taken together, the manifest is a single primitive behind four properties — idempotency, append-only auditability, source-matched integrity, and incremental resume — rather than four mechanisms bolted on.

idempotency.py

# src/access_iq/ingestion/idempotency.py:25-42
def should_skip_if_already_successful(*, s3: Any, bucket: str, manifest_prefix: str) -> bool:
    manifest_prefix = normalize_manifest_prefix(manifest_prefix)
    key = _latest_manifest_key(s3=s3, bucket=bucket, prefix=manifest_prefix)
    if not key:
        return False

    body = s3.get_object(Bucket=bucket, Key=key)["Body"].read()
    try:
        manifest = json.loads(body)
    except (TypeError, json.JSONDecodeError):
        log.warning("manifest_decode_failed", bucket=bucket, key=key)  #fail open: re-ingest, don't skip on bad data
        return False

    if not isinstance(manifest, dict):
        log.warning("manifest_not_dict", bucket=bucket, key=key)  #fail open: never treat malformed manifest as success
        return False

    return bool(manifest.get("status") == "success")

postgres.py

# src/access_iq/ingestion/postgres.py:104-115
if should_skip_if_already_successful(
    s3=s3, bucket=platform_bucket, manifest_prefix=manifest_prefix
):
    bound_log.info("ingest_skipped", reason="latest_manifest_success")
    return {
        "source": db,
        "run_id": run_id,
        "env": env,
        "ingest_date": ingest_date.isoformat(),
        "status": "skipped",
        "reason": "latest_manifest_success",
    }

manifests.py

# src/access_iq/ingestion/manifests.py:48-49, 58-80

# run_id is defined as str(uuid.uuid4())  #fresh per run — guarantees a unique manifest key

def build_manifest_key(*, source: str, ingest_date: str, run_id: str) -> str:
    #run_id namespaces the key, so two runs can never target the same object
    return f"_manifests/source={source}/ingest_date={ingest_date}/run_id={run_id}.json"


def write_manifest(
    *, s3: Any, bucket: str, manifest: Manifest, kms_key_arn: str | None = None
) -> str:
    key = build_manifest_key(
        source=manifest.source,
        ingest_date=manifest.ingest_date,
        run_id=manifest.run_id,
    )
    body = json.dumps(manifest.model_dump(), indent=2, default=str).encode("utf-8")
    #only ever put_object to a fresh run_id key — no delete, no overwrite anywhere
    s3.put_object(
        Bucket=bucket,
        Key=key,
        Body=body,
        ContentType="application/json",
        **s3_kms_args(kms_key_arn),
    )
    return key

sftp.py

# src/access_iq/ingestion/sftp.py:31-32, 134-137
def sha256_bytes(b: bytes) -> str:
    return hashlib.sha256(b).hexdigest()

# ... within ingest_sftp_directory_to_bronze:
with sftp.open(remote_path, "rb") as f:
    data = f.read()

digest = sha256_bytes(data)  #fingerprint exactly what landed — SFTP gives no integrity metadata

manifests.py

# src/access_iq/ingestion/manifests.py:83-119
def discover_latest_successful_date(*, s3: Any, bucket: str, source: str) -> date | None:
    """Find the latest ingest_date with a successful manifest for a source.

    Scans manifest prefixes by ingest_date, checks from newest to oldest,
    and returns the first date whose latest manifest has status=success.
    """
    prefix = normalize_manifest_prefix(f"_manifests/source={source}/")
    paginator = s3.get_paginator("list_objects_v2")

    dates: set[date] = set()
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter="/"):
        for cp in page.get("CommonPrefixes", []):
            match = _re.search(r"ingest_date=(\d{4}-\d{2}-\d{2})", cp["Prefix"])
            if match:
                dates.add(date.fromisoformat(match.group(1)))

    if not dates:
        return None

    for d in sorted(dates, reverse=True):  #newest first — stop at the first proven-good date
        mp = build_manifest_prefix(source=source, ingest_date=d.isoformat())
        latest: dict[str, Any] | None = None
        for page in paginator.paginate(Bucket=bucket, Prefix=mp):
            for obj in page.get("Contents", []):
                if latest is None or obj["LastModified"] > latest["LastModified"]:
                    latest = obj
        if not latest:
            continue
        body = s3.get_object(Bucket=bucket, Key=latest["Key"])["Body"].read()
        try:
            manifest_data = json.loads(body)
            if isinstance(manifest_data, dict) and manifest_data.get("status") == "success":
                return d
        except (TypeError, json.JSONDecodeError):
            continue

    return None

Data Model

The Gold layer is a Kimball star schema: four fact tables and six conformed dimensions, each fact declared at a single grain — fct_wait_times one row per referral, fct_urgent_care one row per A&E attendance, fct_utilisation one row per appointment, and fct_inequality one row per metric × period × stratifier × stratum (Diagram 1). The three transactional facts share conformed dimensions — dim_patient, dim_site, dim_date (joined on a month grain), plus dim_specialty for referrals — so any measure slices the same way across pathways. fct_inequality is a derived aggregate: it re-aggregates fct_wait_times and fct_urgent_care to the stratum grain, reusing those facts' measure definitions, and computes the DNA-rate metric from Silver appointments directly. Joining dim_patient for stratifiers keeps demographics consistent with every other mart. (Two of the three disparity metrics are fact-sourced today; unifying the DNA metric onto fct_utilisation would make all three consistently fact-derived — a small, deliberate piece of outstanding work.)

A single discipline runs through the layer: demographics are derived once and conformed everywhere. The Bronze patient feed arrives with its own imd_decile column, but Silver does not trust it — it discards the supplied value and re-derives IMD decile from each patient's LSOA code through a controlled LSOA→IMD lookup seed (Snippet 1). Gold then never recomputes demographics either: facts that need IMD, ethnicity, age or sex re-derive them by joining dim_patient rather than carrying their own copies — Silver appointments deliberately omits imd_decile to force this. The result is one auditable lineage — Bronze LSOA → seed lookup → dim_patient → every fact — so two marts can never disagree on a patient's deprivation decile.

erDiagram
    DIM_PATIENT   ||--o{ FCT_WAIT_TIMES  : patient_sk
    DIM_PATIENT   ||--o{ FCT_URGENT_CARE : patient_sk
    DIM_PATIENT   ||--o{ FCT_UTILISATION : patient_sk
    DIM_SITE      ||--o{ FCT_WAIT_TIMES  : site_sk
    DIM_SITE      ||--o{ FCT_URGENT_CARE : site_sk
    DIM_SITE      ||--o{ FCT_UTILISATION : site_sk
    DIM_SPECIALTY ||--o{ FCT_WAIT_TIMES  : specialty_sk
    DIM_DATE      ||--o{ FCT_WAIT_TIMES  : treatment_month
    DIM_DATE      ||--o{ FCT_URGENT_CARE : arrival_month
    DIM_DATE      ||--o{ FCT_UTILISATION : appointment_month
    DIM_DATE      ||--o{ FCT_INEQUALITY  : period

    FCT_WAIT_TIMES  ||--o{ FCT_INEQUALITY : "feeds (derived)"
    FCT_URGENT_CARE ||--o{ FCT_INEQUALITY : "feeds (derived)"
    DIM_IMD       ||--o{ FCT_INEQUALITY : "stratum domain"
    DIM_ETHNICITY ||--o{ FCT_INEQUALITY : "stratum domain"

    DIM_PATIENT {
        varchar patient_sk PK
        varchar age_band
        varchar sex
        varchar ethnicity_ons
        int     imd_decile
    }
    FCT_WAIT_TIMES {
        varchar referral_id "grain"
        varchar patient_sk FK
        varchar specialty_sk FK
        varchar site_sk FK
        date    treatment_month
        int     wait_days
    }
    FCT_URGENT_CARE {
        varchar uc_log_id "grain"
        varchar patient_sk FK
        varchar site_sk FK
        date    arrival_month
        bool    four_hour_breach_flag
    }
    FCT_UTILISATION {
        varchar appointment_id "grain"
        varchar patient_sk FK
        varchar site_sk FK
        date    appointment_month
    }
    FCT_INEQUALITY {
        varchar metric_name "grain"
        date    period "grain"
        varchar stratifier "grain"
        varchar stratum "grain"
        float   metric_value
        float   sii_value
        float   rii_value
    }

patients.sql

-- dbt/models/silver/patients.sql:39-50, 60
with_imd AS (
    SELECT
        deduped.*,
        lsoa_imd.imd_decile  AS _lsoa_imd_decile,   -- re-derived from LSOA, not Bronze's imd_decile
        imd.imd_label,
        imd.deprivation_level
    FROM deduped
    LEFT JOIN {{ ref('seed_lsoa_imd_lookup') }} lsoa_imd
        ON lsoa_imd.lsoa_code = deduped.lsoa_code
    LEFT JOIN {{ ref('seed_imd') }} imd
        ON imd.imd_decile = lsoa_imd.imd_decile
)
-- final SELECT:
    _lsoa_imd_decile AS imd_decile,                 -- Bronze's own imd_decile column is discarded

Inequality Metrics

The Slope Index of Inequality (SII) and Relative Index of Inequality (RII) are the standard PHE/OHID measures of health disparity across socioeconomic groups. Both are computed in pure Redshift SQL — no Python, no R, no external process. Ridit scores are derived for each IMD decile from cumulative population window functions (Snippet 1), then SII and RII are computed via a closed-form weighted least-squares regression in the spirit of the Fingertips SII/RII approach, in a single CTE chain (Snippet 2). The ridit and regression logic runs only on the imd_decile stratifier; ethnicity, age and sex report raw metric values (Snippet 3), since the ordinal-ranking assumption behind SII/RII doesn't apply to non-ordered categories. Small-cell suppression is applied in the final SELECT (Snippet 3): any stratum with fewer than 5 patients has both count and metric value set to NULL, per NHS Digital statistical disclosure rules.

fct_inequality.sql

-- ── Ridit scores for SII/RII (IMD decile stratifier only) ───────────────────
-- ridit_i = (cumulative_pop_up_to_i - 0.5 * pop_i) / total_pop
-- IMD decile ordered ascending (1 = most deprived)

with_ridit AS (
    SELECT
        b.*,
        CASE
            WHEN b.stratifier = 'imd_decile' THEN
                (SUM(b.population_count) OVER (
                    PARTITION BY b.metric_name, b.period, b.stratifier
                    -- window spans ALL strata; cast to int only for imd, else 'Female'::int would throw
                    ORDER BY CASE WHEN b.stratifier = 'imd_decile' THEN b.stratum::integer ELSE 0 END
                    ROWS UNBOUNDED PRECEDING
                ) - 0.5 * b.population_count)
                * 1.0
                / NULLIF(SUM(b.population_count) OVER (
                    PARTITION BY b.metric_name, b.period, b.stratifier
                ), 0)
            ELSE NULL                                  -- non-ordinal stratifiers get no ridit score
        END AS ridit_score,
        CASE
            WHEN b.stratifier = 'imd_decile' THEN
                b.population_count * 1.0               -- *1.0 forces float; integer division would zero the weight
                / NULLIF(SUM(b.population_count) OVER (
                    PARTITION BY b.metric_name, b.period, b.stratifier
                ), 0)
            ELSE NULL
        END AS weight
    FROM base b
)

fct_inequality.sql

sii_rii AS (
    SELECT
        metric_name,
        period,
        -- SII: weighted least-squares slope of metric_value on ridit score
        (COUNT(*)::float * SUM(weight::float * ridit_score::float * metric_value::float)
         - SUM(weight::float * ridit_score::float) * SUM(weight::float * metric_value::float))
        / NULLIF(
            COUNT(*)::float * SUM(weight::float * ridit_score::float * ridit_score::float)
            - POWER(SUM(weight::float * ridit_score::float), 2),
            0
        ) AS sii_value,
        -- RII: the same slope divided by the population-weighted mean of the metric
        (COUNT(*)::float * SUM(weight::float * ridit_score::float * metric_value::float)
         - SUM(weight::float * ridit_score::float) * SUM(weight::float * metric_value::float))
        / NULLIF(
            COUNT(*)::float * SUM(weight::float * ridit_score::float * ridit_score::float)
            - POWER(SUM(weight::float * ridit_score::float), 2),
            0
        )
        / NULLIF(
            SUM(population_count::float * metric_value::float) / NULLIF(SUM(population_count::float), 0),
            0
        ) AS rii_value
    FROM with_ridit
    WHERE stratifier = 'imd_decile' AND ridit_score IS NOT NULL
    GROUP BY metric_name, period
)

fct_inequality.sql

-- ── Final SELECT with small-cell suppression (D-07) ─────────────────────────

SELECT
    r.metric_name,
    r.period,
    r.stratifier,
    r.stratum,
    -- NHS Digital small-cell suppression: <5 patients → NULL out count AND value
    CASE WHEN r.population_count < 5 THEN NULL ELSE r.population_count END AS population_count,
    CASE WHEN r.population_count < 5 THEN NULL ELSE r.metric_value    END  AS metric_value,
    r.ridit_score,
    s.sii_value,
    s.rii_value
FROM with_ridit r
LEFT JOIN sii_rii s
    ON r.metric_name = s.metric_name
    AND r.period = s.period
    AND r.stratifier = 'imd_decile'   -- SII/RII attach to imd rows only; ethnicity/age/sex keep raw metric_value

Data Quality

Quality is enforced at three layers that escalate in enforcement, each catching a different class of problem.

A Bronze readiness gate — a standalone Python/pandas profiling tool (make ready) the operator runs against raw Bronze before building Silver — checks seven structural properties: entity completeness, primary-key uniqueness, join-key existence, null rates on critical columns, join-key type consistency across entities, referential integrity, and date-range coverage (Snippet 1). It exits non-zero on any blocking failure, surfacing a structurally broken Bronze before Silver work begins.

On every pipeline run, dbt schema tests validate Silver and Gold — 38 Silver and 59 Gold tests spanning primary-key uniqueness, null rates, accepted value sets, numeric ranges and row-count bounds, a mix of dbt's core generic tests and the dbt-expectations package (Snippet 2). Gold additionally pins each fact's declared grain with a compound-uniqueness test.

After Silver builds and before Gold, a Great Expectations gate validates the four person-level Silver tables — patients, encounters, referrals, diagnoses (Snippet 3). It is deliberately tolerant where the data legitimately is: patient_sk may be up to ~20% null (mostly=0.80), because roughly 14% of encounters and referrals belong to patients quarantined by NHS Mod-11 validation. The gate writes a PASS/FAIL row per table to gold._dq_results; a dbt pre-hook on every Gold model then refuses to build if any table failed or if the gate hasn't run at all that day (Snippet 4) — fail-closed, so Gold is never built on unvalidated or unverified data. A failure also emits a CloudWatch metric that drives an alarm.

The three layers escalate in enforcement: the Bronze gate is an operator pre-flight check, the dbt schema tests run on every build, and the GE gate is fully automated and fail-closed.

readiness_gate.py

def run_readiness_checks(*, settings: Settings) -> list[CheckResult]:
    """Execute all 7 readiness checks against live Bronze data."""
    entity_dfs = load_all_bronze_entities(
        aws_profile=settings.aws_profile,
        aws_region=settings.aws_region,
        platform_bucket=settings.platform_bucket,
    )

    all_results: list[CheckResult] = []
    all_results.extend(check_entity_completeness(entity_dfs=entity_dfs))     # 1 entity completeness
    all_results.extend(check_pk_unique_nonnull(entity_dfs=entity_dfs))       # 2 PK unique & non-null
    all_results.extend(check_join_key_existence(entity_dfs=entity_dfs))      # 3 join-key existence
    all_results.extend(check_null_rates(entity_dfs=entity_dfs))              # 4 null rates (critical cols blocking)
    all_results.extend(check_type_consistency(entity_dfs=entity_dfs))        # 5 join-key type consistency across entities
    all_results.extend(check_referential_integrity(entity_dfs=entity_dfs))   # 6 referential integrity (orphan FKs)
    all_results.extend(check_date_range_coverage(entity_dfs=entity_dfs))     # 7 date-range coverage
    ...

_silver_models.yml

  - name: patients
    description: "Conformed patient demographics with HMAC pseudonymised patient_sk, IMD from LSOA join"
    tests:
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 1
    columns:
      - name: patient_sk
        description: "HMAC-SHA-256 of NHS number - primary key"
        tests:
          - not_null
          - unique
      - name: _ingest_date
        tests:
          - not_null
      - name: age_band
        tests:
          - dbt_expectations.expect_column_values_to_be_in_set:
              value_set:
                ["0-15", "16-24", "25-34", "35-44", "45-54", "55-64", "65-74", "75+"]
              config:
                severity: warn
      - name: ethnicity_ons
        tests:
          - dbt_expectations.expect_column_values_to_not_be_null:
              config:
                severity: warn
      - name: imd_decile
        tests:
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 1
              max_value: 10

run_ge_gate.py

    if table_name == "patients":
        suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="patient_sk"))
        suite.add_expectation(
            gx.expectations.ExpectColumnDistinctValuesToBeInSet(
                column="sex",
                value_set=["M", "F", "I", "U"],
            )
        )
    elif table_name == "encounters":
        suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="encounter_id"))
        # ~14% of encounters have NULL patient_sk from Mod-11 quarantined patients
        suite.add_expectation(
            gx.expectations.ExpectColumnValuesToNotBeNull(column="patient_sk", mostly=0.80)
        )
    elif table_name == "referrals":
        suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="referral_id"))
        # ~14% of referrals have NULL patient_sk from Mod-11 quarantined patients
        suite.add_expectation(
            gx.expectations.ExpectColumnValuesToNotBeNull(column="patient_sk", mostly=0.80)
        )

check_ge_gate.sql

{%- if execute -%}
  {%- set gate_query %}
    SELECT
        COALESCE(SUM(CASE WHEN run_status = 'FAILED' THEN 1 ELSE 0 END), 0) AS failures,
        COUNT(*) AS total_runs
    FROM gold._dq_results
    WHERE run_date = CURRENT_DATE
  {%- endset %}

  {%- set results = run_query(gate_query) -%}
  {%- set failures = results.columns[0].values()[0] -%}
  {%- set total    = results.columns[1].values()[0] -%}

  {%- if total == 0 -%}                            -- fail-closed: GE hasn't run today → block Gold
    {{ exceptions.raise_compiler_error(
       "GE gate: No GE validation runs found for today (" ~ modules.datetime.date.today().isoformat() ~ "). "
       ~ "Run `make dq-gate` or `python dbt/scripts/run_ge_gate.py` before `dbt build --select gold`."
    ) }}
  {%- elif failures > 0 -%}                        -- any FAILED Silver table → block Gold
    {{ exceptions.raise_compiler_error(
       "GE gate FAILED: " ~ failures ~ " Silver table(s) have DQ failures. "
       ~ "Gold promotion blocked. Check gold._dq_results for details."
    ) }}
  {%- endif -%}
{%- endif -%}

Security & Compliance

NHS numbers never reach an analyst-readable table. The raw number lands only in the restricted Bronze and quarantine schemas; every analyst-facing Silver and Gold table carries a pseudonymous patient_sk instead. That surrogate is produced by a Redshift Lambda UDF (f_hmac_nhs_number) that HMAC-SHA-256s the NHS number, fetching the per-environment key from Secrets Manager on cold start and caching it for the warm lifetime of the Lambda (Snippet 1). HMAC rather than bare SHA-256 is deliberate: a 10-digit NHS number has only a 10¹⁰ keyspace, trivially rainbow-tabled, so a keyed hash is what makes the surrogate non-reversible without the key. The hash is deterministic within an environment, so pseudonymised rows still join across tables; and because each environment has its own key, a development data leak can't be correlated to production patients. The UDF never logs its inputs and maps NULL→NULL, so a missing value can neither crash the batch nor leak.

Invalid NHS numbers are caught, not trusted. A Mod-11 check-digit macro validates every number; rows that fail are routed to a quarantine table with a rejection reason rather than silently dropped, while valid rows flow on to the conformed patient model (Snippet 2). The two models are complementary halves of one filter — nothing is lost.

Access is least-privilege by prefix: every S3 grant is scoped to a path, not a bucket (Snippet 3). The ingestion role writes only bronze/* and _manifests/*; the Spectrum role reads the whole lake but writes only gold_export/*; and the dashboard reader — a dedicated IAM user, because Streamlit Community Cloud needs static credentials — gets read-only gold_export/* plus an explicit KMS decrypt grant. The Spectrum role produces the export and the dashboard user consumes it on the same prefix, neither able to touch the other's data. No role has wildcard S3 access.

handler.py

"""
...
Threat mitigations
------------------
* T-05-01: Input NHS numbers are **never** logged.
* T-05-02: ``None`` inputs produce ``None`` outputs (no crash/leak).
* T-05-03: Only the key ARN is in the env var; the key itself is fetched
  at runtime from Secrets Manager via IAM role.
"""

_KEY_CACHE: bytes | None = None


def _get_key() -> bytes:
    """Fetch HMAC key from Secrets Manager (cached after first call)."""
    global _KEY_CACHE  # noqa: PLW0603
    if _KEY_CACHE is not None:                       # cold start fetches; warm invocations reuse
        return _KEY_CACHE

    arn = os.environ["HMAC_KEY_SECRET_ARN"]          # only the ARN is in the env, never the key
    client = boto3.client("secretsmanager")
    resp = client.get_secret_value(SecretId=arn)
    payload = json.loads(resp["SecretString"])
    _KEY_CACHE = str(payload["key"]).encode("utf-8")
    return _KEY_CACHE


def handler(event: dict, context: object) -> str:  # noqa: ARG001
    # Redshift batch contract: event["arguments"] = [["nhs1"], ["nhs2"], [None], ...]
    key = _get_key()
    results: list[str | None] = []
    for row in event["arguments"]:
        value = row[0]
        if value is None:
            results.append(None)                     # NULL→NULL: missing value can't crash or leak
        else:
            results.append(hmac.new(key, str(value).encode("utf-8"), hashlib.sha256).hexdigest())
    return json.dumps({"success": True, "results": results})

nhs_mod11_check.sql

{% macro nhs_mod11_check(col) %}
    {%- set cleaned = "REGEXP_REPLACE(" ~ col ~ ", '[^0-9]', '')" -%}
    CASE
        WHEN LENGTH({{ cleaned }}) != 10
            THEN 'invalid_format'
        WHEN 11 - MOD( CAST(SUBSTRING({{ cleaned }},1,1) AS INT)*10
                     + CAST(SUBSTRING({{ cleaned }},2,1) AS INT)*9
                     + ... + CAST(SUBSTRING({{ cleaned }},9,1) AS INT)*2, 11) = 10
            THEN 'mod11_invalid'        -- remainder 10 is not a valid NHS check digit
        WHEN <recomputed check digit> != CAST(SUBSTRING({{ cleaned }},10,1) AS INT)
            THEN 'mod11_mismatch'       -- computed check digit ≠ the 10th digit
        ELSE NULL                       -- valid
    END
{% endmacro %}

-- The complementary filters — valid rows to patients, failures to quarantine (both verbatim):

-- dbt/models/silver/patients.sql:21-31
nhs_validated AS (
    SELECT *,
        {{ nhs_mod11_check('nhs_pseudo_id') }} AS _nhs_validation_failure
    FROM bronze
),
valid_only AS (
    SELECT *
    FROM nhs_validated
    WHERE _nhs_validation_failure IS NULL          -- valid → conformed patient model
),

-- dbt/models/silver/quarantine/patients_quarantine.sql:48-50
FROM deduped_validated
WHERE _rn = 1
  AND _nhs_validation_failure IS NOT NULL          -- invalid → quarantine, with rejection_reason

iam.py

# infra/access_iq_infra/stacks/iam.py:60-71  — ingestion role writes ONLY bronze/* + _manifests/*
iam.PolicyStatement(
    actions=[
        "s3:PutObject",
        "s3:AbortMultipartUpload",
        "s3:ListBucketMultipartUploads",
        "s3:ListMultipartUploadParts",
    ],
    resources=[
        f"arn:aws:s3:::{platform_bucket.bucket_name}/_manifests/*",
        f"arn:aws:s3:::{platform_bucket.bucket_name}/bronze/*",
    ],
),

# infra/access_iq_infra/stacks/warehouse.py:99-101  — Spectrum reads all, writes ONLY gold_export/*
lake_bucket.grant_read(spectrum_role)
# Gold export: UNLOAD writes Parquet to gold_export/ prefix (Phase 7, D-05)
lake_bucket.grant_write(spectrum_role, "gold_export/*")

Observability & Operations

All pipeline events are emitted as structured JSON via structlog to per-source CloudWatch log groups — one per ingestion source, plus the orchestration pipeline. Fifteen metric filters parse those logs into custom metrics on two schemas: ingestion filters key on a $.status field (success/failed per source), while downstream stages emit named $.event records — GE gate, dbt Silver/Gold, export, pipeline completion (Snippet 1). Those metrics feed six classes of alarm: ingestion failure, GE gate failure, GE validation/checkpoint error, pipeline staleness (48h), Gold export staleness (50h), and budget threshold. The staleness alarms invert the usual pattern — they fire on the absence of a completion event over the window, treating missing data as breaching.

A separate EventBridge rule catches ECS container exits — OOM kills and non-zero exits (Snippet 2). These are invisible to application logging because the container dies before it can emit a log event, so the signal has to come from the ECS control plane, not the app.

The budget alarm is a hard safety net. When monthly actual spend crosses 80% of the ceiling ($10 dev / $20 prod), AWS Budgets publishes to an SNS topic — locked down to accept publishes only from the Budgets service — which triggers a Lambda that notifies via SNS and Slack, then destroys the five ephemeral stacks: compute, warehouse, network, observability and iam (Snippet 3). Crucially, data preservation isn't a matter of the Lambda's list being right: its IAM policy scopes cloudformation:DeleteStack to exactly those five stack ARNs, so it physically cannot delete the stateful stacks (lake, secrets, catalog, ecr) even if it tried. Data is preserved by permission, not by hope. Spend stops.

observability.py

# Ingestion: filter on a $.status field, one metric per source
mf_failed = logs.MetricFilter(
    self,
    f"MetricFilter-{safe_id}",
    log_group=lg,
    filter_pattern=logs.FilterPattern.string_value("$.status", "=", "failed"),
    metric_namespace=metric_namespace,
    metric_name=f"IngestionFailed-{source}",
    metric_value="1",
    default_value=0,
)

# Downstream stages: filter on named $.event records (7 filters)
for event_name, metric_name in (
    ("ge_gate_failed", "GEGateFailed"),
    ("ge_gate_passed", "GEGatePassed"),
    ("gold_export_complete", "GoldExportComplete"),
    ("dbt_silver_complete", "DbtSilverComplete"),
    ("dbt_gold_complete", "DbtGoldComplete"),
    ("pipeline_complete", "PipelineComplete"),
    ("validation_error", "ValidationError"),
):
    mf = logs.MetricFilter(
        self,
        f"MetricFilter-{metric_name}",
        log_group=pipeline_lg,
        filter_pattern=logs.FilterPattern.string_value("$.event", "=", event_name),
        metric_namespace=metric_namespace,
        metric_name=metric_name,
        metric_value="1",
        default_value=0,
    )

observability.py

ecs_oom_rule = events.Rule(
    self,
    "EcsOomRule",
    rule_name=f"{cfg.app_name}-{cfg.env_name}-ecs-oom-detection",
    description="Detect ECS task OOM kills and non-zero exit codes",
    event_pattern=events.EventPattern(
        source=["aws.ecs"],
        detail_type=["ECS Task State Change"],
        detail={
            "lastStatus": ["STOPPED"],
            "stoppedReason": [{"prefix": "Essential container in task exited"}],
        },
    ),
)
ecs_oom_rule.add_target(targets.SnsTopic(sns_topic))

budget.py

def handler(event, context):
    stacks = os.environ["EPHEMERAL_STACKS"].split(",")   # injected by CDK, not hardcoded
    ...
    cf = boto3.client("cloudformation", region_name=region)
    for stack in stacks:
        try:
            cf.delete_stack(StackName=stack)
            print(f"Initiated destroy: {stack}")
        except Exception as e:
            print(f"Skip {stack}: {e}")

# The guarantee — IAM scopes deletion to exactly the ephemeral stacks, and the topic can't be spoofed · budget.py:100-106, 160-169

# Only AWS Budgets may publish to the teardown topic (T-09-02)
topic.add_to_resource_policy(
    iam.PolicyStatement(
        principals=[iam.ServicePrincipal("budgets.amazonaws.com")],
        actions=["sns:Publish"],
        resources=[topic.topic_arn],
    )
)

# DeleteStack scoped to the named ephemeral stacks ONLY — stateful stacks are unreachable
teardown_fn.add_to_role_policy(
    iam.PolicyStatement(
        actions=["cloudformation:DeleteStack"],
        resources=[
            f"arn:aws:cloudformation:{teardown_region}:{teardown_account}:stack/{name}/*"
            for name in ephemeral_stack_names
        ],
    )
)

Showcase

Three dashboard pages, each structured around questions Trust leadership needs to answer. The dashboard reads static Gold Parquet exports via DuckDB — no live warehouse connection required, available 24/7 at $0/month hosting cost.

Business question

Are our referral-to-treatment times within acceptable limits across the Trust?

Insight

Trust-wide KPIs suggest moderate pressure — but provider-level breakdown reveals two sites with P90 waits above 250 days, more than double the Trust average. The aggregate number was masking a concentrated problem.

Decision enabled

Commission a targeted capacity review at the two outlier sites. The provider-level breakdown provides the evidence base — without it, the issue would remain invisible in Trust-wide reporting.

Business question

Are we providing equitable access to care across all socioeconomic groups?

Insight

Patients in the most deprived areas wait nearly three times longer than those in the least deprived — a 91-day gap between Decile 1 and Decile 10. The gradient is consistent across all deprivation levels, not an isolated outlier at either extreme. The SII of 147.32 quantifies the full slope for board reporting.

Decision enabled

Target pathway redesign and outreach at the highest-deprivation deciles. The SII gives leadership a single trackable figure for quarterly inequality-reduction commitments — and a baseline to measure whether interventions are closing the gap.

Business question

When does A&E come under the most pressure — and are we staffed to match it?

Insight

Demand concentrates heavily on Saturday and Sunday evenings between 17:00 and 21:00 — a pattern invisible in weekly or monthly aggregate reporting. Weekday staffing patterns don't reflect where the actual pressure falls.

Decision enabled

Redesign weekend triage staffing rotas to match observed demand peaks. The heatmap provides the evidence base for a commissioner-facing business case — visual, specific, and impossible to dismiss with aggregate data.

Open the live dashboard ↗

Outcomes

The platform runs end to end from a cold deploy — three parallel ingests, 20 dbt models, automated quality gates and dashboard export — in about 90 minutes, and back to $0 in 40.

Must requirements 22 delivered · 1 partial 23

Should requirements 9 delivered · 1 gap 10

lines of Python across the platform

lines of infrastructure as code

CDK stacks · full deploy/destroy lifecycle

dbt models · 10 Silver + 10 Gold

Architecture Decision Records

idle cost · from $2 a working session

What worked

The two-account boundary forced every access path through explicit IAM and VPC peering — a more secure design than a single account.
Manifest-based idempotency handled every re-run and backfill without special cases.
Ephemeral infrastructure made configuration drift impossible — nothing survives a teardown unless it's in code.

What I'd change

Build one vertical slice end-to-end before scaling horizontally — building all of each layer first meant cross-layer issues surfaced late, as expensive rework.
Wire live-stack integration tests from week one — several Spectrum and cross-account IAM bugs only surfaced during manual deploys.

What's next

From moving data to learning from it.

The successor turns the Gold marts into features: demand forecasting to predict capacity pressure weeks ahead, and a DNA-risk model that flags at-risk patients for proactive outreach — measured back through the same inequality dashboard, closing the loop between prediction and outcome. That's the next case study.