hardware-testingsecurityCI/CD

Securely Exposing Hardware-in-the-Loop: Access Controls for Pi and GPU Testbeds

UUnknown

2026-02-10

10 min read

Practical guide to securely expose Raspberry Pi AI HAT and NVLink GPU testbeds to CI with least privilege and auditability.

Hook: stop production surprises — secure your hardware-in-the-loop (HIL) testbeds

Environment drift, fragile CI jobs that suddenly fail on hardware, and untracked access to expensive GPU and edge devices are everyday risks for engineering teams in 2026. If your CI pipeline can trigger a job that touches a Raspberry Pi AI HAT or an NVLink-connected GPU node without fine-grained controls and immutable audit trails, you’re one misconfigured secret or rogue PR away from downtime, data exposure, or costly misuse.

The 2026 context: why HIL security matters now

Two hardware trends that shaped late 2025 and early 2026 make this topic urgent: the rapid adoption of capable edge AI modules (for example, the Raspberry Pi 5 and the new AI HAT+ 2) and tighter integration between heterogeneous CPUs and GPUs through fabrics such as NVLink Fusion. Major industry moves — like RISC-V silicon vendors integrating NVLink support — mean heterogeneous clusters are becoming mainstream, not exotic research setups.

“The new AI HAT+ 2 unlocks generative AI for the Raspberry Pi 5.” — ZDNET (late 2025)

“SiFive will integrate Nvidia's NVLink Fusion infrastructure with its RISC-V processor IP platforms.” — Forbes (Jan 2026)

These advances let teams run real, production-like inference and multi-GPU training experiments in preprod — but they also increase the attack surface. In 2026 organizations must balance developer velocity with zero-trust controls, least privilege, hardware attestation, and auditable access to HIL assets.

Threat model and security goals

Before designing controls, make the threat model explicit:

Compromised CI runner or pipeline credentials launching arbitrary jobs.
Insider misuse or misconfiguration allowing broad access to device firmware or data stored on devices.
Telemetry or model artifacts exfiltration from edge devices or GPU nodes.
Unintended hardware usage leading to cost overrun (24/7 GPU usage).

Security goals derived from that model:

Least privilege — only the exact test job has the minimum device access and for the minimum time.
Auditability — every session, command, and job artifact is logged and tamper-evident.
Reproducibility — provisioned test environments are ephemeral and created from versioned IaC manifests.
Cost controls — devices scale to zero or sleep when idle; expensive GPUs are autoscaled.

Secure architecture patterns for exposing HIL to CI

There are three practical patterns teams adopt; choose the one that fits your scale and risk appetite:

Self-hosted runner per device — a runner lives on or next to the hardware (Pi or GPU node). Pros: direct low-latency access and simple scheduling. Cons: higher blast radius unless hardened and ephemeral.
Shared device pool behind a hardware gateway — a central service mediates test requests, queues runs, and binds jobs to devices. Pros: fine-grained access control and centralized auditing. Cons: added complexity and potential gateway target for attackers.
Kubernetes-native device orchestration — devices are nodes in a K8s cluster (or mapped via device proxies), and CI spawns jobs as pods using device plugins (NVIDIA device plugin or custom device-plugin for Pi peripherals). Pros: RBAC, admission control, and common tooling. Cons: requires K8s expertise and careful node isolation.

Network topology — segmentation and zero-trust

Design the network so hardware is reachable only via controlled paths:

Place Pi and GPU clusters in a dedicated VPC/subnet with strict NACLs or security groups.
Use a bastion or hardware gateway that implements mTLS and performs session authentication and authorization. Avoid exposing SSH directly to CI runners.
Consider per-job ephemeral tunnels (short-lived WireGuard or mTLS reverse tunnels) to bind a CI job to a device for the test duration.
Use DNS-based allowlists and egress restrictions so test artifacts only flow to approved storage endpoints.

Identity and Access Control: short-lived credentials and workload identity

Replace long-lived keys with short-lived tokens and proven identity flows:

Use OIDC for CI identity federation — GitHub Actions, GitLab, and many CI systems support OIDC to request cloud tokens for specific jobs.
Issue short-lived SSH certificates via an SSH CA (HashiCorp Vault, Smallstep) to access devices. Certificates should be valid for the job window only — follow a security checklist for granting automated access.
On Kubernetes, use Workload Identity (GKE) or projected tokens to grant ServiceAccounts ephemeral cloud credentials instead of embedding keys.
Map CI pipeline identities to resource-specific roles (e.g., an action can only request access to the ‘gpu-test-queue’ resource).

Enforcing least privilege on devices

At the OS and container level:

Run test workloads in containers with minimal capabilities: drop CAP_SYS_ADMIN and unnecessary CAP_*; use seccomp, AppArmor, or SELinux profiles.
Use device cgroup allowlists (cgroup v2 device ACLs) or udev rules to expose only the specific device nodes the job needs (GPU device file or an I2C device on Pi).
Prefer non-privileged pods in Kubernetes; if node-level access is required, limit it with admission controllers and explicit annotations.

Session control and recording

For interactive debugging sessions, implement session management:

Use tools that provide session recording (tlog, sudo with session logging, or commercial session management products) and stream session transcripts to immutable storage.
For SSH, force the use of an audit proxy that logs all input/output and stores logs in your SIEM.

CI integrations — patterns and examples

CI system integration should prioritize ephemeral provisioning, policy enforcement, and result capture. Below are pragmatic approaches:

Pattern A — GitHub Actions + actions-runner-controller + K8s device plugin

Use the actions-runner-controller project to create ephemeral runner pods that claim GPU resources via the NVIDIA device plugin or bind to a Pi node via nodeSelector.

# example snippet: Runner CRD requests a GPU node label
apiVersion: actions.summerwind.dev/v1alpha1
kind: Runner
metadata:
  name: gpu-runner
spec:
  replicas: 1
  workflowLabels:
    - gpu-test
  template:
    spec:
      nodeSelector:
        hardware-type: gpu-nvlink
      tolerations:
        - key: "gpu"
          operator: "Exists"
      containers:
        - name: runner
          resources:
            limits:
              nvidia.com/gpu: 1

This lets you keep RBAC and admission policies applied at pod creation time. The runner pod will have only 1 GPU visible and is ephemeral — delete the pod and the runner goes away.

Pattern B — Self-hosted runner on Pi with strict sandboxing

When tests require direct hardware (audio codec, camera, GPIO), run a self-hosted runner on the Pi but enforce:

Containerized jobs only; use rootless container runtimes like Podman or rootless Docker.
Device whitelisting and pstore/secure boot if available.
Automatic runner reprovisioning via cloud-init/Terraform for lab scale.

Pattern C — Gateway + job queue (recommended for mixed fleets)

Build a small authorization gateway that accepts signed job requests from CI, performs policy checks, and binds jobs to devices. Gateway responsibilities:

Authorize the CI OIDC assertion and check job claims (labels, test name, artifact destinations).
Enforce device allowlists and allocate resources from a dynamic inventory.
Spin up ephemeral tunnels (SSH or mTLS) and revoke them after the job.

Raspberry Pi AI HAT specifics — protect edge AI modules

Pi devices pose particular risks: local storage with secrets, exposed peripheral buses (I2C, SPI), and simpler OS stacks. Hardening checklist:

Disable unneeded services and secure SSH with certificate-based auth only.
Use OS-level attestation where possible: enable secure boot or sign your kernels and images.
Lock down peripheral access via udev rules and only mount required device files into containers.
Ensure AI HAT firmware is up-to-date and that your update pipeline is signed and auditable.
Record sensor and peripheral interactions in a tamper-evident log (push to central S3/Blob with versioning).

NVLink-connected GPU testbeds — rules for multi-GPU safety

NVLink creates a high-speed fabric that only makes sense inside a physical cluster. Key operational controls:

Do not attempt to virtualize NVLink across a WAN — keep NVLink topologies local and schedule jobs requiring NVLink via your scheduler (Slurm, K8s with nodeAffinity).
Use NVIDIA MIG (multi-instance GPU) where possible to create bounded GPU slices for tests that need isolation but not NVLink-level bandwidth.
Install the NVIDIA GPU Operator to unify drivers, DCGM monitoring, and device plugin lifecycle management.
Limit NVLink-enabled job submission to a whitelisted CI identity and require manual approval for jobs that request full NVLink fabric access.

Auditability and forensics — the logs you need

Audits fail when there are gaps. Capture these sources:

CI job metadata and OIDC assertions (who requested the job, repo/PR, commit hash).
Session recordings (SSH, interactive terminals) stored immutably.
Device telemetry: dmesg, journalctl, NVIDIA DCGM telemetry, and Pi system logs.
Kubernetes audit logs and admission controller decisions (deny/allow, policy names).
Network flows to and from devices (VPC flow logs, eBPF capturing if needed) and direct uploads to SIEM.

Ensure logs are sent to a tamper-evident store (object storage with versioning and access logs). For high-assurance setups, integrate a WORM (write-once read-many) storage or object lock policy — if you operate under special compliance constraints consider a sovereign cloud migration plan.

Concrete example: ephemeral GPU job requested from GitHub Actions

Summary flow:

GitHub Action triggers and uses OIDC to request an ephemeral cloud token for the job.
The token is used to call your hardware gateway API to request a GPU slot labeled nvlink-ready.
Gateway runs policy checks (repo allowlist, PR status) and allocates a GPU node. It issues a short-lived SSH certificate scoped to the job.
Gateway creates an ephemeral tunnel to the node and returns the tunnel endpoint and certificate to the runner.
Runner executes the test, streams logs to a signed artifact store, and on completion the gateway revokes the certificate and tears down the tunnel.

Benefits: no long-lived credentials on runners, central approval and logging, and devices returned to the pool automatically.

Operational playbook: onboarding, rotation, and cost controls

Inventory and label every device with canonical metadata (hardware-type, capabilities, last-patch-date) and surface that in your operational dashboards (designing resilient operational dashboards).
Automate OS and firmware patching and validate with smoke tests before rejoining the pool.
Rotate any device credentials and certificates routinely; use a secret manager (Vault) to issue dynamic credentials.
Implement autosleep/scale-to-zero for Pi test benches and autoscale-to-zero for GPU pools; pair this with power orchestration and local PDU/UPS management (micro‑DC PDU & UPS orchestration); bill back usage to teams to avoid resource leaks.

Compliance checklist (SOC2/ISO-friendly controls)

Access control — documented RBAC policies, short-lived credentials, and principle of least privilege.
Change management — IaC manifests for device provisioning and version-controlled test harnesses.
Monitoring & logging — centralized logs, alerts for abnormal hardware usage, and retention policies.
Incident response — playbook for device compromise and forensic collection steps.
Data protection — ensure test artifact destinations are approved and encrypted at rest/in transit.

If you need to meet public-sector standards consider how FedRAMP or similar approvals affect platform choices.

2026 and beyond — future-proofing your HIL security

Expect these trends to influence HIL strategies:

Hardware attestation becomes common — ARM/SoC vendors and Pi-class boards will ship stronger root-of-trust and attestation stacks, letting you prove device identity before scheduling tests.
RISC-V + NVLink will enable new heterogeneous nodes (as announced in 2026), making scheduler-side topology awareness essential.
Confidential compute and TEEs will be used for sensitive model testing in preprod, isolating weights and datasets even from ops teams.
Policy-as-code and real-time enforcement will be enforced by admission controllers and service meshes at the hardware access level.

Final checklist — immediate actions you can take this week

Audit who currently can start CI jobs that touch hardware. Remove any generic ‘ci-bot’ wildcards.
Set up an SSH CA and rotate to certificate-based device auth for at least one test device — consult vendor comparisons for identity tooling (identity vendor comparison).
Implement one ephemeral runner workflow in a sandbox: use actions-runner-controller or ephemeral self-hosted runners.
Start shipping device logs to a central, versioned object store and ensure retention is configured.

Call to action

If you’re responsible for preprod environments or CI infrastructure, start by hardening one hardware path end-to-end: OIDC-based job identity, short-lived credentials (SSH certs), an authorization gateway, and immutable logs. Want a checklist tailored to your stack (GitHub/GitLab/Jenkins + Kubernetes + Terraform)? Contact our preprod.cloud team for a security review or grab our open-source policy templates and Kubernetes manifests to get an ephemeral runner prototype running in a few hours.

Actionable takeaway: prioritize ephemeral credentials + centralized gating for all HIL jobs. It’s the simplest change that reduces blast radius dramatically while keeping developer velocity.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.