Files
apix-mvp/docs/arc42/09-architecture-decisions.md
Carsten Rehfeld b2a16a8be7 Implement apix-registry with IoT sunset/decommission lifecycle and full BDD suite
- REST API: register, patch, O-level, replacements, history, search endpoints
- IoT lifecycle validations: future sunset, lock-before-release, sunset-passed-before-decommission
- DB schema: Liquibase changesets 001–008 (services, versions, replacements, sunset-at column)
- @ColumnTransformer(write="?::jsonb") on bsm_payload fields to avoid JDBC varchar→jsonb rejection
- Jandex plugin on apix-common + quarkus.index-dependency so @NotBlank validators resolve at runtime
- quarkus-logging-json extension added; quarkus.log.console.json=false is now a recognised key
- Fix requireSunsetBeforeLockRelease: Boolean.TRUE.equals instead of !Boolean.FALSE.equals (null guard)
- BDD suite: 27 scenarios / 213 steps across 5 feature files (sunset-lock, decommission, replacement, discovery, anonymity)
- Test infrastructure: JDBC TRUNCATE in @Before for DB isolation, Arc.container() for clock control — no test endpoints in production code
- sunsetAt truncated to microseconds in BDD steps to match Postgres timestamptz precision
- Cucumber step fixes: singular/plural candidate(s), lastResponse propagation in replacementsReturnsNCandidates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-08 09:13:26 +02:00

32 KiB
Raw Permalink Blame History

arc42, status
arc42 status
9 — Architecture Decisions stub

ADR-001: Java 21 + Quarkus 3.x over Python/FastAPI or Spring Boot

Status: Revised (was Python + FastAPI) Context: The MVP targets a microservice architecture deployed as container-native services. Two competing concerns: development speed (solo, LLM-assisted) and production reliability (trust infrastructure, compile-time safety). Python + FastAPI was the original choice for speed; on review, Java 21 + Quarkus fits the microservice target better and Quarkus's dev mode removes most of the iteration speed penalty.

Decision: Java 21 + Quarkus 3.x with GraalVM Native Image for production builds.

Stack breakdown:

Layer Technology Replaces
REST API RESTEasy Reactive (JAX-RS) FastAPI routers
Persistence Hibernate ORM + Panache SQLAlchemy
Migrations Liquibase Alembic
Validation Hibernate Validator (Bean Validation) Pydantic
Portal templates Qute Jinja2
Spider scheduler Quarkus Scheduler APScheduler
HTTP client (Spider) Quarkus REST Client Reactive aiohttp
Health checks SmallRye Health (/q/health) Manual /health route
Metrics Micrometer Manual
Security (API key) Quarkus Security custom identity provider Custom middleware
Build Maven 3.9 pip / uvicorn
Testing JUnit 5 + @QuarkusTest + RestAssured pytest
Production binary GraalVM Native Image Python interpreter
Dev loop quarkus dev (JVM mode, live reload, continuous testing) uvicorn --reload

Rationale:

  • Compile-time safety: Quarkus resolves dependency injection, validation, and REST binding at build time — not at runtime via reflection. Errors that would surface as 500 at runtime in Python surface as build failures in Quarkus.
  • Purpose-built for microservices: Quarkus's design assumption is container-native, independently deployable services. Spring Boot was designed for monoliths first and microservices second; Quarkus is the reverse.
  • Native image quality: GraalVM Native Image works cleanly with Quarkus because Quarkus uses no runtime reflection by default. Spring Boot's native image support requires reflection hints for anything that Spring's runtime proxy model touches. Quarkus native: ~5080MB RAM per service, ~100ms startup.
  • Dev mode removes the speed penalty: quarkus dev gives instant live reload, continuous test execution, and Dev UI — the iteration loop is comparable to Python for day-to-day development. Native build only runs for the production image.
  • Java 21 virtual threads: Reactive-style concurrency (needed for the Spider's async HTTP checks) without reactive programming model complexity. @RunOnVirtualThread on the Spider scheduler gives non-blocking I/O without Mutiny/Reactor boilerplate.
  • Liquibase: User's existing tool, Quarkus has a first-class Liquibase extension — no migration cost.

Development model: Code and test in JVM mode (quarkus dev). CI builds the native image. Production container runs the native binary.

Rejected alternatives:

  • Python + FastAPI: dynamic typing; no compile-time safety; memory/startup acceptable but native image not available; retained for the SDK (client-side), not server-side
  • Spring Boot 3.x + GraalVM Native: native image works but requires reflection hints for Spring's proxy model; more operational complexity than Quarkus native for the same result
  • Go: fastest native binary; no JVM; but Carsten's background is JVM-based; no meaningful advantage over Quarkus native for this use case

Consequence for SDK: The apix-sdk-python and apix-sdk-typescript remain Python and TypeScript — the server being Java has no impact on client SDK language.


ADR-002: PostgreSQL + JSONB over MongoDB

Status: Decided Context: BSM is a structured document with a fixed core and a flexible optional section (regulatory, pricing). Registry operations are relational (search by capability, filter by country, join with liveness status). Decision: PostgreSQL 16 with JSONB column for BSM payload Rationale: Relational integrity where it matters (service_id, liveness_status, timestamps are typed columns). JSONB for BSM payload flexibility without a separate document store. Single database engine to maintain. Liquibase manages migrations (see ADR-008). Rejected alternatives:

  • MongoDB: no relational joins; schema migration story is weaker; adds operational complexity for no benefit

ADR-003: Caddy over nginx or Traefik

Status: Decided Context: Need TLS termination and reverse proxy. Solo operator, no DevOps team. Decision: Caddy Rationale: Automatic TLS via Let's Encrypt with zero configuration. Caddyfile is 10 lines vs nginx config of 50+. Traefik adds Docker label complexity not needed for a two-service setup. Rejected alternatives:

  • nginx: more control, more config; cert renewal needs certbot cron; higher solo maintenance burden
  • Traefik: good for dynamic service discovery; overkill for fixed 2-service Docker Compose

ADR-004: HTMX + Qute over React/Vue for Portal

Status: Revised (was HTMX + Jinja2 / FastAPI — updated for Quarkus stack) Context: Portal is admin-grade, not consumer-grade. Primary users are registrants (submit BSM) and admin (assign O-levels). No real-time requirements. Stack is now Quarkus, so Jinja2 (Python) is not available. Decision: HTMX + Qute templates served from the Quarkus portal application. Rationale: Qute is Quarkus's native, type-safe templating engine. Type-safe means template errors surface at build time, not at render time — consistent with the compile-time safety rationale of ADR-001. No JS build pipeline. No npm. HTMX handles dynamic form behaviour (inline validation, partial page updates) without a JS framework. Template hot-reload works in quarkus dev mode. Qute specifics: Templates live in src/main/resources/templates/. Type-safe binding via @CheckedTemplate — the Java compiler verifies that template variables exist and are of the declared type. Significantly safer than Jinja2's runtime-only variable resolution. Rejected alternatives:

  • React/Vue: overkill for admin portal; adds build pipeline maintenance; SPA adds complexity without user benefit
  • Freemarker / Thymeleaf: both work with Quarkus but are not type-safe; Qute is the idiomatic Quarkus choice
  • Jinja2: Python only; not available in Quarkus

ADR-005: Automated O-1 / O-2 / O-3 Verification in MVP; O-4 / O-5 Post-MVP

Status: Decided Context: The trust model has six organisation levels (O-0 to O-5). The original assumption was to defer all automated verification to Phase 2. On review: O-1, O-2, and O-3 are achievable in the MVP timeframe and are essential for the PoC to credibly demonstrate the trust model — not just describe it. O-4 and O-5 require human reviewers (Accredited Verifiers) and are genuinely post-MVP.

Decision: Implement automated verification for O-1, O-2, and O-3 in the MVP. O-4 and O-5 remain manual / post-MVP.

Level Mechanism External dependency MVP?
O-1 Identity Verified DNS TXT record proof of domain ownership; business email MX check Standard DNS resolver — no external API Yes
O-2 Legal Entity Verified GLEIF REST API (primary); OpenCorporates API (fallback for registrants without LEI) GLEIF (free, public); OpenCorporates (free tier) Yes
O-2 pre-condition Sanctions screening against OFAC SDN + EU consolidated + UN SC lists Public datasets; downloaded and cached locally; no live API call at check time Yes
O-3 Hygiene Verified HTTP fetch of /.well-known/security.txt; DNS DMARC + SPF lookup; reachability of Privacy Policy + ToS URLs HTTP fetcher + DNS — no external API Yes
O-4 Operationally Verified Accredited Verifier assessment — human review Accredited Verifier network No
O-5 Audited Third-party audit certificate (SOC 2 / ISO 27001) Audit body No

Rationale: O-1 and O-3 require only DNS + HTTP — zero external API dependencies, implementable in hours. O-2 via GLEIF is one REST call against a well-documented public API. Sanctions screening uses locally cached public datasets — no live API dependency at verification time, only at download time. The combined effort is ~12 weeks of focused work, and the result is a PoC that demonstrates the trust model end-to-end rather than describing it.

Consequences:

  • config.py gains: GLEIF_API_URL, OPENCORPORATES_API_KEY, SANCTIONS_CACHE_PATH, SANCTIONS_REFRESH_INTERVAL_DAYS
  • New src/api/verification/ module with 6 components (C-31 to C-36)
  • New Alembic migration: verification_status, olevel, olevel_checked_at, sanctions_cleared columns
  • Verification tests (C-37 to C-41) use mocked external APIs — no live network calls in test suite
  • Admin portal still shows pending verifications; admin can override any O-level manually (important for edge cases and for O-4/O-5 placeholders)

Rejected alternative: Fully manual O-level assignment. Rejected because the PoC then cannot demonstrate automated trust elevation — the most important differentiator from a static directory.


ADR-006: Single-Region Hetzner Deployment for MVP


ADR-006: Two-VPS Hetzner Deployment (APIX Application + Gitea)

Status: Revised (was single VPS — updated for Gitea separation per ADR-010) Context: Gitea requires dedicated hosting separate from the APIX application to avoid coupled failure domains. Code hosting and application hosting failing together during a deployment is an unacceptable blast radius.

Decision: Two Hetzner CX22 VPS instances, both in FSN1 (Falkenstein, Germany):

VPS Purpose Services
apix-app APIX application (Docker Swarm) registry, spider, portal, PostgreSQL, Caddy
apix-gitea Code + CI/CD (Docker Compose) Gitea, Caddy, act_runner (JVM), act_runner (GraalVM native)

Rationale: Decoupled failure domains. A deployment to apix-app cannot affect Gitea. A Gitea restart cannot affect the running registry. The GraalVM native build runner runs on apix-gitea — it is CPU-intensive but isolated from the running application services.

Cost: 2 × CX22 = ~€8.70/month. Acceptable for PoC.

Backup: Both VPS: pg_dump (apix-app) and Gitea data volume (apix-gitea) backed up daily to their respective Hetzner volumes.

Exit path: Post-funding: Hetzner Managed Database for HA PostgreSQL; multi-region Gitea replication; dedicated build runner on a larger instance.


ADR-007: Register Verification APIs as Reference APIX Entries

Status: Decided Context: APIX uses external public APIs (GLEIF, OpenCorporates, EU Sanctions list, Companies House) as part of its automated O-level verification pipeline. These APIs are themselves exactly the kind of atomic, independently callable services that the APIX model is designed to make discoverable. Registering them in APIX serves two purposes: (1) it demonstrates the model by making the verification infrastructure itself discoverable, and (2) it creates a natural outreach opportunity — once registered as reference entries, BSF can invite the operators to self-upgrade their registrations.

Decision: BSF registers GLEIF, OpenCorporates, EU Sanctions (eu-sanctions.io or equivalent public endpoint), and Companies House UK as reference APIX entries at O-0/O-1 during the MVP build (Week 5). These are registered by BSF as the registrant, clearly labelled as "reference registration — operator not yet self-registered."

Service Capability tag Initial O-level Target O-level (post-outreach)
GLEIF API legal-entity.lookup O-1 (domain verified) O-2+ if GLEIF self-registers
OpenCorporates API company.lookup O-1 O-2+ if OC self-registers
EU Sanctions endpoint sanctions.screen O-1 O-2+
Companies House UK org.verify.uk O-1 O-2+

Rationale: These registrations cost BSF one afternoon of work and produce four real, meaningful entries in the registry. They also demonstrate recursion: APIX verifies organisations using services that are themselves registered in APIX. This is a strong narrative for the STF application and for founding member pitches. The outreach to GLEIF and Companies House to self-upgrade their registrations is also a legitimate business development activity.

Constraints: BSF's Terms of Service must explicitly permit third-party reference registrations at O-0. Admin override allows BSF to mark these entries as "reference — not operator-maintained" to avoid misleading consuming agents about SLA.

Rejected alternative: Only register self-operated services. Rejected because it leaves the registry with fewer entries and misses the recursive demonstration value.


ADR-009: Maven Multi-Module Project with Separated Scheduler

Status: Decided Context: The registry API and the Spider scheduler are distinct concerns with different scaling and deployment characteristics. The API must be responsive at all times; the scheduler runs on a fixed interval and is latency-insensitive. Bundling them into a single deployable couples their release cycles and prevents independent scaling. Similarly, shared types (enums, DTOs) and the verification pipeline should not be duplicated across services.

Decision: Maven multi-module project with five modules:

Module Type Depends on Responsibility
apix-common Plain Java 21 library Shared enums (OLevel, LivenessStatus), DTOs (BsmPayload, ServiceSummaryDto), VerificationResult record
apix-verification Plain Java 21 library apix-common O-level elevation pipeline; uses java.net.http.HttpClient and dnsjava; no Quarkus dependency — fully testable without Quarkus context
apix-registry Quarkus 3.x app apix-common, apix-verification REST API (HATEOAS), BSM registration, capability search, Liquibase migrations (schema owner)
apix-spider Quarkus 3.x app apix-common @Scheduled liveness checks, OpenAPI/MCP spec verification; connects to shared DB; does not run migrations
apix-portal Quarkus 3.x app apix-common HTMX + Qute web portal; calls apix-registry via REST Client; admin O-level assignment

Rationale:

  • Scheduler independence: Spider can be restarted, redeployed, or scaled independently of the API. A Spider bug cannot take down the registry. Quarkus @Scheduled inside the registry would tie their lifecycles together.
  • Plain Java library for verification: apix-verification uses java.net.http.HttpClient (Java 11+) and dnsjava — no Quarkus runtime needed. This means all verification logic is unit-testable with plain JUnit, with no @QuarkusTest overhead. The registry wraps the pipeline in a CDI bean (VerificationOrchestrator) that injects Quarkus config and calls the library.
  • Schema ownership: Registry runs Liquibase at startup. Spider connects to the same PostgreSQL instance but has Liquibase disabled (quarkus.liquibase.migrate-at-start=false). Spider has its own ServiceRecord entity mapped to the same table — it only reads endpoint_url and writes liveness_status, last_checked_at, uptime_30d_percent, avg_response_ms, consecutive_failures.
  • Parent POM as BOM: Quarkus BOM imported in parent manages all transitive version alignment. Each Quarkus module inherits plugin config via <parent>. Plain Java modules only inherit maven-compiler-plugin config (Java 21 release).

Consequence for Docker Compose: Three independently deployable containers (registry, spider, portal) + PostgreSQL + Caddy. Each has its own Dockerfile with multi-stage GraalVM native build. On CX22 (4GB): 3 × ~80MB native = ~240MB + PostgreSQL ~256MB + Caddy ~20MB ≈ 516MB total — comfortable headroom.

Rejected alternative: Single Quarkus application with Spider as @Scheduled bean inside the registry. Rejected because it couples API and scheduler lifecycles, prevents independent scaling, and violates the single-responsibility principle that microservices are meant to enforce.


ADR-008: Liquibase over Flyway for Database Migrations

Status: Decided Context: Database schema migrations are required for both the initial schema and the incremental additions (verification status columns, liveness metrics). Both Liquibase and Flyway have first-class Quarkus extensions. The question is which fits the microservice context and the team's existing knowledge better. Decision: Liquibase with XML changesets. Rationale: Carsten already knows Liquibase — this is the primary decision factor for a solo MVP. The operational risk of learning a new migration tool while building a new framework (Quarkus) simultaneously is not justified by Flyway's marginal simplicity advantage. Liquibase's rollback support, changeset contexts (dev vs prod), and precondition checks provide more control for a trust infrastructure that must handle schema changes carefully. Quarkus's quarkus-liquibase extension runs changelogs at startup automatically — identical developer experience to Flyway. Consequence: Changesets live in src/main/resources/db/changelog/. Master changelog at db.changelog-master.xml; individual changesets in db/changelog/changes/. Rejected alternative: Flyway. More common in microservice community; simpler mental model; SQL-first. Rejected because the switching cost (learning new tool under time pressure) outweighs the simplicity benefit for a solo developer who already knows Liquibase.


ADR-010: Self-Hosted Gitea as Primary; GitHub as Automated Push Mirror

Status: Decided Context: The project requires code hosting and a Docker container registry for CI/CD artifacts. Options: GitHub (public, US-hosted), Gitea self-hosted (European sovereignty), GitLab self-hosted (heavier). The BSF's sovereignty narrative demands European-hosted, non-commercially-controlled infrastructure. GitHub remains relevant for community visibility (STF application, IETF credibility, developer adoption).

Decision: Gitea self-hosted on a dedicated Hetzner CX22 VPS as the authoritative remote. GitHub is a read-only push mirror, updated automatically by Gitea on every push to main. Gitea Container Registry hosts all Docker images.

Infrastructure:

  • Hetzner CX22 (FSN1, Germany) dedicated to Gitea — separate from the APIX application VPS
  • Gitea runs in Docker Compose on the Gitea VPS (Gitea + Caddy for TLS)
  • Gitea Container Registry enabled (OCI-compatible; images pushed as gitea.botstandards.org/<org>/<module>:<tag>)
  • SQLite for Gitea's own database — solo team, no concurrent write pressure; eliminates second PostgreSQL instance on the Gitea VPS
  • Gitea Actions enabled; act_runner installed on the Gitea VPS for JVM-mode builds (fast, low CPU)
  • Native image builds run on a separate Gitea Actions runner on the APIX VPS (scheduled, not on every push — CPU-intensive)
  • GitHub push mirror: configured in Gitea repository settings; pushes to github.com/bot-standards-foundation/<repo> on every main branch push; GitHub repo is read-only for external contributors (PRs accepted via GitHub, mirrored to Gitea)

Rationale:

  • All code, all build artifacts, all CI pipelines run on European infrastructure under BSF control
  • Gitea Actions is GitHub Actions-compatible YAML — no migration cost if ever moving to GitHub Actions
  • Container images pulled from Gitea registry at deploy time — no DockerHub dependency
  • GitHub mirror preserves community discoverability without surrendering control
  • SQLite for Gitea removes a second PostgreSQL instance; Gitea's write load (a solo developer) is trivially within SQLite's capacity

Rejected alternatives:

  • GitHub as primary: contradicts sovereignty narrative; US-controlled; acceptable for mirror only
  • GitLab self-hosted: heavier resource requirements; Gitea is sufficient for one developer

ADR-011: Docker Swarm Single-Node for Zero-Downtime Production Deployment

Status: Decided Context: The APIX registry is trust infrastructure — downtime during deployments damages credibility with registered services, consuming agents, and the STF reviewer. Docker Compose's standard up -d stops the old container before the new one is healthy, causing a brief outage. Kubernetes is operationally out of scope for a solo developer. A zero-downtime deployment mechanism is required.

Decision: Docker Swarm single-node mode for production on the APIX VPS. Local development continues to use Docker Compose (simpler, no Swarm overhead). Production uses a docker-stack.yml with deploy.update_config.order: start-first and health-check gating.

How zero-downtime works:

  1. CI pushes new image to Gitea registry
  2. Deploy step runs docker service update --image <new-image> <service> via SSH
  3. Swarm starts new container and waits for its health check to pass
  4. Once healthy, Swarm begins routing traffic to the new container
  5. Old container is stopped and removed
  6. If health check never passes: automatic rollback to previous image (rollback_config)
  7. Caddy routes to the Swarm service VIP — it never needs reconfiguring during rolling updates

Swarm stack config per service:

deploy:
  replicas: 1
  update_config:
    order: start-first
    failure_action: rollback
    delay: 10s
  rollback_config:
    order: stop-first
  restart_policy:
    condition: on-failure
    delay: 5s
    max_attempts: 3

Local vs production parity:

Concern Local (docker-compose.yml) Production (docker-stack.yml)
Orchestrator Docker Compose Docker Swarm single-node
Images Built from source (quarkus dev) Pre-built native images from Gitea registry
TLS None Caddy auto-cert
Rolling updates Not supported start-first with health check gate
Secrets .env file Docker Swarm secrets

Rejected alternatives:

  • Docker Compose only: brief downtime on every deploy; not acceptable for trust infrastructure
  • Kubernetes (k3s): zero-downtime capable but operationally too heavy for a solo developer
  • Traefik instead of Caddy: Traefik has better native Swarm label integration but adds complexity; Caddy routing to Swarm service VIP achieves the same result without replacing the reverse proxy

ADR-012: Three-Stage CI/CD Pipeline

Status: Decided Context: GraalVM native image builds take 1015 minutes. Running them on every push would make the CI feedback loop unusable for active development. Conversely, deploying JVM-mode images to production is not acceptable — native images are required for the memory profile and startup time targets. Production deployments must be independently tested before going live.

Decision: Three distinct CI stages with independent triggers:

Stage Trigger Runner Duration Output
1 — Fast cycle Every push to any branch Gitea VPS act_runner (JVM) ~35 min JVM build pass/fail; unit + @QuarkusTest results
2 — Native build Merge to main APIX VPS act_runner (GraalVM) ~1015 min Native images pushed to Gitea Container Registry
3 — Deploy Git tag (v*) Gitea VPS act_runner ~2 min Zero-downtime Swarm rolling update; health check verified; rollback on failure

Stage 1 — Fast cycle (.gitea/workflows/ci-fast.yml):

  • mvn verify in JVM mode on all modules
  • @QuarkusTest with Testcontainers (PostgreSQL) for registry + spider
  • WireMock-based tests for verification pipeline
  • No Docker build; no native compilation

Stage 2 — Native build (.gitea/workflows/ci-native.yml):

  • mvn package -Pnative -Dquarkus.native.container-build=true for each Quarkus module
  • Docker multi-stage build produces native image
  • Integration test of native container (@QuarkusIntegrationTest)
  • Push tagged image to Gitea registry: gitea.botstandards.org/bsf/<module>:main-<sha>

Stage 3 — Deploy (.gitea/workflows/deploy.yml):

  • SSH to APIX VPS
  • docker service update --image <new-image> apix_registry (and spider, portal)
  • Wait for Swarm health check confirmation
  • Verify /q/health endpoint returns UP
  • On failure: Swarm auto-rollback; pipeline fails with alert

Rationale: Stage separation gives a fast feedback loop (developer doesn't wait 15 min for native build feedback) while ensuring production always runs tested native images. The deploy stage is a separate, explicit action — no code is deployed to production without a human creating a git tag.

Consequence: Requires two Gitea Actions runners:

  • Gitea VPS: JVM runner (Stage 1 + Stage 3) — low CPU requirement
  • APIX VPS: GraalVM native runner (Stage 2) — CPU-intensive; runs on a schedule or on-demand, not concurrently with the running application

ADR-013: Server-Side i18n via Quarkus @MessageBundle; EN + DE for MVP

Status: Decided Context: The portal must be usable by German-speaking founding member candidates (manufacturing sector, logistics operators) without requiring them to work in English. The STF application emphasises European focus — a German-language portal is consistent with that narrative. The stack is Quarkus + Qute; multiple i18n approaches exist and the choice must remain consistent with the compile-time safety rationale of ADR-001.

Decision: Server-side i18n using Quarkus @MessageBundle for static UI strings with Qute type-safe injection. Locale resolved from Accept-Language header with apix-locale cookie override. English (EN) and German (DE) for MVP. Language switcher in base layout.

How it works:

  • Messages.java@MessageBundle-annotated interface; one method per translatable string key
  • messages.properties — English default strings (all keys defined here)
  • messages_de.properties — German strings (same keys, DE values)
  • Quarkus resolves the correct properties file at build time and injects it into Qute templates
  • Templates use {inject:msg.someKey} syntax — the Qute compiler verifies the key exists on Messages.java at build time
  • LocaleResolver.java CDI bean: reads Accept-Language header; falls back to apix-locale cookie if present; returns java.util.Locale
  • PortalResource injects LocaleResolver; passes locale to template rendering context
  • POST /locale (LocaleResource.java) — sets apix-locale cookie; redirects to Referer; used by language switcher
  • Tour and help content (JavaScript structures) are built by HelpContentService in the resolved locale and rendered into each page as a <script> block via Qute — consuming agents and portal users both receive pre-localized strings; no client-side translation layer

Rationale:

  • Server-side rendering is the natural fit for Quarkus + Qute — no JS i18n library needed, no build pipeline
  • @MessageBundle gives compile-time verification that all string keys exist — consistent with ADR-001 compile-time safety
  • Baking tour content into the page as pre-localized JSON means the help engine (help.js) receives already-resolved strings; no translation lookup at runtime
  • Cookie-based locale preference survives page navigation without requiring a user account
  • EN + DE covers the BSF operating language (EN) and the primary founding member market (DE); other locales can be added by adding a new properties file — no code change

Rejected alternatives:

  • Client-side i18n (data-i18n attributes + JS TRANSLATIONS object as in the used-books reference): works for a single-file app but loses compile-time key checking; breaks the Qute type-safe model; requires maintaining a separate JS translation layer alongside the server-side one
  • Separate JSON locale files served as static assets: decouples translations from build; loses key verification; requires a JS runtime translation layer and additional fetch call per page load

ADR-014: Client-Side Help Overlay Engine with Server-Rendered Tour Content

Status: Decided Context: Portal users — registrants submitting their first BSM, agent developers querying the registry for the first time, admins assigning O-levels — need in-context guidance at the exact moment they are performing an action. A static FAQ page requires users to context-switch. The portal must also work as a self-guided demo for the STF reviewer and founding member pitches. The used-books application in this repository contains a proven pattern: a four-wing spotlight overlay with a draggable tour card, progress dots, and a separate page-level help drawer.

Decision: Client-side JS help overlay engine (help.js) adapted from the used-books pattern. Tour and page-help content is server-rendered into each page via Qute as a locale-aware JS data structure. No external tour library dependency.

Architecture:

  • help.js — single file, no framework; ~350 lines; manages the full overlay lifecycle:
    • Four <div> dimming wings (help-dim-top/left/right/bottom) that cut out a spotlight window around the current target element
    • Highlight ring (help-highlight) positioned over the target
    • Draggable tour card (help-card) with header drag handle (cursor-grab), group icon, progress dots, title, state indicator, body text, Back / Skip / Next buttons
    • Page-level static help drawer (help-drawer) sliding in from the right; contains: "Guided Tours" section (list of tours relevant to the current page) + "Page Help" section (static explanation of the current page)
    • Context filter: the drawer shows only tours whose pages array includes the current page ID (<body data-page-id="...">)
    • tourCheckAndNext(): validates any required form state before advancing a step; configurable per step
  • Tour data injection: each Qute template embeds a <script> block with two page-scoped globals rendered at request time in the resolved locale:
    <script>
    window.PAGE_TOURS = {tours};
    window.PAGE_HELP  = {pageHelp};
    </script>
    
    {tours} and {pageHelp} are String parameters passed by PortalResource — pre-serialized JSON produced by HelpContentService in the resolved locale. Qute renders them into the page; help.js reads them on window.onload.
  • TourDefinition.java + TourStep.java — Java records defining the data model for tour content
  • HelpContentService.java — CDI bean; builds locale-resolved TourDefinition list per page; serializes to JSON; 5 tours defined for MVP:
Tour ID Pages Steps
tour-agent-setup / (home) 3: root endpoint URL → HATEOAS links JSON → capability query example
tour-register /register 5: open form → BSM name + description → capability tags → O-level meaning → submit
tour-search /search 3: enter capability → read results → interpret liveness badge
tour-trust /service/{id} 4: O-level indicator → S-level indicator → liveness badge → last_checked_at
tour-admin /admin 4: pending verifications list → assign O-level → reference registration flag → API key reminder
  • templates/layout.html — Qute base layout (all pages extend this); includes: help button (?) in nav bar; overlay HTML (4 wing divs + highlight ring + tour card + progress dots + state indicator); help drawer shell; <script src="/help.js">; language switcher form

Rationale:

  • Client-side overlay requires no round-trips per step — smooth UX for a step-through walkthrough
  • Server-rendering tour content in the resolved locale via Qute keeps i18n consistent with ADR-013 — one locale resolution point, no client-side translation map
  • Spotlight overlay moves out of the way when dragged — user can see the target element while reading the explanation, unlike a modal
  • The used-books pattern is already in production in an adjacent project; adaptation cost is low; no learning curve
  • No external CDN dependency means the help system works offline and does not introduce a third-party privacy concern

Rejected alternatives:

  • Shepherd.js / Driver.js: well-maintained but external JS dependency; overkill for five tours in a portal with known pages; adds CDN or bundler dependency
  • Pure modal help without overlay: user cannot see the element being explained while reading the explanation; defeats the purpose of contextual guidance
  • Help text embedded directly in Qute templates: clutters the template; cannot be stepped through; not filterable by page context; not locale-switchable without full re-render