Implement apix-registry with IoT sunset/decommission lifecycle and full BDD suite

- REST API: register, patch, O-level, replacements, history, search endpoints
- IoT lifecycle validations: future sunset, lock-before-release, sunset-passed-before-decommission
- DB schema: Liquibase changesets 001–008 (services, versions, replacements, sunset-at column)
- @ColumnTransformer(write="?::jsonb") on bsm_payload fields to avoid JDBC varchar→jsonb rejection
- Jandex plugin on apix-common + quarkus.index-dependency so @NotBlank validators resolve at runtime
- quarkus-logging-json extension added; quarkus.log.console.json=false is now a recognised key
- Fix requireSunsetBeforeLockRelease: Boolean.TRUE.equals instead of !Boolean.FALSE.equals (null guard)
- BDD suite: 27 scenarios / 213 steps across 5 feature files (sunset-lock, decommission, replacement, discovery, anonymity)
- Test infrastructure: JDBC TRUNCATE in @Before for DB isolation, Arc.container() for clock control — no test endpoints in production code
- sunsetAt truncated to microseconds in BDD steps to match Postgres timestamptz precision
- Cucumber step fixes: singular/plural candidate(s), lastResponse propagation in replacementsReturnsNCandidates

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Carsten Rehfeld
2026-05-08 09:13:26 +02:00
commit b2a16a8be7
71 changed files with 5480 additions and 0 deletions
+32
View File
@@ -0,0 +1,32 @@
---
arc42: "1 — Introduction and Goals"
status: stub
---
## 1.1 MVP Goal Statement
TODO: Define what must be provable at the end of the PoC phase.
Key question: What does a Sovereign Tech Fund reviewer need to see to confirm this is real running infrastructure?
## 1.2 Quality Goals
TODO: Top 35 quality goals, measurable.
Example dimensions: Queryability, Correctness of liveness status, Registration reliability, Availability.
## 1.3 Stakeholders
| Role | Expectation |
|---|---|
| STF reviewer | Running public URL, queryable, real services registered |
| Agent developer | Capability search returns structured, machine-readable results |
| Service registrant | Registration via portal or API; status visible within minutes |
| BSF (Carsten) | Deployable solo; maintainable; demonstrable to founding members |
## 1.4 Out of Scope (MVP)
- Billing and commercial tiers
- Automated O-level / S-level verification
- Multi-region redundancy
- Full CE/regulatory BSM validation
- Agent Enterprise composition layer
- IoT device template persistence (DC-1)
+42
View File
@@ -0,0 +1,42 @@
---
arc42: "2 — Architecture Constraints"
status: stub
---
## 2.1 Technical Constraints
| Constraint | Rationale |
|---|---|
| Hosted on Hetzner (EU) | European sovereignty narrative; cost; GDPR residency |
| Docker Compose deployment | Solo maintainability; no Kubernetes overhead for PoC |
| Python 3.12 | AI ecosystem fit; LLM-assisted dev speed; SDK readiness |
| PostgreSQL 16 | Relational integrity + JSONB flexibility for BSM payload |
| Caddy reverse proxy | Auto-TLS (Let's Encrypt); zero-config HTTPS |
| Open source (Apache 2.0) | STF requirement; community credibility |
| HTTPS mandatory | Trust infrastructure must be served over TLS — non-negotiable even for PoC |
## 2.2 Organisational Constraints
| Constraint | Rationale |
|---|---|
| Solo developer | All components must be maintainable by one person |
| LLM-assisted development | Accepted; all generated code must be reviewed before commit |
| Public GitHub repository | STF requires open-source deliverables; also community signal |
| No external team dependencies | No waiting on others; all unblocked decisions are made by Carsten |
## 2.3 Regulatory Constraints
| Constraint | Rationale |
|---|---|
| GDPR-lite | Only data stored: registrant email (for contact), service URL, BSM payload. No analytics, no tracking. |
| No PII in logs | Even at DEBUG level — email addresses must not appear in log output |
| No secrets in images or Git | API keys and DB credentials via runtime env only |
## 2.4 Convention Constraints
| Constraint | Rationale |
|---|---|
| HATEOAS API style | Core APIX Internet-Draft requirement; agents must be able to navigate from root URL |
| IETF Internet-Draft alignment | BSM field names must match draft-rehfeld-bot-service-index-00 |
| PlantUML for all diagrams | Project convention (not Mermaid) |
| arc42 documentation structure | This document set |
+58
View File
@@ -0,0 +1,58 @@
---
arc42: "3 — Context and Scope"
status: stub
---
## 3.1 Business Context
TODO: PlantUML system context diagram.
External actors:
- **Autonomous Agent** — queries the index by capability; reads BSM; consumes registered services
- **Service Registrant** — submits BSM via portal or API; receives registration confirmation
- **Spider** — automated crawler (internal); checks liveness of registered services against external endpoints
- **Admin (BSF)** — assigns O-levels; approves pending registrations; monitors registry health
- **External Service Endpoints** — the actual services being registered; queried by Spider for liveness
```plantuml
@startuml context
!include https://raw.githubusercontent.com/plantuml-stdlib/C4-PlantUML/master/C4_Context.puml
Person(agent, "Autonomous Agent", "Queries registry by capability; consumes registered services")
Person(registrant, "Service Registrant", "Submits BSM; monitors registration status")
Person(admin, "BSF Admin", "Assigns O-levels; approves registrations")
System(apix, "APIX Registry", "Global, queryable index of machine-consumable services")
System_Ext(ext_service, "External Service Endpoint", "The registered service; queried by Spider for liveness")
Rel(agent, apix, "Capability query / BSM fetch", "HTTPS/JSON")
Rel(registrant, apix, "BSM registration / status check", "HTTPS/JSON or Portal")
Rel(admin, apix, "O-level assignment / moderation", "Portal (API-key)")
Rel(apix, ext_service, "Liveness check / spec fetch", "HTTPS", "Spider")
@enduml
```
## 3.2 Technical Context
TODO: PlantUML technical context diagram showing network boundaries.
Components inside the system boundary:
- Caddy (reverse proxy, TLS termination)
- API service (FastAPI)
- Portal service (FastAPI + HTMX)
- Spider service (async Python scheduler)
- PostgreSQL (registry database)
## 3.3 External Interface Table
| Interface | Direction | Protocol | Data |
|---|---|---|---|
| Capability query | Agent → API | HTTPS GET | Query params: `capability`, `country`, `olevel`; Response: BSM list |
| BSM registration | Registrant → API | HTTPS POST | BSM JSON payload + API key header |
| Service detail | Agent → API | HTTPS GET | BSM + liveness status |
| HATEOAS root | Agent → API | HTTPS GET | Navigation links JSON |
| Liveness check | Spider → Ext. service | HTTPS GET | HTTP status + response time |
| OpenAPI fetch | Spider → Ext. service | HTTPS GET | OpenAPI JSON spec |
| Admin portal | Admin → Portal | HTTPS | Browser; HTML form |
+56
View File
@@ -0,0 +1,56 @@
---
arc42: "4 — Solution Strategy"
status: stub
---
## 4.1 Technology Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Language + framework | Java 21 + Quarkus 3.x | Compile-time safety; purpose-built for microservices; GraalVM native image first-class (see ADR-001) |
| Production binary | GraalVM Native Image | ~5080MB RAM per service; ~100ms startup; fits Hetzner CX22 with headroom |
| Dev loop | `quarkus dev` (JVM mode) | Live reload + continuous testing; native build only for production image |
| Persistence | Hibernate ORM + Panache | Standard Quarkus persistence; Panache active record reduces boilerplate |
| BSM payload | PostgreSQL JSONB + `@JdbcTypeCode(SqlTypes.JSON)` | Flexible schema for optional BSM fields without a separate document store |
| Migrations | Liquibase | User's existing tool; first-class Quarkus extension; rollback + context support (see ADR-008) |
| Reverse proxy | Caddy | Auto-TLS with Let's Encrypt; minimal config (see ADR-003) |
| Portal rendering | HTMX + Qute | No JS build pipeline; type-safe templates (build-time error on missing variables); idiomatic Quarkus (see ADR-004) |
| Spider concurrency | Java 21 virtual threads (`@RunOnVirtualThread`) | Non-blocking HTTP checks without reactive programming complexity |
| HTTP client (Spider) | Quarkus REST Client Reactive | Declarative; integrates with Quarkus DI and fault tolerance extensions |
| Build tool | Maven 3.9 | Quarkus documentation is Maven-first; Quarkus Maven plugin handles native build |
| Testing | JUnit 5 + `@QuarkusTest` + RestAssured + WireMock | `@QuarkusTest` starts real application context; RestAssured for HTTP assertions; WireMock for external API mocks |
## 4.2 Architectural Patterns
| Pattern | Application |
|---|---|
| HATEOAS | `IndexResource` returns all navigation links; agents navigate from root without prior knowledge |
| Repository pattern | DB access in `ServiceRepository` (Panache); business logic in `RegistryService`; resources are thin |
| Compile-time DI | Quarkus CDI resolves all injection at build time; no runtime reflection surprises |
| Scheduler-based Spider | `@Scheduled(every="15m")` on `SpiderScheduler`; stateless per run; virtual threads for concurrent checks |
| Verification pipeline | Sequential O-level elevation (O-1 → sanctions → O-2 → O-3); each step is an independent CDI bean |
| API key on writes | Single shared key for MVP via custom Quarkus Security identity provider; per-registrant keys post-MVP |
| Fail-fast validation | BSM validated at boundary via Bean Validation (`@Valid` on JAX-RS resource); invalid BSM rejected with 400 + constraint violation details |
## 4.3 Quality Goal → Decision Mapping
| Quality Goal | Architecture Decision |
|---|---|
| Compile-time safety | Quarkus CDI + Bean Validation + Qute type-safe templates — errors at build time, not runtime |
| Queryability | HATEOAS root + capability search; JPQL + JSONB operator query in ServiceRepository |
| Liveness accuracy | SpiderScheduler every 15 min; `last_checked_at` + `uptime_30d_percent` exposed in response |
| Registration reliability | Idempotent `UPSERT` on endpoint URL; Liquibase migrations with rollback support |
| Security hygiene | HTTPS via Caddy; API key on write endpoints; no PII in logs; non-root container user |
| Solo maintainability | Docker Compose; `quarkus dev` for local loop; single JVM language across all services |
## 4.4 MVP Shortcuts (Accepted Technical Debt)
| Shortcut | Exit Path |
|---|---|
| O-4 / O-5 assigned manually | Accredited Verifier integration post-MVP |
| Single shared API key | Per-registrant key management + OAuth2 post-MVP |
| No rate limiting on read endpoints | Caddy rate_limit directive when traffic warrants |
| OpenAPI / MCP parsers validate presence only | Field-level spec comparison in Spider post-MVP |
| Single-region deployment | Hetzner multi-region + Managed Database post-funding |
| No billing | Commercial tier in Phase 2 |
| No CI/CD pipeline | GitHub Actions native build pipeline post-MVP |
+135
View File
@@ -0,0 +1,135 @@
---
arc42: "5 — Building Block View"
status: stub
---
## 5.1 Level 1 — Maven Module Structure
```plantuml
@startuml modules
skinparam packageStyle rectangle
package "apix-common\n(plain Java 21 library)" as common {
component [OLevel\nLivenessStatus\nBsmPayload\nServiceSummaryDto\nVerificationResult] as dtos
}
package "apix-verification\n(plain Java 21 library)" as verification {
component [O1DnsVerifier\nO2GleifVerifier\nO2OpenCorporatesVerifier\nO3HygieneVerifier\nSanctionsScreener\nVerificationPipeline] as verifiers
}
package "apix-registry\n(Quarkus 3.x app)" as registry {
component [IndexResource\nServiceResource\nRegisterResource] as res
component [RegistryService\nVerificationOrchestrator] as svc
component [ServiceRecord\nServiceRepository] as repo
component [Liquibase\nmigrations] as lb
}
package "apix-spider\n(Quarkus 3.x app)" as spider {
component [SpiderScheduler\nLivenessFetcher\nLivenessEvaluator\nOpenApiParser\nMcpParser] as spider_core
component [SpiderServiceView\nSpiderRepository] as spider_repo
}
package "apix-portal\n(Quarkus 3.x app)" as portal {
component [PortalResource\nAdminResource\nRegistryClient] as portal_res
component [Qute templates] as templates
}
verification ..> common : depends on
registry ..> common : depends on
registry ..> verification : depends on
spider ..> common : depends on
portal ..> common : depends on
@enduml
```
## 5.2 Level 1 — Deployment View (Docker Compose)
```plantuml
@startuml deploy_l1
package "Docker Compose — Hetzner CX22" {
component [Caddy\n:80 / :443] as caddy
component [apix-registry\n:8180 (internal)] as registry
component [apix-spider\n:8082 (internal only)] as spider
component [apix-portal\n:8081 (internal)] as portal
database [PostgreSQL 16\n:5432 (internal)] as db
}
cloud Internet
Internet --> caddy : HTTPS
caddy --> registry : /api/*
caddy --> portal : /*
registry --> db : Hibernate ORM (Liquibase owner)
spider --> db : Hibernate ORM (Liquibase disabled)
portal --> registry : REST Client (HTTP internal)
spider --> [External Services] : liveness checks (HTTPS)
@enduml
```
## 5.3 Level 2 — apix-registry Internals
```plantuml
@startuml level2_registry
package "apix-registry" {
component [IndexResource] as r_index
component [ServiceResource] as r_svc
component [RegisterResource] as r_reg
component [RegistryService] as svc
component [VerificationOrchestrator] as orch
component [ServiceRepository\n(Panache)] as repo
component [Bean Validation\n(@Valid on JAX-RS)] as val
}
r_index --> svc
r_svc --> svc
r_reg --> val
r_reg --> svc
r_reg --> orch
svc --> repo
orch --> [VerificationPipeline\n(apix-verification)]
orch --> repo : persist VerificationResult
@enduml
```
## 5.4 Level 2 — apix-spider Internals
```plantuml
@startuml level2_spider
package "apix-spider" {
component [SpiderScheduler\n@Scheduled(every=15m)] as sched
component [LivenessFetcher\n@RestClient\n@RunOnVirtualThread] as fetcher
component [LivenessEvaluator\n(pure logic)] as eval
component [OpenApiParser] as oa
component [McpParser] as mcp
component [SpiderRepository\n(Panache)] as repo
}
sched --> repo : load services due for check
sched --> fetcher : dispatch per service
fetcher --> eval : HTTP status + response_ms
fetcher --> oa : spec URL
fetcher --> mcp : MCP URL
eval --> repo : write LivenessStatus
oa --> repo : write spec validation result
mcp --> repo : write spec validation result
@enduml
```
## 5.5 Component Responsibility Table
| Module / Component | Type | Responsibility |
|---|---|---|
| `apix-common` | Plain Java library | Shared enums and DTOs; no framework dependency; used by all modules |
| `apix-verification` | Plain Java library | O-level elevation pipeline; pure logic + external HTTP/DNS calls via `java.net.http`; no Quarkus context |
| `apix-registry` | Quarkus app | REST API (HATEOAS); BSM registration + validation; capability search; schema owner (Liquibase) |
| `apix-spider` | Quarkus app | Scheduled liveness checks; OpenAPI/MCP spec verification; writes liveness metrics to DB; independent lifecycle |
| `apix-portal` | Quarkus app | Human-readable web portal (HTMX + Qute); registration form; admin O-level view; calls registry via REST Client |
| `VerificationOrchestrator` | CDI bean (registry) | Bridge between Quarkus config injection and the plain-Java `VerificationPipeline`; persists results |
| `LivenessEvaluator` | Plain class (spider) | Pure function: HTTP status + response time → `LivenessStatus`; no I/O; testable without Quarkus |
| `ServiceRecord` | Panache entity (registry) | Full entity — all columns; schema owner |
| `SpiderServiceView` | Panache entity (spider) | Read/write subset of `services` table — only liveness columns; does not run migrations |
| PostgreSQL | Database | Single shared instance; registry owns schema; spider and portal are consumers |
+94
View File
@@ -0,0 +1,94 @@
---
arc42: "6 — Runtime View"
status: stub
---
## Scenario 1 — Agent Queries Registry by Capability
```plantuml
@startuml sc1
actor Agent
participant "Caddy" as caddy
participant "API Service" as api
database "PostgreSQL" as db
Agent -> caddy : GET /api/services?capability=inventory.read&country=DE
caddy -> api : forward
api -> db : SELECT services WHERE capability MATCH AND country=DE AND liveness=live
db --> api : [ServiceRecord, ...]
api --> caddy : 200 OK — JSON array of BSM summaries with _links
caddy --> Agent : response
@enduml
```
## Scenario 2 — Service Registrant Submits BSM via Portal
```plantuml
@startuml sc2
actor Registrant
participant "Caddy" as caddy
participant "Portal Service" as portal
participant "API Service" as api
database "PostgreSQL" as db
Registrant -> caddy : POST /register (form submit)
caddy -> portal : forward
portal -> api : POST /api/register (BSM JSON + API key)
api -> api : validate BSM (Pydantic)
alt BSM invalid
api --> portal : 422 Unprocessable — validation errors
portal --> Registrant : form with errors highlighted
else BSM valid
api -> db : UPSERT service record
db --> api : service_id
api --> portal : 201 Created — service_id + status URL
portal --> Registrant : "Registered. Status: pending O-level."
end
@enduml
```
## Scenario 3 — Spider Liveness Check
```plantuml
@startuml sc3
participant "Spider Scheduler" as sched
participant "Fetcher" as fetcher
participant "Evaluator" as eval
participant "DB Writer" as writer
database "PostgreSQL" as db
participant "External Service" as ext
sched -> db : SELECT services WHERE next_check <= NOW()
db --> sched : [ServiceRecord, ...]
loop for each service
sched -> fetcher : check(service_url, spec_url)
fetcher -> ext : GET service_url (timeout: 5s)
ext --> fetcher : HTTP response
fetcher -> eval : (status_code, response_time_ms)
eval --> writer : liveness=live|degraded|unreachable, checked_at=NOW()
writer -> db : UPDATE liveness_status WHERE service_id=X
end
@enduml
```
## Scenario 4 — Agent Navigates via HATEOAS
```plantuml
@startuml sc4
actor Agent
participant "API Service" as api
Agent -> api : GET /api/
api --> Agent : 200 OK — { "_links": { "search": "/api/services{?capability,country}", "register": "/api/register", "health": "/api/health" } }
Agent -> api : GET /api/services?capability=slot.book
api --> Agent : 200 OK — [{ "id": "...", "name": "...", "_links": { "self": "/api/services/{id}" } }, ...]
Agent -> api : GET /api/services/{id}
api --> Agent : 200 OK — full BSM record + liveness status
@enduml
```
+65
View File
@@ -0,0 +1,65 @@
---
arc42: "7 — Deployment View"
status: stub
---
## 7.1 Hetzner Deployment Diagram
```plantuml
@startuml deploy
node "Hetzner CX22\n(2 vCPU, 4GB RAM, Ubuntu 24.04)" as hetzner {
component [Caddy\n:80, :443] as caddy
component [API Service\n:8000 (internal)] as api
component [Portal Service\n:8001 (internal)] as portal
component [Spider Service\n(no exposed port)] as spider
database [PostgreSQL 16\n:5432 (internal)] as db
folder "Hetzner Volume\n(20GB)" as vol
}
cloud "Internet" {
actor Agent
actor Registrant
}
cloud "Let's Encrypt" as le
Agent --> caddy : HTTPS :443
Registrant --> caddy : HTTPS :443
caddy --> api
caddy --> portal
caddy <--> le : ACME cert renewal
api --> db
spider --> db
db --> vol : data persistence
@enduml
```
## 7.2 Environment Table
| Setting | Dev | Prod (Hetzner) |
|---|---|---|
| TLS | None (HTTP only) | Auto via Caddy + Let's Encrypt |
| DB | postgres:16 local container | postgres:16 container, data on Hetzner volume |
| Spider interval | 2 min (fast feedback) | 15 min |
| API key | `dev-key-insecure` | Strong random key, env var only |
| Log level | DEBUG | INFO |
| Port exposure | All ports exposed to host | Only :80, :443 via Caddy; all others internal |
## 7.3 Backup and Restore
- `backup.sh` runs via cron daily at 03:00 UTC
- Executes `pg_dump` into `/backup/apix_$(date +%Y%m%d).sql.gz`
- Backup directory mounted on Hetzner volume (separate from DB data volume)
- Retain last 7 dumps; older files deleted by script
- Restore: `psql < apix_YYYYMMDD.sql.gz` — documented in `infra/hetzner/RESTORE.md`
## 7.4 Domain and DNS
TODO: Confirm domain name (OQ-MVP-01).
Planned DNS setup:
- `registry.apix.dev` (or `index.botstandards.org`) → Hetzner VPS IP (A record)
- TTL: 300s initially for fast propagation during setup
Caddy will automatically obtain and renew the TLS certificate once the A record resolves to the server IP.
+127
View File
@@ -0,0 +1,127 @@
---
arc42: "8 — Crosscutting Concepts"
status: stub
---
## 8.1 Logging
- Format: structured JSON in production (`python-json-logger`); human-readable in dev
- Log levels: DEBUG (dev only), INFO (operational events), WARNING (recoverable anomalies), ERROR (failures needing attention)
- What is logged per component:
| Component | INFO events | WARNING events | ERROR events |
|---|---|---|---|
| API | request method+path+status+duration | validation failure (BSM) | DB connection failure |
| Spider | check start/end, service_id, liveness result, duration | response time > 3s | fetch timeout, DB write failure |
| Portal | form submission received | — | API call failure |
- **Never log:** email addresses, API keys, DB credentials, any PII
- Request IDs: generate UUID per request, include in all log lines for that request
## 8.2 Error Handling
- All API errors return structured JSON: `{ "error": "string", "detail": "string", "code": "APIX-ERR-XXX" }`
- HTTP status codes:
- `400` — malformed request (not JSON, missing content-type)
- `422` — BSM validation failure (Pydantic; include field-level errors)
- `401` — missing or invalid API key on write endpoints
- `404` — service not found
- `429` — rate limit exceeded
- `500` — internal server error (never expose stack trace to client)
- Spider errors are logged but do not crash the scheduler; failed service → `liveness=unreachable`
## 8.3 Security Hygiene (MVP-grade)
| Control | Implementation | What this is NOT |
|---|---|---|
| HTTPS | Caddy auto-TLS; HTTP redirects to HTTPS | Not HSTS with long max-age (add post-MVP) |
| Write endpoint auth | `X-API-Key` header checked against env var | Not per-user keys (add post-MVP) |
| Rate limiting on writes | Caddy `rate_limit` directive: 10 req/min per IP on `/api/register` | Not full DDoS protection |
| No secrets in Git | `.env.example` only; `.env` in `.gitignore` | Not secret scanning CI (add post-MVP) |
| No PII in logs | Enforced by convention; no log of `registrant_email` field | Not automated PII detection |
| Non-root containers | All Dockerfiles use `USER appuser` | Not read-only filesystem (add post-MVP) |
## 8.4 BSM Validation
- Validation layer: Pydantic v2 model in `models/bsm.py`
- Required fields (per Internet-Draft): `name`, `version`, `description`, `capabilities[]`, `endpoint`, `contact_email`
- Optional fields validated if present: `olevel`, `slevel`, `pricing`, `regulatory`
- On validation failure: `422` with field-level error list
- Re-registration (same `endpoint` URL): treated as update (UPSERT); BSM version must be >= existing version
- Schema version stored with each record; enables future migration
## 8.5 Liveness Check
- **"Live"** = HTTP 2xx response within 5 seconds from the registered `endpoint` URL
- **"Degraded"** = HTTP 2xx but response time > 3 seconds
- **"Unreachable"** = timeout, connection refused, or non-2xx response
- Status transitions: any state → any state on each check (no hysteresis in MVP)
- Check frequency: 15 min in prod, 2 min in dev
- `last_checked_at` timestamp always exposed in API response
## 8.6 Idempotency
- `POST /api/register` with the same `endpoint` URL: UPSERT (update BSM, reset liveness to `pending`)
- Spider re-check: always overwrites previous liveness status — idempotent by design
- DB migrations (Liquibase): each changeset is forward-only; re-running skips already-applied changesets (Liquibase tracks applied changesets in `DATABASECHANGELOG` table)
## 8.7 Internationalisation (i18n)
See ADR-013 for the full decision and rationale.
**Locale resolution order (highest priority first):**
1. `apix-locale` cookie (set by the language switcher via `POST /locale`)
2. `Accept-Language` request header (browser preference)
3. Default: `en` (English)
**String externalisation:**
- All user-visible strings in Qute templates are referenced via `{inject:msg.<key>}` — not hardcoded
- `Messages.java` (`@MessageBundle`) declares all keys; Quarkus compiler verifies usage at build time
- `messages.properties` — English; `messages_de.properties` — German; adding a locale requires only a new properties file
- Keys follow the pattern `<section>.<element>` (e.g. `nav.register`, `service.oLevel.label`, `admin.pending.title`)
**Help / tour content:**
- Tour titles, step headings, and step body text are defined in `HelpContentService` using `Messages` keys, resolved to the request locale
- The resolved tour data is serialized to JSON and embedded in each page as `window.PAGE_TOURS` + `window.PAGE_HELP` — no client-side translation lookup at runtime
- Adding a translated tour step requires only adding the key to `Messages.java` + both properties files
**What is not localised in MVP:**
- Error messages from Bean Validation (return as-is in EN; acceptable for API-layer errors)
- Log messages (always EN)
- BSM content submitted by registrants (stored as-is; not translated)
**Language switcher:**
- `<form method="post" action="/locale">` with `<input name="lang" value="de|en">` in the base layout
- `POST /locale`: validates lang against `["en", "de"]`; sets `apix-locale` cookie (path `/`, SameSite=Lax, HttpOnly); redirects to `Referer` header
- Language switcher is rendered in the base layout; available on every portal page
## 8.8 Human-Readable Service Detail (Index Level 2 Entry)
The machine-readable service entry (`GET /api/services/{id}` returning JSON) and the human-readable portal page (`GET /services/{id}` returning HTML) represent the same data. The HTML version is designed for a human making a go/no-go decision about using a service — not for a machine parsing a schema.
**Design principle:** answer four questions in order, above the fold where possible:
1. **Who is this?** — name, description
2. **Can I trust them?** — O-level with plain-English explanation, liveness uptime, last-verified date
3. **What exactly does it do?** — capabilities, pricing
4. **How do I call it?** — endpoint, spec links, example snippet
**Trust level presentation:**
- O-level and S-level are never shown as bare codes (O-2, S-1) to human visitors — always rendered as `badge + level name + 2-sentence explanation`
- The explanation is locale-resolved from `Messages` (keys `service.oLevel.N.description`) — not hardcoded in the template
- O-level badge color conveys confidence tier at a glance: grey (O-0), blue (O-1/O-2/O-3), green (O-4/O-5)
- "Reference entry by BSF" badge is shown prominently when `isReferenceEntry=true` — prevents a human from mistaking a BSF-registered third-party service for one that has self-registered
**Liveness presentation:**
- Status displayed as colored dot + label (LIVE / DEGRADED / UNREACHABLE) — not as an enum string
- Uptime percentage and average response time are formatted human values ("98.4%", "142 ms") computed by `ServiceDetailViewModelFactory`, not raw floats
- Last-checked timestamp shown relative ("8 minutes ago") with absolute ISO date in a `<title>` tooltip — humans read relative time faster; machines read absolute
**Separation of concerns:**
- `ServiceDetailViewModelFactory` (portal module) owns all human-readable computation: relative timestamps, color class selection, O-level description lookup, GLEIF LEI URL construction
- The Qute template (`service.html`) contains no business logic — it renders what the view model provides
- The registry API is not changed for this feature; the portal fetches the existing full-detail endpoint and enriches the response client-side in the portal
**Integration section (collapsible):**
- The raw endpoint URL and a minimal HTTP example are provided for developers who discover the service through the portal rather than via agent query
- Link to `GET /api/services/{id}` (machine-readable JSON) is included — a developer can use the portal as a discovery UI and then switch to the machine API
- This collapsible is closed by default to keep the human trust signals prominent
+397
View File
@@ -0,0 +1,397 @@
---
arc42: "9 — Architecture Decisions"
status: stub
---
## ADR-001: Java 21 + Quarkus 3.x over Python/FastAPI or Spring Boot
**Status:** Revised (was Python + FastAPI)
**Context:** The MVP targets a microservice architecture deployed as container-native services. Two competing concerns: development speed (solo, LLM-assisted) and production reliability (trust infrastructure, compile-time safety). Python + FastAPI was the original choice for speed; on review, Java 21 + Quarkus fits the microservice target better and Quarkus's dev mode removes most of the iteration speed penalty.
**Decision:** Java 21 + Quarkus 3.x with GraalVM Native Image for production builds.
**Stack breakdown:**
| Layer | Technology | Replaces |
|---|---|---|
| REST API | RESTEasy Reactive (JAX-RS) | FastAPI routers |
| Persistence | Hibernate ORM + Panache | SQLAlchemy |
| Migrations | Liquibase | Alembic |
| Validation | Hibernate Validator (Bean Validation) | Pydantic |
| Portal templates | Qute | Jinja2 |
| Spider scheduler | Quarkus Scheduler | APScheduler |
| HTTP client (Spider) | Quarkus REST Client Reactive | aiohttp |
| Health checks | SmallRye Health (`/q/health`) | Manual `/health` route |
| Metrics | Micrometer | Manual |
| Security (API key) | Quarkus Security custom identity provider | Custom middleware |
| Build | Maven 3.9 | pip / uvicorn |
| Testing | JUnit 5 + @QuarkusTest + RestAssured | pytest |
| Production binary | GraalVM Native Image | Python interpreter |
| Dev loop | `quarkus dev` (JVM mode, live reload, continuous testing) | `uvicorn --reload` |
**Rationale:**
- **Compile-time safety:** Quarkus resolves dependency injection, validation, and REST binding at build time — not at runtime via reflection. Errors that would surface as `500` at runtime in Python surface as build failures in Quarkus.
- **Purpose-built for microservices:** Quarkus's design assumption is container-native, independently deployable services. Spring Boot was designed for monoliths first and microservices second; Quarkus is the reverse.
- **Native image quality:** GraalVM Native Image works cleanly with Quarkus because Quarkus uses no runtime reflection by default. Spring Boot's native image support requires reflection hints for anything that Spring's runtime proxy model touches. Quarkus native: ~5080MB RAM per service, ~100ms startup.
- **Dev mode removes the speed penalty:** `quarkus dev` gives instant live reload, continuous test execution, and Dev UI — the iteration loop is comparable to Python for day-to-day development. Native build only runs for the production image.
- **Java 21 virtual threads:** Reactive-style concurrency (needed for the Spider's async HTTP checks) without reactive programming model complexity. `@RunOnVirtualThread` on the Spider scheduler gives non-blocking I/O without Mutiny/Reactor boilerplate.
- **Liquibase:** User's existing tool, Quarkus has a first-class Liquibase extension — no migration cost.
**Development model:** Code and test in JVM mode (`quarkus dev`). CI builds the native image. Production container runs the native binary.
**Rejected alternatives:**
- Python + FastAPI: dynamic typing; no compile-time safety; memory/startup acceptable but native image not available; retained for the SDK (client-side), not server-side
- Spring Boot 3.x + GraalVM Native: native image works but requires reflection hints for Spring's proxy model; more operational complexity than Quarkus native for the same result
- Go: fastest native binary; no JVM; but Carsten's background is JVM-based; no meaningful advantage over Quarkus native for this use case
**Consequence for SDK:** The apix-sdk-python and apix-sdk-typescript remain Python and TypeScript — the server being Java has no impact on client SDK language.
---
## ADR-002: PostgreSQL + JSONB over MongoDB
**Status:** Decided
**Context:** BSM is a structured document with a fixed core and a flexible optional section (`regulatory`, `pricing`). Registry operations are relational (search by capability, filter by country, join with liveness status).
**Decision:** PostgreSQL 16 with JSONB column for BSM payload
**Rationale:** Relational integrity where it matters (service_id, liveness_status, timestamps are typed columns). JSONB for BSM payload flexibility without a separate document store. Single database engine to maintain. Liquibase manages migrations (see ADR-008).
**Rejected alternatives:**
- MongoDB: no relational joins; schema migration story is weaker; adds operational complexity for no benefit
---
## ADR-003: Caddy over nginx or Traefik
**Status:** Decided
**Context:** Need TLS termination and reverse proxy. Solo operator, no DevOps team.
**Decision:** Caddy
**Rationale:** Automatic TLS via Let's Encrypt with zero configuration. Caddyfile is 10 lines vs nginx config of 50+. Traefik adds Docker label complexity not needed for a two-service setup.
**Rejected alternatives:**
- nginx: more control, more config; cert renewal needs certbot cron; higher solo maintenance burden
- Traefik: good for dynamic service discovery; overkill for fixed 2-service Docker Compose
---
## ADR-004: HTMX + Qute over React/Vue for Portal
**Status:** Revised (was HTMX + Jinja2 / FastAPI — updated for Quarkus stack)
**Context:** Portal is admin-grade, not consumer-grade. Primary users are registrants (submit BSM) and admin (assign O-levels). No real-time requirements. Stack is now Quarkus, so Jinja2 (Python) is not available.
**Decision:** HTMX + Qute templates served from the Quarkus portal application.
**Rationale:** Qute is Quarkus's native, type-safe templating engine. Type-safe means template errors surface at build time, not at render time — consistent with the compile-time safety rationale of ADR-001. No JS build pipeline. No npm. HTMX handles dynamic form behaviour (inline validation, partial page updates) without a JS framework. Template hot-reload works in `quarkus dev` mode.
**Qute specifics:** Templates live in `src/main/resources/templates/`. Type-safe binding via `@CheckedTemplate` — the Java compiler verifies that template variables exist and are of the declared type. Significantly safer than Jinja2's runtime-only variable resolution.
**Rejected alternatives:**
- React/Vue: overkill for admin portal; adds build pipeline maintenance; SPA adds complexity without user benefit
- Freemarker / Thymeleaf: both work with Quarkus but are not type-safe; Qute is the idiomatic Quarkus choice
- Jinja2: Python only; not available in Quarkus
---
## ADR-005: Automated O-1 / O-2 / O-3 Verification in MVP; O-4 / O-5 Post-MVP
**Status:** Decided
**Context:** The trust model has six organisation levels (O-0 to O-5). The original assumption was to defer all automated verification to Phase 2. On review: O-1, O-2, and O-3 are achievable in the MVP timeframe and are essential for the PoC to credibly demonstrate the trust model — not just describe it. O-4 and O-5 require human reviewers (Accredited Verifiers) and are genuinely post-MVP.
**Decision:** Implement automated verification for O-1, O-2, and O-3 in the MVP. O-4 and O-5 remain manual / post-MVP.
| Level | Mechanism | External dependency | MVP? |
|---|---|---|---|
| O-1 Identity Verified | DNS TXT record proof of domain ownership; business email MX check | Standard DNS resolver — no external API | Yes |
| O-2 Legal Entity Verified | GLEIF REST API (primary); OpenCorporates API (fallback for registrants without LEI) | GLEIF (free, public); OpenCorporates (free tier) | Yes |
| O-2 pre-condition | Sanctions screening against OFAC SDN + EU consolidated + UN SC lists | Public datasets; downloaded and cached locally; no live API call at check time | Yes |
| O-3 Hygiene Verified | HTTP fetch of `/.well-known/security.txt`; DNS DMARC + SPF lookup; reachability of Privacy Policy + ToS URLs | HTTP fetcher + DNS — no external API | Yes |
| O-4 Operationally Verified | Accredited Verifier assessment — human review | Accredited Verifier network | No |
| O-5 Audited | Third-party audit certificate (SOC 2 / ISO 27001) | Audit body | No |
**Rationale:** O-1 and O-3 require only DNS + HTTP — zero external API dependencies, implementable in hours. O-2 via GLEIF is one REST call against a well-documented public API. Sanctions screening uses locally cached public datasets — no live API dependency at verification time, only at download time. The combined effort is ~12 weeks of focused work, and the result is a PoC that demonstrates the trust model end-to-end rather than describing it.
**Consequences:**
- `config.py` gains: `GLEIF_API_URL`, `OPENCORPORATES_API_KEY`, `SANCTIONS_CACHE_PATH`, `SANCTIONS_REFRESH_INTERVAL_DAYS`
- New `src/api/verification/` module with 6 components (C-31 to C-36)
- New Alembic migration: `verification_status`, `olevel`, `olevel_checked_at`, `sanctions_cleared` columns
- Verification tests (C-37 to C-41) use mocked external APIs — no live network calls in test suite
- Admin portal still shows pending verifications; admin can override any O-level manually (important for edge cases and for O-4/O-5 placeholders)
**Rejected alternative:** Fully manual O-level assignment. Rejected because the PoC then cannot demonstrate automated trust elevation — the most important differentiator from a static directory.
---
## ADR-006: Single-Region Hetzner Deployment for MVP
---
## ADR-006: Two-VPS Hetzner Deployment (APIX Application + Gitea)
**Status:** Revised (was single VPS — updated for Gitea separation per ADR-010)
**Context:** Gitea requires dedicated hosting separate from the APIX application to avoid coupled failure domains. Code hosting and application hosting failing together during a deployment is an unacceptable blast radius.
**Decision:** Two Hetzner CX22 VPS instances, both in FSN1 (Falkenstein, Germany):
| VPS | Purpose | Services |
|---|---|---|
| `apix-app` | APIX application (Docker Swarm) | registry, spider, portal, PostgreSQL, Caddy |
| `apix-gitea` | Code + CI/CD (Docker Compose) | Gitea, Caddy, act_runner (JVM), act_runner (GraalVM native) |
**Rationale:** Decoupled failure domains. A deployment to `apix-app` cannot affect Gitea. A Gitea restart cannot affect the running registry. The GraalVM native build runner runs on `apix-gitea` — it is CPU-intensive but isolated from the running application services.
**Cost:** 2 × CX22 = ~€8.70/month. Acceptable for PoC.
**Backup:** Both VPS: pg_dump (apix-app) and Gitea data volume (apix-gitea) backed up daily to their respective Hetzner volumes.
**Exit path:** Post-funding: Hetzner Managed Database for HA PostgreSQL; multi-region Gitea replication; dedicated build runner on a larger instance.
---
## ADR-007: Register Verification APIs as Reference APIX Entries
**Status:** Decided
**Context:** APIX uses external public APIs (GLEIF, OpenCorporates, EU Sanctions list, Companies House) as part of its automated O-level verification pipeline. These APIs are themselves exactly the kind of atomic, independently callable services that the APIX model is designed to make discoverable. Registering them in APIX serves two purposes: (1) it demonstrates the model by making the verification infrastructure itself discoverable, and (2) it creates a natural outreach opportunity — once registered as reference entries, BSF can invite the operators to self-upgrade their registrations.
**Decision:** BSF registers GLEIF, OpenCorporates, EU Sanctions (eu-sanctions.io or equivalent public endpoint), and Companies House UK as reference APIX entries at O-0/O-1 during the MVP build (Week 5). These are registered by BSF as the registrant, clearly labelled as "reference registration — operator not yet self-registered."
| Service | Capability tag | Initial O-level | Target O-level (post-outreach) |
|---|---|---|---|
| GLEIF API | `legal-entity.lookup` | O-1 (domain verified) | O-2+ if GLEIF self-registers |
| OpenCorporates API | `company.lookup` | O-1 | O-2+ if OC self-registers |
| EU Sanctions endpoint | `sanctions.screen` | O-1 | O-2+ |
| Companies House UK | `org.verify.uk` | O-1 | O-2+ |
**Rationale:** These registrations cost BSF one afternoon of work and produce four real, meaningful entries in the registry. They also demonstrate recursion: APIX verifies organisations using services that are themselves registered in APIX. This is a strong narrative for the STF application and for founding member pitches. The outreach to GLEIF and Companies House to self-upgrade their registrations is also a legitimate business development activity.
**Constraints:** BSF's Terms of Service must explicitly permit third-party reference registrations at O-0. Admin override allows BSF to mark these entries as "reference — not operator-maintained" to avoid misleading consuming agents about SLA.
**Rejected alternative:** Only register self-operated services. Rejected because it leaves the registry with fewer entries and misses the recursive demonstration value.
---
## ADR-009: Maven Multi-Module Project with Separated Scheduler
**Status:** Decided
**Context:** The registry API and the Spider scheduler are distinct concerns with different scaling and deployment characteristics. The API must be responsive at all times; the scheduler runs on a fixed interval and is latency-insensitive. Bundling them into a single deployable couples their release cycles and prevents independent scaling. Similarly, shared types (enums, DTOs) and the verification pipeline should not be duplicated across services.
**Decision:** Maven multi-module project with five modules:
| Module | Type | Depends on | Responsibility |
|---|---|---|---|
| `apix-common` | Plain Java 21 library | — | Shared enums (`OLevel`, `LivenessStatus`), DTOs (`BsmPayload`, `ServiceSummaryDto`), `VerificationResult` record |
| `apix-verification` | Plain Java 21 library | `apix-common` | O-level elevation pipeline; uses `java.net.http.HttpClient` and `dnsjava`; no Quarkus dependency — fully testable without Quarkus context |
| `apix-registry` | Quarkus 3.x app | `apix-common`, `apix-verification` | REST API (HATEOAS), BSM registration, capability search, Liquibase migrations (schema owner) |
| `apix-spider` | Quarkus 3.x app | `apix-common` | `@Scheduled` liveness checks, OpenAPI/MCP spec verification; connects to shared DB; does **not** run migrations |
| `apix-portal` | Quarkus 3.x app | `apix-common` | HTMX + Qute web portal; calls `apix-registry` via REST Client; admin O-level assignment |
**Rationale:**
- **Scheduler independence:** Spider can be restarted, redeployed, or scaled independently of the API. A Spider bug cannot take down the registry. Quarkus `@Scheduled` inside the registry would tie their lifecycles together.
- **Plain Java library for verification:** `apix-verification` uses `java.net.http.HttpClient` (Java 11+) and `dnsjava` — no Quarkus runtime needed. This means all verification logic is unit-testable with plain JUnit, with no `@QuarkusTest` overhead. The registry wraps the pipeline in a CDI bean (`VerificationOrchestrator`) that injects Quarkus config and calls the library.
- **Schema ownership:** Registry runs Liquibase at startup. Spider connects to the same PostgreSQL instance but has Liquibase disabled (`quarkus.liquibase.migrate-at-start=false`). Spider has its own `ServiceRecord` entity mapped to the same table — it only reads `endpoint_url` and writes `liveness_status`, `last_checked_at`, `uptime_30d_percent`, `avg_response_ms`, `consecutive_failures`.
- **Parent POM as BOM:** Quarkus BOM imported in parent manages all transitive version alignment. Each Quarkus module inherits plugin config via `<parent>`. Plain Java modules only inherit `maven-compiler-plugin` config (Java 21 release).
**Consequence for Docker Compose:** Three independently deployable containers (registry, spider, portal) + PostgreSQL + Caddy. Each has its own Dockerfile with multi-stage GraalVM native build. On CX22 (4GB): 3 × ~80MB native = ~240MB + PostgreSQL ~256MB + Caddy ~20MB ≈ 516MB total — comfortable headroom.
**Rejected alternative:** Single Quarkus application with Spider as `@Scheduled` bean inside the registry. Rejected because it couples API and scheduler lifecycles, prevents independent scaling, and violates the single-responsibility principle that microservices are meant to enforce.
---
## ADR-008: Liquibase over Flyway for Database Migrations
**Status:** Decided
**Context:** Database schema migrations are required for both the initial schema and the incremental additions (verification status columns, liveness metrics). Both Liquibase and Flyway have first-class Quarkus extensions. The question is which fits the microservice context and the team's existing knowledge better.
**Decision:** Liquibase with XML changesets.
**Rationale:** Carsten already knows Liquibase — this is the primary decision factor for a solo MVP. The operational risk of learning a new migration tool while building a new framework (Quarkus) simultaneously is not justified by Flyway's marginal simplicity advantage. Liquibase's rollback support, changeset contexts (dev vs prod), and precondition checks provide more control for a trust infrastructure that must handle schema changes carefully. Quarkus's `quarkus-liquibase` extension runs changelogs at startup automatically — identical developer experience to Flyway.
**Consequence:** Changesets live in `src/main/resources/db/changelog/`. Master changelog at `db.changelog-master.xml`; individual changesets in `db/changelog/changes/`.
**Rejected alternative:** Flyway. More common in microservice community; simpler mental model; SQL-first. Rejected because the switching cost (learning new tool under time pressure) outweighs the simplicity benefit for a solo developer who already knows Liquibase.
---
## ADR-010: Self-Hosted Gitea as Primary; GitHub as Automated Push Mirror
**Status:** Decided
**Context:** The project requires code hosting and a Docker container registry for CI/CD artifacts. Options: GitHub (public, US-hosted), Gitea self-hosted (European sovereignty), GitLab self-hosted (heavier). The BSF's sovereignty narrative demands European-hosted, non-commercially-controlled infrastructure. GitHub remains relevant for community visibility (STF application, IETF credibility, developer adoption).
**Decision:** Gitea self-hosted on a dedicated Hetzner CX22 VPS as the authoritative remote. GitHub is a read-only push mirror, updated automatically by Gitea on every push to main. Gitea Container Registry hosts all Docker images.
**Infrastructure:**
- Hetzner CX22 (FSN1, Germany) dedicated to Gitea — separate from the APIX application VPS
- Gitea runs in Docker Compose on the Gitea VPS (Gitea + Caddy for TLS)
- Gitea Container Registry enabled (OCI-compatible; images pushed as `gitea.botstandards.org/<org>/<module>:<tag>`)
- SQLite for Gitea's own database — solo team, no concurrent write pressure; eliminates second PostgreSQL instance on the Gitea VPS
- Gitea Actions enabled; act_runner installed on the Gitea VPS for JVM-mode builds (fast, low CPU)
- Native image builds run on a separate Gitea Actions runner on the APIX VPS (scheduled, not on every push — CPU-intensive)
- GitHub push mirror: configured in Gitea repository settings; pushes to `github.com/bot-standards-foundation/<repo>` on every main branch push; GitHub repo is read-only for external contributors (PRs accepted via GitHub, mirrored to Gitea)
**Rationale:**
- All code, all build artifacts, all CI pipelines run on European infrastructure under BSF control
- Gitea Actions is GitHub Actions-compatible YAML — no migration cost if ever moving to GitHub Actions
- Container images pulled from Gitea registry at deploy time — no DockerHub dependency
- GitHub mirror preserves community discoverability without surrendering control
- SQLite for Gitea removes a second PostgreSQL instance; Gitea's write load (a solo developer) is trivially within SQLite's capacity
**Rejected alternatives:**
- GitHub as primary: contradicts sovereignty narrative; US-controlled; acceptable for mirror only
- GitLab self-hosted: heavier resource requirements; Gitea is sufficient for one developer
---
## ADR-011: Docker Swarm Single-Node for Zero-Downtime Production Deployment
**Status:** Decided
**Context:** The APIX registry is trust infrastructure — downtime during deployments damages credibility with registered services, consuming agents, and the STF reviewer. Docker Compose's standard `up -d` stops the old container before the new one is healthy, causing a brief outage. Kubernetes is operationally out of scope for a solo developer. A zero-downtime deployment mechanism is required.
**Decision:** Docker Swarm single-node mode for production on the APIX VPS. Local development continues to use Docker Compose (simpler, no Swarm overhead). Production uses a `docker-stack.yml` with `deploy.update_config.order: start-first` and health-check gating.
**How zero-downtime works:**
1. CI pushes new image to Gitea registry
2. Deploy step runs `docker service update --image <new-image> <service>` via SSH
3. Swarm starts new container and waits for its health check to pass
4. Once healthy, Swarm begins routing traffic to the new container
5. Old container is stopped and removed
6. If health check never passes: automatic rollback to previous image (`rollback_config`)
7. Caddy routes to the Swarm service VIP — it never needs reconfiguring during rolling updates
**Swarm stack config per service:**
```yaml
deploy:
replicas: 1
update_config:
order: start-first
failure_action: rollback
delay: 10s
rollback_config:
order: stop-first
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
```
**Local vs production parity:**
| Concern | Local (`docker-compose.yml`) | Production (`docker-stack.yml`) |
|---|---|---|
| Orchestrator | Docker Compose | Docker Swarm single-node |
| Images | Built from source (`quarkus dev`) | Pre-built native images from Gitea registry |
| TLS | None | Caddy auto-cert |
| Rolling updates | Not supported | `start-first` with health check gate |
| Secrets | `.env` file | Docker Swarm secrets |
**Rejected alternatives:**
- Docker Compose only: brief downtime on every deploy; not acceptable for trust infrastructure
- Kubernetes (k3s): zero-downtime capable but operationally too heavy for a solo developer
- Traefik instead of Caddy: Traefik has better native Swarm label integration but adds complexity; Caddy routing to Swarm service VIP achieves the same result without replacing the reverse proxy
---
## ADR-012: Three-Stage CI/CD Pipeline
**Status:** Decided
**Context:** GraalVM native image builds take 1015 minutes. Running them on every push would make the CI feedback loop unusable for active development. Conversely, deploying JVM-mode images to production is not acceptable — native images are required for the memory profile and startup time targets. Production deployments must be independently tested before going live.
**Decision:** Three distinct CI stages with independent triggers:
| Stage | Trigger | Runner | Duration | Output |
|---|---|---|---|---|
| **1 — Fast cycle** | Every push to any branch | Gitea VPS act_runner (JVM) | ~35 min | JVM build pass/fail; unit + `@QuarkusTest` results |
| **2 — Native build** | Merge to `main` | APIX VPS act_runner (GraalVM) | ~1015 min | Native images pushed to Gitea Container Registry |
| **3 — Deploy** | Git tag (`v*`) | Gitea VPS act_runner | ~2 min | Zero-downtime Swarm rolling update; health check verified; rollback on failure |
**Stage 1 — Fast cycle (`.gitea/workflows/ci-fast.yml`):**
- `mvn verify` in JVM mode on all modules
- `@QuarkusTest` with Testcontainers (PostgreSQL) for registry + spider
- WireMock-based tests for verification pipeline
- No Docker build; no native compilation
**Stage 2 — Native build (`.gitea/workflows/ci-native.yml`):**
- `mvn package -Pnative -Dquarkus.native.container-build=true` for each Quarkus module
- Docker multi-stage build produces native image
- Integration test of native container (`@QuarkusIntegrationTest`)
- Push tagged image to Gitea registry: `gitea.botstandards.org/bsf/<module>:main-<sha>`
**Stage 3 — Deploy (`.gitea/workflows/deploy.yml`):**
- SSH to APIX VPS
- `docker service update --image <new-image> apix_registry` (and spider, portal)
- Wait for Swarm health check confirmation
- Verify `/q/health` endpoint returns UP
- On failure: Swarm auto-rollback; pipeline fails with alert
**Rationale:** Stage separation gives a fast feedback loop (developer doesn't wait 15 min for native build feedback) while ensuring production always runs tested native images. The deploy stage is a separate, explicit action — no code is deployed to production without a human creating a git tag.
**Consequence:** Requires two Gitea Actions runners:
- Gitea VPS: JVM runner (Stage 1 + Stage 3) — low CPU requirement
- APIX VPS: GraalVM native runner (Stage 2) — CPU-intensive; runs on a schedule or on-demand, not concurrently with the running application
---
## ADR-013: Server-Side i18n via Quarkus @MessageBundle; EN + DE for MVP
**Status:** Decided
**Context:** The portal must be usable by German-speaking founding member candidates (manufacturing sector, logistics operators) without requiring them to work in English. The STF application emphasises European focus — a German-language portal is consistent with that narrative. The stack is Quarkus + Qute; multiple i18n approaches exist and the choice must remain consistent with the compile-time safety rationale of ADR-001.
**Decision:** Server-side i18n using Quarkus `@MessageBundle` for static UI strings with Qute type-safe injection. Locale resolved from `Accept-Language` header with `apix-locale` cookie override. English (EN) and German (DE) for MVP. Language switcher in base layout.
**How it works:**
- `Messages.java``@MessageBundle`-annotated interface; one method per translatable string key
- `messages.properties` — English default strings (all keys defined here)
- `messages_de.properties` — German strings (same keys, DE values)
- Quarkus resolves the correct properties file at build time and injects it into Qute templates
- Templates use `{inject:msg.someKey}` syntax — the Qute compiler verifies the key exists on `Messages.java` at build time
- `LocaleResolver.java` CDI bean: reads `Accept-Language` header; falls back to `apix-locale` cookie if present; returns `java.util.Locale`
- `PortalResource` injects `LocaleResolver`; passes locale to template rendering context
- `POST /locale` (`LocaleResource.java`) — sets `apix-locale` cookie; redirects to `Referer`; used by language switcher
- **Tour and help content** (JavaScript structures) are built by `HelpContentService` in the resolved locale and rendered into each page as a `<script>` block via Qute — consuming agents and portal users both receive pre-localized strings; no client-side translation layer
**Rationale:**
- Server-side rendering is the natural fit for Quarkus + Qute — no JS i18n library needed, no build pipeline
- `@MessageBundle` gives compile-time verification that all string keys exist — consistent with ADR-001 compile-time safety
- Baking tour content into the page as pre-localized JSON means the help engine (`help.js`) receives already-resolved strings; no translation lookup at runtime
- Cookie-based locale preference survives page navigation without requiring a user account
- EN + DE covers the BSF operating language (EN) and the primary founding member market (DE); other locales can be added by adding a new properties file — no code change
**Rejected alternatives:**
- Client-side i18n (`data-i18n` attributes + JS TRANSLATIONS object as in the `used-books` reference): works for a single-file app but loses compile-time key checking; breaks the Qute type-safe model; requires maintaining a separate JS translation layer alongside the server-side one
- Separate JSON locale files served as static assets: decouples translations from build; loses key verification; requires a JS runtime translation layer and additional `fetch` call per page load
---
## ADR-014: Client-Side Help Overlay Engine with Server-Rendered Tour Content
**Status:** Decided
**Context:** Portal users — registrants submitting their first BSM, agent developers querying the registry for the first time, admins assigning O-levels — need in-context guidance at the exact moment they are performing an action. A static FAQ page requires users to context-switch. The portal must also work as a self-guided demo for the STF reviewer and founding member pitches. The `used-books` application in this repository contains a proven pattern: a four-wing spotlight overlay with a draggable tour card, progress dots, and a separate page-level help drawer.
**Decision:** Client-side JS help overlay engine (`help.js`) adapted from the `used-books` pattern. Tour and page-help content is server-rendered into each page via Qute as a locale-aware JS data structure. No external tour library dependency.
**Architecture:**
- `help.js` — single file, no framework; ~350 lines; manages the full overlay lifecycle:
- Four `<div>` dimming wings (`help-dim-top/left/right/bottom`) that cut out a spotlight window around the current target element
- Highlight ring (`help-highlight`) positioned over the target
- Draggable tour card (`help-card`) with header drag handle (`cursor-grab`), group icon, progress dots, title, state indicator, body text, Back / Skip / Next buttons
- Page-level static help drawer (`help-drawer`) sliding in from the right; contains: "Guided Tours" section (list of tours relevant to the current page) + "Page Help" section (static explanation of the current page)
- Context filter: the drawer shows only tours whose `pages` array includes the current page ID (`<body data-page-id="...">`)
- `tourCheckAndNext()`: validates any required form state before advancing a step; configurable per step
- **Tour data injection:** each Qute template embeds a `<script>` block with two page-scoped globals rendered at request time in the resolved locale:
```html
<script>
window.PAGE_TOURS = {tours};
window.PAGE_HELP = {pageHelp};
</script>
```
`{tours}` and `{pageHelp}` are `String` parameters passed by `PortalResource` — pre-serialized JSON produced by `HelpContentService` in the resolved locale. Qute renders them into the page; `help.js` reads them on `window.onload`.
- `TourDefinition.java` + `TourStep.java` — Java records defining the data model for tour content
- `HelpContentService.java` — CDI bean; builds locale-resolved `TourDefinition` list per page; serializes to JSON; 5 tours defined for MVP:
| Tour ID | Pages | Steps |
|---|---|---|
| `tour-agent-setup` | `/` (home) | 3: root endpoint URL → HATEOAS links JSON → capability query example |
| `tour-register` | `/register` | 5: open form → BSM name + description → capability tags → O-level meaning → submit |
| `tour-search` | `/search` | 3: enter capability → read results → interpret liveness badge |
| `tour-trust` | `/service/{id}` | 4: O-level indicator → S-level indicator → liveness badge → last_checked_at |
| `tour-admin` | `/admin` | 4: pending verifications list → assign O-level → reference registration flag → API key reminder |
- `templates/layout.html` — Qute base layout (all pages extend this); includes: help button (?) in nav bar; overlay HTML (4 wing divs + highlight ring + tour card + progress dots + state indicator); help drawer shell; `<script src="/help.js">`; language switcher form
**Rationale:**
- Client-side overlay requires no round-trips per step — smooth UX for a step-through walkthrough
- Server-rendering tour content in the resolved locale via Qute keeps i18n consistent with ADR-013 — one locale resolution point, no client-side translation map
- Spotlight overlay moves out of the way when dragged — user can see the target element while reading the explanation, unlike a modal
- The `used-books` pattern is already in production in an adjacent project; adaptation cost is low; no learning curve
- No external CDN dependency means the help system works offline and does not introduce a third-party privacy concern
**Rejected alternatives:**
- Shepherd.js / Driver.js: well-maintained but external JS dependency; overkill for five tours in a portal with known pages; adds CDN or bundler dependency
- Pure modal help without overlay: user cannot see the element being explained while reading the explanation; defeats the purpose of contextual guidance
- Help text embedded directly in Qute templates: clutters the template; cannot be stepped through; not filterable by page context; not locale-switchable without full re-render
+52
View File
@@ -0,0 +1,52 @@
---
arc42: "10 — Quality Requirements"
status: stub
---
## 10.1 Quality Tree
```
Quality
├── Functionality
│ ├── Capability search returns relevant results
│ ├── HATEOAS navigation works from root URL without prior knowledge
│ └── BSM validation rejects invalid submissions with actionable errors
├── Reliability
│ ├── Liveness status reflects actual service state within one check interval
│ └── Registry survives VPS restart (data persisted to volume)
├── Security Hygiene
│ ├── All traffic over HTTPS
│ ├── Write endpoints reject unauthenticated requests
│ └── No credentials or PII in logs or Git
└── Operability
├── Deployable from scratch on a new Hetzner VPS in < 30 minutes
├── Health endpoint reflects actual DB connectivity
└── Logs provide enough context to diagnose a registration failure without a debugger
```
## 10.2 Quality Scenarios
| # | Stimulus | Response | Measurable Outcome |
|---|---|---|---|
| QS-01 | Agent sends `GET /api/services?capability=inventory.read` | Returns list of matching services with BSM summaries and `_links` | Response time < 500ms; result includes at least 1 registered service |
| QS-02 | Registrant submits BSM with missing required field | API returns 422 with field-level error identifying the missing field | Error response includes field name and reason; no partial write to DB |
| QS-03 | Registered service goes offline | Spider marks it `unreachable` within 15 min | `liveness_status=unreachable` and updated `last_checked_at` in API response |
| QS-04 | Agent sends `GET /api/` (root) | Returns JSON with `_links` to search, register, and health endpoints | No prior knowledge of path structure required; all links resolvable |
| QS-05 | VPS is rebooted | All services come back up automatically; registry data intact | `docker compose up` on restart (via restart policy); 0 data loss |
| QS-06 | Unauthenticated POST to `/api/register` | 401 Unauthorized | No registration created; API key required |
| QS-07 | STF reviewer opens portal in browser | Homepage shows registry stats + search; registration form works | Zero errors in browser console; form submits successfully |
## 10.3 MVP Acceptance Criteria
The PoC is **done** when all of the following are true:
- [ ] Public URL is reachable over HTTPS
- [ ] `GET /api/` returns valid HATEOAS navigation links
- [ ] `GET /api/services?capability=X` returns at least 1 result for at least 3 distinct capability queries
- [ ] At least 5 real services are registered (not demo fixtures)
- [ ] Spider has run at least one full check cycle and updated liveness status for all registered services
- [ ] Portal registration form accepts a valid BSM and shows confirmation
- [ ] Admin O-level assignment works via portal
- [ ] `GET /api/health` returns 200 with DB status
- [ ] No credentials or PII appear in `docker compose logs` output
- [ ] `infra/hetzner/provision.sh` + `deploy.sh` installs and starts the full stack on a fresh Hetzner VPS
+31
View File
@@ -0,0 +1,31 @@
---
arc42: "11 — Risks and Technical Debt"
status: stub
---
## 11.1 Risk Register
| # | Risk | Probability | Impact | Mitigation |
|---|---|---|---|---|
| R-01 | Big tech ships a competing agent service directory before PoC is done | Medium | High | Speed is the primary mitigation. PoC by end of 2026. IETF draft establishes prior art regardless of PoC state. |
| R-02 | Chicken-and-egg: no real registrants → registry looks empty → no agents query it → no registrant motivation | High | High | Pre-seed with 5 real services (self + Lexnexum + 3 outreach targets) before any public announcement. Never launch empty. |
| R-03 | Solo bus factor: Carsten gets sick/unavailable | Medium | High | All infra as code (GitHub); `provision.sh` + `deploy.sh` must be runnable by anyone with Hetzner access. No undocumented steps. |
| R-04 | Hetzner VPS data loss (disk failure) | Low | High | Daily pg_dump to separate Hetzner volume. Restore documented and tested. |
| R-05 | Spider causes load on registrant services (aggressive checking) | Low | Medium | 15-min interval; 5s timeout; respect `Crawl-delay` in robots.txt if present; opt-out mechanism in BSM. |
| R-06 | STF rejects application despite PoC | Medium | Medium | PoC also serves founding member pitch and IETF credibility regardless of STF outcome. |
| R-07 | IETF draft does not progress / working group not formed | Medium | Medium | APIX can operate as a de-facto standard regardless of IETF formal status (as DNS did). |
## 11.2 Technical Debt Log
Accepted shortcuts in the MVP, with explicit exit paths:
| # | Debt | Accepted Because | Exit Path | Priority |
|---|---|---|---|---|
| TD-01 | Manual O-level assignment | Automated GLEIF/domain check is weeks of work; manual is safe for PoC | Automated O-1 (DNS/domain) + O-2 (GLEIF) in Phase 2 | High |
| TD-02 | Single shared API key | Per-registrant key management requires auth layer; premature for PoC | OAuth2 / per-registrant key management post-MVP | High |
| TD-03 | No rate limiting on read endpoints | PoC traffic too low to warrant it | Caddy rate_limit directives when traffic warrants | Medium |
| TD-04 | No full OpenAPI spec field validation by Spider | Field-level validation requires schema comparison logic; overkill for PoC | Spider `openapi_parser.py` extension post-MVP | Medium |
| TD-05 | Single-region deployment | Multi-region requires DB replication; solo can't maintain safely | Hetzner Managed Database + multi-region post-funding | Low (PoC SLA is acceptable) |
| TD-06 | No CI/CD pipeline | Solo dev; manual deploy via `deploy.sh` is sufficient | GitHub Actions pipeline post-MVP | Low |
| TD-07 | No TLS for Spider → DB connection | Both on same Docker network; no external exposure | TLS on internal connections post-MVP if required by audit | Low |
| TD-08 | Spider has no respect for registrant `robots.txt` | Most registered services won't have agent-specific crawl rules yet | Add robots.txt check to Spider fetcher when needed | Low |
+23
View File
@@ -0,0 +1,23 @@
---
arc42: "12 — Glossary"
status: stub
---
| Term | Definition |
|---|---|
| **APIX** | API Index — the global, neutral, machine-queryable registry of services for autonomous agents |
| **BSM** | Bot Service Manifest — the structured metadata document that describes a machine-consumable service (capabilities, endpoint, trust level, pricing) |
| **Spider** | The automated APIX crawler that periodically checks liveness and spec compliance of registered services |
| **O-level** | Organisation trust level (O-0 to O-5). O-0: unverified. O-1: domain ownership confirmed. O-2: legal entity verified. Higher levels require additional compliance verification. |
| **S-level** | Service trust level. Reflects technical verification of the service against its declared BSM spec. |
| **Liveness** | Operational status of a registered service as last measured by the Spider. States: `pending`, `live`, `degraded`, `unreachable`. |
| **AE** | Agent Enterprise — an autonomous agent that composes APIX-registered services into a workflow and potentially earns on each execution |
| **HATEOAS** | Hypermedia as the Engine of Application State — REST architectural constraint where the client navigates entirely via links returned by the server; no out-of-band URL knowledge required |
| **DC-1** | Device Class registration — the APIX registration record for an IoT device class (BSM template); persists beyond the original operator's cloud service lifetime |
| **Capability** | A machine-readable tag describing what a service does (e.g., `inventory.read`, `slot.book`, `customs.doc`). The primary search dimension in APIX. |
| **GLEIF** | Global Legal Entity Identifier Foundation — the data source used for automated O-2 legal entity verification |
| **Internet-Draft** | `draft-rehfeld-bot-service-index-00` — the IETF submission that formalises the APIX specification |
| **PoC** | Proof of Concept — the MVP deployment described in this document |
| **STF** | Sovereign Tech Fund — the German federal funding body; primary target for APIX infrastructure funding |
| **BSF** | Bot Standards Foundation — the Swiss Stiftung that governs the APIX standard and operates the reference index |
| **UPSERT** | Insert-or-update DB operation — used for re-registration: same endpoint URL updates existing record rather than creating a duplicate |