made my services go-live playbook to run against services before go-live.

2026-03-19 23:41:30 -07:00
parent 9b13f5a5b7
commit c52a7e1abe
1 changed files with 145 additions and 0 deletions
--- a/playbooks/service-golive.md
+++ b/playbooks/service-golive.md
@@ -0,0 +1,145 @@
+# Playbook: Service Go-Live Review
+
+Use this playbook before exposing any service to external access through Nginx Proxy Manager (NPM).
+When invoked, read the project directory in the current working directory and work through each section as an interactive checklist.
+
+---
+
+## How to Use
+
+Tell the AI: _"Use the service-golive playbook to review this project"_
+
+The AI will:
+1. Read the project files in the current directory
+2. Work through each section below
+3. For each item — report PASS, FAIL, or WARN with specific findings
+4. At the end, give a go/no-go recommendation
+
+Do not proceed to the next section until the current one is resolved or explicitly deferred.
+
+---
+
+## Section 1: Feature & Improvement Review
+
+Goal: Catch missing functionality before users find it.
+
+- [ ] Does the service have a health check endpoint (e.g. `/health` or `/ping`)?
+- [ ] Are all intended routes/endpoints implemented and reachable?
+- [ ] Is there a meaningful error response for bad input (not raw stack traces)?
+- [ ] Are there any obvious UX gaps or incomplete flows in the UI (if applicable)?
+- [ ] Is there logging in place to capture errors and key events?
+- [ ] Are there any TODO/FIXME/HACK comments in the code that indicate unfinished work?
+- [ ] Does the service handle its own startup failures gracefully (exits cleanly, logs reason)?
+
+**AI Action:** List any gaps found with file and line references. Ask the user whether to fix now or defer.
+
+---
+
+## Section 2: Performance Review
+
+Goal: Ensure the service won't collapse under real load.
+
+- [ ] Are database queries using indexes on columns used in WHERE/JOIN/ORDER BY clauses?
+- [ ] Are N+1 query patterns present (loop that fires a query per item)?
+- [ ] Is connection pooling configured for the database?
+- [ ] Are large responses paginated?
+- [ ] Are any blocking operations (file I/O, external API calls) being done synchronously in an async context?
+- [ ] Are static assets (if any) being served through Nginx, not the app?
+- [ ] Is there any unbounded data being loaded into memory (e.g. `SELECT *` with no limit)?
+- [ ] Are background tasks or scheduled jobs using a proper queue/worker model (not threading hacks)?
+- [ ] Is Gzip/Brotli compression enabled in Nginx for text responses?
+
+**AI Action:** Flag any issues with specific file references. Suggest fixes. Ask user to confirm or defer.
+
+---
+
+## Section 3: Security Audit
+
+Goal: Do not put a vulnerable service on the internet. Be thorough.
+
+### 3a. Secrets & Credentials
+- [ ] No hardcoded passwords, tokens, API keys, or secrets in any source file
+- [ ] `.env` file is in `.gitignore` and not committed
+- [ ] `.env.example` exists with placeholder values only
+- [ ] No secrets in Docker Compose files (use `env_file` or environment variable references, not literal values)
+- [ ] No secrets in Nginx config files
+
+### 3b. Authentication & Authorization
+- [ ] All non-public endpoints require authentication
+- [ ] Authentication tokens/sessions have an expiry
+- [ ] Password hashing uses bcrypt, argon2, or scrypt — not MD5/SHA1
+- [ ] There is no default admin password that ships with the service
+- [ ] Role/permission checks exist if the app has multiple access levels
+- [ ] Failed login attempts are rate-limited or account-locked after N failures
+
+### 3c. Input Validation & Injection
+- [ ] All user input is validated server-side (not just client-side)
+- [ ] SQL queries use parameterized statements or ORM — no string concatenation
+- [ ] File upload paths are sanitized — no path traversal possible
+- [ ] HTML output is escaped to prevent XSS (or a framework handles this automatically)
+- [ ] Redirects only go to allowed/relative URLs — no open redirect
+- [ ] JSON deserialization does not allow arbitrary object instantiation
+
+### 3d. HTTP & Nginx Security Headers
+Verify the Nginx config for the proxy host includes:
+- [ ] `X-Frame-Options: DENY` or `SAMEORIGIN`
+- [ ] `X-Content-Type-Options: nosniff`
+- [ ] `X-XSS-Protection: 1; mode=block`
+- [ ] `Referrer-Policy: strict-origin-when-cross-origin`
+- [ ] `Content-Security-Policy` header defined (even if broad to start)
+- [ ] `Strict-Transport-Security` (HSTS) with `max-age` >= 31536000
+- [ ] Server version header suppressed (`server_tokens off`)
+- [ ] Unnecessary HTTP methods disabled (e.g. TRACE, DELETE if not used)
+
+### 3e. TLS / HTTPS
+- [ ] TLS certificate is valid and not self-signed for production
+- [ ] HTTP traffic redirects to HTTPS (not served in parallel)
+- [ ] TLS 1.0 and 1.1 disabled — only TLS 1.2+ allowed
+- [ ] Weak cipher suites disabled
+- [ ] Certificate expiry is monitored (NPM auto-renews, but verify it's configured)
+
+### 3f. Docker & Container Security
+- [ ] Containers do not run as root (check `user:` in Compose or Dockerfile `USER` instruction)
+- [ ] No container has `privileged: true` unless there is a documented reason
+- [ ] No unnecessary host volume mounts (especially `/var/run/docker.sock` unless intentional)
+- [ ] Container images are not using `latest` tag in production
+- [ ] Docker socket is not exposed to the external network
+- [ ] Resource limits (`mem_limit`, `cpus`) are set on containers
+
+**AI Action:** Run the following tools if available:
+- `bandit -r . -ll` — Python static security analysis
+- `trivy fs . --severity HIGH,CRITICAL` — dependency and filesystem CVE scan
+- `docker scout cves <image>` — container image vulnerability scan
+
+Report all FAIL/WARN findings. Do not proceed to go-live recommendation until critical issues are resolved.
+
+### 3g. Network & Exposure
+- [ ] Only port 80/443 are exposed publicly — no app ports (e.g. 8000, 3000) directly open to internet
+- [ ] NPM proxy host has access list or basic auth if the service is internal-only
+- [ ] Rate limiting is configured in Nginx or the app for API endpoints
+- [ ] The service does not expose an admin panel (e.g. `/admin`, `/dashboard`) without additional auth
+- [ ] Database ports (3306, 5432, 6379) are NOT exposed beyond the Docker network
+- [ ] SSH is not running inside any container
+
+### 3h. Dependency & Supply Chain
+- [ ] Dependencies are pinned to specific versions (not `*` or `latest`)
+- [ ] Known CVEs in dependencies? (run `trivy fs .` or `pip-audit` / `npm audit`)
+- [ ] No abandoned or unmaintained packages with known issues
+- [ ] Docker base images are from official/verified sources
+
+---
+
+## Section 4: Go-Live Decision
+
+After all sections are complete:
+
+- List all unresolved FINDs grouped by severity: **CRITICAL / HIGH / MEDIUM / LOW**
+- **CRITICAL or HIGH unresolved = NO GO.** These must be fixed before external access.
+- **MEDIUM/LOW unresolved** = user decides whether to defer with documented acceptance
+- Provide a final summary:
+  - Total checks: X
+  - Passed: X
+  - Failed (critical): X
+  - Failed (non-critical): X
+  - Deferred: X
+  - **Recommendation: GO / NO GO / GO WITH CONDITIONS**