174 lines
8.4 KiB
Markdown
174 lines
8.4 KiB
Markdown
# Playbook: Service Go-Live Review
|
|
|
|
Use this playbook before exposing any service to external access through Nginx Proxy Manager (NPM).
|
|
When invoked, read the project directory in the current working directory and work through each section as an interactive checklist.
|
|
|
|
---
|
|
|
|
## How to Use
|
|
|
|
Tell the AI: _"Use the service-golive playbook to review this project: https://git.chns.tech/CHNS/AI/raw/branch/main/playbooks/service-golive.md"_
|
|
|
|
The AI will:
|
|
1. Read the project files in the current directory
|
|
2. Work through each section below
|
|
3. For each item — report PASS, FAIL, or WARN with specific findings
|
|
4. At the end, give a go/no-go recommendation
|
|
|
|
Do not proceed to the next section until the current one is resolved or explicitly deferred.
|
|
|
|
---
|
|
|
|
## Section 1: Feature & Improvement Review
|
|
|
|
Goal: Catch missing functionality before users find it.
|
|
|
|
- [ ] Does the service have a health check endpoint (e.g. `/health` or `/ping`)?
|
|
- [ ] Are all intended routes/endpoints implemented and reachable?
|
|
- [ ] Is there a meaningful error response for bad input (not raw stack traces)?
|
|
- [ ] Are there any obvious UX gaps or incomplete flows in the UI (if applicable)?
|
|
- [ ] Is there logging in place to capture errors and key events?
|
|
- [ ] Are there any TODO/FIXME/HACK comments in the code that indicate unfinished work?
|
|
- [ ] Does the service handle its own startup failures gracefully (exits cleanly, logs reason)?
|
|
|
|
### 1a. Ntfy Admin Notifications
|
|
|
|
Goal: Ensure the super admin is alerted to significant events without having to monitor logs manually.
|
|
|
|
- [ ] Is Ntfy (or equivalent push notification system) integrated into the application?
|
|
- [ ] Are admin-relevant events triggering Ntfy notifications?
|
|
|
|
**If Ntfy is NOT implemented**, flag as WARN and recommend the following events for notification coverage based on what the app does:
|
|
|
|
| Event | Severity | Why it matters |
|
|
|---|---|---|
|
|
| Successful admin login | High | Detect unauthorized admin access |
|
|
| Failed admin login (threshold reached) | High | Brute-force indicator |
|
|
| New user registration | Medium | Visibility into who is joining |
|
|
| User account deletion | Medium | Audit trail for removals |
|
|
| Role/permission escalation | High | Privilege change could indicate compromise |
|
|
| Password reset requested | Medium | Could indicate account takeover attempt |
|
|
| Rate limit triggered | Medium | Abuse or misconfigured client |
|
|
| API key created or revoked | High | Credential lifecycle event |
|
|
| Service startup / crash recovery | Medium | Unexpected restarts need awareness |
|
|
| High error rate (e.g. 5xx spike) | High | App health degrading in production |
|
|
| Large data export initiated | Medium | Data exfiltration risk indicator |
|
|
| Config or environment change detected | High | Unplanned changes should be visible |
|
|
|
|
**AI Action:** Search the codebase for Ntfy integration (look for `ntfy`, `ntfy.sh`, or HTTP POST calls to a notification endpoint). If none found, list the above recommended events as WARN items and ask the user whether to implement before go-live or defer.
|
|
|
|
---
|
|
|
|
**AI Action:** List any gaps found with file and line references. Ask the user whether to fix now or defer.
|
|
|
|
---
|
|
|
|
## Section 2: Performance Review
|
|
|
|
Goal: Ensure the service won't collapse under real load.
|
|
|
|
- [ ] Are database queries using indexes on columns used in WHERE/JOIN/ORDER BY clauses?
|
|
- [ ] Are N+1 query patterns present (loop that fires a query per item)?
|
|
- [ ] Is connection pooling configured for the database?
|
|
- [ ] Are large responses paginated?
|
|
- [ ] Are any blocking operations (file I/O, external API calls) being done synchronously in an async context?
|
|
- [ ] Are static assets (if any) being served through Nginx, not the app?
|
|
- [ ] Is there any unbounded data being loaded into memory (e.g. `SELECT *` with no limit)?
|
|
- [ ] Are background tasks or scheduled jobs using a proper queue/worker model (not threading hacks)?
|
|
- [ ] Is Gzip/Brotli compression enabled in Nginx for text responses?
|
|
|
|
**AI Action:** Flag any issues with specific file references. Suggest fixes. Ask user to confirm or defer.
|
|
|
|
---
|
|
|
|
## Section 3: Security Audit
|
|
|
|
Goal: Do not put a vulnerable service on the internet. Be thorough.
|
|
|
|
### 3a. Secrets & Credentials
|
|
- [ ] No hardcoded passwords, tokens, API keys, or secrets in any source file
|
|
- [ ] `.env` file is in `.gitignore` and not committed
|
|
- [ ] `.env.example` exists with placeholder values only
|
|
- [ ] No secrets in Docker Compose files (use `env_file` or environment variable references, not literal values)
|
|
- [ ] No secrets in Nginx config files
|
|
|
|
### 3b. Authentication & Authorization
|
|
- [ ] All non-public endpoints require authentication
|
|
- [ ] Authentication tokens/sessions have an expiry
|
|
- [ ] Password hashing uses bcrypt, argon2, or scrypt — not MD5/SHA1
|
|
- [ ] There is no default admin password that ships with the service
|
|
- [ ] Role/permission checks exist if the app has multiple access levels
|
|
- [ ] Failed login attempts are rate-limited or account-locked after N failures
|
|
|
|
### 3c. Input Validation & Injection
|
|
- [ ] All user input is validated server-side (not just client-side)
|
|
- [ ] SQL queries use parameterized statements or ORM — no string concatenation
|
|
- [ ] File upload paths are sanitized — no path traversal possible
|
|
- [ ] HTML output is escaped to prevent XSS (or a framework handles this automatically)
|
|
- [ ] Redirects only go to allowed/relative URLs — no open redirect
|
|
- [ ] JSON deserialization does not allow arbitrary object instantiation
|
|
|
|
### 3d. HTTP & Nginx Security Headers
|
|
Verify the Nginx config for the proxy host includes:
|
|
- [ ] `X-Frame-Options: DENY` or `SAMEORIGIN`
|
|
- [ ] `X-Content-Type-Options: nosniff`
|
|
- [ ] `X-XSS-Protection: 1; mode=block`
|
|
- [ ] `Referrer-Policy: strict-origin-when-cross-origin`
|
|
- [ ] `Content-Security-Policy` header defined (even if broad to start)
|
|
- [ ] `Strict-Transport-Security` (HSTS) with `max-age` >= 31536000
|
|
- [ ] Server version header suppressed (`server_tokens off`)
|
|
- [ ] Unnecessary HTTP methods disabled (e.g. TRACE, DELETE if not used)
|
|
|
|
### 3e. TLS / HTTPS
|
|
- [ ] TLS certificate is valid and not self-signed for production
|
|
- [ ] HTTP traffic redirects to HTTPS (not served in parallel)
|
|
- [ ] TLS 1.0 and 1.1 disabled — only TLS 1.2+ allowed
|
|
- [ ] Weak cipher suites disabled
|
|
- [ ] Certificate expiry is monitored (NPM auto-renews, but verify it's configured)
|
|
|
|
### 3f. Docker & Container Security
|
|
- [ ] Containers do not run as root (check `user:` in Compose or Dockerfile `USER` instruction)
|
|
- [ ] No container has `privileged: true` unless there is a documented reason
|
|
- [ ] No unnecessary host volume mounts (especially `/var/run/docker.sock` unless intentional)
|
|
- [ ] Container images are not using `latest` tag in production
|
|
- [ ] Docker socket is not exposed to the external network
|
|
- [ ] Resource limits (`mem_limit`, `cpus`) are set on containers
|
|
|
|
**AI Action:** Run the following tools if available:
|
|
- `bandit -r . -ll` — Python static security analysis
|
|
- `trivy fs . --severity HIGH,CRITICAL` — dependency and filesystem CVE scan
|
|
- `docker scout cves <image>` — container image vulnerability scan
|
|
|
|
Report all FAIL/WARN findings. Do not proceed to go-live recommendation until critical issues are resolved.
|
|
|
|
### 3g. Network & Exposure
|
|
- [ ] Only port 80/443 are exposed publicly — no app ports (e.g. 8000, 3000) directly open to internet
|
|
- [ ] NPM proxy host has access list or basic auth if the service is internal-only
|
|
- [ ] Rate limiting is configured in Nginx or the app for API endpoints
|
|
- [ ] The service does not expose an admin panel (e.g. `/admin`, `/dashboard`) without additional auth
|
|
- [ ] Database ports (3306, 5432, 6379) are NOT exposed beyond the Docker network
|
|
- [ ] SSH is not running inside any container
|
|
|
|
### 3h. Dependency & Supply Chain
|
|
- [ ] Dependencies are pinned to specific versions (not `*` or `latest`)
|
|
- [ ] Known CVEs in dependencies? (run `trivy fs .` or `pip-audit` / `npm audit`)
|
|
- [ ] No abandoned or unmaintained packages with known issues
|
|
- [ ] Docker base images are from official/verified sources
|
|
|
|
---
|
|
|
|
## Section 4: Go-Live Decision
|
|
|
|
After all sections are complete:
|
|
|
|
- List all unresolved FINDs grouped by severity: **CRITICAL / HIGH / MEDIUM / LOW**
|
|
- **CRITICAL or HIGH unresolved = NO GO.** These must be fixed before external access.
|
|
- **MEDIUM/LOW unresolved** = user decides whether to defer with documented acceptance
|
|
- Provide a final summary:
|
|
- Total checks: X
|
|
- Passed: X
|
|
- Failed (critical): X
|
|
- Failed (non-critical): X
|
|
- Deferred: X
|
|
- **Recommendation: GO / NO GO / GO WITH CONDITIONS**
|