diff --git a/awx-deployment-troubleshooting.md b/awx-deployment-troubleshooting.md new file mode 100644 index 0000000..1d32e07 --- /dev/null +++ b/awx-deployment-troubleshooting.md @@ -0,0 +1,195 @@ +# AWX Operator Deployment Troubleshooting Guide + +## Environment + +- **AWX Operator Version:** 2.19.1 +- **AWX Version:** 24.6.1 +- **Platform:** k3s +- **Storage Provisioner:** Longhorn + +--- + +## Issue 1: Database Migration Check Fails + +### Symptom + +The operator fails at the `Check for pending migrations` task with: + +``` +ValueError: invalid literal for int() with base 10: 'error executing command in container: +failed to exec in container: failed to create exec ...: task ...' +``` + +The `awx-task` deployment shows `unavailableReplicas: 1`. + +### Root Cause + +The operator attempts to `kubectl exec` into the `awx-task` container to run `awx-manage showmigrations`, but the container isn't running. The `init-database` init container is stuck because it cannot connect to PostgreSQL. + +### Resolution + +Fix the underlying PostgreSQL issue (see Issues 2-4 below). Once postgres is healthy, the operator will succeed on its next reconciliation loop. + +--- + +## Issue 2: PostgreSQL Pod Not Created (Missing StatefulSet) + +### Symptom + +No postgres StatefulSet or pod exists in the `awx` namespace. The operator doesn't attempt to create one. + +### Root Cause + +The `awx-postgres-configuration` secret existed but had an empty/unset `host` value. The operator saw the secret, assumed an external database was configured, and skipped creating the managed PostgreSQL StatefulSet. + +### Resolution + +Delete the broken secret and let the operator recreate it with correct managed database values: + +```bash +kubectl delete secret -n awx awx-postgres-configuration +kubectl annotate awx -n awx awx --overwrite restartedAt=now +``` + +The operator will regenerate the secret with `host: awx-postgres-15` and create the StatefulSet. + +--- + +## Issue 3: Orphaned PVC Blocking Operator Progress + +### Symptom + +The operator reconciliation loop fails or hangs. A previously deleted PVC left the operator in a bad state. + +### Root Cause + +Deleting a PVC that the operator's managed StatefulSet depends on breaks the expected state. The operator may not recover automatically. + +### Resolution + +Clean up all related resources and let the operator rebuild: + +```bash +kubectl delete statefulset -n awx awx-postgres-15 +kubectl delete pvc -n awx postgres-15-awx-postgres-15-0 +kubectl delete secret -n awx awx-postgres-configuration +kubectl annotate awx -n awx awx --overwrite restartedAt=now +``` + +--- + +## Issue 4: PostgreSQL Permission Denied on Data Directory + +### Symptom + +The postgres pod fails to start with: + +``` +mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied +``` + +### Root Cause + +Longhorn provisions volumes mounted as root with restrictive permissions. The `fsGroupChangePolicy: OnRootMismatch` setting doesn't trigger a recursive chown because the volume root directory appears correctly owned — but subdirectory creation by the postgres user (UID 26) still fails. + +### Resolution + +**Option A — Fix fsGroupChangePolicy (try first):** + +In the AWX CR, set `fsGroupChangePolicy: Always` to force Kubernetes to recursively apply ownership before the container starts: + +```yaml +spec: + postgres_storage_class: longhorn + postgres_security_context: + runAsUser: 0 + runAsGroup: 0 + fsGroup: 0 + fsGroupChangePolicy: Always +``` + +Then delete and let the operator recreate: + +```bash +kubectl delete statefulset -n awx awx-postgres-15 +kubectl delete pvc -n awx postgres-15-awx-postgres-15-0 +kubectl apply -f awx.yaml +``` + +**Option B — Patch StatefulSet with init container (if Option A fails):** + +After the operator creates the StatefulSet, patch it to add a permissions-fixing init container: + +```bash +kubectl patch statefulset awx-postgres-15 -n awx --type=json \ + -p='[{"op":"add","path":"/spec/template/spec/initContainers","value":[{"name":"fix-perms","image":"busybox","command":["sh","-c","chown -R 26:26 /var/lib/pgsql/data && chmod 700 /var/lib/pgsql/data"],"volumeMounts":[{"name":"postgres-15","mountPath":"/var/lib/pgsql/data"}],"securityContext":{"runAsUser":0}}]}]' +``` + +Then restart the postgres pod: + +```bash +kubectl delete pod -n awx -l app.kubernetes.io/name=awx-postgres-15 +``` + +> **Note:** The operator may revert this patch on the next reconciliation. If so, Option A or switching to a StorageClass that respects fsGroup natively is the long-term fix. + +--- + +## Key Differences: security_context_settings vs postgres_security_context + +| CR Field | Applies To | +|----------|-----------| +| `security_context_settings` | AWX web and task pods | +| `postgres_security_context` | Managed PostgreSQL pod | + +These are independent. Setting one does not affect the other. + +--- + +## Useful Diagnostic Commands + +```bash +# Check all AWX resources +kubectl get all -n awx + +# Check PVC status +kubectl get pvc -n awx + +# Check postgres secret configuration +kubectl get secret -n awx awx-postgres-configuration -o jsonpath="{.data.host}" | base64 -d + +# Watch operator logs +kubectl logs -n awx deployment/awx-operator-controller-manager -f --tail=50 + +# Check postgres pod logs +kubectl logs -n awx -l app.kubernetes.io/name=awx-postgres-15 + +# Force operator re-reconciliation +kubectl annotate awx -n awx awx --overwrite restartedAt=$(date +%s) +``` + +--- + +## Full Recovery Procedure (Nuclear Option) + +If the deployment is in a completely broken state, reset everything and let the operator rebuild from scratch: + +```bash +# Delete all managed resources +kubectl delete deployment -n awx awx-task awx-web +kubectl delete statefulset -n awx awx-postgres-15 +kubectl delete pvc -n awx postgres-15-awx-postgres-15-0 +kubectl delete secret -n awx awx-postgres-configuration +kubectl delete secret -n awx awx-app-credentials +kubectl delete secret -n awx awx-admin-password +kubectl delete secret -n awx awx-broadcast-websocket +kubectl delete secret -n awx awx-receptor-ca +kubectl delete secret -n awx awx-receptor-work-signing + +# Restart the operator +kubectl rollout restart deployment -n awx awx-operator-controller-manager + +# The operator will recreate everything from the AWX CR +``` + +> **Warning:** This deletes all AWX state including admin passwords and database data. Only use if you have no data to preserve or have a backup.