# AWX Operator Deployment Troubleshooting Guide ## Environment - **AWX Operator Version:** 2.19.1 - **AWX Version:** 24.6.1 - **Platform:** k3s - **Storage Provisioner:** Longhorn --- ## Issue 1: Database Migration Check Fails ### Symptom The operator fails at the `Check for pending migrations` task with: ``` ValueError: invalid literal for int() with base 10: 'error executing command in container: failed to exec in container: failed to create exec ...: task ...' ``` The `awx-task` deployment shows `unavailableReplicas: 1`. ### Root Cause The operator attempts to `kubectl exec` into the `awx-task` container to run `awx-manage showmigrations`, but the container isn't running. The `init-database` init container is stuck because it cannot connect to PostgreSQL. ### Resolution Fix the underlying PostgreSQL issue (see Issues 2-4 below). Once postgres is healthy, the operator will succeed on its next reconciliation loop. --- ## Issue 2: PostgreSQL Pod Not Created (Missing StatefulSet) ### Symptom No postgres StatefulSet or pod exists in the `awx` namespace. The operator doesn't attempt to create one. ### Root Cause The `awx-postgres-configuration` secret existed but had an empty/unset `host` value. The operator saw the secret, assumed an external database was configured, and skipped creating the managed PostgreSQL StatefulSet. ### Resolution Delete the broken secret and let the operator recreate it with correct managed database values: ```bash kubectl delete secret -n awx awx-postgres-configuration kubectl annotate awx -n awx awx --overwrite restartedAt=now ``` The operator will regenerate the secret with `host: awx-postgres-15` and create the StatefulSet. --- ## Issue 3: Orphaned PVC Blocking Operator Progress ### Symptom The operator reconciliation loop fails or hangs. A previously deleted PVC left the operator in a bad state. ### Root Cause Deleting a PVC that the operator's managed StatefulSet depends on breaks the expected state. The operator may not recover automatically. ### Resolution Clean up all related resources and let the operator rebuild: ```bash kubectl delete statefulset -n awx awx-postgres-15 kubectl delete pvc -n awx postgres-15-awx-postgres-15-0 kubectl delete secret -n awx awx-postgres-configuration kubectl annotate awx -n awx awx --overwrite restartedAt=now ``` --- ## Issue 4: PostgreSQL Permission Denied on Data Directory ### Symptom The postgres pod fails to start with: ``` mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied ``` ### Root Cause Longhorn provisions volumes mounted as root with restrictive permissions. The `fsGroupChangePolicy: OnRootMismatch` setting doesn't trigger a recursive chown because the volume root directory appears correctly owned — but subdirectory creation by the postgres user (UID 26) still fails. ### Resolution **Option A — Fix fsGroupChangePolicy (try first):** In the AWX CR, set `fsGroupChangePolicy: Always` to force Kubernetes to recursively apply ownership before the container starts: ```yaml spec: postgres_storage_class: longhorn postgres_security_context: runAsUser: 0 runAsGroup: 0 fsGroup: 0 fsGroupChangePolicy: Always ``` Then delete and let the operator recreate: ```bash kubectl delete statefulset -n awx awx-postgres-15 kubectl delete pvc -n awx postgres-15-awx-postgres-15-0 kubectl apply -f awx.yaml ``` **Option B — Patch StatefulSet with init container (if Option A fails):** After the operator creates the StatefulSet, patch it to add a permissions-fixing init container: ```bash kubectl patch statefulset awx-postgres-15 -n awx --type=json \ -p='[{"op":"add","path":"/spec/template/spec/initContainers","value":[{"name":"fix-perms","image":"busybox","command":["sh","-c","chown -R 26:26 /var/lib/pgsql/data && chmod 700 /var/lib/pgsql/data"],"volumeMounts":[{"name":"postgres-15","mountPath":"/var/lib/pgsql/data"}],"securityContext":{"runAsUser":0}}]}]' ``` Then restart the postgres pod: ```bash kubectl delete pod -n awx -l app.kubernetes.io/name=awx-postgres-15 ``` > **Note:** The operator may revert this patch on the next reconciliation. If so, Option A or switching to a StorageClass that respects fsGroup natively is the long-term fix. --- ## Key Differences: security_context_settings vs postgres_security_context | CR Field | Applies To | |----------|-----------| | `security_context_settings` | AWX web and task pods | | `postgres_security_context` | Managed PostgreSQL pod | These are independent. Setting one does not affect the other. --- ## Useful Diagnostic Commands ```bash # Check all AWX resources kubectl get all -n awx # Check PVC status kubectl get pvc -n awx # Check postgres secret configuration kubectl get secret -n awx awx-postgres-configuration -o jsonpath="{.data.host}" | base64 -d # Watch operator logs kubectl logs -n awx deployment/awx-operator-controller-manager -f --tail=50 # Check postgres pod logs kubectl logs -n awx -l app.kubernetes.io/name=awx-postgres-15 # Force operator re-reconciliation kubectl annotate awx -n awx awx --overwrite restartedAt=$(date +%s) ``` --- ## Full Recovery Procedure (Nuclear Option) If the deployment is in a completely broken state, reset everything and let the operator rebuild from scratch: ```bash # Delete all managed resources kubectl delete deployment -n awx awx-task awx-web kubectl delete statefulset -n awx awx-postgres-15 kubectl delete pvc -n awx postgres-15-awx-postgres-15-0 kubectl delete secret -n awx awx-postgres-configuration kubectl delete secret -n awx awx-app-credentials kubectl delete secret -n awx awx-admin-password kubectl delete secret -n awx awx-broadcast-websocket kubectl delete secret -n awx awx-receptor-ca kubectl delete secret -n awx awx-receptor-work-signing # Restart the operator kubectl rollout restart deployment -n awx awx-operator-controller-manager # The operator will recreate everything from the AWX CR ``` > **Warning:** This deletes all AWX state including admin passwords and database data. Only use if you have no data to preserve or have a backup.