6.0 KiB
AWX Operator Deployment Troubleshooting Guide
Environment
- AWX Operator Version: 2.19.1
- AWX Version: 24.6.1
- Platform: k3s
- Storage Provisioner: Longhorn
Issue 1: Database Migration Check Fails
Symptom
The operator fails at the Check for pending migrations task with:
ValueError: invalid literal for int() with base 10: 'error executing command in container:
failed to exec in container: failed to create exec ...: task ...'
The awx-task deployment shows unavailableReplicas: 1.
Root Cause
The operator attempts to kubectl exec into the awx-task container to run awx-manage showmigrations, but the container isn't running. The init-database init container is stuck because it cannot connect to PostgreSQL.
Resolution
Fix the underlying PostgreSQL issue (see Issues 2-4 below). Once postgres is healthy, the operator will succeed on its next reconciliation loop.
Issue 2: PostgreSQL Pod Not Created (Missing StatefulSet)
Symptom
No postgres StatefulSet or pod exists in the awx namespace. The operator doesn't attempt to create one.
Root Cause
The awx-postgres-configuration secret existed but had an empty/unset host value. The operator saw the secret, assumed an external database was configured, and skipped creating the managed PostgreSQL StatefulSet.
Resolution
Delete the broken secret and let the operator recreate it with correct managed database values:
kubectl delete secret -n awx awx-postgres-configuration
kubectl annotate awx -n awx awx --overwrite restartedAt=now
The operator will regenerate the secret with host: awx-postgres-15 and create the StatefulSet.
Issue 3: Orphaned PVC Blocking Operator Progress
Symptom
The operator reconciliation loop fails or hangs. A previously deleted PVC left the operator in a bad state.
Root Cause
Deleting a PVC that the operator's managed StatefulSet depends on breaks the expected state. The operator may not recover automatically.
Resolution
Clean up all related resources and let the operator rebuild:
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl delete secret -n awx awx-postgres-configuration
kubectl annotate awx -n awx awx --overwrite restartedAt=now
Issue 4: PostgreSQL Permission Denied on Data Directory
Symptom
The postgres pod fails to start with:
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied
Root Cause
Longhorn provisions volumes mounted as root with restrictive permissions. The fsGroupChangePolicy: OnRootMismatch setting doesn't trigger a recursive chown because the volume root directory appears correctly owned — but subdirectory creation by the postgres user (UID 26) still fails.
Resolution
Option A — Fix fsGroupChangePolicy (try first):
In the AWX CR, set fsGroupChangePolicy: Always to force Kubernetes to recursively apply ownership before the container starts:
spec:
postgres_storage_class: longhorn
postgres_security_context:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
fsGroupChangePolicy: Always
Then delete and let the operator recreate:
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl apply -f awx.yaml
Option B — Patch StatefulSet with init container (if Option A fails):
After the operator creates the StatefulSet, patch it to add a permissions-fixing init container:
kubectl patch statefulset awx-postgres-15 -n awx --type=json \
-p='[{"op":"add","path":"/spec/template/spec/initContainers","value":[{"name":"fix-perms","image":"busybox","command":["sh","-c","chown -R 26:26 /var/lib/pgsql/data && chmod 700 /var/lib/pgsql/data"],"volumeMounts":[{"name":"postgres-15","mountPath":"/var/lib/pgsql/data"}],"securityContext":{"runAsUser":0}}]}]'
Then restart the postgres pod:
kubectl delete pod -n awx -l app.kubernetes.io/name=awx-postgres-15
Note: The operator may revert this patch on the next reconciliation. If so, Option A or switching to a StorageClass that respects fsGroup natively is the long-term fix.
Key Differences: security_context_settings vs postgres_security_context
| CR Field | Applies To |
|---|---|
security_context_settings |
AWX web and task pods |
postgres_security_context |
Managed PostgreSQL pod |
These are independent. Setting one does not affect the other.
Useful Diagnostic Commands
# Check all AWX resources
kubectl get all -n awx
# Check PVC status
kubectl get pvc -n awx
# Check postgres secret configuration
kubectl get secret -n awx awx-postgres-configuration -o jsonpath="{.data.host}" | base64 -d
# Watch operator logs
kubectl logs -n awx deployment/awx-operator-controller-manager -f --tail=50
# Check postgres pod logs
kubectl logs -n awx -l app.kubernetes.io/name=awx-postgres-15
# Force operator re-reconciliation
kubectl annotate awx -n awx awx --overwrite restartedAt=$(date +%s)
Full Recovery Procedure (Nuclear Option)
If the deployment is in a completely broken state, reset everything and let the operator rebuild from scratch:
# Delete all managed resources
kubectl delete deployment -n awx awx-task awx-web
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl delete secret -n awx awx-postgres-configuration
kubectl delete secret -n awx awx-app-credentials
kubectl delete secret -n awx awx-admin-password
kubectl delete secret -n awx awx-broadcast-websocket
kubectl delete secret -n awx awx-receptor-ca
kubectl delete secret -n awx awx-receptor-work-signing
# Restart the operator
kubectl rollout restart deployment -n awx awx-operator-controller-manager
# The operator will recreate everything from the AWX CR
Warning: This deletes all AWX state including admin passwords and database data. Only use if you have no data to preserve or have a backup.