Files
SnarfCode/awx-deployment-troubleshooting.md
2026-05-21 12:56:57 -04:00

6.0 KiB

AWX Operator Deployment Troubleshooting Guide

Environment

  • AWX Operator Version: 2.19.1
  • AWX Version: 24.6.1
  • Platform: k3s
  • Storage Provisioner: Longhorn

Issue 1: Database Migration Check Fails

Symptom

The operator fails at the Check for pending migrations task with:

ValueError: invalid literal for int() with base 10: 'error executing command in container:
failed to exec in container: failed to create exec ...: task ...'

The awx-task deployment shows unavailableReplicas: 1.

Root Cause

The operator attempts to kubectl exec into the awx-task container to run awx-manage showmigrations, but the container isn't running. The init-database init container is stuck because it cannot connect to PostgreSQL.

Resolution

Fix the underlying PostgreSQL issue (see Issues 2-4 below). Once postgres is healthy, the operator will succeed on its next reconciliation loop.


Issue 2: PostgreSQL Pod Not Created (Missing StatefulSet)

Symptom

No postgres StatefulSet or pod exists in the awx namespace. The operator doesn't attempt to create one.

Root Cause

The awx-postgres-configuration secret existed but had an empty/unset host value. The operator saw the secret, assumed an external database was configured, and skipped creating the managed PostgreSQL StatefulSet.

Resolution

Delete the broken secret and let the operator recreate it with correct managed database values:

kubectl delete secret -n awx awx-postgres-configuration
kubectl annotate awx -n awx awx --overwrite restartedAt=now

The operator will regenerate the secret with host: awx-postgres-15 and create the StatefulSet.


Issue 3: Orphaned PVC Blocking Operator Progress

Symptom

The operator reconciliation loop fails or hangs. A previously deleted PVC left the operator in a bad state.

Root Cause

Deleting a PVC that the operator's managed StatefulSet depends on breaks the expected state. The operator may not recover automatically.

Resolution

Clean up all related resources and let the operator rebuild:

kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl delete secret -n awx awx-postgres-configuration
kubectl annotate awx -n awx awx --overwrite restartedAt=now

Issue 4: PostgreSQL Permission Denied on Data Directory

Symptom

The postgres pod fails to start with:

mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied

Root Cause

Longhorn provisions volumes mounted as root with restrictive permissions. The fsGroupChangePolicy: OnRootMismatch setting doesn't trigger a recursive chown because the volume root directory appears correctly owned — but subdirectory creation by the postgres user (UID 26) still fails.

Resolution

Option A — Fix fsGroupChangePolicy (try first):

In the AWX CR, set fsGroupChangePolicy: Always to force Kubernetes to recursively apply ownership before the container starts:

spec:
  postgres_storage_class: longhorn
  postgres_security_context:
    runAsUser: 0
    runAsGroup: 0
    fsGroup: 0
    fsGroupChangePolicy: Always

Then delete and let the operator recreate:

kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl apply -f awx.yaml

Option B — Patch StatefulSet with init container (if Option A fails):

After the operator creates the StatefulSet, patch it to add a permissions-fixing init container:

kubectl patch statefulset awx-postgres-15 -n awx --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/initContainers","value":[{"name":"fix-perms","image":"busybox","command":["sh","-c","chown -R 26:26 /var/lib/pgsql/data && chmod 700 /var/lib/pgsql/data"],"volumeMounts":[{"name":"postgres-15","mountPath":"/var/lib/pgsql/data"}],"securityContext":{"runAsUser":0}}]}]'

Then restart the postgres pod:

kubectl delete pod -n awx -l app.kubernetes.io/name=awx-postgres-15

Note: The operator may revert this patch on the next reconciliation. If so, Option A or switching to a StorageClass that respects fsGroup natively is the long-term fix.


Key Differences: security_context_settings vs postgres_security_context

CR Field Applies To
security_context_settings AWX web and task pods
postgres_security_context Managed PostgreSQL pod

These are independent. Setting one does not affect the other.


Useful Diagnostic Commands

# Check all AWX resources
kubectl get all -n awx

# Check PVC status
kubectl get pvc -n awx

# Check postgres secret configuration
kubectl get secret -n awx awx-postgres-configuration -o jsonpath="{.data.host}" | base64 -d

# Watch operator logs
kubectl logs -n awx deployment/awx-operator-controller-manager -f --tail=50

# Check postgres pod logs
kubectl logs -n awx -l app.kubernetes.io/name=awx-postgres-15

# Force operator re-reconciliation
kubectl annotate awx -n awx awx --overwrite restartedAt=$(date +%s)

Full Recovery Procedure (Nuclear Option)

If the deployment is in a completely broken state, reset everything and let the operator rebuild from scratch:

# Delete all managed resources
kubectl delete deployment -n awx awx-task awx-web
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl delete secret -n awx awx-postgres-configuration
kubectl delete secret -n awx awx-app-credentials
kubectl delete secret -n awx awx-admin-password
kubectl delete secret -n awx awx-broadcast-websocket
kubectl delete secret -n awx awx-receptor-ca
kubectl delete secret -n awx awx-receptor-work-signing

# Restart the operator
kubectl rollout restart deployment -n awx awx-operator-controller-manager

# The operator will recreate everything from the AWX CR

Warning: This deletes all AWX state including admin passwords and database data. Only use if you have no data to preserve or have a backup.