Files
SnarfCode/awx-deployment-troubleshooting.md
2026-05-21 12:56:57 -04:00

196 lines
6.0 KiB
Markdown

# AWX Operator Deployment Troubleshooting Guide
## Environment
- **AWX Operator Version:** 2.19.1
- **AWX Version:** 24.6.1
- **Platform:** k3s
- **Storage Provisioner:** Longhorn
---
## Issue 1: Database Migration Check Fails
### Symptom
The operator fails at the `Check for pending migrations` task with:
```
ValueError: invalid literal for int() with base 10: 'error executing command in container:
failed to exec in container: failed to create exec ...: task ...'
```
The `awx-task` deployment shows `unavailableReplicas: 1`.
### Root Cause
The operator attempts to `kubectl exec` into the `awx-task` container to run `awx-manage showmigrations`, but the container isn't running. The `init-database` init container is stuck because it cannot connect to PostgreSQL.
### Resolution
Fix the underlying PostgreSQL issue (see Issues 2-4 below). Once postgres is healthy, the operator will succeed on its next reconciliation loop.
---
## Issue 2: PostgreSQL Pod Not Created (Missing StatefulSet)
### Symptom
No postgres StatefulSet or pod exists in the `awx` namespace. The operator doesn't attempt to create one.
### Root Cause
The `awx-postgres-configuration` secret existed but had an empty/unset `host` value. The operator saw the secret, assumed an external database was configured, and skipped creating the managed PostgreSQL StatefulSet.
### Resolution
Delete the broken secret and let the operator recreate it with correct managed database values:
```bash
kubectl delete secret -n awx awx-postgres-configuration
kubectl annotate awx -n awx awx --overwrite restartedAt=now
```
The operator will regenerate the secret with `host: awx-postgres-15` and create the StatefulSet.
---
## Issue 3: Orphaned PVC Blocking Operator Progress
### Symptom
The operator reconciliation loop fails or hangs. A previously deleted PVC left the operator in a bad state.
### Root Cause
Deleting a PVC that the operator's managed StatefulSet depends on breaks the expected state. The operator may not recover automatically.
### Resolution
Clean up all related resources and let the operator rebuild:
```bash
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl delete secret -n awx awx-postgres-configuration
kubectl annotate awx -n awx awx --overwrite restartedAt=now
```
---
## Issue 4: PostgreSQL Permission Denied on Data Directory
### Symptom
The postgres pod fails to start with:
```
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied
```
### Root Cause
Longhorn provisions volumes mounted as root with restrictive permissions. The `fsGroupChangePolicy: OnRootMismatch` setting doesn't trigger a recursive chown because the volume root directory appears correctly owned — but subdirectory creation by the postgres user (UID 26) still fails.
### Resolution
**Option A — Fix fsGroupChangePolicy (try first):**
In the AWX CR, set `fsGroupChangePolicy: Always` to force Kubernetes to recursively apply ownership before the container starts:
```yaml
spec:
postgres_storage_class: longhorn
postgres_security_context:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
fsGroupChangePolicy: Always
```
Then delete and let the operator recreate:
```bash
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl apply -f awx.yaml
```
**Option B — Patch StatefulSet with init container (if Option A fails):**
After the operator creates the StatefulSet, patch it to add a permissions-fixing init container:
```bash
kubectl patch statefulset awx-postgres-15 -n awx --type=json \
-p='[{"op":"add","path":"/spec/template/spec/initContainers","value":[{"name":"fix-perms","image":"busybox","command":["sh","-c","chown -R 26:26 /var/lib/pgsql/data && chmod 700 /var/lib/pgsql/data"],"volumeMounts":[{"name":"postgres-15","mountPath":"/var/lib/pgsql/data"}],"securityContext":{"runAsUser":0}}]}]'
```
Then restart the postgres pod:
```bash
kubectl delete pod -n awx -l app.kubernetes.io/name=awx-postgres-15
```
> **Note:** The operator may revert this patch on the next reconciliation. If so, Option A or switching to a StorageClass that respects fsGroup natively is the long-term fix.
---
## Key Differences: security_context_settings vs postgres_security_context
| CR Field | Applies To |
|----------|-----------|
| `security_context_settings` | AWX web and task pods |
| `postgres_security_context` | Managed PostgreSQL pod |
These are independent. Setting one does not affect the other.
---
## Useful Diagnostic Commands
```bash
# Check all AWX resources
kubectl get all -n awx
# Check PVC status
kubectl get pvc -n awx
# Check postgres secret configuration
kubectl get secret -n awx awx-postgres-configuration -o jsonpath="{.data.host}" | base64 -d
# Watch operator logs
kubectl logs -n awx deployment/awx-operator-controller-manager -f --tail=50
# Check postgres pod logs
kubectl logs -n awx -l app.kubernetes.io/name=awx-postgres-15
# Force operator re-reconciliation
kubectl annotate awx -n awx awx --overwrite restartedAt=$(date +%s)
```
---
## Full Recovery Procedure (Nuclear Option)
If the deployment is in a completely broken state, reset everything and let the operator rebuild from scratch:
```bash
# Delete all managed resources
kubectl delete deployment -n awx awx-task awx-web
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl delete secret -n awx awx-postgres-configuration
kubectl delete secret -n awx awx-app-credentials
kubectl delete secret -n awx awx-admin-password
kubectl delete secret -n awx awx-broadcast-websocket
kubectl delete secret -n awx awx-receptor-ca
kubectl delete secret -n awx awx-receptor-work-signing
# Restart the operator
kubectl rollout restart deployment -n awx awx-operator-controller-manager
# The operator will recreate everything from the AWX CR
```
> **Warning:** This deletes all AWX state including admin passwords and database data. Only use if you have no data to preserve or have a backup.