Added AWX fixes
This commit is contained in:
195
awx-deployment-troubleshooting.md
Normal file
195
awx-deployment-troubleshooting.md
Normal file
@@ -0,0 +1,195 @@
|
|||||||
|
# AWX Operator Deployment Troubleshooting Guide
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **AWX Operator Version:** 2.19.1
|
||||||
|
- **AWX Version:** 24.6.1
|
||||||
|
- **Platform:** k3s
|
||||||
|
- **Storage Provisioner:** Longhorn
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Issue 1: Database Migration Check Fails
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
The operator fails at the `Check for pending migrations` task with:
|
||||||
|
|
||||||
|
```
|
||||||
|
ValueError: invalid literal for int() with base 10: 'error executing command in container:
|
||||||
|
failed to exec in container: failed to create exec ...: task ...'
|
||||||
|
```
|
||||||
|
|
||||||
|
The `awx-task` deployment shows `unavailableReplicas: 1`.
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
|
||||||
|
The operator attempts to `kubectl exec` into the `awx-task` container to run `awx-manage showmigrations`, but the container isn't running. The `init-database` init container is stuck because it cannot connect to PostgreSQL.
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
|
||||||
|
Fix the underlying PostgreSQL issue (see Issues 2-4 below). Once postgres is healthy, the operator will succeed on its next reconciliation loop.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Issue 2: PostgreSQL Pod Not Created (Missing StatefulSet)
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
No postgres StatefulSet or pod exists in the `awx` namespace. The operator doesn't attempt to create one.
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
|
||||||
|
The `awx-postgres-configuration` secret existed but had an empty/unset `host` value. The operator saw the secret, assumed an external database was configured, and skipped creating the managed PostgreSQL StatefulSet.
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
|
||||||
|
Delete the broken secret and let the operator recreate it with correct managed database values:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl delete secret -n awx awx-postgres-configuration
|
||||||
|
kubectl annotate awx -n awx awx --overwrite restartedAt=now
|
||||||
|
```
|
||||||
|
|
||||||
|
The operator will regenerate the secret with `host: awx-postgres-15` and create the StatefulSet.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Issue 3: Orphaned PVC Blocking Operator Progress
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
The operator reconciliation loop fails or hangs. A previously deleted PVC left the operator in a bad state.
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
|
||||||
|
Deleting a PVC that the operator's managed StatefulSet depends on breaks the expected state. The operator may not recover automatically.
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
|
||||||
|
Clean up all related resources and let the operator rebuild:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl delete statefulset -n awx awx-postgres-15
|
||||||
|
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
|
||||||
|
kubectl delete secret -n awx awx-postgres-configuration
|
||||||
|
kubectl annotate awx -n awx awx --overwrite restartedAt=now
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Issue 4: PostgreSQL Permission Denied on Data Directory
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
The postgres pod fails to start with:
|
||||||
|
|
||||||
|
```
|
||||||
|
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied
|
||||||
|
```
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
|
||||||
|
Longhorn provisions volumes mounted as root with restrictive permissions. The `fsGroupChangePolicy: OnRootMismatch` setting doesn't trigger a recursive chown because the volume root directory appears correctly owned — but subdirectory creation by the postgres user (UID 26) still fails.
|
||||||
|
|
||||||
|
### Resolution
|
||||||
|
|
||||||
|
**Option A — Fix fsGroupChangePolicy (try first):**
|
||||||
|
|
||||||
|
In the AWX CR, set `fsGroupChangePolicy: Always` to force Kubernetes to recursively apply ownership before the container starts:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
spec:
|
||||||
|
postgres_storage_class: longhorn
|
||||||
|
postgres_security_context:
|
||||||
|
runAsUser: 0
|
||||||
|
runAsGroup: 0
|
||||||
|
fsGroup: 0
|
||||||
|
fsGroupChangePolicy: Always
|
||||||
|
```
|
||||||
|
|
||||||
|
Then delete and let the operator recreate:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl delete statefulset -n awx awx-postgres-15
|
||||||
|
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
|
||||||
|
kubectl apply -f awx.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option B — Patch StatefulSet with init container (if Option A fails):**
|
||||||
|
|
||||||
|
After the operator creates the StatefulSet, patch it to add a permissions-fixing init container:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl patch statefulset awx-postgres-15 -n awx --type=json \
|
||||||
|
-p='[{"op":"add","path":"/spec/template/spec/initContainers","value":[{"name":"fix-perms","image":"busybox","command":["sh","-c","chown -R 26:26 /var/lib/pgsql/data && chmod 700 /var/lib/pgsql/data"],"volumeMounts":[{"name":"postgres-15","mountPath":"/var/lib/pgsql/data"}],"securityContext":{"runAsUser":0}}]}]'
|
||||||
|
```
|
||||||
|
|
||||||
|
Then restart the postgres pod:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl delete pod -n awx -l app.kubernetes.io/name=awx-postgres-15
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Note:** The operator may revert this patch on the next reconciliation. If so, Option A or switching to a StorageClass that respects fsGroup natively is the long-term fix.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Differences: security_context_settings vs postgres_security_context
|
||||||
|
|
||||||
|
| CR Field | Applies To |
|
||||||
|
|----------|-----------|
|
||||||
|
| `security_context_settings` | AWX web and task pods |
|
||||||
|
| `postgres_security_context` | Managed PostgreSQL pod |
|
||||||
|
|
||||||
|
These are independent. Setting one does not affect the other.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Useful Diagnostic Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check all AWX resources
|
||||||
|
kubectl get all -n awx
|
||||||
|
|
||||||
|
# Check PVC status
|
||||||
|
kubectl get pvc -n awx
|
||||||
|
|
||||||
|
# Check postgres secret configuration
|
||||||
|
kubectl get secret -n awx awx-postgres-configuration -o jsonpath="{.data.host}" | base64 -d
|
||||||
|
|
||||||
|
# Watch operator logs
|
||||||
|
kubectl logs -n awx deployment/awx-operator-controller-manager -f --tail=50
|
||||||
|
|
||||||
|
# Check postgres pod logs
|
||||||
|
kubectl logs -n awx -l app.kubernetes.io/name=awx-postgres-15
|
||||||
|
|
||||||
|
# Force operator re-reconciliation
|
||||||
|
kubectl annotate awx -n awx awx --overwrite restartedAt=$(date +%s)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Full Recovery Procedure (Nuclear Option)
|
||||||
|
|
||||||
|
If the deployment is in a completely broken state, reset everything and let the operator rebuild from scratch:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Delete all managed resources
|
||||||
|
kubectl delete deployment -n awx awx-task awx-web
|
||||||
|
kubectl delete statefulset -n awx awx-postgres-15
|
||||||
|
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
|
||||||
|
kubectl delete secret -n awx awx-postgres-configuration
|
||||||
|
kubectl delete secret -n awx awx-app-credentials
|
||||||
|
kubectl delete secret -n awx awx-admin-password
|
||||||
|
kubectl delete secret -n awx awx-broadcast-websocket
|
||||||
|
kubectl delete secret -n awx awx-receptor-ca
|
||||||
|
kubectl delete secret -n awx awx-receptor-work-signing
|
||||||
|
|
||||||
|
# Restart the operator
|
||||||
|
kubectl rollout restart deployment -n awx awx-operator-controller-manager
|
||||||
|
|
||||||
|
# The operator will recreate everything from the AWX CR
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Warning:** This deletes all AWX state including admin passwords and database data. Only use if you have no data to preserve or have a backup.
|
||||||
Reference in New Issue
Block a user