# AWX Operator Deployment Troubleshooting Guide

## Environment

- **AWX Operator Version:** 2.19.1
- **AWX Version:** 24.6.1
- **Platform:** k3s
- **Storage Provisioner:** Longhorn

---

## Issue 1: Database Migration Check Fails

### Symptom

The operator fails at the `Check for pending migrations` task with:

```
ValueError: invalid literal for int() with base 10: 'error executing command in container:
failed to exec in container: failed to create exec ...: task ...'
```

The `awx-task` deployment shows `unavailableReplicas: 1`.

### Root Cause

The operator attempts to `kubectl exec` into the `awx-task` container to run `awx-manage showmigrations`, but the container isn't running. The `init-database` init container is stuck because it cannot connect to PostgreSQL.

### Resolution

Fix the underlying PostgreSQL issue (see Issues 2-4 below). Once postgres is healthy, the operator will succeed on its next reconciliation loop.

---

## Issue 2: PostgreSQL Pod Not Created (Missing StatefulSet)

### Symptom

No postgres StatefulSet or pod exists in the `awx` namespace. The operator doesn't attempt to create one.

### Root Cause

The `awx-postgres-configuration` secret existed but had an empty/unset `host` value. The operator saw the secret, assumed an external database was configured, and skipped creating the managed PostgreSQL StatefulSet.

### Resolution

Delete the broken secret and let the operator recreate it with correct managed database values:

```bash
kubectl delete secret -n awx awx-postgres-configuration
kubectl annotate awx -n awx awx --overwrite restartedAt=now
```

The operator will regenerate the secret with `host: awx-postgres-15` and create the StatefulSet.

---

## Issue 3: Orphaned PVC Blocking Operator Progress

### Symptom

The operator reconciliation loop fails or hangs. A previously deleted PVC left the operator in a bad state.

### Root Cause

Deleting a PVC that the operator's managed StatefulSet depends on breaks the expected state. The operator may not recover automatically.

### Resolution

Clean up all related resources and let the operator rebuild:

```bash
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl delete secret -n awx awx-postgres-configuration
kubectl annotate awx -n awx awx --overwrite restartedAt=now
```

---

## Issue 4: PostgreSQL Permission Denied on Data Directory

### Symptom

The postgres pod fails to start with:

```
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied
```

### Root Cause

Longhorn provisions volumes mounted as root with restrictive permissions. The `fsGroupChangePolicy: OnRootMismatch` setting doesn't trigger a recursive chown because the volume root directory appears correctly owned — but subdirectory creation by the postgres user (UID 26) still fails.

### Resolution

**Option A — Fix fsGroupChangePolicy (try first):**

In the AWX CR, set `fsGroupChangePolicy: Always` to force Kubernetes to recursively apply ownership before the container starts:

```yaml
spec:
  postgres_storage_class: longhorn
  postgres_security_context:
    runAsUser: 0
    runAsGroup: 0
    fsGroup: 0
    fsGroupChangePolicy: Always
```

Then delete and let the operator recreate:

```bash
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl apply -f awx.yaml
```

**Option B — Patch StatefulSet with init container (if Option A fails):**

After the operator creates the StatefulSet, patch it to add a permissions-fixing init container:

```bash
kubectl patch statefulset awx-postgres-15 -n awx --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/initContainers","value":[{"name":"fix-perms","image":"busybox","command":["sh","-c","chown -R 26:26 /var/lib/pgsql/data && chmod 700 /var/lib/pgsql/data"],"volumeMounts":[{"name":"postgres-15","mountPath":"/var/lib/pgsql/data"}],"securityContext":{"runAsUser":0}}]}]'
```

Then restart the postgres pod:

```bash
kubectl delete pod -n awx -l app.kubernetes.io/name=awx-postgres-15
```

> **Note:** The operator may revert this patch on the next reconciliation. If so, Option A or switching to a StorageClass that respects fsGroup natively is the long-term fix.

---

## Key Differences: security_context_settings vs postgres_security_context

| CR Field | Applies To |
|----------|-----------|
| `security_context_settings` | AWX web and task pods |
| `postgres_security_context` | Managed PostgreSQL pod |

These are independent. Setting one does not affect the other.

---

## Useful Diagnostic Commands

```bash
# Check all AWX resources
kubectl get all -n awx

# Check PVC status
kubectl get pvc -n awx

# Check postgres secret configuration
kubectl get secret -n awx awx-postgres-configuration -o jsonpath="{.data.host}" | base64 -d

# Watch operator logs
kubectl logs -n awx deployment/awx-operator-controller-manager -f --tail=50

# Check postgres pod logs
kubectl logs -n awx -l app.kubernetes.io/name=awx-postgres-15

# Force operator re-reconciliation
kubectl annotate awx -n awx awx --overwrite restartedAt=$(date +%s)
```

---

## Full Recovery Procedure (Nuclear Option)

If the deployment is in a completely broken state, reset everything and let the operator rebuild from scratch:

```bash
# Delete all managed resources
kubectl delete deployment -n awx awx-task awx-web
kubectl delete statefulset -n awx awx-postgres-15
kubectl delete pvc -n awx postgres-15-awx-postgres-15-0
kubectl delete secret -n awx awx-postgres-configuration
kubectl delete secret -n awx awx-app-credentials
kubectl delete secret -n awx awx-admin-password
kubectl delete secret -n awx awx-broadcast-websocket
kubectl delete secret -n awx awx-receptor-ca
kubectl delete secret -n awx awx-receptor-work-signing

# Restart the operator
kubectl rollout restart deployment -n awx awx-operator-controller-manager

# The operator will recreate everything from the AWX CR
```

> **Warning:** This deletes all AWX state including admin passwords and database data. Only use if you have no data to preserve or have a backup.