Elastic kubernetes Service

Compute choices:
- Managed node groups
- Fargate (serverless pds)
- Karpenter (or Automode)
Network choices:
- VPC CNI
- Ingress/Load balancers:
- Cilium on EKS
Identity & access:
- EKS pod identity: associate iam roles to service accounts
- Native
Storage:
- EBS CSI driver
- EFS CSI Driver
- S3 CSI Driver
Observability:
- Control plane logs: Cloud Watch + Fludnet Bit
- Metrics:
  - Amazon managed grafana
  - Amazon managed prometheus
Security:
- Network
- Policy

EKSCTL

Refer Here for official docs
eksctl can be used with commands and yaml files.
eksctl config yaml schema
installation
Creating ingress controller
EKS supports karpenter natively
Important annotations

Annotation Key	Purpose / Usage	Scope / Context
`service.beta.kubernetes.io/aws-load-balancer-type`	Specifies the type of load balancer (e.g., `external`)	Kubernetes Service of `type: LoadBalancer` (Kubernetes SIGs, Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-nlb-target-type`	Determines NLB target mode: `instance` or `ip`	Kubernetes Service (AWS Documentation, Kubernetes SIGs)
`service.beta.kubernetes.io/aws-load-balancer-name`	Sets a custom name for the AWS Load Balancer	Kubernetes Service (Kubernetes SIGs)
`service.beta.kubernetes.io/aws-load-balancer-internal` (deprecated)	Marks the Load Balancer as internal; now replaced by `aws-load-balancer-scheme`	Kubernetes Service (Kubernetes SIGs)
`service.beta.kubernetes.io/aws-load-balancer-scheme`	Defines Load Balancer scheme: `internal` or `internet-facing`	Kubernetes Service (AWS Documentation, Kubernetes SIGs)
`service.beta.kubernetes.io/load-balancer-source-ranges` (deprecated)	Restricts access by CIDR ranges; use `.spec.loadBalancerSourceRanges` instead	Kubernetes Service (AWS Documentation, Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol`	Sets health-check protocol: `TCP`, `HTTP`, etc.	Kubernetes Service (AWS Documentation)
`service.beta.kubernetes.io/aws-load-balancer-healthcheck-path`	Specifies HTTP path for health checks (e.g., `/healthz`)	Kubernetes Service (AWS Documentation)
`service.beta.kubernetes.io/aws-load-balancer-healthcheck-port`	Defines the port for health checks (`traffic-port` or explicit port)	Kubernetes Service (AWS Documentation)
`service.beta.kubernetes.io/aws-load-balancer-subnets`	Indicates which subnets (by IDs or names) to place the Load Balancer	Kubernetes Service (AWS Documentation, Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-ssl-cert`	ARN for SSL certificate via ACM for HTTPS/TLS termination	Kubernetes Service (AWS Documentation, Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-ssl-ports`	Specifies which ports to serve over SSL/TLS (`*` or list)	Kubernetes Service (AWS Documentation, Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-target-group-attributes`	Sets target group attributes (e.g., stickiness)	Kubernetes Service (Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-target-node-labels`	Filters targets via node labels	Kubernetes Service (Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-subnets` (repeat from above as Beta)	See above—you can include subnet names or IDs	Kubernetes Service (Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-proxy-protocol`	Enables PROXY protocol support (value: `"*"`)	Kubernetes Service (Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-attributes`	Defines advanced ALB attributes (e.g., `"deletion_protection.enabled=true"`)	Kubernetes Service (Kubernetes)
`service.beta.kubernetes.io/aws-load-balancer-alpn-policy`	Sets ALPN policy (e.g., `HTTP2Optional`)	Kubernetes Service (Kubernetes)
`alb.ingress.kubernetes.io/load-balancer-name`	Custom name for ALB created via Ingress	Used by AWS Load Balancer Controller (Kubernetes SIGs)
`alb.ingress.kubernetes.io/group.name`	Defines IngressGroup name for grouping ALBs	AWS Load Balancer Controller (Kubernetes SIGs)
`alb.ingress.kubernetes.io/group.order`	Sets Ingress ordering within a group	AWS Load Balancer Controller (Kubernetes SIGs)
`alb.ingress.kubernetes.io/tags`	Applies tags to LoadBalancer in key=value format	AWS Load Balancer Controller (Kubernetes SIGs)
`alb.ingress.kubernetes.io/ip-address-type`	Chooses IP type: `ipv4` or `dualstack`	AWS Load Balancer Controller (Kubernetes SIGs)
`alb.ingress.kubernetes.io/scheme`	Marks ALB as `internal` or `internet-facing`	AWS Load Balancer Controller (Kubernetes SIGs)
`rbac.authorization.kubernetes.io/autoupdate`	Used in RBAC `ClusterRole` or `ClusterRoleBinding` for auto updates	General Kubernetes RBAC (used in EKS) (AWS Documentation)

GitOps

This is typically managed by an automated agent like argocd or flux running in your environment
Git is the single source of truth
- Define a state (manifests/helm charts) in git
- commit & merge the PR
- Automatic reconcillation happens
Installing argo cd Refer Here
CRD YAML for argocd deploy

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: nginx
spec:
  destination:
    namespace: ''
    server: https://kubernetes.default.svc
  source:
    path: Jun25/k8s/argo/deployment
    repoURL: https://github.com/asquarezone/KubernetesZone.git
    targetRevision: HEAD
  sources: []
  project: default
  syncPolicy:
    automated:
      prune: false
      selfHeal: false
      enabled: true

k8s trouble shooting guide

flowchart TD
A[Pod unhealthy / feature broken] --> B{Status?}
B -->|Pending| SCHED[Check scheduling]
B -->|ContainerCreating| CC[Check image, volumes, CNI]
B -->|CrashLoopBackOff| CRASH[Check app logs & probes]
B -->|ImagePullBackOff/ErrImagePull| IMG[Check image & pull secret]
B -->|Running but failing| NET[Check Service/Endpoints/DNS]
B -->|Terminating/Evicted| NODE[Node pressure or finalizers]
B -->|Completed stuck| JOB[Job/CronJob policy/events]
B -->|Forbidden/Unauthorized| RBAC[RBAC / tokens / PSS]
B -->|LB/Ingress broken| ING[Controller, class, rules, TLS]

1) Pod lifecycle & frequent errors

Pod status / error	What it means	Confirm with	Common fixes
Pending with event `0/X nodes are available` (Insufficient cpu/memory/ephemeral-storage)	Scheduler can’t place it	`kubectl describe pod`, check `events`; `kubectl get nodes -o wide`; `kubectl get resourcequota -A`	Reduce requests/limits, free node resources, scale cluster, remove blocking taints, adjust nodeSelector/affinity, fix PDB.
Pending w/ `pod has unbound immediate PersistentVolumeClaims`	PVC can’t bind	`kubectl get pvc -o wide`; `kubectl describe pvc`	Create a matching PV or fix StorageClass, set proper accessModes, storage size, topology/zone.
ContainerCreating	Kubelet preparing sandbox	`kubectl describe pod`; node `kubelet` logs	Fix CNI install, image pull timeouts, mount errors, permission/SELinux (use `securityContext` / `fsGroup`).
ImagePullBackOff / ErrImagePull	Image cannot be pulled	`kubectl describe pod` events	Fix image name/tag/registry, network/DNS to registry, create `imagePullSecrets`, update `imagePullPolicy`.
CrashLoopBackOff	Process exits repeatedly	`kubectl logs POD -c CNT --previous`	Fix command/args, config/secret, missing files/permissions, dependency endpoints; relax probe timings if too aggressive.
CreateContainerConfigError	Bad pod spec/config	`kubectl describe pod`	Missing `ConfigMap`/`Secret`, wrong keys, bad env var refs, invalid volumeMount path.
RunContainerError	Runtime couldn’t start container	`describe pod`; node `containerd`/`dockerd` logs	Port in use, invalid capabilities, seccomp/AppArmor/SELinux denials.
OOMKilled	Exceeded memory limit	`kubectl get pod -o wide`, `kubectl logs`	Increase memory limit or reduce usage; add heap controls; right-size requests/limits.
Evicted (`DiskPressure`/`MemoryPressure`/`PIDPressure`)	Node under pressure	`kubectl describe node`	Free disk (/var/lib/containerd), tune image GC, increase node size, fix noisy neighbors.
Terminating forever	Finalizer or volume stuck	`kubectl get pod -o yaml`	Remove stale finalizers (carefully), unmount volumes, restart kubelet if needed.

2) Scheduling & node issues

Events show: node(s) had taint {...} that the pod didn't tolerate → add matching tolerations or remove taint from node.
Affinity/anti-affinity / topology spread prevent placement → verify labels on nodes and the rules in the pod.
ResourceQuota/LimitRange violations → kubectl describe quota/limitrange -n <ns>; adjust requests/limits or quotas.
Node NotReady → kubectl describe node; fix kubelet, CNI, time skew, certificate expiry, disk full.

3) Networking & Service/Ingress gotchas

Symptom	Checks	Fixes
Service has no endpoints	`kubectl get endpoints <svc> -o wide`	Ensure pod labels match service selector; pod readiness must be passing.
DNS failing	From a debug pod: `nslookup kubernetes.default`	Check `coredns` logs, node DNS, kube-proxy/iptables; restart coredns if wedged.
NodePort/LoadBalancer unreachable	Cloud LB / firewall / security groups	Open firewall, ensure correct `externalTrafficPolicy`, health checks; for local clusters use Ingress/port-forward.
Ingress not working	`kubectl get ingressclass`, controller pods/logs	Install an Ingress controller, set proper `ingressClassName`, host/path rules, TLS secret names.
HostPort conflict	Node logs show port in use	Remove hostPort or change value; avoid hostPort unless necessary.

4) Storage & volume errors

PVC stuck Pending: No PV/SC match, wrong accessModes (e.g., ReadWriteOnce on multi-node), zone mismatch, or missing default StorageClass.
Mount errors (timeout waiting for device, formatting failed) → check driver CSIDriver/node plugin logs; validate fsGroup, volumeMode (Block vs Filesystem), and app user permissions.
Multi-attach errors: Some volumes are single-writer only—use appropriate access mode or switch to a RWX-capable storage class (e.g., NFS/FSx/File/Filestore).

5) RBAC / Auth / Policy

Forbidden: user ... cannot get/list ... → kubectl auth can-i get pods --as <user> -n <ns>; create proper Role/ClusterRole and RoleBinding/ClusterRoleBinding to the ServiceAccount.
ServiceAccount token issues: Ensure pod serviceAccountName exists; projected tokens in older clusters may need automountServiceAccountToken.
Pod Security (PSS) / Gatekeeper / Kyverno: Errors like violates PodSecurity "restricted:latest" → add compliant securityContext (drop caps, no hostPath, runAsNonRoot), or adjust namespace labels/policies.
Admission webhooks failing: Timeouts or TLS errors → check ValidatingWebhookConfiguration / MutatingWebhookConfiguration and the webhooks’ Service/DNS/certs.

6) Probes & app readiness

Readiness/Liveness failing → kubectl describe pod events; kubectl logs. Common fixes: increase initialDelaySeconds, widen timeoutSeconds, ensure the app listens on the same port/path as probe, and make probe idempotent.

7) Jobs / CronJobs

BackoffLimitExceeded: Job crashed too many times → fix container command or resource limits; raise backoffLimit if appropriate.
CronJobs not running: Check suspend: false, concurrencyPolicy, timezone, and controller logs; verify schedule format (*/5 * * * * etc.).
Missed schedules (controller down time) → .status.lastScheduleTime and events; adjust startingDeadlineSeconds.

8) Autoscaling & metrics

HPA shows missing request for cpu → set CPU (and/or memory) requests on target containers.
HPA says unable to get metrics for resource → install/repair metrics-server and confirm TLS/aggregator.
Cluster-autoscaler won’t scale: Unschedulable reason is not node-size-fixable (e.g., PVC zone constraint) or scaling disabled for node group.

9) Control-plane & etcd (self-managed clusters)

etcdserver: mvcc: database space exceeded → compact & defrag etcd, prune events; increase quota.
API 429 / rate limit → client backoff, tighten watches, reduce controller churn.
Cert expiry / clock skew → rotate certs, enable NTP.

10) Kubeconfig / context / client

x509: certificate signed by unknown authority or Unauthorized → wrong context/cluster CA; refresh kubeconfig (e.g., aws eks update-kubeconfig, gcloud container clusters get-credentials).
i/o timeout → network path to API blocked; check proxies, VPN, firewall.

Core commands you’ll use every time

# Pods & events
kubectl get pods -A -o wide
kubectl describe pod <pod> -n <ns>
kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 50

# Logs (current / previous crash) + a specific container
kubectl logs <pod> -n <ns> -c <container>
kubectl logs <pod> -n <ns> -c <container> --previous

# Services / endpoints / ingress
kubectl get svc,ep,ing -A -o wide

# PVC/PV/SC
kubectl get pvc,pv,storageclass -A
kubectl describe pvc <pvc> -n <ns>

# RBAC quick test
kubectl auth can-i create pods --as system:serviceaccount:<ns>:<sa> -n <ns>

# Node health
kubectl get nodes -o wide
kubectl describe node <node>

A tiny “debug pod” you can drop anywhere

apiVersion: v1
kind: Pod
metadata:
  name: net-debug
spec:
  containers:
  - name: dbg
    image: busybox:stable
    command: ["sh", "-c", "sleep 1d"]
    securityContext: {runAsNonRoot: true}
  restartPolicy: Never

Then kubectl exec -it net-debug -- sh to run nslookup, wget, curl, etc.

Kubernetes Classroom notes 24/Aug/2025

Elastic kubernetes Service

EKSCTL

GitOps

k8s trouble shooting guide

1) Pod lifecycle & frequent errors

2) Scheduling & node issues

3) Networking & Service/Ingress gotchas

4) Storage & volume errors

5) RBAC / Auth / Policy

6) Probes & app readiness

7) Jobs / CronJobs

8) Autoscaling & metrics

9) Control-plane & etcd (self-managed clusters)

10) Kubeconfig / context / client

Core commands you’ll use every time

A tiny “debug pod” you can drop anywhere

By continuous learner

Leave a ReplyCancel reply

Elastic kubernetes Service

EKSCTL

GitOps

k8s trouble shooting guide

1) Pod lifecycle & frequent errors

2) Scheduling & node issues

3) Networking & Service/Ingress gotchas

4) Storage & volume errors

5) RBAC / Auth / Policy

6) Probes & app readiness

7) Jobs / CronJobs

8) Autoscaling & metrics

9) Control-plane & etcd (self-managed clusters)

10) Kubeconfig / context / client

Core commands you’ll use every time

A tiny “debug pod” you can drop anywhere

Share this:

By continuous learner

Leave a ReplyCancel reply

Discover more from Direct DevOps from Quality Thought