Elastic kubernetes Service
- Compute choices:
- Managed node groups
- Fargate (serverless pds)
- Karpenter (or Automode)
- Network choices:
- VPC CNI
- Ingress/Load balancers:
- Cilium on EKS
- Identity & access:
- EKS pod identity: associate iam roles to service accounts
- Native
- Storage:
- EBS CSI driver
- EFS CSI Driver
- S3 CSI Driver
- Observability:
- Control plane logs: Cloud Watch + Fludnet Bit
- Metrics:
- Amazon managed grafana
- Amazon managed prometheus
- Security:
- Network
- Policy
EKSCTL
- Refer Here for official docs
- eksctl can be used with commands and yaml files.
- eksctl config yaml schema
- installation
- Creating ingress controller
- EKS supports karpenter natively
- Important annotations
| Annotation Key | Purpose / Usage | Scope / Context |
|---|---|---|
service.beta.kubernetes.io/aws-load-balancer-type |
Specifies the type of load balancer (e.g., external) |
Kubernetes Service of type: LoadBalancer (Kubernetes SIGs, Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type |
Determines NLB target mode: instance or ip |
Kubernetes Service (AWS Documentation, Kubernetes SIGs) |
service.beta.kubernetes.io/aws-load-balancer-name |
Sets a custom name for the AWS Load Balancer | Kubernetes Service (Kubernetes SIGs) |
service.beta.kubernetes.io/aws-load-balancer-internal (deprecated) |
Marks the Load Balancer as internal; now replaced by aws-load-balancer-scheme |
Kubernetes Service (Kubernetes SIGs) |
service.beta.kubernetes.io/aws-load-balancer-scheme |
Defines Load Balancer scheme: internal or internet-facing |
Kubernetes Service (AWS Documentation, Kubernetes SIGs) |
service.beta.kubernetes.io/load-balancer-source-ranges (deprecated) |
Restricts access by CIDR ranges; use .spec.loadBalancerSourceRanges instead |
Kubernetes Service (AWS Documentation, Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol |
Sets health-check protocol: TCP, HTTP, etc. |
Kubernetes Service (AWS Documentation) |
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path |
Specifies HTTP path for health checks (e.g., /healthz) |
Kubernetes Service (AWS Documentation) |
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port |
Defines the port for health checks (traffic-port or explicit port) |
Kubernetes Service (AWS Documentation) |
service.beta.kubernetes.io/aws-load-balancer-subnets |
Indicates which subnets (by IDs or names) to place the Load Balancer | Kubernetes Service (AWS Documentation, Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-ssl-cert |
ARN for SSL certificate via ACM for HTTPS/TLS termination | Kubernetes Service (AWS Documentation, Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-ssl-ports |
Specifies which ports to serve over SSL/TLS (* or list) |
Kubernetes Service (AWS Documentation, Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes |
Sets target group attributes (e.g., stickiness) | Kubernetes Service (Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-target-node-labels |
Filters targets via node labels | Kubernetes Service (Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-subnets (repeat from above as Beta) |
See above—you can include subnet names or IDs | Kubernetes Service (Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol |
Enables PROXY protocol support (value: "*") |
Kubernetes Service (Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-attributes |
Defines advanced ALB attributes (e.g., "deletion_protection.enabled=true") |
Kubernetes Service (Kubernetes) |
service.beta.kubernetes.io/aws-load-balancer-alpn-policy |
Sets ALPN policy (e.g., HTTP2Optional) |
Kubernetes Service (Kubernetes) |
alb.ingress.kubernetes.io/load-balancer-name |
Custom name for ALB created via Ingress | Used by AWS Load Balancer Controller (Kubernetes SIGs) |
alb.ingress.kubernetes.io/group.name |
Defines IngressGroup name for grouping ALBs | AWS Load Balancer Controller (Kubernetes SIGs) |
alb.ingress.kubernetes.io/group.order |
Sets Ingress ordering within a group | AWS Load Balancer Controller (Kubernetes SIGs) |
alb.ingress.kubernetes.io/tags |
Applies tags to LoadBalancer in key=value format | AWS Load Balancer Controller (Kubernetes SIGs) |
alb.ingress.kubernetes.io/ip-address-type |
Chooses IP type: ipv4 or dualstack |
AWS Load Balancer Controller (Kubernetes SIGs) |
alb.ingress.kubernetes.io/scheme |
Marks ALB as internal or internet-facing |
AWS Load Balancer Controller (Kubernetes SIGs) |
rbac.authorization.kubernetes.io/autoupdate |
Used in RBAC ClusterRole or ClusterRoleBinding for auto updates |
General Kubernetes RBAC (used in EKS) (AWS Documentation) |
GitOps
- This is typically managed by an automated agent like argocd or flux running in your environment
- Git is the single source of truth
- Define a state (manifests/helm charts) in git
- commit & merge the PR
- Automatic reconcillation happens
- Installing argo cd Refer Here
- CRD YAML for argocd deploy
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: nginx
spec:
destination:
namespace: ''
server: https://kubernetes.default.svc
source:
path: Jun25/k8s/argo/deployment
repoURL: https://github.com/asquarezone/KubernetesZone.git
targetRevision: HEAD
sources: []
project: default
syncPolicy:
automated:
prune: false
selfHeal: false
enabled: true
k8s trouble shooting guide
flowchart TD
A[Pod unhealthy / feature broken] --> B{Status?}
B -->|Pending| SCHED[Check scheduling]
B -->|ContainerCreating| CC[Check image, volumes, CNI]
B -->|CrashLoopBackOff| CRASH[Check app logs & probes]
B -->|ImagePullBackOff/ErrImagePull| IMG[Check image & pull secret]
B -->|Running but failing| NET[Check Service/Endpoints/DNS]
B -->|Terminating/Evicted| NODE[Node pressure or finalizers]
B -->|Completed stuck| JOB[Job/CronJob policy/events]
B -->|Forbidden/Unauthorized| RBAC[RBAC / tokens / PSS]
B -->|LB/Ingress broken| ING[Controller, class, rules, TLS]
1) Pod lifecycle & frequent errors
| Pod status / error | What it means | Confirm with | Common fixes |
|---|---|---|---|
Pending with event 0/X nodes are available (Insufficient cpu/memory/ephemeral-storage) |
Scheduler can’t place it | kubectl describe pod, check events; kubectl get nodes -o wide; kubectl get resourcequota -A |
Reduce requests/limits, free node resources, scale cluster, remove blocking taints, adjust nodeSelector/affinity, fix PDB. |
Pending w/ pod has unbound immediate PersistentVolumeClaims |
PVC can’t bind | kubectl get pvc -o wide; kubectl describe pvc |
Create a matching PV or fix StorageClass, set proper accessModes, storage size, topology/zone. |
| ContainerCreating | Kubelet preparing sandbox | kubectl describe pod; node kubelet logs |
Fix CNI install, image pull timeouts, mount errors, permission/SELinux (use securityContext / fsGroup). |
| ImagePullBackOff / ErrImagePull | Image cannot be pulled | kubectl describe pod events |
Fix image name/tag/registry, network/DNS to registry, create imagePullSecrets, update imagePullPolicy. |
| CrashLoopBackOff | Process exits repeatedly | kubectl logs POD -c CNT --previous |
Fix command/args, config/secret, missing files/permissions, dependency endpoints; relax probe timings if too aggressive. |
| CreateContainerConfigError | Bad pod spec/config | kubectl describe pod |
Missing ConfigMap/Secret, wrong keys, bad env var refs, invalid volumeMount path. |
| RunContainerError | Runtime couldn’t start container | describe pod; node containerd/dockerd logs |
Port in use, invalid capabilities, seccomp/AppArmor/SELinux denials. |
| OOMKilled | Exceeded memory limit | kubectl get pod -o wide, kubectl logs |
Increase memory limit or reduce usage; add heap controls; right-size requests/limits. |
Evicted (DiskPressure/MemoryPressure/PIDPressure) |
Node under pressure | kubectl describe node |
Free disk (/var/lib/containerd), tune image GC, increase node size, fix noisy neighbors. |
| Terminating forever | Finalizer or volume stuck | kubectl get pod -o yaml |
Remove stale finalizers (carefully), unmount volumes, restart kubelet if needed. |
2) Scheduling & node issues
- Events show:
node(s) had taint {...} that the pod didn't tolerate→ add matching tolerations or remove taint from node. - Affinity/anti-affinity / topology spread prevent placement → verify labels on nodes and the rules in the pod.
- ResourceQuota/LimitRange violations →
kubectl describe quota/limitrange -n <ns>; adjust requests/limits or quotas. - Node NotReady →
kubectl describe node; fix kubelet, CNI, time skew, certificate expiry, disk full.
3) Networking & Service/Ingress gotchas
| Symptom | Checks | Fixes |
|---|---|---|
| Service has no endpoints | kubectl get endpoints <svc> -o wide |
Ensure pod labels match service selector; pod readiness must be passing. |
| DNS failing | From a debug pod: nslookup kubernetes.default |
Check coredns logs, node DNS, kube-proxy/iptables; restart coredns if wedged. |
| NodePort/LoadBalancer unreachable | Cloud LB / firewall / security groups | Open firewall, ensure correct externalTrafficPolicy, health checks; for local clusters use Ingress/port-forward. |
| Ingress not working | kubectl get ingressclass, controller pods/logs |
Install an Ingress controller, set proper ingressClassName, host/path rules, TLS secret names. |
| HostPort conflict | Node logs show port in use | Remove hostPort or change value; avoid hostPort unless necessary. |
4) Storage & volume errors
- PVC stuck Pending: No PV/SC match, wrong
accessModes(e.g.,ReadWriteOnceon multi-node), zone mismatch, or missing default StorageClass. - Mount errors (
timeout waiting for device,formatting failed) → check driver CSIDriver/node plugin logs; validatefsGroup,volumeMode(Block vs Filesystem), and app user permissions. - Multi-attach errors: Some volumes are single-writer only—use appropriate access mode or switch to a RWX-capable storage class (e.g., NFS/FSx/File/Filestore).
5) RBAC / Auth / Policy
Forbidden: user ... cannot get/list ...→kubectl auth can-i get pods --as <user> -n <ns>; create proper Role/ClusterRole and RoleBinding/ClusterRoleBinding to the ServiceAccount.- ServiceAccount token issues: Ensure pod
serviceAccountNameexists; projected tokens in older clusters may need automountServiceAccountToken. - Pod Security (PSS) / Gatekeeper / Kyverno: Errors like
violates PodSecurity "restricted:latest"→ add compliantsecurityContext(drop caps, no hostPath, runAsNonRoot), or adjust namespace labels/policies. - Admission webhooks failing: Timeouts or TLS errors → check
ValidatingWebhookConfiguration/MutatingWebhookConfigurationand the webhooks’ Service/DNS/certs.
6) Probes & app readiness
- Readiness/Liveness failing →
kubectl describe podevents;kubectl logs. Common fixes: increaseinitialDelaySeconds, widentimeoutSeconds, ensure the app listens on the same port/path as probe, and make probe idempotent.
7) Jobs / CronJobs
BackoffLimitExceeded: Job crashed too many times → fix container command or resource limits; raisebackoffLimitif appropriate.- CronJobs not running: Check
suspend: false,concurrencyPolicy, timezone, and controller logs; verify schedule format (*/5 * * * *etc.). - Missed schedules (controller down time) →
.status.lastScheduleTimeand events; adjuststartingDeadlineSeconds.
8) Autoscaling & metrics
- HPA shows
missing request for cpu→ set CPU (and/or memory) requests on target containers. - HPA says
unable to get metrics for resource→ install/repair metrics-server and confirm TLS/aggregator. - Cluster-autoscaler won’t scale: Unschedulable reason is not node-size-fixable (e.g., PVC zone constraint) or scaling disabled for node group.
9) Control-plane & etcd (self-managed clusters)
etcdserver: mvcc: database space exceeded→ compact & defrag etcd, prune events; increase quota.- API 429 / rate limit → client backoff, tighten watches, reduce controller churn.
- Cert expiry / clock skew → rotate certs, enable NTP.
10) Kubeconfig / context / client
x509: certificate signed by unknown authorityorUnauthorized→ wrong context/cluster CA; refresh kubeconfig (e.g.,aws eks update-kubeconfig,gcloud container clusters get-credentials).i/o timeout→ network path to API blocked; check proxies, VPN, firewall.
Core commands you’ll use every time
# Pods & events
kubectl get pods -A -o wide
kubectl describe pod <pod> -n <ns>
kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 50
# Logs (current / previous crash) + a specific container
kubectl logs <pod> -n <ns> -c <container>
kubectl logs <pod> -n <ns> -c <container> --previous
# Services / endpoints / ingress
kubectl get svc,ep,ing -A -o wide
# PVC/PV/SC
kubectl get pvc,pv,storageclass -A
kubectl describe pvc <pvc> -n <ns>
# RBAC quick test
kubectl auth can-i create pods --as system:serviceaccount:<ns>:<sa> -n <ns>
# Node health
kubectl get nodes -o wide
kubectl describe node <node>
A tiny “debug pod” you can drop anywhere
apiVersion: v1
kind: Pod
metadata:
name: net-debug
spec:
containers:
- name: dbg
image: busybox:stable
command: ["sh", "-c", "sleep 1d"]
securityContext: {runAsNonRoot: true}
restartPolicy: Never
Then kubectl exec -it net-debug -- sh to run nslookup, wget, curl, etc.
