Kubernetes Classroom notes 24/Aug/2025

Elastic kubernetes Service

  • Compute choices:
    • Managed node groups
    • Fargate (serverless pds)
    • Karpenter (or Automode)
  • Network choices:
    • VPC CNI
    • Ingress/Load balancers:
    • Cilium on EKS
  • Identity & access:
    • EKS pod identity: associate iam roles to service accounts
    • Native
  • Storage:
    • EBS CSI driver
    • EFS CSI Driver
    • S3 CSI Driver
  • Observability:
    • Control plane logs: Cloud Watch + Fludnet Bit
    • Metrics:
      • Amazon managed grafana
      • Amazon managed prometheus
  • Security:
    • Network
    • Policy

EKSCTL

Annotation Key Purpose / Usage Scope / Context
service.beta.kubernetes.io/aws-load-balancer-type Specifies the type of load balancer (e.g., external) Kubernetes Service of type: LoadBalancer (Kubernetes SIGs, Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type Determines NLB target mode: instance or ip Kubernetes Service (AWS Documentation, Kubernetes SIGs)
service.beta.kubernetes.io/aws-load-balancer-name Sets a custom name for the AWS Load Balancer Kubernetes Service (Kubernetes SIGs)
service.beta.kubernetes.io/aws-load-balancer-internal (deprecated) Marks the Load Balancer as internal; now replaced by aws-load-balancer-scheme Kubernetes Service (Kubernetes SIGs)
service.beta.kubernetes.io/aws-load-balancer-scheme Defines Load Balancer scheme: internal or internet-facing Kubernetes Service (AWS Documentation, Kubernetes SIGs)
service.beta.kubernetes.io/load-balancer-source-ranges (deprecated) Restricts access by CIDR ranges; use .spec.loadBalancerSourceRanges instead Kubernetes Service (AWS Documentation, Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol Sets health-check protocol: TCP, HTTP, etc. Kubernetes Service (AWS Documentation)
service.beta.kubernetes.io/aws-load-balancer-healthcheck-path Specifies HTTP path for health checks (e.g., /healthz) Kubernetes Service (AWS Documentation)
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port Defines the port for health checks (traffic-port or explicit port) Kubernetes Service (AWS Documentation)
service.beta.kubernetes.io/aws-load-balancer-subnets Indicates which subnets (by IDs or names) to place the Load Balancer Kubernetes Service (AWS Documentation, Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-ssl-cert ARN for SSL certificate via ACM for HTTPS/TLS termination Kubernetes Service (AWS Documentation, Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-ssl-ports Specifies which ports to serve over SSL/TLS (* or list) Kubernetes Service (AWS Documentation, Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes Sets target group attributes (e.g., stickiness) Kubernetes Service (Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-target-node-labels Filters targets via node labels Kubernetes Service (Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-subnets (repeat from above as Beta) See above—you can include subnet names or IDs Kubernetes Service (Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol Enables PROXY protocol support (value: "*") Kubernetes Service (Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-attributes Defines advanced ALB attributes (e.g., "deletion_protection.enabled=true") Kubernetes Service (Kubernetes)
service.beta.kubernetes.io/aws-load-balancer-alpn-policy Sets ALPN policy (e.g., HTTP2Optional) Kubernetes Service (Kubernetes)
alb.ingress.kubernetes.io/load-balancer-name Custom name for ALB created via Ingress Used by AWS Load Balancer Controller (Kubernetes SIGs)
alb.ingress.kubernetes.io/group.name Defines IngressGroup name for grouping ALBs AWS Load Balancer Controller (Kubernetes SIGs)
alb.ingress.kubernetes.io/group.order Sets Ingress ordering within a group AWS Load Balancer Controller (Kubernetes SIGs)
alb.ingress.kubernetes.io/tags Applies tags to LoadBalancer in key=value format AWS Load Balancer Controller (Kubernetes SIGs)
alb.ingress.kubernetes.io/ip-address-type Chooses IP type: ipv4 or dualstack AWS Load Balancer Controller (Kubernetes SIGs)
alb.ingress.kubernetes.io/scheme Marks ALB as internal or internet-facing AWS Load Balancer Controller (Kubernetes SIGs)
rbac.authorization.kubernetes.io/autoupdate Used in RBAC ClusterRole or ClusterRoleBinding for auto updates General Kubernetes RBAC (used in EKS) (AWS Documentation)

GitOps

  • This is typically managed by an automated agent like argocd or flux running in your environment
  • Git is the single source of truth
    • Define a state (manifests/helm charts) in git
    • commit & merge the PR
    • Automatic reconcillation happens
  • Installing argo cd Refer Here
  • CRD YAML for argocd deploy
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: nginx
spec:
  destination:
    namespace: ''
    server: https://kubernetes.default.svc
  source:
    path: Jun25/k8s/argo/deployment
    repoURL: https://github.com/asquarezone/KubernetesZone.git
    targetRevision: HEAD
  sources: []
  project: default
  syncPolicy:
    automated:
      prune: false
      selfHeal: false
      enabled: true

k8s trouble shooting guide

flowchart TD
A[Pod unhealthy / feature broken] --> B{Status?}
B -->|Pending| SCHED[Check scheduling]
B -->|ContainerCreating| CC[Check image, volumes, CNI]
B -->|CrashLoopBackOff| CRASH[Check app logs & probes]
B -->|ImagePullBackOff/ErrImagePull| IMG[Check image & pull secret]
B -->|Running but failing| NET[Check Service/Endpoints/DNS]
B -->|Terminating/Evicted| NODE[Node pressure or finalizers]
B -->|Completed stuck| JOB[Job/CronJob policy/events]
B -->|Forbidden/Unauthorized| RBAC[RBAC / tokens / PSS]
B -->|LB/Ingress broken| ING[Controller, class, rules, TLS]

1) Pod lifecycle & frequent errors

Pod status / error What it means Confirm with Common fixes
Pending with event 0/X nodes are available (Insufficient cpu/memory/ephemeral-storage) Scheduler can’t place it kubectl describe pod, check events; kubectl get nodes -o wide; kubectl get resourcequota -A Reduce requests/limits, free node resources, scale cluster, remove blocking taints, adjust nodeSelector/affinity, fix PDB.
Pending w/ pod has unbound immediate PersistentVolumeClaims PVC can’t bind kubectl get pvc -o wide; kubectl describe pvc Create a matching PV or fix StorageClass, set proper accessModes, storage size, topology/zone.
ContainerCreating Kubelet preparing sandbox kubectl describe pod; node kubelet logs Fix CNI install, image pull timeouts, mount errors, permission/SELinux (use securityContext / fsGroup).
ImagePullBackOff / ErrImagePull Image cannot be pulled kubectl describe pod events Fix image name/tag/registry, network/DNS to registry, create imagePullSecrets, update imagePullPolicy.
CrashLoopBackOff Process exits repeatedly kubectl logs POD -c CNT --previous Fix command/args, config/secret, missing files/permissions, dependency endpoints; relax probe timings if too aggressive.
CreateContainerConfigError Bad pod spec/config kubectl describe pod Missing ConfigMap/Secret, wrong keys, bad env var refs, invalid volumeMount path.
RunContainerError Runtime couldn’t start container describe pod; node containerd/dockerd logs Port in use, invalid capabilities, seccomp/AppArmor/SELinux denials.
OOMKilled Exceeded memory limit kubectl get pod -o wide, kubectl logs Increase memory limit or reduce usage; add heap controls; right-size requests/limits.
Evicted (DiskPressure/MemoryPressure/PIDPressure) Node under pressure kubectl describe node Free disk (/var/lib/containerd), tune image GC, increase node size, fix noisy neighbors.
Terminating forever Finalizer or volume stuck kubectl get pod -o yaml Remove stale finalizers (carefully), unmount volumes, restart kubelet if needed.

2) Scheduling & node issues

  • Events show: node(s) had taint {...} that the pod didn't tolerate → add matching tolerations or remove taint from node.
  • Affinity/anti-affinity / topology spread prevent placement → verify labels on nodes and the rules in the pod.
  • ResourceQuota/LimitRange violations → kubectl describe quota/limitrange -n <ns>; adjust requests/limits or quotas.
  • Node NotReadykubectl describe node; fix kubelet, CNI, time skew, certificate expiry, disk full.

3) Networking & Service/Ingress gotchas

Symptom Checks Fixes
Service has no endpoints kubectl get endpoints <svc> -o wide Ensure pod labels match service selector; pod readiness must be passing.
DNS failing From a debug pod: nslookup kubernetes.default Check coredns logs, node DNS, kube-proxy/iptables; restart coredns if wedged.
NodePort/LoadBalancer unreachable Cloud LB / firewall / security groups Open firewall, ensure correct externalTrafficPolicy, health checks; for local clusters use Ingress/port-forward.
Ingress not working kubectl get ingressclass, controller pods/logs Install an Ingress controller, set proper ingressClassName, host/path rules, TLS secret names.
HostPort conflict Node logs show port in use Remove hostPort or change value; avoid hostPort unless necessary.

4) Storage & volume errors

  • PVC stuck Pending: No PV/SC match, wrong accessModes (e.g., ReadWriteOnce on multi-node), zone mismatch, or missing default StorageClass.
  • Mount errors (timeout waiting for device, formatting failed) → check driver CSIDriver/node plugin logs; validate fsGroup, volumeMode (Block vs Filesystem), and app user permissions.
  • Multi-attach errors: Some volumes are single-writer only—use appropriate access mode or switch to a RWX-capable storage class (e.g., NFS/FSx/File/Filestore).

5) RBAC / Auth / Policy

  • Forbidden: user ... cannot get/list ...kubectl auth can-i get pods --as <user> -n <ns>; create proper Role/ClusterRole and RoleBinding/ClusterRoleBinding to the ServiceAccount.
  • ServiceAccount token issues: Ensure pod serviceAccountName exists; projected tokens in older clusters may need automountServiceAccountToken.
  • Pod Security (PSS) / Gatekeeper / Kyverno: Errors like violates PodSecurity "restricted:latest" → add compliant securityContext (drop caps, no hostPath, runAsNonRoot), or adjust namespace labels/policies.
  • Admission webhooks failing: Timeouts or TLS errors → check ValidatingWebhookConfiguration / MutatingWebhookConfiguration and the webhooks’ Service/DNS/certs.

6) Probes & app readiness

  • Readiness/Liveness failingkubectl describe pod events; kubectl logs. Common fixes: increase initialDelaySeconds, widen timeoutSeconds, ensure the app listens on the same port/path as probe, and make probe idempotent.

7) Jobs / CronJobs

  • BackoffLimitExceeded: Job crashed too many times → fix container command or resource limits; raise backoffLimit if appropriate.
  • CronJobs not running: Check suspend: false, concurrencyPolicy, timezone, and controller logs; verify schedule format (*/5 * * * * etc.).
  • Missed schedules (controller down time) → .status.lastScheduleTime and events; adjust startingDeadlineSeconds.

8) Autoscaling & metrics

  • HPA shows missing request for cpu → set CPU (and/or memory) requests on target containers.
  • HPA says unable to get metrics for resource → install/repair metrics-server and confirm TLS/aggregator.
  • Cluster-autoscaler won’t scale: Unschedulable reason is not node-size-fixable (e.g., PVC zone constraint) or scaling disabled for node group.

9) Control-plane & etcd (self-managed clusters)

  • etcdserver: mvcc: database space exceeded → compact & defrag etcd, prune events; increase quota.
  • API 429 / rate limit → client backoff, tighten watches, reduce controller churn.
  • Cert expiry / clock skew → rotate certs, enable NTP.

10) Kubeconfig / context / client

  • x509: certificate signed by unknown authority or Unauthorized → wrong context/cluster CA; refresh kubeconfig (e.g., aws eks update-kubeconfig, gcloud container clusters get-credentials).
  • i/o timeout → network path to API blocked; check proxies, VPN, firewall.

Core commands you’ll use every time

# Pods & events
kubectl get pods -A -o wide
kubectl describe pod <pod> -n <ns>
kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 50

# Logs (current / previous crash) + a specific container
kubectl logs <pod> -n <ns> -c <container>
kubectl logs <pod> -n <ns> -c <container> --previous

# Services / endpoints / ingress
kubectl get svc,ep,ing -A -o wide

# PVC/PV/SC
kubectl get pvc,pv,storageclass -A
kubectl describe pvc <pvc> -n <ns>

# RBAC quick test
kubectl auth can-i create pods --as system:serviceaccount:<ns>:<sa> -n <ns>

# Node health
kubectl get nodes -o wide
kubectl describe node <node>

A tiny “debug pod” you can drop anywhere

apiVersion: v1
kind: Pod
metadata:
  name: net-debug
spec:
  containers:
  - name: dbg
    image: busybox:stable
    command: ["sh", "-c", "sleep 1d"]
    securityContext: {runAsNonRoot: true}
  restartPolicy: Never

Then kubectl exec -it net-debug -- sh to run nslookup, wget, curl, etc.


Published
Categorized as Uncategorized Tagged

By continuous learner

devops & cloud enthusiastic learner

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Please turn AdBlock off
Animated Social Media Icons by Acurax Responsive Web Designing Company

Discover more from Direct DevOps from Quality Thought

Subscribe now to keep reading and get access to the full archive.

Continue reading

Exit mobile version
%%footer%%