Kubernetes Refresher
Comprehensive quick-reference for Kubernetes — pods, deployments, services, networking, storage, RBAC, and operational patterns
Table of Contents
0. Setup & Environment
Get a local Kubernetes cluster running on macOS so every example in this guide works immediately.
Prerequisites
Install Docker Desktop — it ships with a built-in single-node cluster:
brew install --cask docker
Enable Kubernetes in Docker Desktop (simplest)
Docker Desktop → Settings → Kubernetes → Enable Kubernetes → Apply & Restart. Wait for the Kubernetes status indicator in the bottom-left to turn green, then verify:
kubectl cluster-info
kubectl get nodes
Alternative: minikube (more control)
minikube gives you a closer-to-production setup and lets you tweak resources, add-ons, and CNI plugins:
brew install minikube
minikube start --driver=docker
minikube status
minikube dashboard— opens the Kubernetes web UI in your browserminikube stop/minikube delete— pause or tear down the cluster
Install kubectl (if not already available)
Docker Desktop includes kubectl, but installing via Homebrew keeps it updated independently:
brew install kubectl
kubectl version --client
kubectl version --client to confirm the binary is on your $PATH. If you have both Docker Desktop and a Homebrew kubectl, the Homebrew one takes precedence when it appears first in $PATH.
Essential kubectl Config
# See which cluster kubectl is currently talking to
kubectl config current-context
# Switch context (useful when you have multiple clusters)
kubectl config use-context docker-desktop
# Set a default namespace so you don't need -n on every command
kubectl config set-context --current --namespace=default
Helpful Tools
brew install k9s— terminal UI for Kubernetes; navigate pods, stream logs, and open shells without typing long kubectl commandsbrew install kubectx— addskubectx(switch clusters) andkubens(switch namespaces) with tab completion
htop for Kubernetes — a real-time dashboard of pods, deployments, services, and events. Press ? inside k9s for keybindings.
Quick Verify
Deploy a pod, inspect it, and clean up — confirms the full local stack is working:
# Run a test pod
kubectl run hello --image=nginx --port=80
# Confirm it reaches Running status
kubectl get pods
# Forward a local port to the pod
kubectl port-forward pod/hello 8080:80
# Visit http://localhost:8080 — should see the nginx welcome page
# Cleanup
kubectl delete pod hello
Architecture Overview
Kubernetes is a distributed system for automating deployment, scaling, and management of containerized workloads. The cluster is split into the control plane (brain) and worker nodes (muscle).
Control Plane Components
| Component | Role |
|---|---|
kube-apiserver | The single entry point for all cluster operations. Validates and processes REST requests, writes to etcd. Horizontally scalable. |
etcd | Distributed key-value store. The only stateful component — all cluster state lives here. Run with 3 or 5 nodes for HA (odd number for quorum). |
kube-scheduler | Watches for unscheduled pods and assigns them to nodes. Considers resource requests, affinity/taints, topology. Pluggable. |
kube-controller-manager | Runs control loops: node controller, replication controller, endpoints controller, service account controller, etc. All in one process. |
cloud-controller-manager | Cloud-specific control loops (load balancer provisioning, node lifecycle, routes). Separates cloud logic from core k8s. |
Worker Node Components
| Component | Role |
|---|---|
kubelet | Agent on every node. Watches PodSpecs assigned to this node, ensures containers are running and healthy. Reports node/pod status to API server. |
kube-proxy | Maintains network rules (iptables or IPVS) for Service abstraction. Routes traffic to the correct pod. Can be replaced by CNI-level proxying (Cilium). |
| Container runtime | Executes containers. containerd (default), CRI-O. Docker was removed as a direct runtime in k8s 1.24 (containerd still uses it underneath). |
Cluster Networking Model
Kubernetes mandates a flat network model:
- Every pod gets its own IP address — no NAT between pods on the same cluster.
- Pods on any node can communicate with pods on any other node without NAT.
- Agents (kubelet) on a node can communicate with all pods on that node.
- CNI plugins (Calico, Cilium, Flannel, Weave) implement this model.
API Request Flow
# What happens when you run: kubectl apply -f deployment.yaml
# 1. kubectl reads kubeconfig (~/.kube/config), finds server URL + credentials
# 2. kubectl serializes the manifest to JSON, sends HTTP PATCH/POST to API server
# 3. API server authenticates (cert, token, webhook) → authorizes (RBAC) → admission controllers
# 4. API server validates the object schema
# 5. API server writes to etcd (two-phase: propose, commit)
# 6. Controllers watch for changes via informers (long-poll on /watch)
# 7. Deployment controller creates/updates ReplicaSet
# 8. ReplicaSet controller creates Pod objects
# 9. Scheduler watches unbound pods, selects node, writes nodeName to Pod spec
# 10. kubelet on chosen node watches for pods assigned to it
# 11. kubelet calls CRI (containerd) to pull image and start container
# 12. kubelet updates Pod status (Running, IP address, etc.)
Core Concepts
Pods
A pod is the smallest deployable unit — a group of one or more containers sharing network (same IP, port space) and storage. Containers in a pod communicate via localhost.
# Minimal pod — rarely created directly in production
apiVersion: v1
kind: Pod
metadata:
name: myapp
namespace: default
labels:
app: myapp
version: "1.0"
spec:
containers:
- name: main
image: nginx:1.25
ports:
- containerPort: 80
resources:
requests:
cpu: "100m" # 0.1 CPU cores
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
---
# Multi-container pod: main app + sidecar
apiVersion: v1
kind: Pod
metadata:
name: app-with-sidecar
spec:
# Init containers run to completion before main containers start
initContainers:
- name: db-migration
image: myapp:migrate
command: ["./migrate", "--up"]
envFrom:
- secretRef:
name: db-credentials
containers:
- name: app
image: myapp:1.0
ports:
- containerPort: 8080
- name: log-shipper # Sidecar: shares filesystem with main
image: fluentbit:2.2
volumeMounts:
- name: app-logs
mountPath: /var/log/app
volumes:
- name: app-logs
emptyDir: {}
Pod Lifecycle Phases
| Phase | Meaning |
|---|---|
Pending | Pod accepted, but containers not yet running. Scheduling or image pull in progress. |
Running | At least one container is running, starting, or restarting. |
Succeeded | All containers exited with status 0 and won't restart (Job pods). |
Failed | All containers terminated, at least one with non-zero exit code. |
Unknown | Pod state can't be determined (node communication lost). |
Labels, Selectors, and Annotations
metadata:
labels:
app: myapp # Identifies the application
env: production # Environment
version: "2.1.0" # Release version
tier: frontend # Logical tier
team: payments # Owning team
annotations:
# Annotations: non-identifying metadata, no selector support
# Can hold larger values (URLs, JSON, multi-line strings)
kubernetes.io/change-cause: "Bumped image to fix CVE-2024-1234"
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
deployment.kubernetes.io/revision: "3"
# Label selector syntax
kubectl get pods -l app=myapp
kubectl get pods -l 'env in (production,staging)'
kubectl get pods -l 'version notin (1.0,1.1)'
kubectl get pods -l '!canary' # Does NOT have 'canary' label
kubectl get pods -l app=myapp,env=prod # AND logic
# Set-based selectors (used in Job, Deployment, etc.)
# matchLabels: { app: myapp } # equality
# matchExpressions:
# - { key: env, operator: In, values: [production, staging] }
# - { key: canary, operator: DoesNotExist }
Namespaces
# Built-in namespaces
# default — resources without a namespace
# kube-system — k8s system components
# kube-public — publicly readable, used for cluster info
# kube-node-lease — node heartbeat leases (performance)
kubectl create namespace production
kubectl get namespaces
# Set default namespace for current context
kubectl config set-context --current --namespace=production
# Cross-namespace DNS: ..svc.cluster.local
Resource Quotas and LimitRanges
# ResourceQuota: caps total resource consumption in a namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
pods: "50"
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
persistentvolumeclaims: "20"
services.loadbalancers: "5"
---
# LimitRange: sets default/min/max per pod or container
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default: # Applied if container doesn't specify limits
cpu: "500m"
memory: "256Mi"
defaultRequest: # Applied if container doesn't specify requests
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
Workloads
Deployments
Deployments manage stateless applications. They own a ReplicaSet, which owns pods. Rolling updates replace pods one batch at a time.
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: production
annotations:
kubernetes.io/change-cause: "Release v2.1 — add payment retry"
spec:
replicas: 3
selector:
matchLabels:
app: myapp # Must match pod template labels
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Extra pods above desired during update
maxUnavailable: 0 # No downtime: always maintain 3 pods
template:
metadata:
labels:
app: myapp
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: myapp:2.1
ports:
- containerPort: 8080
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1"
memory: "512Mi"
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
# Rollout commands
kubectl rollout status deployment/myapp
kubectl rollout history deployment/myapp
kubectl rollout history deployment/myapp --revision=3
# Rollback to previous revision
kubectl rollout undo deployment/myapp
kubectl rollout undo deployment/myapp --to-revision=2
# Pause/resume (useful for canary-style manual gates)
kubectl rollout pause deployment/myapp
kubectl rollout resume deployment/myapp
# Scale
kubectl scale deployment myapp --replicas=5
StatefulSets
StatefulSets provide stable network identities (pod-0, pod-1, ...), stable persistent storage (each pod keeps its PVC on reschedule), and ordered deployment/scaling. Use for databases, message queues, distributed caches.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: "postgres" # Headless service name — required for DNS
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:16
ports:
- containerPort: 5432
name: postgres
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates: # Each pod gets its own PVC
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "fast-ssd"
resources:
requests:
storage: 20Gi
pod-name.service-name.namespace.svc.cluster.local. For a StatefulSet named postgres with headless service postgres in namespace default: postgres-0.postgres.default.svc.cluster.local.
DaemonSets
DaemonSets ensure one pod per node (or per selected nodes). Used for log collectors, monitoring agents, CNI plugins, node-level security tooling.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentbit
namespace: logging
spec:
selector:
matchLabels:
app: fluentbit
updateStrategy:
type: RollingUpdate # or OnDelete (manual)
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: fluentbit
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule # Run on control plane nodes too
containers:
- name: fluentbit
image: fluent/fluent-bit:2.2
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: docker-containers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: docker-containers
hostPath:
path: /var/lib/docker/containers
Jobs and CronJobs
# Job: run to completion
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration
spec:
completions: 1 # Number of successful pod completions needed
parallelism: 1 # Number of pods to run in parallel
backoffLimit: 3 # Retry up to 3 times on failure
activeDeadlineSeconds: 600 # Kill job if not done in 10 min
ttlSecondsAfterFinished: 3600 # Auto-clean up 1h after finish
template:
spec:
restartPolicy: Never # OnFailure or Never for Jobs (not Always)
containers:
- name: migrate
image: myapp:2.1
command: ["./migrate", "--up"]
envFrom:
- secretRef:
name: db-credentials
---
# CronJob: scheduled jobs
apiVersion: batch/v1
kind: CronJob
metadata:
name: report-generator
spec:
schedule: "0 2 * * *" # At 02:00 every day (cron syntax)
timeZone: "America/New_York" # k8s 1.27+
concurrencyPolicy: Forbid # Allow, Forbid, or Replace
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
startingDeadlineSeconds: 300 # Skip if missed by 5 minutes
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: reporter
image: reporter:1.0
Pod Disruption Budgets
# PDB: limits voluntary disruptions (node drain, cluster upgrades)
# Ensures minimum availability during maintenance
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
spec:
minAvailable: 2 # At least 2 pods must be available
# OR: maxUnavailable: 1 # At most 1 pod unavailable at once
selector:
matchLabels:
app: myapp
minAvailable equals replicas, node drains will block forever.
Services & Networking
Service Types
| Type | Accessibility | Use Case |
|---|---|---|
ClusterIP | Cluster-internal only | Default. Pod-to-pod communication within cluster. |
NodePort | External via node IP:port (30000–32767) | Development, on-prem without load balancer. Exposes on every node. |
LoadBalancer | External via cloud LB | Production cloud deployments. Creates cloud load balancer automatically. |
ExternalName | DNS CNAME alias | Map service name to external DNS (e.g., external database FQDN). |
# ClusterIP (default)
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
type: ClusterIP # Omit = ClusterIP
selector:
app: myapp # Routes to pods with this label
ports:
- name: http
port: 80 # Service port (what clients connect to)
targetPort: 8080 # Container port (can be named: targetPort: http)
protocol: TCP
---
# LoadBalancer with annotations (AWS EKS example)
apiVersion: v1
kind: Service
metadata:
name: myapp-lb
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
spec:
type: LoadBalancer
selector:
app: myapp
ports:
- port: 443
targetPort: 8080
---
# ExternalName: routes to external service
apiVersion: v1
kind: Service
metadata:
name: prod-database
spec:
type: ExternalName
externalName: db.prod.example.com
Headless Services
# Headless: clusterIP: None — no VIP, DNS returns pod IPs directly
# Required for StatefulSets; enables direct pod addressing
apiVersion: v1
kind: Service
metadata:
name: postgres-headless
spec:
clusterIP: None # This makes it headless
selector:
app: postgres
ports:
- port: 5432
targetPort: 5432
Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx # Which IngressClass to use
tls:
- hosts:
- api.example.com
secretName: api-tls-cert # cert-manager populates this
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: api-v1
port:
number: 80
- path: /v2
pathType: Prefix
backend:
service:
name: api-v2
port:
number: 80
- host: admin.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: admin-ui
port:
number: 80
Gateway API
Gateway API (GA in k8s 1.28) is the successor to Ingress. It separates infrastructure (GatewayClass, Gateway) from routing (HTTPRoute) concerns.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: myapp-route
spec:
parentRefs:
- name: prod-gateway # Reference to a Gateway object
namespace: infra
hostnames:
- "api.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: myapp
port: 80
weight: 90 # Canary: 90% to stable
- name: myapp-canary
port: 80
weight: 10
Network Policies
Network policies control ingress/egress traffic between pods. Default: all traffic allowed — policies are additive whitelists. Requires a CNI that enforces policies (Calico, Cilium, Weave).
# Default-deny all ingress in a namespace, then allow selectively
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: production
spec:
podSelector: {} # Matches all pods in namespace
policyTypes:
- Ingress
---
# Allow ingress only from pods with specific labels, and from a namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-ingress
namespace: production
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api # From pods labeled app=api
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: monitoring # From monitoring NS
ports:
- protocol: TCP
port: 5432
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 169.254.169.254/32 # Block AWS metadata service
ports:
- protocol: TCP
port: 443
podSelector: {} matches ALL pods in the namespace. A from: [] or to: [] (empty list) means deny all. A missing ingress or egress section means allow all for that direction when policyTypes doesn't include it.
Configuration
ConfigMaps
# Create from literal values
kubectl create configmap app-config \
--from-literal=LOG_LEVEL=info \
--from-literal=MAX_CONNECTIONS=100
# Create from a file (key = filename)
kubectl create configmap nginx-config --from-file=nginx.conf
# Create from env file (dotenv format)
kubectl create configmap app-env --from-env-file=.env.production
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
LOG_LEVEL: "info"
MAX_CONNECTIONS: "100"
# Multi-line values: pipe preserves newlines
config.yaml: |
server:
port: 8080
timeout: 30s
database:
pool_size: 10
---
# Using ConfigMap in a pod
spec:
containers:
- name: app
image: myapp:1.0
# Option 1: Inject all keys as env vars
envFrom:
- configMapRef:
name: app-config
# Option 2: Inject specific keys
env:
- name: LOG_LEVEL
valueFrom:
configMapKeyRef:
name: app-config
key: LOG_LEVEL
# Option 3: Mount as a volume (files)
volumeMounts:
- name: config-volume
mountPath: /etc/config
readOnly: true
volumes:
- name: config-volume
configMap:
name: app-config
Secrets
EncryptionConfiguration, or use external secret managers (Vault, AWS Secrets Manager via External Secrets Operator).
# Create generic secret
kubectl create secret generic db-creds \
--from-literal=username=admin \
--from-literal=password='s3cur3P@ss'
# Create TLS secret
kubectl create secret tls my-tls \
--cert=tls.crt \
--key=tls.key
# Create docker-registry secret
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=myuser \
--docker-password=mypass
apiVersion: v1
kind: Secret
metadata:
name: db-credentials
type: Opaque # Generic; also: kubernetes.io/tls, kubernetes.io/dockerconfigjson
data:
username: YWRtaW4= # base64 encoded "admin"
password: czNjdXIzUEBzcw==
# Use stringData for human-readable (auto-encoded on apply)
stringData:
connection-string: "postgresql://admin:s3cur3P@ss@db:5432/mydb"
---
# Mount secret as volume (files are updated automatically when secret changes)
spec:
containers:
- name: app
volumeMounts:
- name: db-secret
mountPath: /etc/secrets
readOnly: true
volumes:
- name: db-secret
secret:
secretName: db-credentials
defaultMode: 0400 # File permissions
Immutable ConfigMaps and Secrets
# Immutable = cannot be changed after creation (must delete + recreate)
# Improves performance: kubelet doesn't need to watch for changes
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config-v2
immutable: true
data:
VERSION: "2.0"
Storage
Volumes
| Volume Type | Lifetime | Use Case |
|---|---|---|
emptyDir | Pod lifetime | Scratch space, inter-container file sharing. Lost on pod restart. |
hostPath | Node lifetime | Access node filesystem. Avoid in production — not portable, security risk. |
configMap | Until ConfigMap deleted | Mount config files into pods. |
secret | Until Secret deleted | Mount credentials as files. |
projected | Mixed | Combine multiple sources (secret + configmap + serviceAccountToken) into one mount. |
persistentVolumeClaim | Claim lifetime | Durable storage — survives pod restarts and rescheduling. |
nfs | External | NFS server. ReadWriteMany access for shared data. |
spec:
volumes:
- name: tmp
emptyDir:
medium: Memory # RAM-backed tmpfs; "" = disk
sizeLimit: 512Mi
- name: projected-vol
projected:
sources:
- secret:
name: db-credentials
- configMap:
name: app-config
- serviceAccountToken:
path: token
expirationSeconds: 3600
audience: my-service
PersistentVolumes and PersistentVolumeClaims
# PersistentVolume: cluster-scoped, provisioned by admin or dynamically
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-example
spec:
capacity:
storage: 100Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce # RWO: single node read-write
# - ReadOnlyMany # ROX: multiple nodes read-only
# - ReadWriteMany # RWX: multiple nodes read-write
# - ReadWriteOncePod # RWOP: k8s 1.22+, single pod only
persistentVolumeReclaimPolicy: Retain # Retain, Recycle (deprecated), Delete
storageClassName: fast-ssd
csi:
driver: ebs.csi.aws.com
volumeHandle: vol-0a1b2c3d4e5f
---
# PersistentVolumeClaim: namespace-scoped, user requests storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: database-storage
namespace: production
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd # Match a StorageClass for dynamic provisioning
resources:
requests:
storage: 50Gi
---
# Use PVC in a pod
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: database-storage
containers:
- name: app
volumeMounts:
- name: data
mountPath: /data
StorageClasses
# StorageClass: defines how to dynamically provision storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
annotations:
storageclass.kubernetes.io/is-default-class: "true" # Default SC
provisioner: ebs.csi.aws.com # CSI driver
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
kmsKeyId: "arn:aws:kms:..."
reclaimPolicy: Delete # Delete PV when PVC deleted
volumeBindingMode: WaitForFirstConsumer # Delay binding until pod scheduled
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer for zonal storage (EBS, GCP PD). This ensures the PV is created in the same AZ as the pod, preventing unschedulable pods due to AZ mismatch.
Scheduling
Node Selectors and Node Affinity
spec:
# Simple: nodeSelector (equality only)
nodeSelector:
kubernetes.io/os: linux
node.kubernetes.io/instance-type: m5.xlarge
affinity:
nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution: HARD rule
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a, us-east-1b]
- key: node.kubernetes.io/instance-type
operator: NotIn
values: [t3.nano, t3.micro]
# preferredDuringSchedulingIgnoredDuringExecution: SOFT rule
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100 # Higher weight = stronger preference
preference:
matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values: [high-memory]
Pod Affinity and Anti-Affinity
affinity:
# Co-locate pods on the same node as pods with label app=cache
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector:
matchLabels:
app: cache
topologyKey: kubernetes.io/hostname
# Spread replicas across different nodes (anti-affinity for HA)
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: myapp
topologyKey: kubernetes.io/hostname # One replica per node
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: topology.kubernetes.io/zone # Prefer different AZs
Taints and Tolerations
Taints repel pods from nodes. Tolerations allow pods to be scheduled on tainted nodes.
# Taint a node: key=value:effect
kubectl taint nodes node1 dedicated=gpu:NoSchedule
kubectl taint nodes node1 maintenance=true:NoExecute # Evicts running pods too
# Effects:
# NoSchedule: Don't schedule new pods without toleration
# PreferNoSchedule: Soft NoSchedule
# NoExecute: Don't schedule + evict existing pods without toleration
# Remove a taint (append -)
kubectl taint nodes node1 dedicated:NoSchedule-
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
# Tolerate any taint with this key regardless of value
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # Evict after 5 min if node stays not-ready
Topology Spread Constraints
# Distribute pods evenly across zones and nodes
spec:
topologySpreadConstraints:
- maxSkew: 1 # Max difference between zones
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
labelSelector:
matchLabels:
app: myapp
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: myapp
Priority Classes
# PriorityClass: higher value = higher priority, can preempt lower-priority pods
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority # or Never
description: "Critical production services"
---
# Use in pod spec
spec:
priorityClassName: high-priority
Scaling
Horizontal Pod Autoscaler (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 2
maxReplicas: 20
metrics:
# CPU: scale when avg CPU utilization exceeds 70% of requests
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 400Mi
# Custom metric from Prometheus Adapter
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 25
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15
resources.requests.cpu set — without requests, utilization % can't be calculated.
Vertical Pod Autoscaler (VPA)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: myapp-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
updatePolicy:
updateMode: "Auto" # Off | Initial | Recreate | Auto
# Auto: evicts pods to apply recommendations (downtime!)
# Initial: only set on new pods
# Off: only recommend, never apply
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: "4"
memory: 8Gi
KEDA (Event-Driven Autoscaling)
KEDA extends HPA with 50+ scalers: Kafka lag, SQS queue depth, Redis lists, Prometheus queries, cron schedules, and more. It can also scale to zero.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: worker-scaler
spec:
scaleTargetRef:
name: worker-deployment
minReplicaCount: 0 # Scale to zero when idle
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: my-consumer-group
topic: events
lagThreshold: "100" # 1 replica per 100 messages of lag
Health Checks & Lifecycle
Probes
| Probe | Failure Action | Use Case |
|---|---|---|
livenessProbe | Restart container | Detect deadlock. If this fails, the container is restarted. Set initialDelaySeconds generously. |
readinessProbe | Remove from Service endpoints | Detect when app is ready to receive traffic. Fails during startup and brief overload. |
startupProbe | Restart container | Slow-starting apps. Disables liveness/readiness during startup. Once it succeeds, normal probes take over. |
spec:
containers:
- name: app
image: myapp:1.0
# Startup probe: allow up to 5 min (30 * 10s) to start
startupProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders:
- name: Custom-Header
value: startup-check
failureThreshold: 30
periodSeconds: 10
# Liveness: restart if unhealthy
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0 # Startup probe handles the delay
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3 # Restart after 3 consecutive failures
successThreshold: 1
# Readiness: only receive traffic when ready
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 1
# Exec probe: run command inside container
# livenessProbe:
# exec:
# command: ["redis-cli", "ping"]
# TCP probe: check port is open
# livenessProbe:
# tcpSocket:
# port: 5432
# gRPC probe (k8s 1.24+)
# livenessProbe:
# grpc:
# port: 50051
# service: "grpc.health.v1.Health"
Lifecycle Hooks and Graceful Shutdown
spec:
terminationGracePeriodSeconds: 60 # Time allowed for graceful shutdown
containers:
- name: app
lifecycle:
postStart:
# Runs immediately after container starts (async, no guarantee before traffic)
exec:
command: ["/bin/sh", "-c", "echo started > /tmp/started"]
preStop:
# Runs before SIGTERM — use to delay shutdown or drain connections
# Critical: add a sleep to allow Service endpoint removal propagation
exec:
command: ["/bin/sh", "-c", "sleep 5 && nginx -s quit"]
# OR httpGet preStop
# httpGet:
# path: /shutdown
# port: 8080
sleep 5 in preStop to allow endpoint propagation before your app stops accepting new connections. Without this, you get request errors during rolling updates.
RBAC & Security
ServiceAccounts
# ServiceAccount: identity for pod processes to call the k8s API
apiVersion: v1
kind: ServiceAccount
metadata:
name: myapp-sa
namespace: production
annotations:
# EKS: IRSA — map to IAM role
eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/myapp-role"
# GKE: Workload Identity
# iam.gke.io/gcp-service-account: "[email protected]"
automountServiceAccountToken: false # Disable auto-mount if not needed
---
spec:
serviceAccountName: myapp-sa
automountServiceAccountToken: false # Also overridable at pod level
Roles and ClusterRoles
# Role: namespace-scoped permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: production
rules:
- apiGroups: [""] # "" = core API group
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "update", "patch"]
---
# ClusterRole: cluster-wide OR reusable across namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: secret-reader
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list"]
# Restrict to specific resource names:
# resourceNames: ["allowed-secret-name"]
---
# RoleBinding: bind Role or ClusterRole to subjects in a namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: read-pods
namespace: production
subjects:
- kind: ServiceAccount
name: myapp-sa
namespace: production
- kind: User
name: "[email protected]"
apiGroup: rbac.authorization.k8s.io
- kind: Group
name: "dev-team"
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role # ClusterRole can also be referenced here
name: pod-reader
apiGroup: rbac.authorization.k8s.io
---
# ClusterRoleBinding: cluster-wide binding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-admin-binding
subjects:
- kind: User
name: [email protected]
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
# Test RBAC permissions
kubectl auth can-i create deployments --namespace production
kubectl auth can-i create deployments --as system:serviceaccount:production:myapp-sa
kubectl auth can-i --list --namespace production # List all permissions
Security Contexts
spec:
# Pod-level security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000 # Volume files owned by this GID
seccompProfile:
type: RuntimeDefault # Apply seccomp filter
containers:
- name: app
# Container-level overrides pod-level
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true # Immutable container filesystem
capabilities:
drop: ["ALL"] # Drop all Linux capabilities
add: ["NET_BIND_SERVICE"] # Add only what's needed
Pod Security Standards (replaces PodSecurityPolicy)
| Level | Restrictions | Use Case |
|---|---|---|
privileged | Unrestricted | System/infrastructure pods only |
baseline | Minimal restrictions; prevents known privilege escalations | General workloads |
restricted | Heavily restricted; follows pod hardening best practices | Security-sensitive workloads |
# Apply Pod Security Standards to a namespace via label
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
kubectl Essentials
Core Commands
# Get resources
kubectl get pods # All pods in current namespace
kubectl get pods -n kube-system # Specific namespace
kubectl get pods -A # All namespaces
kubectl get pods -o wide # Show node, IP
kubectl get pods -o yaml # Full YAML output
kubectl get pods --watch # Stream changes
# Describe: human-readable detail + events
kubectl describe pod myapp-7d9f8b-xyz
kubectl describe node worker-1
# Logs
kubectl logs myapp-7d9f8b-xyz
kubectl logs myapp-7d9f8b-xyz -c sidecar # Specific container
kubectl logs myapp-7d9f8b-xyz --previous # Previous container (after crash)
kubectl logs -l app=myapp --all-containers=true # All pods with label
kubectl logs myapp-7d9f8b-xyz --tail=100 -f # Follow last 100 lines
# Exec into container
kubectl exec -it myapp-7d9f8b-xyz -- /bin/bash
kubectl exec -it myapp-7d9f8b-xyz -c sidecar -- sh
# Port forwarding (local debugging without exposing service)
kubectl port-forward pod/myapp-7d9f8b-xyz 8080:8080
kubectl port-forward svc/myapp 8080:80
kubectl port-forward deployment/myapp 8080:8080
Apply vs Create
# apply: declarative, idempotent, tracks changes via annotation
# Use for ongoing management of resources
kubectl apply -f deployment.yaml
kubectl apply -f ./k8s/ # All files in directory
kubectl apply -k ./overlays/production/ # Kustomize directory
# create: imperative, fails if resource exists
# Use for one-time creation
kubectl create -f deployment.yaml
# diff: preview changes before applying
kubectl diff -f deployment.yaml
# dry-run: validate without applying
kubectl apply -f deployment.yaml --dry-run=client # Local validation only
kubectl apply -f deployment.yaml --dry-run=server # Sends to API server (full validation)
# delete
kubectl delete -f deployment.yaml
kubectl delete pod myapp-7d9f8b-xyz
kubectl delete pod myapp-7d9f8b-xyz --grace-period=0 --force # Emergency only
Output Formatting
# Custom columns
kubectl get pods -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.phase,\
NODE:.spec.nodeName,\
IP:.status.podIP
# JSONPath: extract specific fields
kubectl get pods -o jsonpath='{.items[*].metadata.name}'
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'
# Get all container images in cluster
kubectl get pods -A -o jsonpath='{range .items[*]}{range .spec.containers[*]}{.image}{"\n"}{end}{end}' | sort -u
# Resource usage (requires metrics-server)
kubectl top pods
kubectl top pods --sort-by=cpu
kubectl top nodes
Context and Namespace Management
# Contexts
kubectl config get-contexts
kubectl config use-context my-cluster-prod
kubectl config current-context
# Set namespace for current context
kubectl config set-context --current --namespace=production
# Useful tools (install separately)
# kubectx — fast context switching: kubectx prod
# kubens — fast namespace switching: kubens production
# Useful aliases
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgs='kubectl get svc'
alias kgd='kubectl get deployments'
alias kns='kubectl config set-context --current --namespace'
# Shell completion
source <(kubectl completion zsh) # or bash
Editing and Patching
# Edit live resource (opens $EDITOR)
kubectl edit deployment myapp
# Patch: targeted update without full manifest
# JSON merge patch
kubectl patch deployment myapp -p '{"spec":{"replicas":5}}'
# Strategic merge patch (for arrays: lists by name key)
kubectl patch deployment myapp --type=strategic -p \
'{"spec":{"template":{"spec":{"containers":[{"name":"app","image":"myapp:2.0"}]}}}}'
# JSON patch (RFC 6902: explicit operations)
kubectl patch deployment myapp --type=json -p \
'[{"op":"replace","path":"/spec/replicas","value":5}]'
# Label and annotate
kubectl label pod myapp-xyz tier=frontend
kubectl annotate deployment myapp description="Main API service"
kubectl label node worker-1 node-type=gpu # Add label to node
Debugging & Troubleshooting
Common Pod Problems
| State | Cause | Fix |
|---|---|---|
ImagePullBackOff | Image not found, wrong tag, no registry credentials | Check image name/tag. Create imagePullSecret. Verify registry access. |
CrashLoopBackOff | Container exits (crash or OOM). k8s keeps restarting with exponential backoff. | kubectl logs --previous. Check OOMKilled in events. |
Pending | No node satisfies scheduling constraints (resources, taints, affinity) | kubectl describe pod → Events. Check node capacity. |
OOMKilled | Container exceeded memory limit | Increase limits.memory. Check for memory leak. Use VPA recommendations. |
ContainerCreating | Volume mount pending (PVC not bound, secret not found) | Check PVC status. Verify secret names. |
Terminating (stuck) | preStop hook hanging, finalizer not cleared | Force delete: --grace-period=0 --force. Check finalizers. |
# Diagnostic workflow
kubectl get pods # See state
kubectl describe pod myapp-7d9f8b-xyz # Events + conditions
kubectl logs myapp-7d9f8b-xyz --previous # Logs from crashed container
kubectl get events --sort-by=.lastTimestamp # Cluster-wide event stream
kubectl get events -n production --field-selector reason=OOMKilling
# Check resource pressure on nodes
kubectl describe node worker-1 | grep -A5 "Conditions:"
kubectl describe node worker-1 | grep -A20 "Allocated resources:"
# DNS debugging: run a debug pod with network tools
kubectl run debug --image=nicolaka/netshoot -it --rm -- bash
# Inside: nslookup myapp.production.svc.cluster.local
# Inside: curl http://myapp.production.svc.cluster.local/healthz
# Inside: dig @10.96.0.10 myapp.production.svc.cluster.local # CoreDNS IP
# Ephemeral debug container (k8s 1.23+, doesn't modify running pod)
kubectl debug -it myapp-7d9f8b-xyz \
--image=busybox \
--target=app # Share process namespace with 'app' container
# Copy a pod spec for debugging (adds debug container, changes command)
kubectl debug myapp-7d9f8b-xyz -it --copy-to=debug-pod --image=busybox
Network Debugging
# Check service endpoints (are pods being selected?)
kubectl get endpoints myapp
kubectl describe svc myapp # Check selector matches pod labels
# Test connectivity from within cluster
kubectl run curl-test --image=curlimages/curl -it --rm -- \
curl http://myapp.default.svc.cluster.local/healthz
# Check DNS resolution
kubectl run dns-test --image=busybox -it --rm -- \
nslookup kubernetes.default.svc.cluster.local
# List all network policies in namespace
kubectl get networkpolicies -n production
# Check kube-proxy iptables rules (on node)
iptables -t nat -L KUBE-SERVICES | grep myapp
Resource Exhaustion
# Find resource-hungry pods
kubectl top pods -A --sort-by=cpu | head -20
kubectl top pods -A --sort-by=memory | head -20
# Check for OOMKilled pods
kubectl get pods -A -o json | \
jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason=="OOMKilled") | .metadata.name'
# Find pods without resource requests (scheduling/QoS risk)
kubectl get pods -A -o json | \
jq '.items[] | select(.spec.containers[].resources.requests == null) | .metadata.name'
# Check PVC status
kubectl get pvc -A
kubectl describe pvc database-storage # If Pending: check StorageClass, provisioner
Helm
Helm is the package manager for Kubernetes. A chart is a package of pre-configured Kubernetes resources. A release is a running instance of a chart in a cluster.
Chart Structure
mychart/
├── Chart.yaml # Chart metadata (name, version, dependencies)
├── values.yaml # Default configuration values
├── values.schema.json # Optional JSON schema for values validation
├── charts/ # Chart dependencies (subcharts)
├── templates/ # Kubernetes manifest templates
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ ├── _helpers.tpl # Named templates (partials), not rendered directly
│ ├── NOTES.txt # Post-install instructions
│ └── tests/
│ └── test-connection.yaml
└── .helmignore
# Chart.yaml
apiVersion: v2
name: myapp
description: My application Helm chart
type: application # or library
version: 1.2.3 # Chart version (semver)
appVersion: "2.1.0" # App version (informational)
dependencies:
- name: postgresql
version: "12.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
Core Commands
# Repository management
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm search repo postgresql
helm search hub redis # Search Artifact Hub
# Install / Upgrade
helm install myapp ./mychart --namespace production --create-namespace
helm install myapp ./mychart -f values.yaml -f values.production.yaml
helm install myapp bitnami/postgresql --set primary.persistence.size=20Gi
# upgrade --install: idempotent (install or upgrade)
helm upgrade --install myapp ./mychart \
--namespace production \
--values values.yaml \
--set image.tag="2.1.0" \
--atomic # Roll back automatically on failure
--timeout 5m
# Dry run
helm upgrade --install myapp ./mychart --dry-run
# Rollback
helm history myapp -n production
helm rollback myapp 3 -n production # Roll back to revision 3
# Status and debugging
helm list -n production
helm status myapp -n production
helm get values myapp -n production # Show applied values
helm get manifest myapp -n production # Show rendered manifests
# Uninstall
helm uninstall myapp -n production
helm uninstall myapp -n production --keep-history # Keep release history
Templates
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "mychart.fullname" . }}
labels:
{{- include "mychart.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "mychart.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "mychart.selectorLabels" . | nindent 8 }}
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
{{- if .Values.resources }}
resources:
{{- toYaml .Values.resources | nindent 10 }}
{{- end }}
env:
{{- range .Values.env }}
- name: {{ .name }}
value: {{ .value | quote }}
{{- end }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
# templates/_helpers.tpl
{{/*
Expand the name of the chart.
*/}}
{{- define "mychart.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Common labels
*/}}
{{- define "mychart.labels" -}}
helm.sh/chart: {{ include "mychart.chart" . }}
{{ include "mychart.selectorLabels" . }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
Hooks
# Hook: run at specific lifecycle event
apiVersion: batch/v1
kind: Job
metadata:
name: "{{ .Release.Name }}-migrate"
annotations:
"helm.sh/hook": pre-upgrade,pre-install # When to run
"helm.sh/hook-weight": "-5" # Order (lower runs first)
"helm.sh/hook-delete-policy": hook-succeeded # Cleanup after
spec:
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
command: ["./migrate", "--up"]
# Hook annotations:
# helm.sh/hook: pre-install, post-install, pre-delete, post-delete,
# pre-upgrade, post-upgrade, pre-rollback, post-rollback, test
Deployment Strategies
Rolling Update (Default)
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Can temporarily have 1 extra pod
maxUnavailable: 0 # Zero-downtime: never go below desired count
# maxSurge and maxUnavailable can also be percentages: "25%"
Blue-Green Deployment
Run two identical environments. Switch traffic instantaneously. Enables instant rollback.
# Deploy "green" alongside existing "blue"
# Blue deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
slot: blue
template:
metadata:
labels:
app: myapp
slot: blue
spec:
containers:
- name: app
image: myapp:1.0
---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
slot: green
template:
metadata:
labels:
app: myapp
slot: green
spec:
containers:
- name: app
image: myapp:2.0
---
# Service: switch by patching selector
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
slot: blue # Change to "green" to cut over
# Switch traffic from blue to green
kubectl patch svc myapp -p '{"spec":{"selector":{"app":"myapp","slot":"green"}}}'
# Rollback: switch back to blue
kubectl patch svc myapp -p '{"spec":{"selector":{"app":"myapp","slot":"blue"}}}'
Canary Deployment
# Canary: route small % of traffic to new version
# Stable: 9 replicas, Canary: 1 replica = 10% traffic to canary
# Stable deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-stable
spec:
replicas: 9
selector:
matchLabels:
app: myapp
track: stable
template:
metadata:
labels:
app: myapp
track: stable
spec:
containers:
- name: app
image: myapp:1.0
---
# Canary deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-canary
spec:
replicas: 1
selector:
matchLabels:
app: myapp
track: canary
template:
metadata:
labels:
app: myapp
track: canary
spec:
containers:
- name: app
image: myapp:2.0
---
# Service selects BOTH (only on shared label)
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp # Matches both stable and canary pods
weight fields. This decouples traffic percentage from pod count.
GitOps
ArgoCD and Flux implement GitOps: the desired state lives in Git, and the operator continuously reconciles the cluster to match. No manual kubectl apply in production.
# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/myorg/gitops-repo
targetRevision: HEAD
path: apps/myapp/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Fix manual changes (drift correction)
syncOptions:
- CreateNamespace=true
Monitoring & Observability
QoS Classes
Kubernetes assigns a QoS class based on resource specifications. This determines eviction priority when a node is under memory pressure.
| QoS Class | Condition | Eviction Priority |
|---|---|---|
| Guaranteed | Every container has equal requests == limits for CPU and memory | Last to evict |
| Burstable | At least one container has requests or limits set (but not equal) | Middle priority |
| BestEffort | No container has any requests or limits | First to evict |
requests == limits for CPU and memory on critical pods (databases, payment services). This makes scheduling deterministic and prevents OOM eviction. For batch workloads, BestEffort or Burstable is fine.
Prometheus + Grafana Stack
# Expose metrics for Prometheus scraping
apiVersion: v1
kind: Service
metadata:
name: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8080"
---
# PodMonitor (kube-prometheus-stack / Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: myapp
namespace: monitoring
spec:
selector:
matchLabels:
app: myapp
namespaceSelector:
matchNames:
- production
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
# Install kube-prometheus-stack via Helm (Prometheus + Grafana + AlertManager)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set grafana.adminPassword=changeme
Logging Architecture
Kubernetes containers log to stdout/stderr. The kubelet writes logs to /var/log/containers/ on each node. Three patterns:
- Node-level agent (DaemonSet): Fluentbit/Fluentd on every node, ships to Elasticsearch/Loki/Splunk. Low overhead, no app changes.
- Sidecar: Log shipper as a sidecar container. More flexible, higher resource cost.
- Direct push: App pushes logs to a centralized system. Language-specific SDKs.
# Access logs via kubectl (kubelet's log endpoint)
kubectl logs myapp-xyz --tail=200 -f
kubectl logs myapp-xyz --since=1h
kubectl logs myapp-xyz --since-time="2024-01-15T10:00:00Z"
# Stern: multi-pod log tailing (install separately)
stern myapp --namespace production --tail=50
# kubetail: tail multiple pods by label
kubetail -l app=myapp -n production
Operators & CRDs
Custom Resource Definitions
CRDs extend the Kubernetes API with domain-specific resource types. Once installed, you can kubectl get, apply, and watch them like built-in resources.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.db.example.com # plural.group
spec:
group: db.example.com
versions:
- name: v1
served: true
storage: true # Only one version can be the storage version
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: [engine, version, storage]
properties:
engine:
type: string
enum: [postgres, mysql]
version:
type: string
storage:
type: string
pattern: '^[0-9]+Gi$'
subresources:
status: {} # Enables .status subresource
additionalPrinterColumns:
- name: Engine
type: string
jsonPath: .spec.engine
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
scope: Namespaced # or Cluster
names:
plural: databases
singular: database
kind: Database
shortNames: [db]
# Using the custom resource
apiVersion: db.example.com/v1
kind: Database
metadata:
name: production-db
spec:
engine: postgres
version: "16"
storage: 100Gi
Operator Pattern
An operator is a controller that watches custom resources and reconciles cluster state. It encodes operational knowledge (provisioning, scaling, backup, failover) in code.
# Popular operators in production
# cert-manager: Automates TLS certificate issuance/renewal (Let's Encrypt, ACME)
# external-dns: Sync Kubernetes Services/Ingresses to DNS providers (Route53, Cloudflare)
# prometheus-operator: Manage Prometheus/Alertmanager instances via CRDs
# external-secrets: Sync secrets from Vault, AWS SSM, GCP Secret Manager
# crossplane: Provision cloud resources (RDS, S3, etc.) via Kubernetes CRDs
# Install cert-manager
helm repo add jetstack https://charts.jetstack.io
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--set installCRDs=true
# cert-manager: automatic TLS for Ingress
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
Best Practices
Resource Management
- Always set resource requests and limits. Without requests, the scheduler can't make informed placement decisions. Without limits, a single runaway pod can starve the node.
- Start with VPA recommendations, then set static requests/limits for production workloads.
- CPU is compressible (throttled, not killed). Memory is not (OOMKilled). Set memory limits conservatively but accurately.
- Prefer Guaranteed QoS for stateful or latency-sensitive services (requests == limits).
High Availability Patterns
spec:
replicas: 3 # Minimum 2 for HA, ideally odd for quorum-aware apps
# Spread across nodes — hard requirement
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: myapp
topologyKey: kubernetes.io/hostname
# Spread across AZs — soft preference
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: myapp
Image Best Practices
- Never use
:latestin production. It's mutable and prevents reproducibility. Pin to a specific tag or digest:myapp@sha256:abc123... - Use distroless or Alpine base images. Smaller attack surface, faster pulls.
- Set
imagePullPolicy: IfNotPresent(default for tagged images).Alwaysadds latency to every pod start. - Scan images in CI with Trivy, Grype, or Snyk before pushing.
Namespace Organization
| Pattern | Namespaces | Best For |
|---|---|---|
| Per environment | dev, staging, production | Small teams, simple topology |
| Per team | payments, auth, platform | Multi-team clusters with isolation needs |
| Per environment + team | payments-prod, payments-staging | Large orgs with hard isolation requirements |
Label Conventions
# Recommended Kubernetes labels (app.kubernetes.io/)
metadata:
labels:
app.kubernetes.io/name: myapp # App name
app.kubernetes.io/instance: myapp-prod # Unique instance
app.kubernetes.io/version: "2.1.0" # Current version
app.kubernetes.io/component: api # Component role
app.kubernetes.io/part-of: platform # Larger system
app.kubernetes.io/managed-by: helm # What manages this
app.kubernetes.io/created-by: ci-system # Who created it
Security Hardening Checklist
- Enable RBAC (default in modern k8s). Audit ClusterRoleBindings regularly.
- Use dedicated ServiceAccounts per workload. Set
automountServiceAccountToken: falsewhen not needed. - Apply Pod Security Standards (
restrictedorbaseline) to all namespaces. - Set
readOnlyRootFilesystem: true,runAsNonRoot: true, drop all capabilities. - Encrypt secrets at rest (
EncryptionConfiguration) or use External Secrets Operator. - Apply Network Policies — default-deny, then allow-list.
- Scan images before deployment. Use admission controllers (OPA Gatekeeper, Kyverno) to enforce policies.
- Enable audit logging on the API server.
- Rotate credentials and certificates regularly.
- Limit etcd access — only API server should communicate with etcd.
Common Pitfalls & Gotchas
OOMKilled — Memory Limits Too Low
# Diagnose OOMKilled
kubectl describe pod myapp-xyz | grep -A3 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Check actual memory usage vs limits
kubectl top pod myapp-xyz --containers
kubectl get events --field-selector reason=OOMKilling
# Fix: increase limits, or find the memory leak
# kubectl set resources deployment/myapp --limits=memory=1Gi
Pending Pods
# Check why pod is Pending
kubectl describe pod myapp-xyz
# Common reasons in Events:
# "Insufficient cpu" / "Insufficient memory" — no node has capacity
# "didn't match Pod's node affinity/selector" — no matching node
# "had taint {key:value} that the pod didn't tolerate"
# "pod has unbound immediate PersistentVolumeClaims"
# Check node capacity
kubectl describe nodes | grep -A10 "Allocated resources"
# Add more nodes or reduce requests
# For cluster autoscaler: check if it's scaling (look at CA events)
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50
DNS Resolution Issues
ndots:5 in pod spec to reduce search domain lookups.
# Reduce unnecessary DNS queries: set ndots to 2
spec:
dnsConfig:
options:
- name: ndots
value: "2" # Default is 5 — causes 5 searches before absolute lookup
- name: single-request-reopen # Avoid race condition in some resolvers
PVC Stuck in Pending
# Check PVC status
kubectl describe pvc database-storage
# Causes:
# "no persistent volumes available" — no matching PV, no StorageClass provisioner
# "waiting for first consumer to be created" — WaitForFirstConsumer binding mode
# StorageClass not found
# CSI driver not running (check pods in kube-system)
# Check storageclass
kubectl get storageclass
kubectl describe storageclass fast-ssd | grep Provisioner
# Check CSI driver pods
kubectl get pods -n kube-system | grep csi
Ingress Not Routing
# Common causes:
# 1. Ingress controller not installed
kubectl get pods -n ingress-nginx | grep controller
# 2. Wrong ingressClassName
kubectl get ingressclass
# 3. Service/port mismatch in backend spec
kubectl describe ingress myapp-ingress
# 4. TLS cert not ready
kubectl describe certificate myapp-tls -n production # cert-manager
# 5. Check ingress controller logs
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | tail -50
Secret Management Anti-Patterns
- Don't commit Secrets to Git (even encrypted YAML). Use External Secrets Operator, Sealed Secrets, or SOPS.
- Don't use environment variables for secrets in long-lived processes — they appear in
/proc/PID/environ, crash dumps, andkubectl describe. Mount as files instead. - Don't share ServiceAccount tokens across deployments. Each workload should have a minimal-privilege SA.
- Rotate secrets regularly. Secret rotation with auto-mount updates (Vault Agent Injector) handles this transparently.
Resource Leaks and etcd Performance
# Orphaned resources accumulate over time and slow etcd
# Find completed/failed jobs (clean up regularly)
kubectl get jobs -A --field-selector status.successful=1
kubectl delete jobs -A --field-selector status.successful=1
# Find evicted pods
kubectl get pods -A | grep Evicted
kubectl get pods -A --field-selector status.phase=Failed
# Delete all evicted pods in a namespace
kubectl get pods -n production --field-selector status.phase=Failed \
-o name | xargs kubectl delete -n production
# Set TTL on jobs to auto-clean
# spec.ttlSecondsAfterFinished: 3600 # In Job spec
# Monitor etcd object count (>100k objects degrades performance)
kubectl get --raw /metrics | grep etcd_object_counts
Rolling Update Gotchas
replicas: 1 and maxUnavailable: 0, rolling updates require an extra pod slot (maxSurge must be > 0). If the node has no capacity for the surge pod, the deployment will stall. For single-replica services, you need either spare capacity or accept brief downtime.
# Check rollout status (detects stalled deploys)
kubectl rollout status deployment/myapp --timeout=5m
# Watch pod transitions during rollout
kubectl get pods -l app=myapp -w
# Check if HPA is fighting the rollout (scaling down your new pods)
kubectl get hpa myapp
# Pause HPA during rollout if needed (set min/max to desired count)
kubectl apply vs replace
kubectl replace does a full object replacement — it deletes fields not in your manifest, including server-set fields. Use kubectl apply (strategic merge) or kubectl patch. replace can delete status, finalizers, and controller-managed fields.
# Safe update pattern
kubectl diff -f deployment.yaml # Always review before applying
kubectl apply -f deployment.yaml # Apply changes
kubectl rollout status deployment/myapp # Verify success
# If you must force-replace (e.g., immutable fields changed):
kubectl apply -f deployment.yaml --force # Delete + recreate — causes downtime
# Or: delete + apply separately with controlled timing