Table of Contents

Setup & Environment

These tools cover the entire debugging surface from DNS resolution to packet-level inspection. Most are available on macOS out of the box.

Built-in macOS Tools

# DNS lookup (both tools available by default)
dig google.com          # full DNS response with query time and server
nslookup google.com     # simpler output, good for quick checks

# Trace the network path (hop-by-hop)
traceroute google.com   # ICMP/UDP based, 30 hops max

# Network connections and listening ports
netstat -an | grep LISTEN      # all listening sockets
netstat -rn                    # routing table

# HTTP — already on macOS
curl -I https://httpbin.org/get     # HEAD request, show response headers
curl -v https://httpbin.org/get     # verbose: shows TLS handshake + headers

Install Extras

brew install httpie mtr nmap wireshark

# httpie — human-friendly HTTP client
http GET https://httpbin.org/get
http POST https://httpbin.org/post name=alice age:=30

# mtr — combines traceroute + ping, live updating
mtr google.com

# nmap — port scanner and service fingerprinter
nmap -sT localhost               # TCP connect scan, localhost
nmap -p 80,443,8080 example.com  # specific ports
nmap -sV example.com             # version detection

# wireshark — GUI packet capture (use tcpdump for CLI)
sudo tcpdump -i en0 port 443 -w capture.pcap
wireshark capture.pcap

Docker for Local Practice

# Create an isolated bridge network to simulate microservice networking
docker network create test-net

# Spin up two containers on the same network
docker run -d --name service-a --network test-net nginx
docker run -d --name service-b --network test-net nginx

# service-a can reach service-b by DNS name (Docker's embedded DNS)
docker exec service-a curl http://service-b

# Inspect the network (see IP assignments, subnet)
docker network inspect test-net

# Cleanup
docker rm -f service-a service-b
docker network rm test-net
httpie vs curl
httpie (http command) is a developer-friendly alternative to curl. It auto-formats JSON responses, adds syntax highlighting, and has a cleaner CLI syntax for setting headers and bodies. Use curl in scripts (it is always available), and http for interactive exploration.

OSI & TCP/IP Model

The OSI model provides a conceptual framework; TCP/IP is what actually runs on the internet. Knowing both is essential for mapping symptoms to layers during debugging.

Layer Mapping

OSI Layer TCP/IP Layer Protocols & Technologies Unit
7 Application Application HTTP, HTTPS, gRPC, WebSocket, DNS, SMTP, FTP, SSH Data / Message
6 Presentation TLS/SSL, encoding (JSON, Protobuf), compression Data
5 Session Session management, RPC, WebSocket sessions Data
4 Transport Transport TCP, UDP, QUIC Segment / Datagram
3 Network Internet IP (IPv4/IPv6), ICMP, BGP, OSPF Packet
2 Data Link Network Access Ethernet, Wi-Fi (802.11), ARP, VLANs Frame
1 Physical Copper, fiber, radio waves, hubs Bits

Layer-Based Debugging

When something is broken, think in layers from bottom up:

Interview Tip
When asked "what happens when you type a URL", walk the layers: DNS resolution (L7) → TCP connection (L4) → IP routing (L3) → TLS handshake (L6/L5) → HTTP request (L7). This shows you understand the full stack.

TCP vs UDP

TCP Three-Way Handshake

Client Server | | |--- SYN (seq=x) --------->| Client initiates connection | | |<-- SYN-ACK (seq=y,ack=x+1) -- Server acknowledges + sends its seq | | |--- ACK (ack=y+1) ------->| Client confirms | | |===== DATA TRANSFER =======| Connection established | | |--- FIN ------------------>| Client initiates teardown |<-- FIN-ACK --------------| |<-- FIN ------------------| Server initiates its half-close |--- ACK ------------------>| Four-way teardown complete

TCP Flags

FlagMeaningCommon Scenario
SYNSynchronize sequence numbersConnection initiation
ACKAcknowledgment field validEvery packet after handshake
FINNo more data from senderGraceful connection close
RSTReset the connection immediatelyPort unreachable, connection aborted
PSHPush buffered data to applicationInteractive sessions (SSH)
URGUrgent pointer field significantRare in practice

Flow Control & Congestion Control

Flow control prevents the sender from overwhelming the receiver. The receiver advertises a receive window (rwnd) — the number of bytes it can buffer. The sender never sends more than rwnd unacknowledged bytes.

Congestion control prevents the sender from overwhelming the network:

UDP: Fire and Forget

UDP has no connection establishment, no ordering guarantees, and no retransmission. The header is only 8 bytes (vs TCP's 20+). Use it when:

PropertyTCPUDP
ConnectionConnection-oriented (3-way handshake)Connectionless
ReliabilityGuaranteed delivery, ordering, dedupBest-effort, no guarantees
Overhead20+ byte header, RTT for setup8 byte header, no setup
Flow controlYes (sliding window)No
Use casesHTTP, SSH, databases, file transferDNS, video, gaming, QUIC

DNS

Resolution Flow

Browser cache miss → OS cache miss → Recursive resolver (ISP or 8.8.8.8) | ┌───────────────┼───────────────┐ v v v Root NS (. ) TLD NS (.com) Auth NS (example.com) | | | "Who owns .com?" "Try ns1.example.com" "93.184.216.34" | Recursive resolver caches result (TTL) | Returns A record to OS | OS caches in /etc/hosts-like store | Browser connects to IP

Record Types

RecordPurposeExample
AIPv4 addressapi.example.com → 93.184.216.34
AAAAIPv6 addressapi.example.com → 2606:2800::1
CNAMEAlias to another namewww → example.com
MXMail server with priority10 mail.example.com
TXTArbitrary text (SPF, DKIM, verification)"v=spf1 include:..."
SRVService location (host + port)_http._tcp → 0 5 80 api.example.com
NSAuthoritative nameservers for zonens1.cloudflare.com
PTRReverse DNS (IP → name)Used by mail servers for spam checks

dig & nslookup Examples

# Basic A record lookup
dig google.com

# Specific record type
dig google.com MX
dig google.com TXT
dig google.com NS

# Query a specific resolver (bypass system resolver)
dig @8.8.8.8 google.com

# Short output (just the answer)
dig +short google.com

# Trace the full resolution path
dig +trace google.com

# Reverse lookup (IP to hostname)
dig -x 8.8.8.8

# Check TTL remaining
dig +ttlid google.com

# nslookup interactive
nslookup google.com
nslookup -type=MX google.com

Internal DNS & Service Discovery

In Kubernetes, every Service gets a DNS entry automatically:

# Format: {service}.{namespace}.svc.cluster.local
# From within a pod:
curl http://payment-service.billing.svc.cluster.local:8080/charge

# Short form (within same namespace):
curl http://payment-service:8080/charge

# Kubernetes DNS server: CoreDNS (runs as pods in kube-system)
kubectl get pods -n kube-system | grep coredns

# Headless service (no cluster IP) — returns pod IPs directly
# Useful for StatefulSets (kafka, cassandra): each pod gets a stable DNS name
# kafka-0.kafka.default.svc.cluster.local
TTL and Caching Pitfalls
DNS changes don't propagate instantly. Old IPs can be cached at multiple levels: browser (often ignores TTL, caches 60s), OS resolver, corporate DNS caches. When doing a failover or IP change, lower TTL to 60s at least 24h in advance. After the change, wait for old TTL to expire before raising it back.

HTTP/1.1, HTTP/2, HTTP/3

Request & Response Anatomy

# HTTP/1.1 request (text-based, human-readable)
GET /api/users/42 HTTP/1.1
Host: api.example.com
Accept: application/json
Authorization: Bearer eyJhbGc...
User-Agent: MyApp/1.0

# HTTP response
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 142
Cache-Control: max-age=60

{"id": 42, "name": "Alice"}

Version Comparison

FeatureHTTP/1.1HTTP/2HTTP/3
TransportTCPTCPQUIC (UDP)
MultiplexingNo (one request/connection)Yes (streams on one conn)Yes (independent streams)
HOL blockingYes (request level)Yes (TCP level)No (per-stream)
Header compressionNone (repeated verbatim)HPACK (huffman + table)QPACK
Server pushNoYes (rarely used)Yes
Connection setup1 RTT + TLS1 RTT + TLS (or 0-RTT)0-RTT or 1-RTT
BinaryNo (text)Yes (framing layer)Yes

HTTP/2 Key Features

HTTP/3 & QUIC

QUIC is a transport protocol built on UDP that reimplements TCP's reliability and TLS's security, but with independent stream delivery. Key advantages:

Key Status Codes

CodeMeaningNotes
200 OKSuccessGET, POST, PUT responses
201 CreatedResource createdPOST that created a resource; include Location header
204 No ContentSuccess, no bodyDELETE, PUT with no response body
301 Moved PermanentlyRedirect (cached)Browser caches indefinitely; hard to undo
302 FoundRedirect (temporary)Not cached; use for auth redirects
304 Not ModifiedUse cached versionETag/If-None-Match or Last-Modified
400 Bad RequestClient errorMalformed JSON, missing required field
401 UnauthorizedNot authenticatedMissing or invalid token; include WWW-Authenticate
403 ForbiddenNot authorizedAuthenticated but lacks permission
404 Not FoundResource absentNever leak whether resource exists to unauthorized callers
409 ConflictState conflictDuplicate create, optimistic lock failure
422 Unprocessable EntityValidation errorSyntactically valid but semantically wrong
429 Too Many RequestsRate limitedInclude Retry-After header
500 Internal Server ErrorServer crashedNever expose stack traces
502 Bad GatewayUpstream errorReverse proxy got bad response from backend
503 Service UnavailableOverloaded/downInclude Retry-After; used during deployments
504 Gateway TimeoutUpstream timeoutBackend took too long; check for slow queries

TLS & HTTPS

TLS 1.3 Handshake (1-RTT)

Client Server | | |--- ClientHello (supported ciphers, ---->| | key_share, random) | | | |<-- ServerHello (chosen cipher, -----| | key_share, Certificate, | | CertificateVerify, Finished) | | | |--- Finished (client auth if mTLS) ------>| | | |========= Encrypted Application Data =====| (1 RTT total) TLS 1.2 required 2 RTTs. TLS 1.3 reduces to 1 RTT. 0-RTT (early data): Client can send data in first flight using session ticket from prior connection. Risk: replay attacks.

Certificates & CA Chain

A TLS certificate proves: "this public key belongs to this domain". The chain of trust works like this:

# Inspect a live TLS certificate
openssl s_client -connect google.com:443 -servername google.com /dev/null \
  | openssl x509 -noout -text | grep -E "Subject:|Issuer:|Not After|DNS:"

# Check certificate chain
openssl s_client -connect google.com:443 -showcerts /dev/null

# Verify expiration date
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -dates

# Decode a local cert
openssl x509 -in cert.pem -noout -text

Let's Encrypt & ACME

# certbot (ACME client) issues free 90-day certificates
# HTTP-01 challenge: LE places a file at /.well-known/acme-challenge/{token}
# DNS-01 challenge: LE adds a TXT record to your DNS zone (needed for wildcards)

# Issue cert for a domain (nginx)
certbot --nginx -d example.com -d www.example.com

# Standalone (no web server running)
certbot certonly --standalone -d example.com

# Auto-renew (run in cron or systemd timer)
certbot renew --quiet

# In Kubernetes: cert-manager handles ACME automatically
# annotations on Ingress trigger Certificate object creation

mTLS: Mutual TLS for Service-to-Service

In standard TLS, only the server presents a certificate. In mTLS, both client and server present certificates. This provides cryptographic proof of identity on both sides — no passwords or API keys needed.

# Generate a private CA for internal services
openssl genrsa -out ca.key 4096
openssl req -new -x509 -days 3650 -key ca.key -out ca.crt -subj "/CN=MyInternalCA"

# Issue a cert for service-a
openssl genrsa -out service-a.key 2048
openssl req -new -key service-a.key -out service-a.csr -subj "/CN=service-a"
openssl x509 -req -days 365 -in service-a.csr -CA ca.crt -CAkey ca.key \
  -CAcreateserial -out service-a.crt

# curl with mTLS
curl --cert service-a.crt --key service-a.key --cacert ca.crt \
  https://service-b.internal/api/data
Service Meshes Automate mTLS
Managing per-service certificates manually is operationally painful. Tools like Istio and Linkerd automatically provision and rotate mTLS certificates for every pod using SPIFFE/SPIRE identity standards. Certificate rotation happens transparently without restarting services.

REST API Design

HTTP Methods

MethodSemanticsIdempotentBody
GETRetrieve resource(s)YesNo
POSTCreate resource or trigger actionNoYes
PUTReplace resource entirelyYesYes
PATCHPartial updateNo (unless careful)Yes
DELETERemove resourceYesOptional
HEADSame as GET but no bodyYesNo
OPTIONSWhat methods does this endpoint support?YesNo

URI Design Principles

# Resources are nouns, not verbs
GET  /users/42              # Good: noun
GET  /getUser/42            # Bad: verb

# Nested resources for relationships
GET  /users/42/orders       # Orders belonging to user 42
POST /users/42/orders       # Create an order for user 42
GET  /users/42/orders/7     # Specific order

# Actions that don't fit CRUD: use sub-resources or POST
POST /orders/7/cancel       # Cancel an order
POST /users/42/password-reset

# Collections vs singletons
GET  /users                 # List all users
GET  /users/42              # Specific user
DELETE /users/42            # Delete user 42

# Filtering, sorting, pagination as query params (never in path)
GET /users?role=admin&sort=created_at&order=desc&limit=20&cursor=abc123

Versioning Strategies

StrategyExampleProsCons
URL path/v1/usersVisible, easy to test in browserBreaks REST purity (same resource, different URL)
Accept headerAccept: application/vnd.myapi.v2+jsonRESTfully correctHard to test, invisible in browser
Custom headerAPI-Version: 2Clean URLsNon-standard, not cacheable by CDN
Query param/users?version=2SimplePollutes query string

Recommendation: URL path versioning (/v1/, /v2/) is the most pragmatic for public APIs. It is explicit, observable in logs, and easy for consumers to manage.

Pagination

// Offset-based (simple but inefficient at scale)
GET /users?offset=20&limit=10
{
  "data": [...],
  "total": 1542,
  "offset": 20,
  "limit": 10
}

// Cursor-based (preferred: stable, works with real-time data)
GET /users?cursor=eyJ1c2VyX2lkIjogNDJ9&limit=10
{
  "data": [...],
  "next_cursor": "eyJ1c2VyX2lkIjogNTJ9",
  "has_more": true
}
// Cursor is typically an opaque base64-encoded bookmark
// (e.g., encoded {id: 42, created_at: "2024-01-15T..."})
// Stable even when records are inserted/deleted between pages

gRPC

gRPC is a high-performance RPC framework developed by Google. It uses Protocol Buffers for serialization and HTTP/2 for transport. It is the dominant choice for internal service-to-service communication in polyglot microservice environments.

Protocol Buffers

# Define service in .proto file
# payment.proto
syntax = "proto3";
package payment;

service PaymentService {
  rpc Charge(ChargeRequest) returns (ChargeResponse);
  rpc StreamTransactions(TransactionFilter)
      returns (stream Transaction);       // server streaming
  rpc BatchCharge(stream ChargeRequest)
      returns (BatchResult);             // client streaming
  rpc Chat(stream Message)
      returns (stream Message);          // bidirectional streaming
}

message ChargeRequest {
  string user_id    = 1;
  int64  amount_cents = 2;
  string currency   = 3;
}

message ChargeResponse {
  string transaction_id = 1;
  string status         = 2;
}

# Generate client/server code
protoc --go_out=. --go-grpc_out=. payment.proto   # Go
python -m grpc_tools.protoc -I. --python_out=. \
  --grpc_python_out=. payment.proto               # Python

gRPC vs REST Comparison

DimensiongRPCREST/JSON
ProtocolHTTP/2 (binary)HTTP/1.1 or HTTP/2 (text)
SerializationProtocol Buffers (~5x smaller, ~7x faster)JSON (human-readable, widely supported)
StreamingBuilt-in (4 modes)SSE or WebSockets (not REST)
Code generationStrong: type-safe clients from .protoOptional (OpenAPI/Swagger)
Browser supportRequires gRPC-Web proxyNative
DebuggingBinary (need grpcurl/BloomRPC)curl/Postman readable
Best forInternal service calls, low-latency, streamingExternal/public APIs, browser clients
# grpcurl — curl for gRPC
# List services
grpcurl -plaintext localhost:50051 list

# Describe a service
grpcurl -plaintext localhost:50051 describe payment.PaymentService

# Call an RPC
grpcurl -plaintext -d '{"user_id": "u42", "amount_cents": 1000, "currency": "USD"}' \
  localhost:50051 payment.PaymentService/Charge
gRPC vs REST Decision Rule
Use gRPC for internal service calls where you control both client and server — the type safety and performance are worth it. Use REST/JSON for anything consumed by external developers, mobile apps, or browsers — the tooling and discoverability are far better.

Exposing APIs: Internal Services

Key Section

Internal APIs are service-to-service calls within your infrastructure. The challenges are: service discovery (how does service A find service B?), load balancing, mutual authentication, and resilience.

Full Request Flow: Service A Calls Service B

Service A (Pod) | | 1. DNS lookup: "payment-service.billing.svc.cluster.local" v CoreDNS (K8s) | | Returns: ClusterIP 10.96.45.12 v K8s Service (ClusterIP: 10.96.45.12:8080) | | 2. kube-proxy rewrites destination to a healthy pod v iptables / IPVS (node kernel) | | Selects: pod 10.244.2.7:8080 (round-robin or random) v Service B Pod (payment-service) | | 3. (Optional) Sidecar proxy intercepts (Envoy/Istio) | - mTLS termination | - Circuit breaker check | - Retry logic | - Telemetry v Application container | v Response travels back same path

Service Mesh (Sidecar Proxy Pattern)

A service mesh injects a proxy sidecar (typically Envoy) into every pod. All traffic passes through the sidecar, giving you observability, security, and reliability without changing application code.

# Istio: enable mTLS for an entire namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: billing
spec:
  mtls:
    mode: STRICT   # Reject plaintext connections

---
# Istio: circuit breaker via DestinationRule
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5      # open circuit after 5 errors
      interval: 30s                # check window
      baseEjectionTime: 30s        # eject for 30s
      maxEjectionPercent: 50       # never eject more than 50% of pods

Service Discovery

Circuit Breaker Pattern

┌──────────────────────────────────────┐ │ Circuit Breaker States │ └──────────────────────────────────────┘ CLOSED (healthy) OPEN (failing) ───────────────── ────────────── Requests pass through ──────────> Requests fail fast (no call to downstream) Count failures Wait for timeout (e.g. 30s) If failures > threshold Then move to HALF-OPEN │ HALF-OPEN (probing) ───────────────────── Let 1 request through If succeeds → CLOSED If fails → OPEN again
# Python resilience with tenacity + circuit breaker logic
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=10)
)
async def call_payment_service(payload: dict) -> dict:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://payment-service:8080/charge",
            json=payload,
            timeout=5.0
        )
        response.raise_for_status()
        return response.json()

Rate Limiting Between Services

Internal rate limiting protects downstream services from being overwhelmed by a misbehaving upstream caller. Implement at the proxy layer (Envoy/Istio) rather than in application code:

# Envoy rate limit filter configuration
http_filters:
  - name: envoy.filters.http.ratelimit
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
      domain: billing-service
      rate_limit_service:
        grpc_service:
          envoy_grpc:
            cluster_name: rate_limit_service

Exposing APIs: External Users / Frontend

Key Section

External APIs are consumed by mobile apps, browsers, third-party developers, and webhooks. The concerns shift to authentication, rate limiting, CORS, TLS termination, and protection from malicious traffic.

Full Request Flow: Mobile App to Backend

Mobile App / Browser | | HTTPS request to api.example.com v DNS resolution → Cloudflare/AWS (CDN edge) | | Static assets served from edge cache (CDN HIT) | Dynamic API requests forwarded to origin v AWS ALB / API Gateway / Nginx (TLS termination here) | | Strips TLS, forwards plain HTTP internally (or re-encrypts) v API Gateway (Kong / AWS API Gateway) | | 1. Authentication middleware (validate JWT/API key) | 2. Rate limiting (token bucket per user/IP) | 3. Request logging + tracing (add X-Request-ID) | 4. Route to upstream service v Backend Service (your API) | v Response ← same path in reverse

Authentication Patterns

MethodHow It WorksBest ForPitfall
API KeyStatic secret in header (X-API-Key: ...)Server-to-server, developer APIsLong-lived; rotate carefully
JWT (Bearer)Signed token in Authorization: Bearer ...User sessions, microservicesCan't revoke without blacklist
OAuth2Delegated authorization (access token + refresh token)Third-party app accessComplex flow; use a library
Session CookieServer-side session, cookie with session IDBrowser-only web appsCSRF vulnerability; need SameSite
mTLSClient certificate presented during TLSB2B, high-security APIsCertificate management overhead
# JWT anatomy
# Header.Payload.Signature (base64url encoded, dot-separated)
# Header: {"alg": "RS256", "typ": "JWT"}
# Payload: {"sub": "user42", "iat": 1706000000, "exp": 1706003600, "roles": ["admin"]}
# Signature: RS256(base64(header) + "." + base64(payload), private_key)

# Decode a JWT (never send sensitive data in payload — it's not encrypted)
jwt_token="eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9..."
echo $jwt_token | cut -d. -f2 | base64 -d 2>/dev/null | python3 -m json.tool

# Validate with public key (do this server-side, never client-side)
# Verify: signature, expiration (exp), issuer (iss), audience (aud)

Rate Limiting Algorithms

AlgorithmHow It WorksProsCons
Token BucketBucket holds N tokens. Each request consumes 1 token. Tokens refill at fixed rate.Allows bursts up to bucket sizeBurst can overwhelm if bucket is large
Leaky BucketRequests enter a queue (bucket). Processed at a fixed rate. Excess dropped.Smooth output rateStrict queue; high latency under load
Fixed WindowCount requests in fixed time window (e.g., 100/min). Reset at boundary.Simple to implementBurst at window boundary (2x rate)
Sliding WindowRolling count over last N seconds using timestamps or Redis sorted sets.No boundary burstMore memory/compute
Sliding LogStore timestamp of each request. Count in window. Expire old ones.Most accurateHigh memory at scale
# Sliding window rate limit with Redis
import redis
import time

r = redis.Redis()

def is_rate_limited(user_id: str, limit: int, window_seconds: int) -> bool:
    key = f"rate:{user_id}"
    now = time.time()
    window_start = now - window_seconds

    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)   # remove old entries
    pipe.zadd(key, {str(now): now})                # add current request
    pipe.zcard(key)                                # count in window
    pipe.expire(key, window_seconds)               # auto-expire key
    results = pipe.execute()

    request_count = results[2]
    return request_count > limit

CORS (Cross-Origin Resource Sharing)

Browsers enforce the same-origin policy: JavaScript on https://app.example.com cannot call https://api.example.com unless the API explicitly allows it. CORS headers tell the browser which cross-origin requests are permitted.

# Simple CORS request (GET/POST with simple headers)
# Browser automatically adds:
Origin: https://app.example.com

# Server must respond with:
Access-Control-Allow-Origin: https://app.example.com  # or * (never for credentialed)
Access-Control-Allow-Credentials: true                 # if sending cookies/auth

# Preflight (OPTIONS) — triggered by:
# - Non-simple methods (PUT, DELETE, PATCH)
# - Non-simple headers (Authorization, Content-Type: application/json)
OPTIONS /api/users HTTP/1.1
Origin: https://app.example.com
Access-Control-Request-Method: DELETE
Access-Control-Request-Headers: Authorization

# Server must respond to OPTIONS with:
HTTP/1.1 204 No Content
Access-Control-Allow-Origin: https://app.example.com
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Headers: Authorization, Content-Type
Access-Control-Max-Age: 86400          # Cache preflight for 24h
Access-Control-Allow-Credentials: true
CORS Pitfall: Wildcard + Credentials
Access-Control-Allow-Origin: * (wildcard) cannot be used with Access-Control-Allow-Credentials: true. If you need to send cookies or Authorization headers cross-origin, you must specify the exact origin. Never use * for authenticated APIs.

TLS Termination

TLS termination is where the encrypted HTTPS connection is decrypted. Two approaches:

CDN & API Caching

# Control what CDNs (and browsers) cache via Cache-Control header
# Public GET endpoints
Cache-Control: public, max-age=300, stale-while-revalidate=60

# Private user data — never cache at CDN
Cache-Control: private, no-cache

# Immutable assets (content-hashed filenames)
Cache-Control: public, max-age=31536000, immutable

# CDN cache purge (Cloudflare API)
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/purge_cache" \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{"purge_everything": true}'

# Vary header: CDN must cache separate responses per Accept-Encoding
Vary: Accept-Encoding

Webhook Patterns

Webhooks are outbound API calls — your server pushes events to a customer's endpoint when something happens. Design considerations:

# Webhook delivery: retry with exponential backoff
import httpx
import hashlib
import hmac

def deliver_webhook(url: str, payload: dict, secret: str) -> bool:
    """
    Deliver webhook with HMAC signature for verification.
    Receiver should validate: hmac.compare_digest(expected_sig, received_sig)
    """
    body = json.dumps(payload, separators=(',', ':')).encode()

    # Signature so receivers can verify authenticity
    signature = hmac.new(
        secret.encode(),
        body,
        hashlib.sha256
    ).hexdigest()

    headers = {
        "Content-Type": "application/json",
        "X-Webhook-Signature": f"sha256={signature}",
        "X-Webhook-Timestamp": str(int(time.time())),
    }

    for attempt in range(5):
        try:
            response = httpx.post(url, content=body, headers=headers, timeout=10)
            if response.status_code < 500:
                return True          # 2xx success or 4xx (don't retry client errors)
        except httpx.RequestError:
            pass
        time.sleep(2 ** attempt)     # 1s, 2s, 4s, 8s, 16s
    return False

Load Balancing

L4 vs L7 Load Balancers

LayerSeesCan Route ByExamplesUse Case
L4 (Transport)TCP/UDP packetsIP, portAWS NLB, HAProxy (TCP mode)Low latency, TLS passthrough, non-HTTP
L7 (Application)HTTP headers, URL, bodyPath, host, header, cookieAWS ALB, Nginx, Envoy, CaddyHTTP routing, SSL termination, A/B testing
Internet | ┌───────────────────────┐ │ L7 Load Balancer │ (Nginx / ALB) │ Path-based routing │ └───────────────────────┘ /api/* | /static/* | Host-based ↓ ↓ ↓ API Pods API Pods S3 / CDN (backend) (v2 canary) (assets) ┌───────────────────────────────────┐ │ L4 Load Balancer (AWS NLB) │ │ TCP passthrough, ultra-low lat │ └───────────────────────────────────┘ ↓ ↓ gRPC Service Database Proxy (PgBouncer)

Load Balancing Algorithms

AlgorithmHow It WorksBest For
Round RobinCycle through servers in orderHomogeneous servers, short-lived requests
Weighted Round RobinServers get proportional share (e.g., 70/30)Heterogeneous capacity, canary deployments
Least ConnectionsSend to server with fewest active connectionsLong-lived connections (WebSockets, gRPC streams)
IP HashHash client IP to always route to same serverSimple session affinity (no shared session store)
Consistent HashingHash to a ring; minimize remapping on server add/removeCache clusters, distributed storage
Random (power of 2)Pick 2 random servers, choose less loadedLarge fleets, avoids round-robin thundering herd

Health Checks & Session Affinity

# Nginx upstream with health checks and sticky sessions
upstream api_backend {
    least_conn;                          # algorithm

    server api-1.internal:8080;
    server api-2.internal:8080;
    server api-3.internal:8080;

    # Sticky sessions via cookie (nginx plus / commercial)
    sticky cookie srv_id expires=1h domain=.example.com path=/;

    keepalive 32;                        # reuse connections to backend
}

server {
    location /api/ {
        proxy_pass http://api_backend;
        health_check interval=5s fails=3 passes=2;   # nginx plus
    }
}

# AWS ALB target group health check equivalent
# In Terraform:
resource "aws_lb_target_group" "api" {
  health_check {
    path                = "/healthz"
    interval            = 30
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
  }
}

WebSockets & Server-Sent Events

WebSocket Upgrade Handshake

WebSockets start as an HTTP request and are "upgraded" to a persistent bidirectional TCP connection. The HTTP connection is reused — no new TCP connection is needed.

# WebSocket upgrade request
GET /ws/feed HTTP/1.1
Host: api.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==   # random base64
Sec-WebSocket-Version: 13

# Server accepts upgrade
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=  # SHA1 of key + GUID

# After upgrade: full-duplex binary framing
# Ping/pong frames for keepalive
# Close frame for graceful shutdown

Server-Sent Events (SSE)

SSE is one-way server push over a single HTTP connection. Simpler than WebSockets when you only need server → client streaming.

# FastAPI SSE endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def event_generator():
    while True:
        # yield events in SSE format
        data = await get_next_event()
        yield f"id: {data['id']}\n"
        yield f"event: {data['type']}\n"
        yield f"data: {json.dumps(data['payload'])}\n\n"
        await asyncio.sleep(0.1)

@app.get("/events")
async def stream_events():
    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no"   # disable Nginx buffering
        }
    )

Comparison: Real-Time Options

MethodDirectionProtocolBest ForNotes
WebSocketBidirectionalWS (over TCP)Chat, games, collaborative editingNeeds sticky sessions at LB
SSEServer → Client onlyHTTPLive feeds, notifications, logsAuto-reconnect built-in; simpler
Long PollingServer → ClientHTTPFallback, fire-and-forget pushHigh overhead; legacy approach
HTTP/2 Server PushServer → ClientHTTP/2Preloading assetsDeprecated/removed in Chrome; avoid

Load Balancing WebSockets

WebSockets maintain a persistent connection to a specific backend instance. This creates challenges for load balancers:

Network Debugging

curl Advanced Usage

# Verbose: show TLS handshake, request/response headers
curl -v https://api.example.com/users

# Show only response headers
curl -I https://api.example.com/users

# Custom headers
curl -H "Authorization: Bearer eyJhbGc..." \
     -H "Content-Type: application/json" \
     https://api.example.com/users

# POST with JSON body
curl -X POST https://api.example.com/users \
     -H "Content-Type: application/json" \
     -d '{"name": "Alice", "email": "[email protected]"}'

# Override DNS (test a specific IP without changing /etc/hosts)
curl --resolve api.example.com:443:93.184.216.34 https://api.example.com/users

# Follow redirects, show final URL
curl -L -w "%{url_effective}\n" https://short.url/abc

# Measure timing breakdown
curl -w "\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nFirst byte: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
     -o /dev/null -s https://api.example.com/users

# Set timeout
curl --connect-timeout 5 --max-time 30 https://api.example.com/users

tcpdump Basics

# Capture HTTP traffic on port 80 to a file
sudo tcpdump -i en0 port 80 -w capture.pcap

# Live capture: show HTTP requests (ASCII)
sudo tcpdump -i en0 -A port 80

# Filter by host and port
sudo tcpdump -i en0 host api.example.com and port 443

# Capture DNS queries
sudo tcpdump -i en0 port 53

# Read captured file in Wireshark
wireshark capture.pcap

# Note: TLS traffic is encrypted in captures.
# Decrypt with SSLKEYLOGFILE env var (Chrome/Firefox/curl support it):
SSLKEYLOGFILE=~/ssl-keys.log curl https://api.example.com
# Then load ssl-keys.log in Wireshark under TLS preferences

Other Diagnostic Tools

# mtr: continuous traceroute (shows packet loss at each hop)
mtr --report --report-cycles 10 google.com

# ss: socket statistics (modern netstat replacement)
ss -tlnp              # TCP listening sockets with process names
ss -tnp               # established TCP connections
ss -s                 # summary statistics

# lsof: what process owns a port
lsof -i :8080         # what's on port 8080
lsof -i TCP:443       # all TCP connections on 443
lsof -i -n -P | grep LISTEN   # all listening ports

# Check if a port is open (no nmap needed)
timeout 3 bash -c 'cat < /dev/null > /dev/tcp/api.example.com/443'
echo $?  # 0 = open, 1 = closed/refused

Common Debugging Scenarios

Connection refused (ECONNREFUSED)

The port is not open on the target host. Check:

  • lsof -i :PORT — is anything listening on that port?
  • docker ps — is the container running? Did it crash?
  • Is the service binding to 127.0.0.1 (loopback only) instead of 0.0.0.0?
  • Is a firewall (iptables, security group) blocking the port?
lsof -i :8080
# If nothing shows, the service isn't running or crashed at startup
# Check service logs: docker logs container-name, journalctl -u service-name
Connection timeout (no response)

Unlike connection refused, the packet reached the network but got dropped silently:

  • Firewall is blocking and dropping (not rejecting) — check security group rules
  • Host is unreachable — check routing table, VPC peering, VPN
  • Wrong IP — the DNS resolved to the wrong address
traceroute api.example.com    # Where does the path stop?
dig api.example.com            # Is DNS resolving to the expected IP?
nmap -p 443 api.example.com   # Is the port responding?
DNS failure (NXDOMAIN or SERVFAIL)
# Test with a public resolver (bypass local cache)
dig @8.8.8.8 api.example.com
dig @1.1.1.1 api.example.com

# Check if it's a local cache issue
sudo killall -HUP mDNSResponder   # flush macOS DNS cache
# or: sudo dscacheutil -flushcache

# SERVFAIL: upstream resolver error — try a different resolver
# NXDOMAIN: the name truly doesn't exist — check DNS zone, check for typos
Certificate errors (SSL handshake failed)
# Check certificate details
openssl s_client -connect api.example.com:443 -servername api.example.com

# Common errors:
# "certificate has expired"
openssl s_client -connect api.example.com:443 2>/dev/null | openssl x509 -noout -dates

# "hostname mismatch" — cert's CN/SANs don't match the hostname
openssl s_client -connect api.example.com:443 2>/dev/null | openssl x509 -noout -text | grep DNS

# "certificate signed by unknown authority" — custom CA not in trust store
curl --cacert /path/to/custom-ca.crt https://internal.example.com
# or add to system trust store:
# macOS: sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain custom-ca.crt

Network Security

Firewalls & Security Groups

Firewalls filter traffic based on source/destination IP, port, and protocol. In cloud environments, security groups are stateful firewalls at the instance/ENI level.

# iptables: Linux kernel firewall (legacy, still common)
# Allow established connections (stateful)
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow incoming on port 443
iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# Drop everything else
iptables -A INPUT -j DROP

# List current rules
iptables -L -n -v

# nftables: modern replacement for iptables
nft list ruleset

# AWS Security Group (Terraform)
resource "aws_security_group" "api" {
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]     # public HTTPS
  }
  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.lb.id]  # only from LB
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Network Segmentation (VPC)

VPC (10.0.0.0/16) ┌─────────────────────────────────────────────────────────┐ │ │ │ Public Subnets (10.0.1.0/24, 10.0.2.0/24) │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ ALB / NAT │ │ Bastion │ ← only SSH entry │ │ │ Gateway │ │ Host │ │ │ └──────────────┘ └──────────────┘ │ │ │ │ │ │ ▼ ▼ (SSH jump) │ │ Private Subnets (10.0.3.0/24, 10.0.4.0/24) │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ App Servers │ │ Databases │ ← no internet access│ │ │ (EKS nodes) │ │ (RDS, Elast.│ │ │ └──────────────┘ └──────────────┘ │ │ │ │ No direct internet route to private subnets │ │ Outbound: private subnet → NAT Gateway → Internet │ └─────────────────────────────────────────────────────────┘

Common Attacks & Mitigations

AttackHow It WorksMitigation
SYN Flood (DDoS)Flood server with SYN packets, exhaust connection table (half-open connections)SYN cookies (stateless), rate limit SYN at edge, Cloudflare/AWS Shield
DNS Spoofing / Cache PoisoningInject false DNS records into resolver cache, redirect traffic to attackerDNSSEC (signed records), randomize source port + query ID
MITM (Man-in-the-Middle)Attacker intercepts traffic between client and serverTLS everywhere, certificate pinning, HSTS
BGP HijackingMalicious AS announces prefixes it doesn't own, reroutes internet trafficRPKI (Route Origin Authorization), BGP filtering
Amplification DDoSUse UDP services (DNS, NTP) to amplify small requests into large floodsBCP38 ingress filtering, rate-limit UDP reflection, disable open resolvers
SSL StrippingDowngrade HTTPS to HTTP in transit (requires MITM position)HSTS, HSTS Preload list — browser refuses to connect over HTTP

Zero-Trust Architecture

Traditional perimeter security assumes: "if you're inside the network, you can be trusted." Zero-trust assumes: "never trust, always verify" — regardless of network location.

# Kubernetes NetworkPolicy: restrict what can talk to the database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: database-allow-api-only
  namespace: production
spec:
  podSelector:
    matchLabels:
      role: database
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: api      # only pods labeled role=api can connect
      ports:
        - protocol: TCP
          port: 5432
# All other ingress to database pods is dropped by default

HSTS & Certificate Pinning

# HSTS: tell browsers to always use HTTPS (never downgrade)
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload

# max-age=31536000 = 1 year
# includeSubDomains: applies to all subdomains
# preload: submit to browser HSTS preload list (hardcoded in Chrome/Firefox)
# WARNING: preload is nearly irreversible — only add if you're committed to HTTPS

# Certificate pinning (mobile apps)
# Pin the public key hash of your certificate or CA
# If the cert changes without updating the pin, all requests fail
# Risk: if you lose the private key, your app is permanently broken for old versions