AWS Refresher — EC2, S3, Lambda, IAM, VPC

Table of Contents

Setup & Environment

Before working with AWS, you need the CLI configured and ideally a local sandbox. LocalStack lets you iterate fast without incurring costs.

Install & Configure AWS CLI

# Install via Homebrew (macOS)
brew install awscli

# Verify installation
aws --version
# aws-cli/2.x.x Python/3.x.x ...

# Interactive configuration wizard
aws configure
# AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
# AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Default region name [None]: us-east-1
# Default output format [None]: json

# View stored config
cat ~/.aws/config
cat ~/.aws/credentials

# Use named profiles for multiple accounts
aws configure --profile staging
aws s3 ls --profile staging

# Set environment variables (useful in CI/CD)
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/...
export AWS_DEFAULT_REGION=us-east-1

LocalStack for Local Development

LocalStack saves money

LocalStack emulates most AWS services locally on port 4566. You can create S3 buckets, queues, Lambda functions, and more without touching a real AWS account. Essential for fast TDD against AWS APIs.

# Run LocalStack via Docker
docker run -d \
  --name localstack \
  -p 4566:4566 \
  -e SERVICES=s3,sqs,sns,lambda,dynamodb,iam \
  localstack/localstack

# Verify it's running
docker ps | grep localstack
curl http://localhost:4566/_localstack/health

# Point AWS CLI at LocalStack with --endpoint-url
aws --endpoint-url=http://localhost:4566 s3 ls
aws --endpoint-url=http://localhost:4566 s3 mb s3://my-test-bucket
aws --endpoint-url=http://localhost:4566 s3 ls

# Create an alias so you don't repeat the flag
alias awslocal='aws --endpoint-url=http://localhost:4566'
awslocal sqs create-queue --queue-name my-queue
awslocal dynamodb list-tables

# docker-compose.yml for persistent LocalStack setup
version: '3.8'
services:
  localstack:
    image: localstack/localstack:latest
    ports:
      - "4566:4566"
    environment:
      - SERVICES=s3,sqs,sns,lambda,dynamodb,secretsmanager
      - DEBUG=1
      - DATA_DIR=/tmp/localstack/data
    volumes:
      - "./localstack-data:/tmp/localstack/data"
      - "/var/run/docker.sock:/var/run/docker.sock"

Never commit real credentials

Keep AWS credentials in ~/.aws/credentials or environment variables. Add .env, *.pem, and credentials to .gitignore. Use IAM roles (not access keys) for production workloads running on AWS.

Core Concepts

AWS organizes its infrastructure around geographic and logical boundaries. Understanding these concepts is prerequisite knowledge for every other service.

Regions, Availability Zones, and Edge Locations

Concept	What it is	Examples
Region	Geographically isolated cluster of data centers. Each region is independent and contains multiple AZs.	`us-east-1` (N. Virginia), `eu-west-1` (Ireland), `ap-southeast-1` (Singapore)
Availability Zone (AZ)	One or more discrete data centers within a region, connected by low-latency links. Each AZ has independent power, cooling, and networking.	`us-east-1a`, `us-east-1b`, `us-east-1c`
Edge Location	Mini data centers used by CloudFront CDN and Route 53 DNS to serve content closer to end users. Not full regions.	200+ locations globally (NYC, London, Tokyo...)
Local Zone	Extensions of a region placed in metro areas for single-digit millisecond latency to specific cities.	`us-east-1-bos-1` (Boston)

Global vs. Regional Services

Scope	Services	Why global?
Global	IAM, Route 53, CloudFront, WAF, Organizations	Identity and DNS must be consistent everywhere
Regional	EC2, S3, RDS, Lambda, VPC, SQS, SNS, ECS, EKS	Data residency, fault isolation, latency optimization
AZ-scoped	EC2 instances, EBS volumes, subnets	Physical hardware tied to specific data centers

ARNs — Amazon Resource Names

Every AWS resource has a unique ARN. Understanding the format matters when writing IAM policies and CloudFormation templates.

# ARN format
arn:partition:service:region:account-id:resource-type/resource-id

# Examples
arn:aws:s3:::my-bucket                          # S3 bucket (global, no region/account)
arn:aws:s3:::my-bucket/path/to/object           # S3 object
arn:aws:iam::123456789012:user/alice             # IAM user (global, no region)
arn:aws:iam::123456789012:role/MyRole            # IAM role
arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0
arn:aws:lambda:us-east-1:123456789012:function:my-function
arn:aws:sqs:us-east-1:123456789012:my-queue
arn:aws:dynamodb:us-east-1:123456789012:table/Users

Shared Responsibility Model

AWS and the customer share security responsibilities. Knowing the boundary prevents misconfigurations.

AWS Responsible For	You Responsible For
Physical hardware, data centers, networking	Data encryption at rest and in transit
Hypervisor and host OS patching	Guest OS patching (EC2 instances)
Managed service patching (RDS, Lambda runtime)	Application-level security, IAM policies
Global infrastructure availability	Network configuration, security groups, NACLs
Compliance certifications (SOC 2, PCI DSS)	Enabling compliance for your workloads on top

IAM — Identity and Access Management

IAM is the access control system for all of AWS. It is global (not region-scoped). Mistakes here are the most common source of both security breaches and confusing permission errors.

Principals: Users, Groups, and Roles

Principal	Purpose	When to use
IAM User	Long-term credentials (password + access keys) for a person or service	Human developers, legacy automation. Prefer roles for EC2/Lambda.
IAM Group	Collection of users; attach policies to groups rather than individual users	Team-level permissions (Developers, ReadOnly, Admins)
IAM Role	Temporary credentials assumed by a service, user, or external identity	EC2 instance profiles, Lambda execution, cross-account access
Service Principal	AWS service identity (e.g., `lambda.amazonaws.com`)	Trust policies: allows a service to assume a role

Policy Structure

Policies are JSON documents. Every policy statement contains: Effect, Action, Resource, and optionally Condition.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3ReadOnMyBucket",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-app-bucket",
        "arn:aws:s3:::my-app-bucket/*"
      ]
    },
    {
      "Sid": "DenyDeleteFromProd",
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::prod-bucket/*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": "us-east-1"
        }
      }
    }
  ]
}

Policy evaluation order

AWS evaluates policies in this order: (1) explicit Deny wins always, (2) explicit Allow needed for access, (3) implicit Deny is the default. An explicit Deny in any policy overrides any Allow — even in other policies attached to the same principal.

Assume Role

Roles are assumed via STS (Security Token Service), which returns temporary credentials valid for 15 minutes to 12 hours.

# Assume a role from the CLI
aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/DeployRole \
  --role-session-name deploy-session

# Returns: AccessKeyId, SecretAccessKey, SessionToken
# Export them to use in subsequent commands
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...

# Verify which identity you're using
aws sts get-caller-identity

// Trust policy — allows EC2 to assume this role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

// Cross-account trust — allows account 987654321098 to assume this role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::987654321098:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        }
      }
    }
  ]
}

Instance Profiles

An instance profile is a container for an IAM role that gets attached to an EC2 instance. The instance automatically retrieves temporary credentials from the instance metadata endpoint.

# From inside an EC2 instance, credentials are available at:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Returns the role name, then:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRoleName

# SDKs and CLI automatically use instance profile credentials — no config needed
# This is why you should NEVER put access keys on EC2 instances

IAM Best Practices Checklist

Least privilege: start with no permissions, add only what's needed
Enable MFA on the root account and all human IAM users
Never use root for day-to-day work — create an admin IAM user instead
No long-term access keys on EC2/Lambda — use instance profiles and execution roles
Rotate access keys regularly; delete unused ones
Use IAM groups to manage permissions at scale, not individual users
Prefer managed policies (AWS-maintained) over inline policies where possible
Use conditions to restrict by IP, MFA, time, or source VPC
Enable CloudTrail to audit all IAM and API calls
Review IAM Access Analyzer to find external access to resources

Compute: EC2

EC2 (Elastic Compute Cloud) provides resizable virtual machines. It is the foundation of most AWS compute architectures, even when you are using higher-level services that run on top of it.

Instance Type Families

Family	Optimized for	Common types	Use case
t3 / t4g	Burstable general purpose	t3.micro, t3.small, t3.medium	Dev/test, low-traffic web servers
m5 / m6i	Balanced compute/memory	m5.large, m5.xlarge, m5.4xlarge	Web servers, app servers, small databases
c5 / c6i	Compute-intensive	c5.large, c5.2xlarge, c5.9xlarge	Batch processing, ML inference, video encoding
r5 / r6i	Memory-intensive	r5.large, r5.4xlarge, r5.24xlarge	In-memory databases, large caches, analytics
g4dn / g5	GPU accelerated	g4dn.xlarge, g5.2xlarge	ML training, GPU rendering, gaming
i3 / i4i	Storage-optimized NVMe	i3.large, i3.2xlarge	NoSQL databases, data warehousing

Pricing Models

Model	Description	Savings vs on-demand	Best for
On-Demand	Pay per second/hour, no commitment	Baseline	Unpredictable workloads, short-term
Reserved Instances	1 or 3 year commitment to a specific instance type	Up to 72%	Stable, predictable baseline load
Savings Plans	Flexible commitment to spend $/hr; applies across instance types	Up to 66%	Predictable spend, flexible instance types
Spot Instances	Spare capacity at up to 90% discount; AWS can reclaim with 2-min notice	Up to 90%	Fault-tolerant batch jobs, ML training
Dedicated Hosts	Physical server dedicated to your account	Varies	Compliance, license requirements

Launching an Instance (CLI)

# Find the latest Amazon Linux 2023 AMI
aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=al2023-ami-*" "Name=architecture,Values=x86_64" \
  --query 'sort_by(Images, &CreationDate)[-1].ImageId' \
  --output text

# Launch an instance
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.micro \
  --key-name my-key-pair \
  --security-group-ids sg-12345678 \
  --subnet-id subnet-12345678 \
  --iam-instance-profile Name=MyInstanceProfile \
  --user-data file://bootstrap.sh \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server}]'

# Check instance status
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=web-server" \
  --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PublicIpAddress]' \
  --output table

# SSH into instance
aws ec2-instance-connect send-ssh-public-key \
  --instance-id i-1234567890abcdef0 \
  --instance-os-user ec2-user \
  --ssh-public-key file://~/.ssh/id_rsa.pub

User Data Script

User data runs once at first boot as root. Use it to install software, configure the instance, and start services.

#!/bin/bash
# bootstrap.sh — runs at first launch as root
set -e
yum update -y

# Install Docker
yum install -y docker
systemctl enable docker
systemctl start docker
usermod -aG docker ec2-user

# Install application
yum install -y git
git clone https://github.com/myorg/myapp /opt/myapp
cd /opt/myapp

# Start with systemd
cat > /etc/systemd/system/myapp.service <<EOF
[Unit]
Description=My Application
After=network.target

[Service]
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/python3 app.py
Restart=always
User=ec2-user

[Install]
WantedBy=multi-user.target
EOF

systemctl enable myapp
systemctl start myapp

Production Use Cases

Use Case	Why This Service
Stateful workloads (databases, caches)	Needs persistent local NVMe storage and a consistent network identity across restarts; Lambda's ephemeral execution environment and stateless model make this impossible.
GPU / ML training (p4d, g5 instances)	Fargate has no GPU support; EC2 gives you direct PCIe access to A100/A10G GPUs and lets you tune CUDA drivers. For one-off training runs, spot instances cut costs 60–90%.
Legacy app migration (lift-and-shift)	You control the OS, runtime, and network stack — zero application refactoring required. Use this as a stepping stone; don't treat it as a destination.
Fault-tolerant batch processing on spot	Spot Fleet + checkpointing to S3 delivers 60–90% cost savings. The key insight: if your job can resume from a checkpoint, preemption is cheap.

Compute: Lambda

Lambda is AWS's serverless compute service. You upload code, define a handler function, and AWS manages everything else: servers, OS, scaling, high availability. You pay only for compute time consumed — down to 1ms granularity.

The Serverless Model

Concept	Description
Function	Your deployment unit — code + dependencies + configuration
Handler	Entry point: `module.function_name` (e.g., `handler.lambda_handler`)
Event	JSON payload delivered to the handler — shape varies by trigger
Context	Runtime info: function name, remaining time, log stream, request ID
Execution environment	Micro-VM (Firecracker) — frozen between invocations, reused when warm
Cold start	First invocation after idle: environment initialization adds 100ms–2s latency
Concurrency	Each simultaneous invocation gets its own environment; default limit 1000/region

Python Handler Example

import json
import logging
import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize outside handler — reused across warm invocations
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Users')


def lambda_handler(event, context):
    """
    Processes an API Gateway proxy event.

    Args:
        event: API Gateway event dict with httpMethod, path, body, headers, etc.
        context: Lambda context with function_name, aws_request_id, etc.

    Returns:
        API Gateway response dict with statusCode, headers, body.
    """
    logger.info("Request ID: %s", context.aws_request_id)
    logger.info("Event: %s", json.dumps(event))

    http_method = event.get('httpMethod', 'GET')
    path_params = event.get('pathParameters') or {}
    user_id = path_params.get('userId')

    if not user_id:
        return _response(400, {'error': 'userId path parameter is required'})

    try:
        if http_method == 'GET':
            result = table.get_item(Key={'userId': user_id})
            user = result.get('Item')
            if not user:
                return _response(404, {'error': 'User not found'})
            return _response(200, user)

        elif http_method == 'DELETE':
            table.delete_item(Key={'userId': user_id})
            return _response(204, {})

        else:
            return _response(405, {'error': f'Method {http_method} not allowed'})

    except ClientError as e:
        error_code = e.response['Error']['Code']
        logger.error("DynamoDB error: %s", error_code)
        return _response(500, {'error': 'Internal server error'})


def _response(status_code: int, body: dict) -> dict:
    return {
        'statusCode': status_code,
        'headers': {
            'Content-Type': 'application/json',
            'Access-Control-Allow-Origin': '*',
        },
        'body': json.dumps(body),
    }

Common Triggers

Trigger	Invocation type	Retry behavior
API Gateway / Function URL	Synchronous	Client retries — no automatic retry
S3 (object created/deleted)	Asynchronous	2 retries, then dead-letter queue
SQS queue	Poll-based	Message returns to queue on failure; DLQ after maxReceiveCount
SNS topic	Asynchronous	2 retries, then DLQ
DynamoDB Streams	Poll-based	Retries until record expires (24h) or DLQ
EventBridge (CloudWatch Events)	Asynchronous	2 retries
CloudWatch Logs subscription	Asynchronous	2 retries

Key Configuration

# Deploy a function
aws lambda create-function \
  --function-name my-api \
  --runtime python3.12 \
  --handler handler.lambda_handler \
  --role arn:aws:iam::123456789012:role/LambdaExecRole \
  --zip-file fileb://function.zip \
  --timeout 30 \
  --memory-size 512 \
  --environment Variables='{TABLE_NAME=Users,LOG_LEVEL=INFO}'

# Update code
aws lambda update-function-code \
  --function-name my-api \
  --zip-file fileb://function.zip

# Set concurrency limit (protect downstream services)
aws lambda put-function-concurrency \
  --function-name my-api \
  --reserved-concurrent-executions 100

# Enable provisioned concurrency (eliminate cold starts)
aws lambda put-provisioned-concurrency-config \
  --function-name my-api \
  --qualifier LIVE \
  --provisioned-concurrent-executions 10

# Invoke synchronously
aws lambda invoke \
  --function-name my-api \
  --payload '{"httpMethod":"GET","pathParameters":{"userId":"abc123"}}' \
  --cli-binary-format raw-in-base64-out \
  response.json
cat response.json

Production Use Cases

Use Case	Why This Service
API backend (API Gateway + Lambda)	Zero infra management, per-request billing, and automatic scaling to thousands of concurrent requests. Choose over EC2 when traffic is spiky or unpredictable — idle EC2 burns money, idle Lambda costs nothing.
Event-driven file processing (S3 → Lambda)	Triggered on upload, process-and-forget — no idle compute cost. Canonical example: thumbnail generation or CSV parsing where you want exactly-once semantics tied to object creation.
Scheduled tasks (EventBridge → Lambda)	Replaces cron on EC2 with no server to maintain and built-in retry on failure. The EC2 cron approach requires keeping an instance alive 24/7 for a job that runs for seconds.
Stream processing (Kinesis / DynamoDB Streams → Lambda)	Real-time processing with built-in batching and checkpointing. Simpler and cheaper than running Flink or Spark Streaming for low-to-medium volume streams where you don't need complex windowing.
Service glue (SQS → DynamoDB, SNS → Slack)	Short-lived, stateless transformations between services are Lambda's sweet spot. Adding EC2 here is engineering overhead with no benefit — Lambda scales to zero between bursts automatically.

Lambda Cold Start Mitigation Strategies

Provisioned Concurrency: Pre-warms N environments. Eliminates cold starts for that capacity. Costs extra.
Keep functions warm: CloudWatch Events rule that pings the function every 5 minutes. Free but only works for low-concurrency functions.
Minimize package size: Smaller zip = faster initialization. Use Lambda layers for large shared dependencies.
Choose faster runtimes: Python and Node have faster cold starts than Java and .NET. Go compiles to a binary (very fast).
Init code outside handler: SDK clients, DB connections, config loading — do this once at module load, reuse across invocations.
SnapStart (Java): Snapshotting the initialized environment for Java 11+ — reduces cold start from seconds to ~1s.

Storage: S3

S3 (Simple Storage Service) is object storage with 11 nines (99.999999999%) of durability. Objects are stored in buckets, are addressed by key, and can range from 0 bytes to 5 TB.

Storage Classes

Class	Use case	Retrieval	Min storage duration
Standard	Frequently accessed data	Milliseconds	None
Standard-IA	Infrequently accessed, needs fast retrieval	Milliseconds	30 days
One Zone-IA	Infrequent access, single AZ (cheaper)	Milliseconds	30 days
Intelligent-Tiering	Unknown or changing access patterns	Milliseconds	None
Glacier Instant Retrieval	Archive, quarterly access	Milliseconds	90 days
Glacier Flexible Retrieval	Archive, occasional access	1–12 hours	90 days
Glacier Deep Archive	Long-term archive, once-a-year access	Up to 48 hours	180 days

Common S3 CLI Commands

# Create a bucket (bucket names are globally unique)
aws s3 mb s3://my-unique-bucket-name --region us-east-1

# Upload a file
aws s3 cp ./local-file.txt s3://my-bucket/remote-path/file.txt

# Upload with specific storage class
aws s3 cp ./archive.zip s3://my-bucket/archives/ \
  --storage-class GLACIER

# Sync a directory (only uploads changed/new files)
aws s3 sync ./build/ s3://my-bucket/static/ \
  --delete \
  --cache-control "max-age=86400"

# Download
aws s3 cp s3://my-bucket/file.txt ./downloaded.txt
aws s3 sync s3://my-bucket/data/ ./local-data/

# List objects
aws s3 ls s3://my-bucket/
aws s3 ls s3://my-bucket/ --recursive --human-readable --summarize

# Delete
aws s3 rm s3://my-bucket/old-file.txt
aws s3 rm s3://my-bucket/old-prefix/ --recursive

# Generate a presigned URL (valid for 1 hour)
aws s3 presign s3://my-bucket/private-file.pdf --expires-in 3600

Bucket Policies vs ACLs

Prefer bucket policies over ACLs

ACLs are a legacy access control mechanism. AWS now recommends disabling ACLs (set Object Ownership to "Bucket owner enforced") and using bucket policies or IAM policies for all access control.

// Bucket policy: allow public read of all objects (for static site hosting)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PublicReadGetObject",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::my-static-site/*"
    }
  ]
}

// Bucket policy: enforce HTTPS only
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyNonHTTPS",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ],
      "Condition": {
        "Bool": {
          "aws:SecureTransport": "false"
        }
      }
    }
  ]
}

Lifecycle Policies & Versioning

// Lifecycle policy: transition to cheaper storage, then expire
{
  "Rules": [
    {
      "ID": "archive-and-expire",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      },
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 30
      }
    }
  ]
}

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket my-bucket \
  --versioning-configuration Status=Enabled

# List all versions of an object
aws s3api list-object-versions --bucket my-bucket --prefix my-file.txt

# Restore a specific version
aws s3api get-object \
  --bucket my-bucket \
  --key my-file.txt \
  --version-id abc123def456 \
  restored.txt

Production Use Cases

Use Case	Why This Service
Data lake foundation	Unlimited storage at $0.023/GB with native integration to Athena, Spark, and Redshift Spectrum. Parquet + partitioned prefixes give you columnar scan performance without running a warehouse — query only the partitions you need.
Static website hosting (S3 + CloudFront)	Global CDN with TLS for pennies per GB served. No servers, no OS patches, 11-nines durability for your assets. The gap between this and a running EC2 instance is both cost and operational burden.
Backup and archive	Lifecycle rules auto-tier objects to Glacier Deep Archive at $0.00099/GB — 23x cheaper than S3 Standard. Object Lock enforces WORM compliance for regulatory retention requirements without custom logic.
ML training data versioning	S3 versioning gives you dataset snapshots with zero overhead; SageMaker reads directly from S3. Compare to maintaining a separate data versioning system — S3 versioning is already there.
Event-driven pipeline triggers	S3 event notifications → Lambda/SQS decouple data ingestion from processing. No polling loop required; AWS delivers the notification within seconds of the PUT.

Databases

AWS offers multiple managed database services. Choosing the right one is a critical architectural decision driven by data model, access patterns, and consistency requirements.

RDS — Relational Database Service

RDS manages common relational databases: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora. AWS handles backups, patching, replication, and failover.

# Create a PostgreSQL RDS instance
aws rds create-db-instance \
  --db-instance-identifier prod-postgres \
  --db-instance-class db.t3.medium \
  --engine postgres \
  --engine-version 16.1 \
  --master-username admin \
  --master-user-password supersecret \
  --db-name myapp \
  --allocated-storage 100 \
  --storage-type gp3 \
  --multi-az \
  --backup-retention-period 7 \
  --no-publicly-accessible \
  --vpc-security-group-ids sg-12345678

# Create a read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-postgres-read \
  --source-db-instance-identifier prod-postgres

DynamoDB — Key-Value / Document Store

DynamoDB is a fully managed NoSQL database with single-digit millisecond performance at any scale. The data model centers on a partition key and optional sort key.

Concept	Description
Partition key	Required. Determines which partition the item lives in. Must uniquely identify items (when no sort key exists).
Sort key	Optional. Together with partition key, forms a composite primary key. Enables range queries.
GSI	Global Secondary Index — alternate access pattern with different partition/sort key. Eventual consistency.
LSI	Local Secondary Index — same partition key, different sort key. Created at table creation time only.
On-demand mode	Pay per request. Scales instantly. Good for unpredictable workloads.
Provisioned mode	Set RCU/WCU. Can use auto-scaling. Cheaper at predictable load.

# Create a table with composite key
aws dynamodb create-table \
  --table-name Orders \
  --attribute-definitions \
    AttributeName=userId,AttributeType=S \
    AttributeName=orderId,AttributeType=S \
    AttributeName=createdAt,AttributeType=S \
  --key-schema \
    AttributeName=userId,KeyType=HASH \
    AttributeName=orderId,KeyType=RANGE \
  --billing-mode PAY_PER_REQUEST \
  --global-secondary-indexes '[
    {
      "IndexName": "CreatedAtIndex",
      "KeySchema": [
        {"AttributeName": "userId", "KeyType": "HASH"},
        {"AttributeName": "createdAt", "KeyType": "RANGE"}
      ],
      "Projection": {"ProjectionType": "ALL"}
    }
  ]'

# Put an item
aws dynamodb put-item \
  --table-name Orders \
  --item '{
    "userId": {"S": "user#abc"},
    "orderId": {"S": "order#001"},
    "createdAt": {"S": "2026-02-23T10:00:00Z"},
    "status": {"S": "pending"},
    "total": {"N": "49.99"}
  }'

# Query items by partition key + sort key condition
aws dynamodb query \
  --table-name Orders \
  --key-condition-expression "userId = :uid AND begins_with(orderId, :prefix)" \
  --expression-attribute-values '{
    ":uid": {"S": "user#abc"},
    ":prefix": {"S": "order#"}
  }'

ElastiCache

Managed in-memory caching. Two engines: Redis (data structures, persistence, pub/sub, clustering) and Memcached (simple key-value, multi-threaded, no persistence).

# Create a Redis cluster
aws elasticache create-replication-group \
  --replication-group-id my-redis \
  --replication-group-description "App cache" \
  --engine redis \
  --engine-version 7.0 \
  --cache-node-type cache.t3.micro \
  --num-cache-clusters 2 \
  --automatic-failover-enabled \
  --at-rest-encryption-enabled \
  --transit-encryption-enabled

Production Use Cases

Use Case	Why This Service
RDS Multi-AZ for production OLTP	Automated failover, point-in-time backups, and OS patching with zero DBA work. Choose over self-managed EC2 Postgres when you don't need exotic extensions — the operational savings outweigh the 20–30% cost premium.
DynamoDB for session stores, user profiles, gaming leaderboards	Single-digit millisecond latency at any scale with no index tuning. Choose when access patterns are known and simple (key-value or key-range) — the moment you need ad-hoc queries, reach for a relational database instead.
DynamoDB Streams + Lambda for change data capture	React to data mutations in real-time without polling. Cheaper and simpler than running Debezium + Kafka for moderate change volumes where exactly-once CDC semantics aren't required.
ElastiCache Redis for rate limiting and real-time leaderboards	Sub-millisecond latency with sorted sets, atomic counters, and pub/sub that DynamoDB can't match. Choose Redis over DynamoDB when you need complex data structures or need to read/write within a single microsecond budget.
Aurora for high-throughput OLTP	5x MySQL / 3x Postgres throughput on the same hardware, with storage auto-scaling to 128TB. Reach for Aurora when standard RDS hits IOPS ceiling — the architecture separates compute from storage, removing the bottleneck.

Networking: VPC

A VPC (Virtual Private Cloud) is your isolated network within AWS. Every EC2 instance, RDS database, and Lambda (when VPC-attached) lives inside a VPC. Understanding VPC design is critical for security and connectivity.

Core VPC Components

Component	Purpose
VPC	Isolated virtual network. Defined by a CIDR block (e.g., `10.0.0.0/16` = 65,536 IPs).
Subnet	A subdivision of the VPC tied to one AZ. Public subnets route to IGW; private subnets route to NAT GW or nowhere.
Internet Gateway (IGW)	Allows public subnets to reach the internet. Attached to the VPC.
NAT Gateway	Allows private subnet instances to initiate outbound internet connections (but blocks inbound). Lives in a public subnet.
Route Table	Defines where traffic is directed. Every subnet is associated with exactly one route table.
Security Group	Stateful virtual firewall at the instance/ENI level. Allow rules only; no explicit deny.
NACL	Stateless firewall at the subnet level. Supports both allow and deny rules. Rules evaluated by number (lowest first).
VPC Peering	Private connectivity between two VPCs (same or different account/region). Not transitive.
VPC Endpoint	Private connection from VPC to AWS services without traversing the internet.

Three-Tier VPC Design

# Typical 3-tier VPC: public / private app / private data
# CIDR: 10.0.0.0/16
#
# Public subnets (load balancers, NAT GW, bastion host)
#   10.0.1.0/24  us-east-1a
#   10.0.2.0/24  us-east-1b
#
# Private app subnets (EC2, ECS, Lambda)
#   10.0.10.0/24  us-east-1a
#   10.0.11.0/24  us-east-1b
#
# Private data subnets (RDS, ElastiCache — no outbound internet needed)
#   10.0.20.0/24  us-east-1a
#   10.0.21.0/24  us-east-1b

# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16 --tag-specifications \
  'ResourceType=vpc,Tags=[{Key=Name,Value=my-vpc}]'

# Create public subnet
aws ec2 create-subnet \
  --vpc-id vpc-12345678 \
  --cidr-block 10.0.1.0/24 \
  --availability-zone us-east-1a

# Enable auto-assign public IP for public subnet
aws ec2 modify-subnet-attribute \
  --subnet-id subnet-12345678 \
  --map-public-ip-on-launch

# Create and attach Internet Gateway
aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway \
  --vpc-id vpc-12345678 \
  --internet-gateway-id igw-12345678

# Add route to IGW in public route table
aws ec2 create-route \
  --route-table-id rtb-12345678 \
  --destination-cidr-block 0.0.0.0/0 \
  --gateway-id igw-12345678

Security Groups vs NACLs

Feature	Security Group	NACL
Level	Instance / ENI	Subnet
Statefulness	Stateful (return traffic automatic)	Stateless (must allow inbound AND outbound)
Rules	Allow only	Allow and Deny
Rule evaluation	All rules evaluated	Evaluated in number order, first match wins
Default behavior	Deny all inbound, allow all outbound	Allow all (default NACL)
Best use	Per-instance access control	Subnet-level block lists (e.g., block an IP range)

Production Use Cases

Use Case	Why This Service
Multi-tier isolation (public / private / isolated subnets)	The ALB lives in the public subnet, app servers in private, databases in an isolated subnet with no route to the internet. NACLs add a second layer of defense — stateless deny rules that security groups can't express.
VPC peering for cross-account shared services	Connect a shared logging or monitoring account to production without traversing the internet. Traffic stays on the AWS backbone — lower latency and no egress costs compared to routing via an internet gateway.
PrivateLink for SaaS integration	Access third-party APIs (Datadog, Snowflake) without the traffic ever leaving the AWS network. Required for PCI-DSS and HIPAA workloads where data must not traverse the public internet.
Transit Gateway for hub-and-spoke multi-VPC routing	At 10+ VPCs, full-mesh peering becomes O(n²) routes to manage. Transit Gateway centralizes routing through a single attachment — one place to audit, one place to update.

Messaging & Queues

Asynchronous messaging decouples producers from consumers, enabling fault tolerance, load leveling, and fan-out patterns.

SQS — Simple Queue Service

Feature	Standard Queue	FIFO Queue
Throughput	Unlimited TPS	300 TPS (3,000 with batching)
Ordering	Best-effort (not guaranteed)	Strict FIFO per message group
Delivery	At least once (duplicates possible)	Exactly once
Use case	High-throughput, order not critical	Financial transactions, user actions

# Create a queue with dead-letter queue
aws sqs create-queue --queue-name my-dlq
aws sqs get-queue-attributes --queue-url ... --attribute-names QueueArn

aws sqs create-queue \
  --queue-name my-queue \
  --attributes '{
    "VisibilityTimeout": "30",
    "MessageRetentionPeriod": "86400",
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-dlq\",\"maxReceiveCount\":\"3\"}"
  }'

# Send a message
aws sqs send-message \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
  --message-body '{"orderId": "abc123", "action": "process"}'

# Receive and process messages
aws sqs receive-message \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
  --max-number-of-messages 10 \
  --wait-time-seconds 20  # Long polling — reduce empty receives

# Delete after processing
aws sqs delete-message \
  --queue-url ... \
  --receipt-handle "AQEBwJ..."

Visibility Timeout

When a consumer receives a message, it becomes invisible to other consumers for the visibility timeout period (default 30s). If the consumer doesn't delete the message within that window, it becomes visible again and another consumer can pick it up. Set this slightly longer than your processing time.

SNS — Simple Notification Service

SNS is a pub/sub service. Publishers send to a topic; subscribers (SQS, Lambda, HTTP, email, SMS) receive a copy. This enables fan-out: one event triggers many parallel consumers.

# Create a topic
aws sns create-topic --name order-events

# Subscribe an SQS queue to the topic
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
  --protocol sqs \
  --notification-endpoint arn:aws:sqs:us-east-1:123456789012:order-processing

# Subscribe a Lambda function
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
  --protocol lambda \
  --notification-endpoint arn:aws:lambda:us-east-1:123456789012:function:send-email

# Publish a message
aws sns publish \
  --topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
  --message '{"orderId": "abc123", "status": "placed"}' \
  --subject "Order Placed"

EventBridge

EventBridge is a serverless event bus. More powerful than SNS for routing: supports content-based routing via rules, schema registry, event replays, and cross-account event buses.

# Create a rule to trigger Lambda when an EC2 instance stops
aws events put-rule \
  --name ec2-stopped \
  --event-pattern '{
    "source": ["aws.ec2"],
    "detail-type": ["EC2 Instance State-change Notification"],
    "detail": {"state": ["stopped"]}
  }' \
  --state ENABLED

# Add Lambda as target
aws events put-targets \
  --rule ec2-stopped \
  --targets 'Id=1,Arn=arn:aws:lambda:us-east-1:123456789012:function:notify-ops'

Production Use Cases

Use Case	Why This Service
SQS: Decoupling microservices	Producer writes at its own pace; consumer processes at its own pace — SQS absorbs the spike. Use Standard queue for maximum throughput, FIFO when order matters (order processing, financial transactions).
SQS + DLQ: Poison message handling	Failed messages are quarantined for inspection rather than silently dropped or blocking the queue. In payment and order processing, you cannot afford to lose a message — the DLQ gives you a durable holding area to debug and replay.
SNS: Fan-out to multiple consumers simultaneously	One published event triggers email notification + analytics pipeline + audit log in parallel. Doing this with SQS alone requires each consumer to poll independently — SNS pushes to all subscribers in one API call.
EventBridge: Cross-account event routing with content-based filtering	Route events to different targets based on payload content without writing routing code. Choose over SNS when you need schema registry for contract enforcement, event replay for debugging, or third-party SaaS integration (Stripe, Auth0 webhooks).
EventBridge Scheduler: Cron replacement	One-time or recurring triggers with no infrastructure to maintain. Replaces the pattern of keeping an EC2 instance alive 24/7 just to run a cron job that executes for a few seconds.

Containers on AWS

AWS offers multiple layers for running containers: ECR for image storage, ECS for container orchestration (AWS-native), and EKS for Kubernetes.

ECR — Elastic Container Registry

# Authenticate Docker to your ECR registry
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  123456789012.dkr.ecr.us-east-1.amazonaws.com

# Create a repository
aws ecr create-repository --repository-name my-app

# Build, tag, and push
docker build -t my-app .
docker tag my-app:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

# Enable image scanning on push (checks for CVEs)
aws ecr put-image-scanning-configuration \
  --repository-name my-app \
  --image-scanning-configuration scanOnPush=true

ECS — Elastic Container Service

ECS has two key concepts: Task Definitions (what to run — image, CPU, memory, environment, ports) and Services (how many copies, load balancer integration, auto-scaling).

Launch Type	Description	When to use
Fargate	Serverless — AWS manages the underlying EC2. Pay per task CPU/memory.	Most workloads. No cluster management overhead.
EC2	You manage an EC2 cluster. ECS places tasks on your instances.	GPU workloads, specific instance types, cost optimization at scale.

// ECS Task Definition (simplified)
{
  "family": "my-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/myAppTaskRole",
  "containerDefinitions": [
    {
      "name": "my-app",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest",
      "portMappings": [
        { "containerPort": 8080, "protocol": "tcp" }
      ],
      "environment": [
        { "name": "ENV", "value": "production" }
      ],
      "secrets": [
        { "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:db-password" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ]
}

# Register task definition
aws ecs register-task-definition --cli-input-json file://task-def.json

# Create a service (runs 2 tasks behind a load balancer)
aws ecs create-service \
  --cluster my-cluster \
  --service-name my-app-svc \
  --task-definition my-app:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration 'awsvpcConfiguration={
    subnets=[subnet-aaa,subnet-bbb],
    securityGroups=[sg-12345678],
    assignPublicIp=DISABLED
  }' \
  --load-balancers 'targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=my-app,containerPort=8080'

# Force new deployment (rolling update)
aws ecs update-service \
  --cluster my-cluster \
  --service my-app-svc \
  --force-new-deployment

EKS — Elastic Kubernetes Service

EKS is managed Kubernetes. AWS runs the control plane (API server, etcd, scheduler). You manage worker nodes (or use Fargate for pods).

# Create a cluster (takes ~10 min)
eksctl create cluster \
  --name my-cluster \
  --region us-east-1 \
  --nodegroup-name workers \
  --node-type m5.large \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 5 \
  --managed

# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name my-cluster

# Verify
kubectl get nodes
kubectl get pods --all-namespaces

Production Use Cases

Use Case	Why This Service
ECS Fargate: Stateless microservices	No cluster management, no AMI patching, pay-per-vCPU-second. Choose over EC2 launch type when your workload is stateless and traffic is variable — Fargate's higher unit cost is offset by eliminating idle EC2 capacity.
ECS EC2: GPU workloads and cost-optimized steady-state	Fargate doesn't support GPUs, and at high, predictable throughput EC2 Reserved Instances are 2–3x cheaper per unit than Fargate. Use EC2 launch type when you've right-sized the fleet and can commit to reserved capacity.
ECR: Private image registry	IAM-integrated authentication means no separate registry credentials to rotate. Built-in vulnerability scanning catches CVEs before deployment, and lifecycle policies automatically prune untagged images to control storage costs.
EKS: Multi-cloud portability and complex orchestration	If your team already operates Kubernetes, EKS avoids retraining and lets you reuse Helm charts, operators, and tooling across clouds. Choose EKS over ECS when you need custom controllers, a service mesh (Istio/Linkerd), or a genuine multi-cloud strategy.

Monitoring & Logging

Observability on AWS centers on three tools: CloudWatch for metrics and logs, CloudTrail for API audit trails, and X-Ray for distributed tracing.

CloudWatch

Feature	Description
Metrics	Time-series data points. EC2 (CPU, network, disk), Lambda (duration, errors, throttles), RDS (connections, latency) all publish metrics automatically.
Alarms	Trigger actions (SNS notification, Auto Scaling, EC2 action) when a metric breaches a threshold.
Logs	Log groups and log streams. Lambda writes here automatically. EC2/ECS needs the CloudWatch agent.
Log Insights	SQL-like query language over log data. Useful for ad-hoc debugging.
Dashboards	Custom real-time metric visualizations across services and regions.
Synthetics	Canary scripts that monitor endpoints and APIs on a schedule.

# Create an alarm: alert when Lambda error rate > 1%
aws cloudwatch put-metric-alarm \
  --alarm-name lambda-errors-high \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --dimensions Name=FunctionName,Value=my-api \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --treat-missing-data notBreaching

# Publish a custom metric
aws cloudwatch put-metric-data \
  --namespace MyApp \
  --metric-name OrdersProcessed \
  --value 42 \
  --unit Count

# Query logs with Log Insights
aws logs start-query \
  --log-group-name /aws/lambda/my-api \
  --start-time $(date -v-1H +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message
    | filter @message like /ERROR/
    | sort @timestamp desc
    | limit 20'

CloudTrail

CloudTrail records every AWS API call (who, what, when, from where) across your account. It is the primary tool for security investigations and compliance auditing.

# Create a trail that writes to S3 (best practice: multi-region trail)
aws cloudtrail create-trail \
  --name my-audit-trail \
  --s3-bucket-name my-audit-logs-bucket \
  --include-global-service-events \
  --is-multi-region-trail \
  --enable-log-file-validation

aws cloudtrail start-logging --name my-audit-trail

# Look up recent events for a specific user
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=alice \
  --max-results 10

# Find who deleted an S3 object
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteObject

X-Ray — Distributed Tracing

X-Ray traces requests as they flow through your application, across Lambda, EC2, ECS, API Gateway, and more. It produces service maps and identifies latency bottlenecks.

import boto3
from aws_xray_sdk.core import xray_recorder, patch_all

# Instrument all boto3 clients automatically
patch_all()

@xray_recorder.capture('process_order')
def process_order(order_id: str) -> dict:
    # This creates a subsegment in the trace
    with xray_recorder.in_subsegment('validate') as subsegment:
        subsegment.put_annotation('orderId', order_id)
        result = validate_order(order_id)

    with xray_recorder.in_subsegment('persist'):
        save_to_dynamodb(result)

    return result

Production Use Cases

Use Case	Why This Service
CloudWatch Alarms + SNS: Operational alerting	CPU above 80%, 5xx error rate above 1%, SQS queue depth growing — each alarm can trigger an SNS notification or an Auto Scaling policy. The tight feedback loop between metric → alarm → action is the foundation of self-healing infrastructure.
CloudWatch Logs Insights: Ad-hoc log analysis	Query across Lambda, ECS, and API Gateway logs in a single pane without exporting data. Significantly cheaper than Splunk or Datadog for basic log search — pay only for the bytes scanned, not a per-host license.
X-Ray: Distributed tracing across services	Visualizes the full request path from API Gateway through Lambda to DynamoDB, with per-segment timing. Without distributed tracing, debugging latency regressions in microservices means correlating timestamps across multiple log streams — X-Ray does it automatically.
CloudTrail: Security audit and compliance	Every AWS API call is logged — who did what, from which IP, and when. Required for SOC 2 and HIPAA compliance, and the first tool you reach for when investigating unauthorized resource changes or privilege escalation.

Infrastructure as Code

IaC treats infrastructure definitions as source code: version-controlled, repeatable, reviewable. On AWS, the native tool is CloudFormation; Terraform is the most popular third-party alternative.

CloudFormation

CloudFormation templates describe a stack — a collection of AWS resources. CloudFormation provisions, updates, and deletes them as a unit.

# cloudformation/api-stack.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: Serverless API stack with Lambda, API Gateway, and DynamoDB

Parameters:
  Environment:
    Type: String
    Default: dev
    AllowedValues: [dev, staging, prod]
    Description: Deployment environment
  LambdaMemory:
    Type: Number
    Default: 512
    MinValue: 128
    MaxValue: 10240

Conditions:
  IsProd: !Equals [!Ref Environment, prod]

Resources:
  # DynamoDB Table
  UsersTable:
    Type: AWS::DynamoDB::Table
    DeletionPolicy: Retain
    Properties:
      TableName: !Sub '${Environment}-Users'
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: userId
          AttributeType: S
      KeySchema:
        - AttributeName: userId
          KeyType: HASH
      PointInTimeRecoverySpecification:
        PointInTimeRecoveryEnabled: !If [IsProd, true, false]
      Tags:
        - Key: Environment
          Value: !Ref Environment

  # IAM Execution Role
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: DynamoDBAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - dynamodb:GetItem
                  - dynamodb:PutItem
                  - dynamodb:DeleteItem
                  - dynamodb:Query
                Resource: !GetAtt UsersTable.Arn

  # Lambda Function
  ApiFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Sub '${Environment}-users-api'
      Runtime: python3.12
      Handler: handler.lambda_handler
      Role: !GetAtt LambdaExecutionRole.Arn
      MemorySize: !Ref LambdaMemory
      Timeout: 30
      Environment:
        Variables:
          TABLE_NAME: !Ref UsersTable
          ENVIRONMENT: !Ref Environment
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'ok'}

  # API Gateway
  ApiGateway:
    Type: AWS::ApiGateway::RestApi
    Properties:
      Name: !Sub '${Environment}-users-api'
      EndpointConfiguration:
        Types: [REGIONAL]

Outputs:
  ApiEndpoint:
    Description: API Gateway endpoint URL
    Value: !Sub 'https://${ApiGateway}.execute-api.${AWS::Region}.amazonaws.com/prod'
    Export:
      Name: !Sub '${AWS::StackName}-ApiEndpoint'

  UsersTableArn:
    Description: DynamoDB table ARN
    Value: !GetAtt UsersTable.Arn
    Export:
      Name: !Sub '${AWS::StackName}-UsersTableArn'

# Validate template syntax
aws cloudformation validate-template --template-body file://api-stack.yaml

# Create/update stack (create-or-update)
aws cloudformation deploy \
  --template-file api-stack.yaml \
  --stack-name my-api-dev \
  --parameter-overrides Environment=dev LambdaMemory=256 \
  --capabilities CAPABILITY_IAM \
  --no-fail-on-empty-changeset

# Preview changes before applying (changeset)
aws cloudformation create-change-set \
  --stack-name my-api-dev \
  --change-set-name preview \
  --template-body file://api-stack.yaml \
  --capabilities CAPABILITY_IAM

aws cloudformation describe-change-set \
  --stack-name my-api-dev \
  --change-set-name preview

# Describe stack events (debug failed deployments)
aws cloudformation describe-stack-events \
  --stack-name my-api-dev \
  --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'

# Delete stack
aws cloudformation delete-stack --stack-name my-api-dev

SAM — Serverless Application Model

SAM extends CloudFormation with shorthand for Lambda, API Gateway, and DynamoDB — saving 80% of the boilerplate for serverless apps.

# template.yaml (SAM)
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Runtime: python3.12
    Timeout: 30
    Environment:
      Variables:
        TABLE_NAME: !Ref UsersTable

Resources:
  UsersFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handler.lambda_handler
      CodeUri: src/
      MemorySize: 512
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref UsersTable
      Events:
        GetUser:
          Type: Api
          Properties:
            Path: /users/{userId}
            Method: GET
        CreateUser:
          Type: Api
          Properties:
            Path: /users
            Method: POST

  UsersTable:
    Type: AWS::Serverless::SimpleTable
    Properties:
      PrimaryKey:
        Name: userId
        Type: String

# Install SAM CLI
brew tap aws/tap
brew install aws-sam-cli

# Build and run locally
sam build
sam local start-api          # Starts local API Gateway
sam local invoke UsersFunction --event event.json

# Deploy
sam deploy --guided         # First time: creates samconfig.toml
sam deploy                  # Subsequent deploys

CloudFormation vs Terraform — When to use which

Dimension	CloudFormation	Terraform
Multi-cloud	AWS only	AWS, GCP, Azure, + 1000s of providers
State management	AWS manages state in the stack	You manage state file (local or S3 backend)
Preview changes	Change sets	`terraform plan`
Drift detection	Built-in	Requires `terraform refresh`
AWS service lag	Zero — new services available day 1	Depends on provider; usually days to weeks
Language	JSON/YAML	HCL (more expressive, supports loops/conditionals well)
Rollback	Automatic on failure	Manual; no automatic rollback
Best for	AWS-only shops, tight AWS integration	Multi-cloud, teams preferring HCL expressiveness

In practice: greenfield AWS-only projects often use CloudFormation (or CDK which compiles to CloudFormation). Teams with multi-cloud needs or existing Terraform expertise reach for Terraform.

Production Use Cases

Use Case	Why This Service
CloudFormation: AWS-native single-account infrastructure	Deep integration with every AWS service on day one, built-in drift detection, and no external state file to lose or corrupt. Choose over Terraform when you're AWS-only and want automatic rollback on stack failures.
Terraform: Multi-cloud and multi-provider management	Manage AWS + Datadog + PagerDuty + GitHub in a single codebase with HCL's expressive loops and conditionals. The community module registry is unmatched. Choose when your infrastructure spans providers or your team has existing Terraform expertise.
SAM: Serverless application development	CloudFormation superset that collapses Lambda + API Gateway + DynamoDB boilerplate by 80%. The killer feature is `sam local invoke` — run your Lambda locally against a real event payload before deploying, closing the feedback loop dramatically.
CDK: Complex infrastructure requiring programmatic logic	Loops, conditionals, inheritance, and type safety in TypeScript or Python — expressing dynamic infrastructure (e.g., deploying N identical services from a list) in YAML becomes unmaintainable quickly. CDK compiles to CloudFormation, so you keep AWS's native rollback and drift detection.

Common Architecture Patterns

These are the building blocks that appear repeatedly in production AWS architectures.

Three-Tier Web Application

# Layer 1: Public-facing (Load Balancer in public subnet)
#   ALB → health checks, SSL termination, routing rules
#
# Layer 2: Application (ECS/EC2 in private app subnet)
#   Auto Scaling Group or ECS Service
#   No public IPs — only ALB can reach them via security group rule
#
# Layer 3: Data (RDS + ElastiCache in private data subnet)
#   Only application layer security group can connect
#
# Traffic flow:
#   Internet → Route 53 → CloudFront (optional CDN) → ALB
#             → ECS tasks → RDS (reads from replica, writes to primary)
#                         → ElastiCache (cache layer)
#
# Outbound from private subnets:
#   App/Data subnet → NAT Gateway (public subnet) → Internet Gateway → Internet

# Key security group rules:
# ALB SG: inbound 443 from 0.0.0.0/0
# App SG: inbound 8080 from ALB SG only
# DB SG:  inbound 5432 from App SG only

Serverless API Pattern

# API Gateway + Lambda + DynamoDB
#
# Request flow:
#   Client → API Gateway → Lambda → DynamoDB
#
# API Gateway handles:
#   - TLS termination
#   - Request throttling (e.g., 10,000 RPS)
#   - API key management
#   - Request/response transformation
#   - Caching (optional)
#
# Lambda handles:
#   - Business logic
#   - Input validation
#   - Auth (via Lambda authorizer or Cognito)
#
# DynamoDB handles:
#   - Data persistence
#   - Single-digit ms read/write at any scale
#
# Cost characteristics:
#   - Zero cost at zero traffic (pay per invocation)
#   - Auto-scales to millions of RPS without config
#   - No servers to manage or patch

Event-Driven Fan-Out (SNS + SQS)

# Pattern: one event triggers multiple independent consumers
#
# SNS Topic: order-events
#   ├── SQS Queue: order-fulfillment   → Lambda: reserve inventory
#   ├── SQS Queue: order-email         → Lambda: send confirmation email
#   └── SQS Queue: order-analytics     → Lambda: update business metrics
#
# Why SNS → SQS (not SNS → Lambda directly)?
#   - SQS acts as a buffer: Lambda invocations are throttled by SQS batch size
#   - Dead-letter queues on SQS catch failures without losing events
#   - Each consumer scales independently
#   - Consumer can be paused (stop polling) without losing messages
#
# CloudFormation snippet for the fan-out:

# Producer publishes ONE message to SNS
aws sns publish \
  --topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
  --message '{"orderId":"abc123","userId":"user456","total":99.99}'

# All three SQS queues receive a copy simultaneously
# Each Lambda processes independently, at its own pace

Static Site (S3 + CloudFront)

# Architecture:
#   GitHub Actions → Build (npm run build) → S3 upload → CloudFront invalidation
#
# S3: hosts built static files (HTML/CSS/JS)
# CloudFront: CDN — caches at edge locations globally, serves HTTPS

# 1. Create S3 bucket for static hosting
aws s3 mb s3://my-static-site --region us-east-1
aws s3 website s3://my-static-site \
  --index-document index.html \
  --error-document 404.html

# 2. Create CloudFront distribution pointing to S3 origin
# (typically done via console or CloudFormation — CLI is verbose)

# 3. Deploy: sync build output, then invalidate CDN cache
aws s3 sync ./dist/ s3://my-static-site/ \
  --delete \
  --cache-control "public, max-age=31536000, immutable"

# Invalidate CloudFront cache (HTML files should not be cached long)
aws cloudfront create-invalidation \
  --distribution-id ABCDEFGHIJKLMN \
  --paths "/*"

# 4. Custom domain: Route 53 alias record → CloudFront distribution
aws route53 change-resource-record-sets \
  --hosted-zone-id ZONE123 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "www.example.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z2FDTNDATAQYW2",
          "DNSName": "d111111abcdef8.cloudfront.net",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

Additional Patterns: Blue/Green & Canary Deployments

Blue/Green Deployment

Run two identical environments (blue = current, green = new). Switch traffic instantly at the load balancer or Route 53 level. Rollback is instant — point traffic back to blue.

ECS: CodeDeploy manages blue/green at the target group level
Lambda: Use aliases and weighted routing between two function versions
Elastic Beanstalk: Swap environment URLs

Canary Deployment

Gradually shift traffic to the new version. Start at 1%, watch error rates, then increase to 10%, 50%, 100%.

# Lambda canary via alias weighted routing
aws lambda update-alias \
  --function-name my-api \
  --name LIVE \
  --function-version 5 \
  --routing-config AdditionalVersionWeights={"4"=0.1}
# 10% to version 4 (old), 90% to version 5 (new)

# After validating:
aws lambda update-alias \
  --function-name my-api \
  --name LIVE \
  --function-version 5
# 100% to version 5

Practice with LocalStack + SAM

The fastest way to internalize these patterns is to build them. Use LocalStack for the data services (S3, DynamoDB, SQS) and sam local for Lambda + API Gateway. You can run a fully functional serverless API on your laptop without touching a real AWS account.