Table of Contents

Setup & Environment

Before working with AWS, you need the CLI configured and ideally a local sandbox. LocalStack lets you iterate fast without incurring costs.

Install & Configure AWS CLI

# Install via Homebrew (macOS)
brew install awscli

# Verify installation
aws --version
# aws-cli/2.x.x Python/3.x.x ...

# Interactive configuration wizard
aws configure
# AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
# AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Default region name [None]: us-east-1
# Default output format [None]: json

# View stored config
cat ~/.aws/config
cat ~/.aws/credentials

# Use named profiles for multiple accounts
aws configure --profile staging
aws s3 ls --profile staging

# Set environment variables (useful in CI/CD)
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/...
export AWS_DEFAULT_REGION=us-east-1

LocalStack for Local Development

LocalStack saves money
LocalStack emulates most AWS services locally on port 4566. You can create S3 buckets, queues, Lambda functions, and more without touching a real AWS account. Essential for fast TDD against AWS APIs.
# Run LocalStack via Docker
docker run -d \
  --name localstack \
  -p 4566:4566 \
  -e SERVICES=s3,sqs,sns,lambda,dynamodb,iam \
  localstack/localstack

# Verify it's running
docker ps | grep localstack
curl http://localhost:4566/_localstack/health

# Point AWS CLI at LocalStack with --endpoint-url
aws --endpoint-url=http://localhost:4566 s3 ls
aws --endpoint-url=http://localhost:4566 s3 mb s3://my-test-bucket
aws --endpoint-url=http://localhost:4566 s3 ls

# Create an alias so you don't repeat the flag
alias awslocal='aws --endpoint-url=http://localhost:4566'
awslocal sqs create-queue --queue-name my-queue
awslocal dynamodb list-tables
# docker-compose.yml for persistent LocalStack setup
version: '3.8'
services:
  localstack:
    image: localstack/localstack:latest
    ports:
      - "4566:4566"
    environment:
      - SERVICES=s3,sqs,sns,lambda,dynamodb,secretsmanager
      - DEBUG=1
      - DATA_DIR=/tmp/localstack/data
    volumes:
      - "./localstack-data:/tmp/localstack/data"
      - "/var/run/docker.sock:/var/run/docker.sock"
Never commit real credentials
Keep AWS credentials in ~/.aws/credentials or environment variables. Add .env, *.pem, and credentials to .gitignore. Use IAM roles (not access keys) for production workloads running on AWS.

Core Concepts

AWS organizes its infrastructure around geographic and logical boundaries. Understanding these concepts is prerequisite knowledge for every other service.

Regions, Availability Zones, and Edge Locations

ConceptWhat it isExamples
Region Geographically isolated cluster of data centers. Each region is independent and contains multiple AZs. us-east-1 (N. Virginia), eu-west-1 (Ireland), ap-southeast-1 (Singapore)
Availability Zone (AZ) One or more discrete data centers within a region, connected by low-latency links. Each AZ has independent power, cooling, and networking. us-east-1a, us-east-1b, us-east-1c
Edge Location Mini data centers used by CloudFront CDN and Route 53 DNS to serve content closer to end users. Not full regions. 200+ locations globally (NYC, London, Tokyo...)
Local Zone Extensions of a region placed in metro areas for single-digit millisecond latency to specific cities. us-east-1-bos-1 (Boston)

Global vs. Regional Services

ScopeServicesWhy global?
Global IAM, Route 53, CloudFront, WAF, Organizations Identity and DNS must be consistent everywhere
Regional EC2, S3, RDS, Lambda, VPC, SQS, SNS, ECS, EKS Data residency, fault isolation, latency optimization
AZ-scoped EC2 instances, EBS volumes, subnets Physical hardware tied to specific data centers

ARNs — Amazon Resource Names

Every AWS resource has a unique ARN. Understanding the format matters when writing IAM policies and CloudFormation templates.

# ARN format
arn:partition:service:region:account-id:resource-type/resource-id

# Examples
arn:aws:s3:::my-bucket                          # S3 bucket (global, no region/account)
arn:aws:s3:::my-bucket/path/to/object           # S3 object
arn:aws:iam::123456789012:user/alice             # IAM user (global, no region)
arn:aws:iam::123456789012:role/MyRole            # IAM role
arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0
arn:aws:lambda:us-east-1:123456789012:function:my-function
arn:aws:sqs:us-east-1:123456789012:my-queue
arn:aws:dynamodb:us-east-1:123456789012:table/Users

Shared Responsibility Model

AWS and the customer share security responsibilities. Knowing the boundary prevents misconfigurations.

AWS Responsible ForYou Responsible For
Physical hardware, data centers, networking Data encryption at rest and in transit
Hypervisor and host OS patching Guest OS patching (EC2 instances)
Managed service patching (RDS, Lambda runtime) Application-level security, IAM policies
Global infrastructure availability Network configuration, security groups, NACLs
Compliance certifications (SOC 2, PCI DSS) Enabling compliance for your workloads on top

IAM — Identity and Access Management

IAM is the access control system for all of AWS. It is global (not region-scoped). Mistakes here are the most common source of both security breaches and confusing permission errors.

Principals: Users, Groups, and Roles

PrincipalPurposeWhen to use
IAM User Long-term credentials (password + access keys) for a person or service Human developers, legacy automation. Prefer roles for EC2/Lambda.
IAM Group Collection of users; attach policies to groups rather than individual users Team-level permissions (Developers, ReadOnly, Admins)
IAM Role Temporary credentials assumed by a service, user, or external identity EC2 instance profiles, Lambda execution, cross-account access
Service Principal AWS service identity (e.g., lambda.amazonaws.com) Trust policies: allows a service to assume a role

Policy Structure

Policies are JSON documents. Every policy statement contains: Effect, Action, Resource, and optionally Condition.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3ReadOnMyBucket",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-app-bucket",
        "arn:aws:s3:::my-app-bucket/*"
      ]
    },
    {
      "Sid": "DenyDeleteFromProd",
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::prod-bucket/*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": "us-east-1"
        }
      }
    }
  ]
}
Policy evaluation order
AWS evaluates policies in this order: (1) explicit Deny wins always, (2) explicit Allow needed for access, (3) implicit Deny is the default. An explicit Deny in any policy overrides any Allow — even in other policies attached to the same principal.

Assume Role

Roles are assumed via STS (Security Token Service), which returns temporary credentials valid for 15 minutes to 12 hours.

# Assume a role from the CLI
aws sts assume-role \
  --role-arn arn:aws:iam::123456789012:role/DeployRole \
  --role-session-name deploy-session

# Returns: AccessKeyId, SecretAccessKey, SessionToken
# Export them to use in subsequent commands
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...

# Verify which identity you're using
aws sts get-caller-identity
// Trust policy — allows EC2 to assume this role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

// Cross-account trust — allows account 987654321098 to assume this role
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::987654321098:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        }
      }
    }
  ]
}

Instance Profiles

An instance profile is a container for an IAM role that gets attached to an EC2 instance. The instance automatically retrieves temporary credentials from the instance metadata endpoint.

# From inside an EC2 instance, credentials are available at:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Returns the role name, then:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRoleName

# SDKs and CLI automatically use instance profile credentials — no config needed
# This is why you should NEVER put access keys on EC2 instances
IAM Best Practices Checklist
  • Least privilege: start with no permissions, add only what's needed
  • Enable MFA on the root account and all human IAM users
  • Never use root for day-to-day work — create an admin IAM user instead
  • No long-term access keys on EC2/Lambda — use instance profiles and execution roles
  • Rotate access keys regularly; delete unused ones
  • Use IAM groups to manage permissions at scale, not individual users
  • Prefer managed policies (AWS-maintained) over inline policies where possible
  • Use conditions to restrict by IP, MFA, time, or source VPC
  • Enable CloudTrail to audit all IAM and API calls
  • Review IAM Access Analyzer to find external access to resources

Compute: EC2

EC2 (Elastic Compute Cloud) provides resizable virtual machines. It is the foundation of most AWS compute architectures, even when you are using higher-level services that run on top of it.

Instance Type Families

FamilyOptimized forCommon typesUse case
t3 / t4g Burstable general purpose t3.micro, t3.small, t3.medium Dev/test, low-traffic web servers
m5 / m6i Balanced compute/memory m5.large, m5.xlarge, m5.4xlarge Web servers, app servers, small databases
c5 / c6i Compute-intensive c5.large, c5.2xlarge, c5.9xlarge Batch processing, ML inference, video encoding
r5 / r6i Memory-intensive r5.large, r5.4xlarge, r5.24xlarge In-memory databases, large caches, analytics
g4dn / g5 GPU accelerated g4dn.xlarge, g5.2xlarge ML training, GPU rendering, gaming
i3 / i4i Storage-optimized NVMe i3.large, i3.2xlarge NoSQL databases, data warehousing

Pricing Models

ModelDescriptionSavings vs on-demandBest for
On-Demand Pay per second/hour, no commitment Baseline Unpredictable workloads, short-term
Reserved Instances 1 or 3 year commitment to a specific instance type Up to 72% Stable, predictable baseline load
Savings Plans Flexible commitment to spend $/hr; applies across instance types Up to 66% Predictable spend, flexible instance types
Spot Instances Spare capacity at up to 90% discount; AWS can reclaim with 2-min notice Up to 90% Fault-tolerant batch jobs, ML training
Dedicated Hosts Physical server dedicated to your account Varies Compliance, license requirements

Launching an Instance (CLI)

# Find the latest Amazon Linux 2023 AMI
aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=al2023-ami-*" "Name=architecture,Values=x86_64" \
  --query 'sort_by(Images, &CreationDate)[-1].ImageId' \
  --output text

# Launch an instance
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t3.micro \
  --key-name my-key-pair \
  --security-group-ids sg-12345678 \
  --subnet-id subnet-12345678 \
  --iam-instance-profile Name=MyInstanceProfile \
  --user-data file://bootstrap.sh \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server}]'

# Check instance status
aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=web-server" \
  --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PublicIpAddress]' \
  --output table

# SSH into instance
aws ec2-instance-connect send-ssh-public-key \
  --instance-id i-1234567890abcdef0 \
  --instance-os-user ec2-user \
  --ssh-public-key file://~/.ssh/id_rsa.pub

User Data Script

User data runs once at first boot as root. Use it to install software, configure the instance, and start services.

#!/bin/bash
# bootstrap.sh — runs at first launch as root
set -e
yum update -y

# Install Docker
yum install -y docker
systemctl enable docker
systemctl start docker
usermod -aG docker ec2-user

# Install application
yum install -y git
git clone https://github.com/myorg/myapp /opt/myapp
cd /opt/myapp

# Start with systemd
cat > /etc/systemd/system/myapp.service <<EOF
[Unit]
Description=My Application
After=network.target

[Service]
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/python3 app.py
Restart=always
User=ec2-user

[Install]
WantedBy=multi-user.target
EOF

systemctl enable myapp
systemctl start myapp
Production Use Cases
Use CaseWhy This Service
Stateful workloads (databases, caches)Needs persistent local NVMe storage and a consistent network identity across restarts; Lambda's ephemeral execution environment and stateless model make this impossible.
GPU / ML training (p4d, g5 instances)Fargate has no GPU support; EC2 gives you direct PCIe access to A100/A10G GPUs and lets you tune CUDA drivers. For one-off training runs, spot instances cut costs 60–90%.
Legacy app migration (lift-and-shift)You control the OS, runtime, and network stack — zero application refactoring required. Use this as a stepping stone; don't treat it as a destination.
Fault-tolerant batch processing on spotSpot Fleet + checkpointing to S3 delivers 60–90% cost savings. The key insight: if your job can resume from a checkpoint, preemption is cheap.

Compute: Lambda

Lambda is AWS's serverless compute service. You upload code, define a handler function, and AWS manages everything else: servers, OS, scaling, high availability. You pay only for compute time consumed — down to 1ms granularity.

The Serverless Model

ConceptDescription
FunctionYour deployment unit — code + dependencies + configuration
HandlerEntry point: module.function_name (e.g., handler.lambda_handler)
EventJSON payload delivered to the handler — shape varies by trigger
ContextRuntime info: function name, remaining time, log stream, request ID
Execution environmentMicro-VM (Firecracker) — frozen between invocations, reused when warm
Cold startFirst invocation after idle: environment initialization adds 100ms–2s latency
ConcurrencyEach simultaneous invocation gets its own environment; default limit 1000/region

Python Handler Example

import json
import logging
import boto3
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize outside handler — reused across warm invocations
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Users')


def lambda_handler(event, context):
    """
    Processes an API Gateway proxy event.

    Args:
        event: API Gateway event dict with httpMethod, path, body, headers, etc.
        context: Lambda context with function_name, aws_request_id, etc.

    Returns:
        API Gateway response dict with statusCode, headers, body.
    """
    logger.info("Request ID: %s", context.aws_request_id)
    logger.info("Event: %s", json.dumps(event))

    http_method = event.get('httpMethod', 'GET')
    path_params = event.get('pathParameters') or {}
    user_id = path_params.get('userId')

    if not user_id:
        return _response(400, {'error': 'userId path parameter is required'})

    try:
        if http_method == 'GET':
            result = table.get_item(Key={'userId': user_id})
            user = result.get('Item')
            if not user:
                return _response(404, {'error': 'User not found'})
            return _response(200, user)

        elif http_method == 'DELETE':
            table.delete_item(Key={'userId': user_id})
            return _response(204, {})

        else:
            return _response(405, {'error': f'Method {http_method} not allowed'})

    except ClientError as e:
        error_code = e.response['Error']['Code']
        logger.error("DynamoDB error: %s", error_code)
        return _response(500, {'error': 'Internal server error'})


def _response(status_code: int, body: dict) -> dict:
    return {
        'statusCode': status_code,
        'headers': {
            'Content-Type': 'application/json',
            'Access-Control-Allow-Origin': '*',
        },
        'body': json.dumps(body),
    }

Common Triggers

TriggerInvocation typeRetry behavior
API Gateway / Function URLSynchronousClient retries — no automatic retry
S3 (object created/deleted)Asynchronous2 retries, then dead-letter queue
SQS queuePoll-basedMessage returns to queue on failure; DLQ after maxReceiveCount
SNS topicAsynchronous2 retries, then DLQ
DynamoDB StreamsPoll-basedRetries until record expires (24h) or DLQ
EventBridge (CloudWatch Events)Asynchronous2 retries
CloudWatch Logs subscriptionAsynchronous2 retries

Key Configuration

# Deploy a function
aws lambda create-function \
  --function-name my-api \
  --runtime python3.12 \
  --handler handler.lambda_handler \
  --role arn:aws:iam::123456789012:role/LambdaExecRole \
  --zip-file fileb://function.zip \
  --timeout 30 \
  --memory-size 512 \
  --environment Variables='{TABLE_NAME=Users,LOG_LEVEL=INFO}'

# Update code
aws lambda update-function-code \
  --function-name my-api \
  --zip-file fileb://function.zip

# Set concurrency limit (protect downstream services)
aws lambda put-function-concurrency \
  --function-name my-api \
  --reserved-concurrent-executions 100

# Enable provisioned concurrency (eliminate cold starts)
aws lambda put-provisioned-concurrency-config \
  --function-name my-api \
  --qualifier LIVE \
  --provisioned-concurrent-executions 10

# Invoke synchronously
aws lambda invoke \
  --function-name my-api \
  --payload '{"httpMethod":"GET","pathParameters":{"userId":"abc123"}}' \
  --cli-binary-format raw-in-base64-out \
  response.json
cat response.json
Production Use Cases
Use CaseWhy This Service
API backend (API Gateway + Lambda)Zero infra management, per-request billing, and automatic scaling to thousands of concurrent requests. Choose over EC2 when traffic is spiky or unpredictable — idle EC2 burns money, idle Lambda costs nothing.
Event-driven file processing (S3 → Lambda)Triggered on upload, process-and-forget — no idle compute cost. Canonical example: thumbnail generation or CSV parsing where you want exactly-once semantics tied to object creation.
Scheduled tasks (EventBridge → Lambda)Replaces cron on EC2 with no server to maintain and built-in retry on failure. The EC2 cron approach requires keeping an instance alive 24/7 for a job that runs for seconds.
Stream processing (Kinesis / DynamoDB Streams → Lambda)Real-time processing with built-in batching and checkpointing. Simpler and cheaper than running Flink or Spark Streaming for low-to-medium volume streams where you don't need complex windowing.
Service glue (SQS → DynamoDB, SNS → Slack)Short-lived, stateless transformations between services are Lambda's sweet spot. Adding EC2 here is engineering overhead with no benefit — Lambda scales to zero between bursts automatically.
Lambda Cold Start Mitigation Strategies
  • Provisioned Concurrency: Pre-warms N environments. Eliminates cold starts for that capacity. Costs extra.
  • Keep functions warm: CloudWatch Events rule that pings the function every 5 minutes. Free but only works for low-concurrency functions.
  • Minimize package size: Smaller zip = faster initialization. Use Lambda layers for large shared dependencies.
  • Choose faster runtimes: Python and Node have faster cold starts than Java and .NET. Go compiles to a binary (very fast).
  • Init code outside handler: SDK clients, DB connections, config loading — do this once at module load, reuse across invocations.
  • SnapStart (Java): Snapshotting the initialized environment for Java 11+ — reduces cold start from seconds to ~1s.

Storage: S3

S3 (Simple Storage Service) is object storage with 11 nines (99.999999999%) of durability. Objects are stored in buckets, are addressed by key, and can range from 0 bytes to 5 TB.

Storage Classes

ClassUse caseRetrievalMin storage duration
StandardFrequently accessed dataMillisecondsNone
Standard-IAInfrequently accessed, needs fast retrievalMilliseconds30 days
One Zone-IAInfrequent access, single AZ (cheaper)Milliseconds30 days
Intelligent-TieringUnknown or changing access patternsMillisecondsNone
Glacier Instant RetrievalArchive, quarterly accessMilliseconds90 days
Glacier Flexible RetrievalArchive, occasional access1–12 hours90 days
Glacier Deep ArchiveLong-term archive, once-a-year accessUp to 48 hours180 days

Common S3 CLI Commands

# Create a bucket (bucket names are globally unique)
aws s3 mb s3://my-unique-bucket-name --region us-east-1

# Upload a file
aws s3 cp ./local-file.txt s3://my-bucket/remote-path/file.txt

# Upload with specific storage class
aws s3 cp ./archive.zip s3://my-bucket/archives/ \
  --storage-class GLACIER

# Sync a directory (only uploads changed/new files)
aws s3 sync ./build/ s3://my-bucket/static/ \
  --delete \
  --cache-control "max-age=86400"

# Download
aws s3 cp s3://my-bucket/file.txt ./downloaded.txt
aws s3 sync s3://my-bucket/data/ ./local-data/

# List objects
aws s3 ls s3://my-bucket/
aws s3 ls s3://my-bucket/ --recursive --human-readable --summarize

# Delete
aws s3 rm s3://my-bucket/old-file.txt
aws s3 rm s3://my-bucket/old-prefix/ --recursive

# Generate a presigned URL (valid for 1 hour)
aws s3 presign s3://my-bucket/private-file.pdf --expires-in 3600

Bucket Policies vs ACLs

Prefer bucket policies over ACLs
ACLs are a legacy access control mechanism. AWS now recommends disabling ACLs (set Object Ownership to "Bucket owner enforced") and using bucket policies or IAM policies for all access control.
// Bucket policy: allow public read of all objects (for static site hosting)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PublicReadGetObject",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::my-static-site/*"
    }
  ]
}

// Bucket policy: enforce HTTPS only
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyNonHTTPS",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ],
      "Condition": {
        "Bool": {
          "aws:SecureTransport": "false"
        }
      }
    }
  ]
}

Lifecycle Policies & Versioning

// Lifecycle policy: transition to cheaper storage, then expire
{
  "Rules": [
    {
      "ID": "archive-and-expire",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      },
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 30
      }
    }
  ]
}
# Enable versioning
aws s3api put-bucket-versioning \
  --bucket my-bucket \
  --versioning-configuration Status=Enabled

# List all versions of an object
aws s3api list-object-versions --bucket my-bucket --prefix my-file.txt

# Restore a specific version
aws s3api get-object \
  --bucket my-bucket \
  --key my-file.txt \
  --version-id abc123def456 \
  restored.txt
Production Use Cases
Use CaseWhy This Service
Data lake foundationUnlimited storage at $0.023/GB with native integration to Athena, Spark, and Redshift Spectrum. Parquet + partitioned prefixes give you columnar scan performance without running a warehouse — query only the partitions you need.
Static website hosting (S3 + CloudFront)Global CDN with TLS for pennies per GB served. No servers, no OS patches, 11-nines durability for your assets. The gap between this and a running EC2 instance is both cost and operational burden.
Backup and archiveLifecycle rules auto-tier objects to Glacier Deep Archive at $0.00099/GB — 23x cheaper than S3 Standard. Object Lock enforces WORM compliance for regulatory retention requirements without custom logic.
ML training data versioningS3 versioning gives you dataset snapshots with zero overhead; SageMaker reads directly from S3. Compare to maintaining a separate data versioning system — S3 versioning is already there.
Event-driven pipeline triggersS3 event notifications → Lambda/SQS decouple data ingestion from processing. No polling loop required; AWS delivers the notification within seconds of the PUT.

Databases

AWS offers multiple managed database services. Choosing the right one is a critical architectural decision driven by data model, access patterns, and consistency requirements.

RDS — Relational Database Service

RDS manages common relational databases: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora. AWS handles backups, patching, replication, and failover.

# Create a PostgreSQL RDS instance
aws rds create-db-instance \
  --db-instance-identifier prod-postgres \
  --db-instance-class db.t3.medium \
  --engine postgres \
  --engine-version 16.1 \
  --master-username admin \
  --master-user-password supersecret \
  --db-name myapp \
  --allocated-storage 100 \
  --storage-type gp3 \
  --multi-az \
  --backup-retention-period 7 \
  --no-publicly-accessible \
  --vpc-security-group-ids sg-12345678

# Create a read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-postgres-read \
  --source-db-instance-identifier prod-postgres

DynamoDB — Key-Value / Document Store

DynamoDB is a fully managed NoSQL database with single-digit millisecond performance at any scale. The data model centers on a partition key and optional sort key.

ConceptDescription
Partition keyRequired. Determines which partition the item lives in. Must uniquely identify items (when no sort key exists).
Sort keyOptional. Together with partition key, forms a composite primary key. Enables range queries.
GSIGlobal Secondary Index — alternate access pattern with different partition/sort key. Eventual consistency.
LSILocal Secondary Index — same partition key, different sort key. Created at table creation time only.
On-demand modePay per request. Scales instantly. Good for unpredictable workloads.
Provisioned modeSet RCU/WCU. Can use auto-scaling. Cheaper at predictable load.
# Create a table with composite key
aws dynamodb create-table \
  --table-name Orders \
  --attribute-definitions \
    AttributeName=userId,AttributeType=S \
    AttributeName=orderId,AttributeType=S \
    AttributeName=createdAt,AttributeType=S \
  --key-schema \
    AttributeName=userId,KeyType=HASH \
    AttributeName=orderId,KeyType=RANGE \
  --billing-mode PAY_PER_REQUEST \
  --global-secondary-indexes '[
    {
      "IndexName": "CreatedAtIndex",
      "KeySchema": [
        {"AttributeName": "userId", "KeyType": "HASH"},
        {"AttributeName": "createdAt", "KeyType": "RANGE"}
      ],
      "Projection": {"ProjectionType": "ALL"}
    }
  ]'

# Put an item
aws dynamodb put-item \
  --table-name Orders \
  --item '{
    "userId": {"S": "user#abc"},
    "orderId": {"S": "order#001"},
    "createdAt": {"S": "2026-02-23T10:00:00Z"},
    "status": {"S": "pending"},
    "total": {"N": "49.99"}
  }'

# Query items by partition key + sort key condition
aws dynamodb query \
  --table-name Orders \
  --key-condition-expression "userId = :uid AND begins_with(orderId, :prefix)" \
  --expression-attribute-values '{
    ":uid": {"S": "user#abc"},
    ":prefix": {"S": "order#"}
  }'

ElastiCache

Managed in-memory caching. Two engines: Redis (data structures, persistence, pub/sub, clustering) and Memcached (simple key-value, multi-threaded, no persistence).

# Create a Redis cluster
aws elasticache create-replication-group \
  --replication-group-id my-redis \
  --replication-group-description "App cache" \
  --engine redis \
  --engine-version 7.0 \
  --cache-node-type cache.t3.micro \
  --num-cache-clusters 2 \
  --automatic-failover-enabled \
  --at-rest-encryption-enabled \
  --transit-encryption-enabled
Production Use Cases
Use CaseWhy This Service
RDS Multi-AZ for production OLTPAutomated failover, point-in-time backups, and OS patching with zero DBA work. Choose over self-managed EC2 Postgres when you don't need exotic extensions — the operational savings outweigh the 20–30% cost premium.
DynamoDB for session stores, user profiles, gaming leaderboardsSingle-digit millisecond latency at any scale with no index tuning. Choose when access patterns are known and simple (key-value or key-range) — the moment you need ad-hoc queries, reach for a relational database instead.
DynamoDB Streams + Lambda for change data captureReact to data mutations in real-time without polling. Cheaper and simpler than running Debezium + Kafka for moderate change volumes where exactly-once CDC semantics aren't required.
ElastiCache Redis for rate limiting and real-time leaderboardsSub-millisecond latency with sorted sets, atomic counters, and pub/sub that DynamoDB can't match. Choose Redis over DynamoDB when you need complex data structures or need to read/write within a single microsecond budget.
Aurora for high-throughput OLTP5x MySQL / 3x Postgres throughput on the same hardware, with storage auto-scaling to 128TB. Reach for Aurora when standard RDS hits IOPS ceiling — the architecture separates compute from storage, removing the bottleneck.

Networking: VPC

A VPC (Virtual Private Cloud) is your isolated network within AWS. Every EC2 instance, RDS database, and Lambda (when VPC-attached) lives inside a VPC. Understanding VPC design is critical for security and connectivity.

Core VPC Components

ComponentPurpose
VPCIsolated virtual network. Defined by a CIDR block (e.g., 10.0.0.0/16 = 65,536 IPs).
SubnetA subdivision of the VPC tied to one AZ. Public subnets route to IGW; private subnets route to NAT GW or nowhere.
Internet Gateway (IGW)Allows public subnets to reach the internet. Attached to the VPC.
NAT GatewayAllows private subnet instances to initiate outbound internet connections (but blocks inbound). Lives in a public subnet.
Route TableDefines where traffic is directed. Every subnet is associated with exactly one route table.
Security GroupStateful virtual firewall at the instance/ENI level. Allow rules only; no explicit deny.
NACLStateless firewall at the subnet level. Supports both allow and deny rules. Rules evaluated by number (lowest first).
VPC PeeringPrivate connectivity between two VPCs (same or different account/region). Not transitive.
VPC EndpointPrivate connection from VPC to AWS services without traversing the internet.

Three-Tier VPC Design

# Typical 3-tier VPC: public / private app / private data
# CIDR: 10.0.0.0/16
#
# Public subnets (load balancers, NAT GW, bastion host)
#   10.0.1.0/24  us-east-1a
#   10.0.2.0/24  us-east-1b
#
# Private app subnets (EC2, ECS, Lambda)
#   10.0.10.0/24  us-east-1a
#   10.0.11.0/24  us-east-1b
#
# Private data subnets (RDS, ElastiCache — no outbound internet needed)
#   10.0.20.0/24  us-east-1a
#   10.0.21.0/24  us-east-1b

# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16 --tag-specifications \
  'ResourceType=vpc,Tags=[{Key=Name,Value=my-vpc}]'

# Create public subnet
aws ec2 create-subnet \
  --vpc-id vpc-12345678 \
  --cidr-block 10.0.1.0/24 \
  --availability-zone us-east-1a

# Enable auto-assign public IP for public subnet
aws ec2 modify-subnet-attribute \
  --subnet-id subnet-12345678 \
  --map-public-ip-on-launch

# Create and attach Internet Gateway
aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway \
  --vpc-id vpc-12345678 \
  --internet-gateway-id igw-12345678

# Add route to IGW in public route table
aws ec2 create-route \
  --route-table-id rtb-12345678 \
  --destination-cidr-block 0.0.0.0/0 \
  --gateway-id igw-12345678

Security Groups vs NACLs

FeatureSecurity GroupNACL
LevelInstance / ENISubnet
StatefulnessStateful (return traffic automatic)Stateless (must allow inbound AND outbound)
RulesAllow onlyAllow and Deny
Rule evaluationAll rules evaluatedEvaluated in number order, first match wins
Default behaviorDeny all inbound, allow all outboundAllow all (default NACL)
Best usePer-instance access controlSubnet-level block lists (e.g., block an IP range)
Production Use Cases
Use CaseWhy This Service
Multi-tier isolation (public / private / isolated subnets)The ALB lives in the public subnet, app servers in private, databases in an isolated subnet with no route to the internet. NACLs add a second layer of defense — stateless deny rules that security groups can't express.
VPC peering for cross-account shared servicesConnect a shared logging or monitoring account to production without traversing the internet. Traffic stays on the AWS backbone — lower latency and no egress costs compared to routing via an internet gateway.
PrivateLink for SaaS integrationAccess third-party APIs (Datadog, Snowflake) without the traffic ever leaving the AWS network. Required for PCI-DSS and HIPAA workloads where data must not traverse the public internet.
Transit Gateway for hub-and-spoke multi-VPC routingAt 10+ VPCs, full-mesh peering becomes O(n²) routes to manage. Transit Gateway centralizes routing through a single attachment — one place to audit, one place to update.

Messaging & Queues

Asynchronous messaging decouples producers from consumers, enabling fault tolerance, load leveling, and fan-out patterns.

SQS — Simple Queue Service

FeatureStandard QueueFIFO Queue
ThroughputUnlimited TPS300 TPS (3,000 with batching)
OrderingBest-effort (not guaranteed)Strict FIFO per message group
DeliveryAt least once (duplicates possible)Exactly once
Use caseHigh-throughput, order not criticalFinancial transactions, user actions
# Create a queue with dead-letter queue
aws sqs create-queue --queue-name my-dlq
aws sqs get-queue-attributes --queue-url ... --attribute-names QueueArn

aws sqs create-queue \
  --queue-name my-queue \
  --attributes '{
    "VisibilityTimeout": "30",
    "MessageRetentionPeriod": "86400",
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-dlq\",\"maxReceiveCount\":\"3\"}"
  }'

# Send a message
aws sqs send-message \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
  --message-body '{"orderId": "abc123", "action": "process"}'

# Receive and process messages
aws sqs receive-message \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
  --max-number-of-messages 10 \
  --wait-time-seconds 20  # Long polling — reduce empty receives

# Delete after processing
aws sqs delete-message \
  --queue-url ... \
  --receipt-handle "AQEBwJ..."
Visibility Timeout
When a consumer receives a message, it becomes invisible to other consumers for the visibility timeout period (default 30s). If the consumer doesn't delete the message within that window, it becomes visible again and another consumer can pick it up. Set this slightly longer than your processing time.

SNS — Simple Notification Service

SNS is a pub/sub service. Publishers send to a topic; subscribers (SQS, Lambda, HTTP, email, SMS) receive a copy. This enables fan-out: one event triggers many parallel consumers.

# Create a topic
aws sns create-topic --name order-events

# Subscribe an SQS queue to the topic
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
  --protocol sqs \
  --notification-endpoint arn:aws:sqs:us-east-1:123456789012:order-processing

# Subscribe a Lambda function
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
  --protocol lambda \
  --notification-endpoint arn:aws:lambda:us-east-1:123456789012:function:send-email

# Publish a message
aws sns publish \
  --topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
  --message '{"orderId": "abc123", "status": "placed"}' \
  --subject "Order Placed"

EventBridge

EventBridge is a serverless event bus. More powerful than SNS for routing: supports content-based routing via rules, schema registry, event replays, and cross-account event buses.

# Create a rule to trigger Lambda when an EC2 instance stops
aws events put-rule \
  --name ec2-stopped \
  --event-pattern '{
    "source": ["aws.ec2"],
    "detail-type": ["EC2 Instance State-change Notification"],
    "detail": {"state": ["stopped"]}
  }' \
  --state ENABLED

# Add Lambda as target
aws events put-targets \
  --rule ec2-stopped \
  --targets 'Id=1,Arn=arn:aws:lambda:us-east-1:123456789012:function:notify-ops'
Production Use Cases
Use CaseWhy This Service
SQS: Decoupling microservicesProducer writes at its own pace; consumer processes at its own pace — SQS absorbs the spike. Use Standard queue for maximum throughput, FIFO when order matters (order processing, financial transactions).
SQS + DLQ: Poison message handlingFailed messages are quarantined for inspection rather than silently dropped or blocking the queue. In payment and order processing, you cannot afford to lose a message — the DLQ gives you a durable holding area to debug and replay.
SNS: Fan-out to multiple consumers simultaneouslyOne published event triggers email notification + analytics pipeline + audit log in parallel. Doing this with SQS alone requires each consumer to poll independently — SNS pushes to all subscribers in one API call.
EventBridge: Cross-account event routing with content-based filteringRoute events to different targets based on payload content without writing routing code. Choose over SNS when you need schema registry for contract enforcement, event replay for debugging, or third-party SaaS integration (Stripe, Auth0 webhooks).
EventBridge Scheduler: Cron replacementOne-time or recurring triggers with no infrastructure to maintain. Replaces the pattern of keeping an EC2 instance alive 24/7 just to run a cron job that executes for a few seconds.

Containers on AWS

AWS offers multiple layers for running containers: ECR for image storage, ECS for container orchestration (AWS-native), and EKS for Kubernetes.

ECR — Elastic Container Registry

# Authenticate Docker to your ECR registry
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  123456789012.dkr.ecr.us-east-1.amazonaws.com

# Create a repository
aws ecr create-repository --repository-name my-app

# Build, tag, and push
docker build -t my-app .
docker tag my-app:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

# Enable image scanning on push (checks for CVEs)
aws ecr put-image-scanning-configuration \
  --repository-name my-app \
  --image-scanning-configuration scanOnPush=true

ECS — Elastic Container Service

ECS has two key concepts: Task Definitions (what to run — image, CPU, memory, environment, ports) and Services (how many copies, load balancer integration, auto-scaling).

Launch TypeDescriptionWhen to use
Fargate Serverless — AWS manages the underlying EC2. Pay per task CPU/memory. Most workloads. No cluster management overhead.
EC2 You manage an EC2 cluster. ECS places tasks on your instances. GPU workloads, specific instance types, cost optimization at scale.
// ECS Task Definition (simplified)
{
  "family": "my-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/myAppTaskRole",
  "containerDefinitions": [
    {
      "name": "my-app",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest",
      "portMappings": [
        { "containerPort": 8080, "protocol": "tcp" }
      ],
      "environment": [
        { "name": "ENV", "value": "production" }
      ],
      "secrets": [
        { "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:db-password" }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/my-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ]
}
# Register task definition
aws ecs register-task-definition --cli-input-json file://task-def.json

# Create a service (runs 2 tasks behind a load balancer)
aws ecs create-service \
  --cluster my-cluster \
  --service-name my-app-svc \
  --task-definition my-app:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration 'awsvpcConfiguration={
    subnets=[subnet-aaa,subnet-bbb],
    securityGroups=[sg-12345678],
    assignPublicIp=DISABLED
  }' \
  --load-balancers 'targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=my-app,containerPort=8080'

# Force new deployment (rolling update)
aws ecs update-service \
  --cluster my-cluster \
  --service my-app-svc \
  --force-new-deployment

EKS — Elastic Kubernetes Service

EKS is managed Kubernetes. AWS runs the control plane (API server, etcd, scheduler). You manage worker nodes (or use Fargate for pods).

# Create a cluster (takes ~10 min)
eksctl create cluster \
  --name my-cluster \
  --region us-east-1 \
  --nodegroup-name workers \
  --node-type m5.large \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 5 \
  --managed

# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name my-cluster

# Verify
kubectl get nodes
kubectl get pods --all-namespaces
Production Use Cases
Use CaseWhy This Service
ECS Fargate: Stateless microservicesNo cluster management, no AMI patching, pay-per-vCPU-second. Choose over EC2 launch type when your workload is stateless and traffic is variable — Fargate's higher unit cost is offset by eliminating idle EC2 capacity.
ECS EC2: GPU workloads and cost-optimized steady-stateFargate doesn't support GPUs, and at high, predictable throughput EC2 Reserved Instances are 2–3x cheaper per unit than Fargate. Use EC2 launch type when you've right-sized the fleet and can commit to reserved capacity.
ECR: Private image registryIAM-integrated authentication means no separate registry credentials to rotate. Built-in vulnerability scanning catches CVEs before deployment, and lifecycle policies automatically prune untagged images to control storage costs.
EKS: Multi-cloud portability and complex orchestrationIf your team already operates Kubernetes, EKS avoids retraining and lets you reuse Helm charts, operators, and tooling across clouds. Choose EKS over ECS when you need custom controllers, a service mesh (Istio/Linkerd), or a genuine multi-cloud strategy.

Monitoring & Logging

Observability on AWS centers on three tools: CloudWatch for metrics and logs, CloudTrail for API audit trails, and X-Ray for distributed tracing.

CloudWatch

FeatureDescription
MetricsTime-series data points. EC2 (CPU, network, disk), Lambda (duration, errors, throttles), RDS (connections, latency) all publish metrics automatically.
AlarmsTrigger actions (SNS notification, Auto Scaling, EC2 action) when a metric breaches a threshold.
LogsLog groups and log streams. Lambda writes here automatically. EC2/ECS needs the CloudWatch agent.
Log InsightsSQL-like query language over log data. Useful for ad-hoc debugging.
DashboardsCustom real-time metric visualizations across services and regions.
SyntheticsCanary scripts that monitor endpoints and APIs on a schedule.
# Create an alarm: alert when Lambda error rate > 1%
aws cloudwatch put-metric-alarm \
  --alarm-name lambda-errors-high \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --dimensions Name=FunctionName,Value=my-api \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 5 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --treat-missing-data notBreaching

# Publish a custom metric
aws cloudwatch put-metric-data \
  --namespace MyApp \
  --metric-name OrdersProcessed \
  --value 42 \
  --unit Count

# Query logs with Log Insights
aws logs start-query \
  --log-group-name /aws/lambda/my-api \
  --start-time $(date -v-1H +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message
    | filter @message like /ERROR/
    | sort @timestamp desc
    | limit 20'

CloudTrail

CloudTrail records every AWS API call (who, what, when, from where) across your account. It is the primary tool for security investigations and compliance auditing.

# Create a trail that writes to S3 (best practice: multi-region trail)
aws cloudtrail create-trail \
  --name my-audit-trail \
  --s3-bucket-name my-audit-logs-bucket \
  --include-global-service-events \
  --is-multi-region-trail \
  --enable-log-file-validation

aws cloudtrail start-logging --name my-audit-trail

# Look up recent events for a specific user
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=alice \
  --max-results 10

# Find who deleted an S3 object
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteObject

X-Ray — Distributed Tracing

X-Ray traces requests as they flow through your application, across Lambda, EC2, ECS, API Gateway, and more. It produces service maps and identifies latency bottlenecks.

import boto3
from aws_xray_sdk.core import xray_recorder, patch_all

# Instrument all boto3 clients automatically
patch_all()

@xray_recorder.capture('process_order')
def process_order(order_id: str) -> dict:
    # This creates a subsegment in the trace
    with xray_recorder.in_subsegment('validate') as subsegment:
        subsegment.put_annotation('orderId', order_id)
        result = validate_order(order_id)

    with xray_recorder.in_subsegment('persist'):
        save_to_dynamodb(result)

    return result
Production Use Cases
Use CaseWhy This Service
CloudWatch Alarms + SNS: Operational alertingCPU above 80%, 5xx error rate above 1%, SQS queue depth growing — each alarm can trigger an SNS notification or an Auto Scaling policy. The tight feedback loop between metric → alarm → action is the foundation of self-healing infrastructure.
CloudWatch Logs Insights: Ad-hoc log analysisQuery across Lambda, ECS, and API Gateway logs in a single pane without exporting data. Significantly cheaper than Splunk or Datadog for basic log search — pay only for the bytes scanned, not a per-host license.
X-Ray: Distributed tracing across servicesVisualizes the full request path from API Gateway through Lambda to DynamoDB, with per-segment timing. Without distributed tracing, debugging latency regressions in microservices means correlating timestamps across multiple log streams — X-Ray does it automatically.
CloudTrail: Security audit and complianceEvery AWS API call is logged — who did what, from which IP, and when. Required for SOC 2 and HIPAA compliance, and the first tool you reach for when investigating unauthorized resource changes or privilege escalation.

Infrastructure as Code

IaC treats infrastructure definitions as source code: version-controlled, repeatable, reviewable. On AWS, the native tool is CloudFormation; Terraform is the most popular third-party alternative.

CloudFormation

CloudFormation templates describe a stack — a collection of AWS resources. CloudFormation provisions, updates, and deletes them as a unit.

# cloudformation/api-stack.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: Serverless API stack with Lambda, API Gateway, and DynamoDB

Parameters:
  Environment:
    Type: String
    Default: dev
    AllowedValues: [dev, staging, prod]
    Description: Deployment environment
  LambdaMemory:
    Type: Number
    Default: 512
    MinValue: 128
    MaxValue: 10240

Conditions:
  IsProd: !Equals [!Ref Environment, prod]

Resources:
  # DynamoDB Table
  UsersTable:
    Type: AWS::DynamoDB::Table
    DeletionPolicy: Retain
    Properties:
      TableName: !Sub '${Environment}-Users'
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: userId
          AttributeType: S
      KeySchema:
        - AttributeName: userId
          KeyType: HASH
      PointInTimeRecoverySpecification:
        PointInTimeRecoveryEnabled: !If [IsProd, true, false]
      Tags:
        - Key: Environment
          Value: !Ref Environment

  # IAM Execution Role
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: DynamoDBAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - dynamodb:GetItem
                  - dynamodb:PutItem
                  - dynamodb:DeleteItem
                  - dynamodb:Query
                Resource: !GetAtt UsersTable.Arn

  # Lambda Function
  ApiFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: !Sub '${Environment}-users-api'
      Runtime: python3.12
      Handler: handler.lambda_handler
      Role: !GetAtt LambdaExecutionRole.Arn
      MemorySize: !Ref LambdaMemory
      Timeout: 30
      Environment:
        Variables:
          TABLE_NAME: !Ref UsersTable
          ENVIRONMENT: !Ref Environment
      Code:
        ZipFile: |
          def lambda_handler(event, context):
              return {'statusCode': 200, 'body': 'ok'}

  # API Gateway
  ApiGateway:
    Type: AWS::ApiGateway::RestApi
    Properties:
      Name: !Sub '${Environment}-users-api'
      EndpointConfiguration:
        Types: [REGIONAL]

Outputs:
  ApiEndpoint:
    Description: API Gateway endpoint URL
    Value: !Sub 'https://${ApiGateway}.execute-api.${AWS::Region}.amazonaws.com/prod'
    Export:
      Name: !Sub '${AWS::StackName}-ApiEndpoint'

  UsersTableArn:
    Description: DynamoDB table ARN
    Value: !GetAtt UsersTable.Arn
    Export:
      Name: !Sub '${AWS::StackName}-UsersTableArn'
# Validate template syntax
aws cloudformation validate-template --template-body file://api-stack.yaml

# Create/update stack (create-or-update)
aws cloudformation deploy \
  --template-file api-stack.yaml \
  --stack-name my-api-dev \
  --parameter-overrides Environment=dev LambdaMemory=256 \
  --capabilities CAPABILITY_IAM \
  --no-fail-on-empty-changeset

# Preview changes before applying (changeset)
aws cloudformation create-change-set \
  --stack-name my-api-dev \
  --change-set-name preview \
  --template-body file://api-stack.yaml \
  --capabilities CAPABILITY_IAM

aws cloudformation describe-change-set \
  --stack-name my-api-dev \
  --change-set-name preview

# Describe stack events (debug failed deployments)
aws cloudformation describe-stack-events \
  --stack-name my-api-dev \
  --query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'

# Delete stack
aws cloudformation delete-stack --stack-name my-api-dev

SAM — Serverless Application Model

SAM extends CloudFormation with shorthand for Lambda, API Gateway, and DynamoDB — saving 80% of the boilerplate for serverless apps.

# template.yaml (SAM)
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Globals:
  Function:
    Runtime: python3.12
    Timeout: 30
    Environment:
      Variables:
        TABLE_NAME: !Ref UsersTable

Resources:
  UsersFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handler.lambda_handler
      CodeUri: src/
      MemorySize: 512
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref UsersTable
      Events:
        GetUser:
          Type: Api
          Properties:
            Path: /users/{userId}
            Method: GET
        CreateUser:
          Type: Api
          Properties:
            Path: /users
            Method: POST

  UsersTable:
    Type: AWS::Serverless::SimpleTable
    Properties:
      PrimaryKey:
        Name: userId
        Type: String
# Install SAM CLI
brew tap aws/tap
brew install aws-sam-cli

# Build and run locally
sam build
sam local start-api          # Starts local API Gateway
sam local invoke UsersFunction --event event.json

# Deploy
sam deploy --guided         # First time: creates samconfig.toml
sam deploy                  # Subsequent deploys
CloudFormation vs Terraform — When to use which
DimensionCloudFormationTerraform
Multi-cloudAWS onlyAWS, GCP, Azure, + 1000s of providers
State managementAWS manages state in the stackYou manage state file (local or S3 backend)
Preview changesChange setsterraform plan
Drift detectionBuilt-inRequires terraform refresh
AWS service lagZero — new services available day 1Depends on provider; usually days to weeks
LanguageJSON/YAMLHCL (more expressive, supports loops/conditionals well)
RollbackAutomatic on failureManual; no automatic rollback
Best forAWS-only shops, tight AWS integrationMulti-cloud, teams preferring HCL expressiveness

In practice: greenfield AWS-only projects often use CloudFormation (or CDK which compiles to CloudFormation). Teams with multi-cloud needs or existing Terraform expertise reach for Terraform.

Production Use Cases
Use CaseWhy This Service
CloudFormation: AWS-native single-account infrastructureDeep integration with every AWS service on day one, built-in drift detection, and no external state file to lose or corrupt. Choose over Terraform when you're AWS-only and want automatic rollback on stack failures.
Terraform: Multi-cloud and multi-provider managementManage AWS + Datadog + PagerDuty + GitHub in a single codebase with HCL's expressive loops and conditionals. The community module registry is unmatched. Choose when your infrastructure spans providers or your team has existing Terraform expertise.
SAM: Serverless application developmentCloudFormation superset that collapses Lambda + API Gateway + DynamoDB boilerplate by 80%. The killer feature is sam local invoke — run your Lambda locally against a real event payload before deploying, closing the feedback loop dramatically.
CDK: Complex infrastructure requiring programmatic logicLoops, conditionals, inheritance, and type safety in TypeScript or Python — expressing dynamic infrastructure (e.g., deploying N identical services from a list) in YAML becomes unmaintainable quickly. CDK compiles to CloudFormation, so you keep AWS's native rollback and drift detection.

Common Architecture Patterns

These are the building blocks that appear repeatedly in production AWS architectures.

Three-Tier Web Application

# Layer 1: Public-facing (Load Balancer in public subnet)
#   ALB → health checks, SSL termination, routing rules
#
# Layer 2: Application (ECS/EC2 in private app subnet)
#   Auto Scaling Group or ECS Service
#   No public IPs — only ALB can reach them via security group rule
#
# Layer 3: Data (RDS + ElastiCache in private data subnet)
#   Only application layer security group can connect
#
# Traffic flow:
#   Internet → Route 53 → CloudFront (optional CDN) → ALB
#             → ECS tasks → RDS (reads from replica, writes to primary)
#                         → ElastiCache (cache layer)
#
# Outbound from private subnets:
#   App/Data subnet → NAT Gateway (public subnet) → Internet Gateway → Internet

# Key security group rules:
# ALB SG: inbound 443 from 0.0.0.0/0
# App SG: inbound 8080 from ALB SG only
# DB SG:  inbound 5432 from App SG only

Serverless API Pattern

# API Gateway + Lambda + DynamoDB
#
# Request flow:
#   Client → API Gateway → Lambda → DynamoDB
#
# API Gateway handles:
#   - TLS termination
#   - Request throttling (e.g., 10,000 RPS)
#   - API key management
#   - Request/response transformation
#   - Caching (optional)
#
# Lambda handles:
#   - Business logic
#   - Input validation
#   - Auth (via Lambda authorizer or Cognito)
#
# DynamoDB handles:
#   - Data persistence
#   - Single-digit ms read/write at any scale
#
# Cost characteristics:
#   - Zero cost at zero traffic (pay per invocation)
#   - Auto-scales to millions of RPS without config
#   - No servers to manage or patch

Event-Driven Fan-Out (SNS + SQS)

# Pattern: one event triggers multiple independent consumers
#
# SNS Topic: order-events
#   ├── SQS Queue: order-fulfillment   → Lambda: reserve inventory
#   ├── SQS Queue: order-email         → Lambda: send confirmation email
#   └── SQS Queue: order-analytics     → Lambda: update business metrics
#
# Why SNS → SQS (not SNS → Lambda directly)?
#   - SQS acts as a buffer: Lambda invocations are throttled by SQS batch size
#   - Dead-letter queues on SQS catch failures without losing events
#   - Each consumer scales independently
#   - Consumer can be paused (stop polling) without losing messages
#
# CloudFormation snippet for the fan-out:

# Producer publishes ONE message to SNS
aws sns publish \
  --topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
  --message '{"orderId":"abc123","userId":"user456","total":99.99}'

# All three SQS queues receive a copy simultaneously
# Each Lambda processes independently, at its own pace

Static Site (S3 + CloudFront)

# Architecture:
#   GitHub Actions → Build (npm run build) → S3 upload → CloudFront invalidation
#
# S3: hosts built static files (HTML/CSS/JS)
# CloudFront: CDN — caches at edge locations globally, serves HTTPS

# 1. Create S3 bucket for static hosting
aws s3 mb s3://my-static-site --region us-east-1
aws s3 website s3://my-static-site \
  --index-document index.html \
  --error-document 404.html

# 2. Create CloudFront distribution pointing to S3 origin
# (typically done via console or CloudFormation — CLI is verbose)

# 3. Deploy: sync build output, then invalidate CDN cache
aws s3 sync ./dist/ s3://my-static-site/ \
  --delete \
  --cache-control "public, max-age=31536000, immutable"

# Invalidate CloudFront cache (HTML files should not be cached long)
aws cloudfront create-invalidation \
  --distribution-id ABCDEFGHIJKLMN \
  --paths "/*"

# 4. Custom domain: Route 53 alias record → CloudFront distribution
aws route53 change-resource-record-sets \
  --hosted-zone-id ZONE123 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "www.example.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z2FDTNDATAQYW2",
          "DNSName": "d111111abcdef8.cloudfront.net",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'
Additional Patterns: Blue/Green & Canary Deployments

Blue/Green Deployment

Run two identical environments (blue = current, green = new). Switch traffic instantly at the load balancer or Route 53 level. Rollback is instant — point traffic back to blue.

  • ECS: CodeDeploy manages blue/green at the target group level
  • Lambda: Use aliases and weighted routing between two function versions
  • Elastic Beanstalk: Swap environment URLs

Canary Deployment

Gradually shift traffic to the new version. Start at 1%, watch error rates, then increase to 10%, 50%, 100%.

# Lambda canary via alias weighted routing
aws lambda update-alias \
  --function-name my-api \
  --name LIVE \
  --function-version 5 \
  --routing-config AdditionalVersionWeights={"4"=0.1}
# 10% to version 4 (old), 90% to version 5 (new)

# After validating:
aws lambda update-alias \
  --function-name my-api \
  --name LIVE \
  --function-version 5
# 100% to version 5
Practice with LocalStack + SAM
The fastest way to internalize these patterns is to build them. Use LocalStack for the data services (S3, DynamoDB, SQS) and sam local for Lambda + API Gateway. You can run a fully functional serverless API on your laptop without touching a real AWS account.