AWS Refresher
Core services, IAM, networking, and infrastructure-as-code essentials
Table of Contents
Setup & Environment
Before working with AWS, you need the CLI configured and ideally a local sandbox. LocalStack lets you iterate fast without incurring costs.
Install & Configure AWS CLI
# Install via Homebrew (macOS)
brew install awscli
# Verify installation
aws --version
# aws-cli/2.x.x Python/3.x.x ...
# Interactive configuration wizard
aws configure
# AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
# AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Default region name [None]: us-east-1
# Default output format [None]: json
# View stored config
cat ~/.aws/config
cat ~/.aws/credentials
# Use named profiles for multiple accounts
aws configure --profile staging
aws s3 ls --profile staging
# Set environment variables (useful in CI/CD)
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/...
export AWS_DEFAULT_REGION=us-east-1
LocalStack for Local Development
# Run LocalStack via Docker
docker run -d \
--name localstack \
-p 4566:4566 \
-e SERVICES=s3,sqs,sns,lambda,dynamodb,iam \
localstack/localstack
# Verify it's running
docker ps | grep localstack
curl http://localhost:4566/_localstack/health
# Point AWS CLI at LocalStack with --endpoint-url
aws --endpoint-url=http://localhost:4566 s3 ls
aws --endpoint-url=http://localhost:4566 s3 mb s3://my-test-bucket
aws --endpoint-url=http://localhost:4566 s3 ls
# Create an alias so you don't repeat the flag
alias awslocal='aws --endpoint-url=http://localhost:4566'
awslocal sqs create-queue --queue-name my-queue
awslocal dynamodb list-tables
# docker-compose.yml for persistent LocalStack setup
version: '3.8'
services:
localstack:
image: localstack/localstack:latest
ports:
- "4566:4566"
environment:
- SERVICES=s3,sqs,sns,lambda,dynamodb,secretsmanager
- DEBUG=1
- DATA_DIR=/tmp/localstack/data
volumes:
- "./localstack-data:/tmp/localstack/data"
- "/var/run/docker.sock:/var/run/docker.sock"
~/.aws/credentials or environment variables. Add .env, *.pem, and credentials to .gitignore. Use IAM roles (not access keys) for production workloads running on AWS.
Core Concepts
AWS organizes its infrastructure around geographic and logical boundaries. Understanding these concepts is prerequisite knowledge for every other service.
Regions, Availability Zones, and Edge Locations
| Concept | What it is | Examples |
|---|---|---|
| Region | Geographically isolated cluster of data centers. Each region is independent and contains multiple AZs. | us-east-1 (N. Virginia), eu-west-1 (Ireland), ap-southeast-1 (Singapore) |
| Availability Zone (AZ) | One or more discrete data centers within a region, connected by low-latency links. Each AZ has independent power, cooling, and networking. | us-east-1a, us-east-1b, us-east-1c |
| Edge Location | Mini data centers used by CloudFront CDN and Route 53 DNS to serve content closer to end users. Not full regions. | 200+ locations globally (NYC, London, Tokyo...) |
| Local Zone | Extensions of a region placed in metro areas for single-digit millisecond latency to specific cities. | us-east-1-bos-1 (Boston) |
Global vs. Regional Services
| Scope | Services | Why global? |
|---|---|---|
| Global | IAM, Route 53, CloudFront, WAF, Organizations | Identity and DNS must be consistent everywhere |
| Regional | EC2, S3, RDS, Lambda, VPC, SQS, SNS, ECS, EKS | Data residency, fault isolation, latency optimization |
| AZ-scoped | EC2 instances, EBS volumes, subnets | Physical hardware tied to specific data centers |
ARNs — Amazon Resource Names
Every AWS resource has a unique ARN. Understanding the format matters when writing IAM policies and CloudFormation templates.
# ARN format
arn:partition:service:region:account-id:resource-type/resource-id
# Examples
arn:aws:s3:::my-bucket # S3 bucket (global, no region/account)
arn:aws:s3:::my-bucket/path/to/object # S3 object
arn:aws:iam::123456789012:user/alice # IAM user (global, no region)
arn:aws:iam::123456789012:role/MyRole # IAM role
arn:aws:ec2:us-east-1:123456789012:instance/i-1234567890abcdef0
arn:aws:lambda:us-east-1:123456789012:function:my-function
arn:aws:sqs:us-east-1:123456789012:my-queue
arn:aws:dynamodb:us-east-1:123456789012:table/Users
Shared Responsibility Model
AWS and the customer share security responsibilities. Knowing the boundary prevents misconfigurations.
| AWS Responsible For | You Responsible For |
|---|---|
| Physical hardware, data centers, networking | Data encryption at rest and in transit |
| Hypervisor and host OS patching | Guest OS patching (EC2 instances) |
| Managed service patching (RDS, Lambda runtime) | Application-level security, IAM policies |
| Global infrastructure availability | Network configuration, security groups, NACLs |
| Compliance certifications (SOC 2, PCI DSS) | Enabling compliance for your workloads on top |
IAM — Identity and Access Management
IAM is the access control system for all of AWS. It is global (not region-scoped). Mistakes here are the most common source of both security breaches and confusing permission errors.
Principals: Users, Groups, and Roles
| Principal | Purpose | When to use |
|---|---|---|
| IAM User | Long-term credentials (password + access keys) for a person or service | Human developers, legacy automation. Prefer roles for EC2/Lambda. |
| IAM Group | Collection of users; attach policies to groups rather than individual users | Team-level permissions (Developers, ReadOnly, Admins) |
| IAM Role | Temporary credentials assumed by a service, user, or external identity | EC2 instance profiles, Lambda execution, cross-account access |
| Service Principal | AWS service identity (e.g., lambda.amazonaws.com) |
Trust policies: allows a service to assume a role |
Policy Structure
Policies are JSON documents. Every policy statement contains: Effect, Action, Resource, and optionally Condition.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowS3ReadOnMyBucket",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-app-bucket",
"arn:aws:s3:::my-app-bucket/*"
]
},
{
"Sid": "DenyDeleteFromProd",
"Effect": "Deny",
"Action": "s3:DeleteObject",
"Resource": "arn:aws:s3:::prod-bucket/*",
"Condition": {
"StringNotEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
}
]
}
Assume Role
Roles are assumed via STS (Security Token Service), which returns temporary credentials valid for 15 minutes to 12 hours.
# Assume a role from the CLI
aws sts assume-role \
--role-arn arn:aws:iam::123456789012:role/DeployRole \
--role-session-name deploy-session
# Returns: AccessKeyId, SecretAccessKey, SessionToken
# Export them to use in subsequent commands
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...
# Verify which identity you're using
aws sts get-caller-identity
// Trust policy — allows EC2 to assume this role
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
// Cross-account trust — allows account 987654321098 to assume this role
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::987654321098:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"Bool": {
"aws:MultiFactorAuthPresent": "true"
}
}
}
]
}
Instance Profiles
An instance profile is a container for an IAM role that gets attached to an EC2 instance. The instance automatically retrieves temporary credentials from the instance metadata endpoint.
# From inside an EC2 instance, credentials are available at:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Returns the role name, then:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/MyRoleName
# SDKs and CLI automatically use instance profile credentials — no config needed
# This is why you should NEVER put access keys on EC2 instances
IAM Best Practices Checklist
- Least privilege: start with no permissions, add only what's needed
- Enable MFA on the root account and all human IAM users
- Never use root for day-to-day work — create an admin IAM user instead
- No long-term access keys on EC2/Lambda — use instance profiles and execution roles
- Rotate access keys regularly; delete unused ones
- Use IAM groups to manage permissions at scale, not individual users
- Prefer managed policies (AWS-maintained) over inline policies where possible
- Use conditions to restrict by IP, MFA, time, or source VPC
- Enable CloudTrail to audit all IAM and API calls
- Review IAM Access Analyzer to find external access to resources
Compute: EC2
EC2 (Elastic Compute Cloud) provides resizable virtual machines. It is the foundation of most AWS compute architectures, even when you are using higher-level services that run on top of it.
Instance Type Families
| Family | Optimized for | Common types | Use case |
|---|---|---|---|
| t3 / t4g | Burstable general purpose | t3.micro, t3.small, t3.medium | Dev/test, low-traffic web servers |
| m5 / m6i | Balanced compute/memory | m5.large, m5.xlarge, m5.4xlarge | Web servers, app servers, small databases |
| c5 / c6i | Compute-intensive | c5.large, c5.2xlarge, c5.9xlarge | Batch processing, ML inference, video encoding |
| r5 / r6i | Memory-intensive | r5.large, r5.4xlarge, r5.24xlarge | In-memory databases, large caches, analytics |
| g4dn / g5 | GPU accelerated | g4dn.xlarge, g5.2xlarge | ML training, GPU rendering, gaming |
| i3 / i4i | Storage-optimized NVMe | i3.large, i3.2xlarge | NoSQL databases, data warehousing |
Pricing Models
| Model | Description | Savings vs on-demand | Best for |
|---|---|---|---|
| On-Demand | Pay per second/hour, no commitment | Baseline | Unpredictable workloads, short-term |
| Reserved Instances | 1 or 3 year commitment to a specific instance type | Up to 72% | Stable, predictable baseline load |
| Savings Plans | Flexible commitment to spend $/hr; applies across instance types | Up to 66% | Predictable spend, flexible instance types |
| Spot Instances | Spare capacity at up to 90% discount; AWS can reclaim with 2-min notice | Up to 90% | Fault-tolerant batch jobs, ML training |
| Dedicated Hosts | Physical server dedicated to your account | Varies | Compliance, license requirements |
Launching an Instance (CLI)
# Find the latest Amazon Linux 2023 AMI
aws ec2 describe-images \
--owners amazon \
--filters "Name=name,Values=al2023-ami-*" "Name=architecture,Values=x86_64" \
--query 'sort_by(Images, &CreationDate)[-1].ImageId' \
--output text
# Launch an instance
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type t3.micro \
--key-name my-key-pair \
--security-group-ids sg-12345678 \
--subnet-id subnet-12345678 \
--iam-instance-profile Name=MyInstanceProfile \
--user-data file://bootstrap.sh \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server}]'
# Check instance status
aws ec2 describe-instances \
--filters "Name=tag:Name,Values=web-server" \
--query 'Reservations[*].Instances[*].[InstanceId,State.Name,PublicIpAddress]' \
--output table
# SSH into instance
aws ec2-instance-connect send-ssh-public-key \
--instance-id i-1234567890abcdef0 \
--instance-os-user ec2-user \
--ssh-public-key file://~/.ssh/id_rsa.pub
User Data Script
User data runs once at first boot as root. Use it to install software, configure the instance, and start services.
#!/bin/bash
# bootstrap.sh — runs at first launch as root
set -e
yum update -y
# Install Docker
yum install -y docker
systemctl enable docker
systemctl start docker
usermod -aG docker ec2-user
# Install application
yum install -y git
git clone https://github.com/myorg/myapp /opt/myapp
cd /opt/myapp
# Start with systemd
cat > /etc/systemd/system/myapp.service <<EOF
[Unit]
Description=My Application
After=network.target
[Service]
WorkingDirectory=/opt/myapp
ExecStart=/usr/bin/python3 app.py
Restart=always
User=ec2-user
[Install]
WantedBy=multi-user.target
EOF
systemctl enable myapp
systemctl start myapp
Production Use Cases
| Use Case | Why This Service |
|---|---|
| Stateful workloads (databases, caches) | Needs persistent local NVMe storage and a consistent network identity across restarts; Lambda's ephemeral execution environment and stateless model make this impossible. |
| GPU / ML training (p4d, g5 instances) | Fargate has no GPU support; EC2 gives you direct PCIe access to A100/A10G GPUs and lets you tune CUDA drivers. For one-off training runs, spot instances cut costs 60–90%. |
| Legacy app migration (lift-and-shift) | You control the OS, runtime, and network stack — zero application refactoring required. Use this as a stepping stone; don't treat it as a destination. |
| Fault-tolerant batch processing on spot | Spot Fleet + checkpointing to S3 delivers 60–90% cost savings. The key insight: if your job can resume from a checkpoint, preemption is cheap. |
Compute: Lambda
Lambda is AWS's serverless compute service. You upload code, define a handler function, and AWS manages everything else: servers, OS, scaling, high availability. You pay only for compute time consumed — down to 1ms granularity.
The Serverless Model
| Concept | Description |
|---|---|
| Function | Your deployment unit — code + dependencies + configuration |
| Handler | Entry point: module.function_name (e.g., handler.lambda_handler) |
| Event | JSON payload delivered to the handler — shape varies by trigger |
| Context | Runtime info: function name, remaining time, log stream, request ID |
| Execution environment | Micro-VM (Firecracker) — frozen between invocations, reused when warm |
| Cold start | First invocation after idle: environment initialization adds 100ms–2s latency |
| Concurrency | Each simultaneous invocation gets its own environment; default limit 1000/region |
Python Handler Example
import json
import logging
import boto3
from botocore.exceptions import ClientError
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Initialize outside handler — reused across warm invocations
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Users')
def lambda_handler(event, context):
"""
Processes an API Gateway proxy event.
Args:
event: API Gateway event dict with httpMethod, path, body, headers, etc.
context: Lambda context with function_name, aws_request_id, etc.
Returns:
API Gateway response dict with statusCode, headers, body.
"""
logger.info("Request ID: %s", context.aws_request_id)
logger.info("Event: %s", json.dumps(event))
http_method = event.get('httpMethod', 'GET')
path_params = event.get('pathParameters') or {}
user_id = path_params.get('userId')
if not user_id:
return _response(400, {'error': 'userId path parameter is required'})
try:
if http_method == 'GET':
result = table.get_item(Key={'userId': user_id})
user = result.get('Item')
if not user:
return _response(404, {'error': 'User not found'})
return _response(200, user)
elif http_method == 'DELETE':
table.delete_item(Key={'userId': user_id})
return _response(204, {})
else:
return _response(405, {'error': f'Method {http_method} not allowed'})
except ClientError as e:
error_code = e.response['Error']['Code']
logger.error("DynamoDB error: %s", error_code)
return _response(500, {'error': 'Internal server error'})
def _response(status_code: int, body: dict) -> dict:
return {
'statusCode': status_code,
'headers': {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*',
},
'body': json.dumps(body),
}
Common Triggers
| Trigger | Invocation type | Retry behavior |
|---|---|---|
| API Gateway / Function URL | Synchronous | Client retries — no automatic retry |
| S3 (object created/deleted) | Asynchronous | 2 retries, then dead-letter queue |
| SQS queue | Poll-based | Message returns to queue on failure; DLQ after maxReceiveCount |
| SNS topic | Asynchronous | 2 retries, then DLQ |
| DynamoDB Streams | Poll-based | Retries until record expires (24h) or DLQ |
| EventBridge (CloudWatch Events) | Asynchronous | 2 retries |
| CloudWatch Logs subscription | Asynchronous | 2 retries |
Key Configuration
# Deploy a function
aws lambda create-function \
--function-name my-api \
--runtime python3.12 \
--handler handler.lambda_handler \
--role arn:aws:iam::123456789012:role/LambdaExecRole \
--zip-file fileb://function.zip \
--timeout 30 \
--memory-size 512 \
--environment Variables='{TABLE_NAME=Users,LOG_LEVEL=INFO}'
# Update code
aws lambda update-function-code \
--function-name my-api \
--zip-file fileb://function.zip
# Set concurrency limit (protect downstream services)
aws lambda put-function-concurrency \
--function-name my-api \
--reserved-concurrent-executions 100
# Enable provisioned concurrency (eliminate cold starts)
aws lambda put-provisioned-concurrency-config \
--function-name my-api \
--qualifier LIVE \
--provisioned-concurrent-executions 10
# Invoke synchronously
aws lambda invoke \
--function-name my-api \
--payload '{"httpMethod":"GET","pathParameters":{"userId":"abc123"}}' \
--cli-binary-format raw-in-base64-out \
response.json
cat response.json
Production Use Cases
| Use Case | Why This Service |
|---|---|
| API backend (API Gateway + Lambda) | Zero infra management, per-request billing, and automatic scaling to thousands of concurrent requests. Choose over EC2 when traffic is spiky or unpredictable — idle EC2 burns money, idle Lambda costs nothing. |
| Event-driven file processing (S3 → Lambda) | Triggered on upload, process-and-forget — no idle compute cost. Canonical example: thumbnail generation or CSV parsing where you want exactly-once semantics tied to object creation. |
| Scheduled tasks (EventBridge → Lambda) | Replaces cron on EC2 with no server to maintain and built-in retry on failure. The EC2 cron approach requires keeping an instance alive 24/7 for a job that runs for seconds. |
| Stream processing (Kinesis / DynamoDB Streams → Lambda) | Real-time processing with built-in batching and checkpointing. Simpler and cheaper than running Flink or Spark Streaming for low-to-medium volume streams where you don't need complex windowing. |
| Service glue (SQS → DynamoDB, SNS → Slack) | Short-lived, stateless transformations between services are Lambda's sweet spot. Adding EC2 here is engineering overhead with no benefit — Lambda scales to zero between bursts automatically. |
Lambda Cold Start Mitigation Strategies
- Provisioned Concurrency: Pre-warms N environments. Eliminates cold starts for that capacity. Costs extra.
- Keep functions warm: CloudWatch Events rule that pings the function every 5 minutes. Free but only works for low-concurrency functions.
- Minimize package size: Smaller zip = faster initialization. Use Lambda layers for large shared dependencies.
- Choose faster runtimes: Python and Node have faster cold starts than Java and .NET. Go compiles to a binary (very fast).
- Init code outside handler: SDK clients, DB connections, config loading — do this once at module load, reuse across invocations.
- SnapStart (Java): Snapshotting the initialized environment for Java 11+ — reduces cold start from seconds to ~1s.
Storage: S3
S3 (Simple Storage Service) is object storage with 11 nines (99.999999999%) of durability. Objects are stored in buckets, are addressed by key, and can range from 0 bytes to 5 TB.
Storage Classes
| Class | Use case | Retrieval | Min storage duration |
|---|---|---|---|
| Standard | Frequently accessed data | Milliseconds | None |
| Standard-IA | Infrequently accessed, needs fast retrieval | Milliseconds | 30 days |
| One Zone-IA | Infrequent access, single AZ (cheaper) | Milliseconds | 30 days |
| Intelligent-Tiering | Unknown or changing access patterns | Milliseconds | None |
| Glacier Instant Retrieval | Archive, quarterly access | Milliseconds | 90 days |
| Glacier Flexible Retrieval | Archive, occasional access | 1–12 hours | 90 days |
| Glacier Deep Archive | Long-term archive, once-a-year access | Up to 48 hours | 180 days |
Common S3 CLI Commands
# Create a bucket (bucket names are globally unique)
aws s3 mb s3://my-unique-bucket-name --region us-east-1
# Upload a file
aws s3 cp ./local-file.txt s3://my-bucket/remote-path/file.txt
# Upload with specific storage class
aws s3 cp ./archive.zip s3://my-bucket/archives/ \
--storage-class GLACIER
# Sync a directory (only uploads changed/new files)
aws s3 sync ./build/ s3://my-bucket/static/ \
--delete \
--cache-control "max-age=86400"
# Download
aws s3 cp s3://my-bucket/file.txt ./downloaded.txt
aws s3 sync s3://my-bucket/data/ ./local-data/
# List objects
aws s3 ls s3://my-bucket/
aws s3 ls s3://my-bucket/ --recursive --human-readable --summarize
# Delete
aws s3 rm s3://my-bucket/old-file.txt
aws s3 rm s3://my-bucket/old-prefix/ --recursive
# Generate a presigned URL (valid for 1 hour)
aws s3 presign s3://my-bucket/private-file.pdf --expires-in 3600
Bucket Policies vs ACLs
// Bucket policy: allow public read of all objects (for static site hosting)
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PublicReadGetObject",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-static-site/*"
}
]
}
// Bucket policy: enforce HTTPS only
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyNonHTTPS",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*"
],
"Condition": {
"Bool": {
"aws:SecureTransport": "false"
}
}
}
]
}
Lifecycle Policies & Versioning
// Lifecycle policy: transition to cheaper storage, then expire
{
"Rules": [
{
"ID": "archive-and-expire",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
},
"NoncurrentVersionExpiration": {
"NoncurrentDays": 30
}
}
]
}
# Enable versioning
aws s3api put-bucket-versioning \
--bucket my-bucket \
--versioning-configuration Status=Enabled
# List all versions of an object
aws s3api list-object-versions --bucket my-bucket --prefix my-file.txt
# Restore a specific version
aws s3api get-object \
--bucket my-bucket \
--key my-file.txt \
--version-id abc123def456 \
restored.txt
Production Use Cases
| Use Case | Why This Service |
|---|---|
| Data lake foundation | Unlimited storage at $0.023/GB with native integration to Athena, Spark, and Redshift Spectrum. Parquet + partitioned prefixes give you columnar scan performance without running a warehouse — query only the partitions you need. |
| Static website hosting (S3 + CloudFront) | Global CDN with TLS for pennies per GB served. No servers, no OS patches, 11-nines durability for your assets. The gap between this and a running EC2 instance is both cost and operational burden. |
| Backup and archive | Lifecycle rules auto-tier objects to Glacier Deep Archive at $0.00099/GB — 23x cheaper than S3 Standard. Object Lock enforces WORM compliance for regulatory retention requirements without custom logic. |
| ML training data versioning | S3 versioning gives you dataset snapshots with zero overhead; SageMaker reads directly from S3. Compare to maintaining a separate data versioning system — S3 versioning is already there. |
| Event-driven pipeline triggers | S3 event notifications → Lambda/SQS decouple data ingestion from processing. No polling loop required; AWS delivers the notification within seconds of the PUT. |
Databases
AWS offers multiple managed database services. Choosing the right one is a critical architectural decision driven by data model, access patterns, and consistency requirements.
RDS — Relational Database Service
RDS manages common relational databases: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora. AWS handles backups, patching, replication, and failover.
# Create a PostgreSQL RDS instance
aws rds create-db-instance \
--db-instance-identifier prod-postgres \
--db-instance-class db.t3.medium \
--engine postgres \
--engine-version 16.1 \
--master-username admin \
--master-user-password supersecret \
--db-name myapp \
--allocated-storage 100 \
--storage-type gp3 \
--multi-az \
--backup-retention-period 7 \
--no-publicly-accessible \
--vpc-security-group-ids sg-12345678
# Create a read replica
aws rds create-db-instance-read-replica \
--db-instance-identifier prod-postgres-read \
--source-db-instance-identifier prod-postgres
DynamoDB — Key-Value / Document Store
DynamoDB is a fully managed NoSQL database with single-digit millisecond performance at any scale. The data model centers on a partition key and optional sort key.
| Concept | Description |
|---|---|
| Partition key | Required. Determines which partition the item lives in. Must uniquely identify items (when no sort key exists). |
| Sort key | Optional. Together with partition key, forms a composite primary key. Enables range queries. |
| GSI | Global Secondary Index — alternate access pattern with different partition/sort key. Eventual consistency. |
| LSI | Local Secondary Index — same partition key, different sort key. Created at table creation time only. |
| On-demand mode | Pay per request. Scales instantly. Good for unpredictable workloads. |
| Provisioned mode | Set RCU/WCU. Can use auto-scaling. Cheaper at predictable load. |
# Create a table with composite key
aws dynamodb create-table \
--table-name Orders \
--attribute-definitions \
AttributeName=userId,AttributeType=S \
AttributeName=orderId,AttributeType=S \
AttributeName=createdAt,AttributeType=S \
--key-schema \
AttributeName=userId,KeyType=HASH \
AttributeName=orderId,KeyType=RANGE \
--billing-mode PAY_PER_REQUEST \
--global-secondary-indexes '[
{
"IndexName": "CreatedAtIndex",
"KeySchema": [
{"AttributeName": "userId", "KeyType": "HASH"},
{"AttributeName": "createdAt", "KeyType": "RANGE"}
],
"Projection": {"ProjectionType": "ALL"}
}
]'
# Put an item
aws dynamodb put-item \
--table-name Orders \
--item '{
"userId": {"S": "user#abc"},
"orderId": {"S": "order#001"},
"createdAt": {"S": "2026-02-23T10:00:00Z"},
"status": {"S": "pending"},
"total": {"N": "49.99"}
}'
# Query items by partition key + sort key condition
aws dynamodb query \
--table-name Orders \
--key-condition-expression "userId = :uid AND begins_with(orderId, :prefix)" \
--expression-attribute-values '{
":uid": {"S": "user#abc"},
":prefix": {"S": "order#"}
}'
ElastiCache
Managed in-memory caching. Two engines: Redis (data structures, persistence, pub/sub, clustering) and Memcached (simple key-value, multi-threaded, no persistence).
# Create a Redis cluster
aws elasticache create-replication-group \
--replication-group-id my-redis \
--replication-group-description "App cache" \
--engine redis \
--engine-version 7.0 \
--cache-node-type cache.t3.micro \
--num-cache-clusters 2 \
--automatic-failover-enabled \
--at-rest-encryption-enabled \
--transit-encryption-enabled
Production Use Cases
| Use Case | Why This Service |
|---|---|
| RDS Multi-AZ for production OLTP | Automated failover, point-in-time backups, and OS patching with zero DBA work. Choose over self-managed EC2 Postgres when you don't need exotic extensions — the operational savings outweigh the 20–30% cost premium. |
| DynamoDB for session stores, user profiles, gaming leaderboards | Single-digit millisecond latency at any scale with no index tuning. Choose when access patterns are known and simple (key-value or key-range) — the moment you need ad-hoc queries, reach for a relational database instead. |
| DynamoDB Streams + Lambda for change data capture | React to data mutations in real-time without polling. Cheaper and simpler than running Debezium + Kafka for moderate change volumes where exactly-once CDC semantics aren't required. |
| ElastiCache Redis for rate limiting and real-time leaderboards | Sub-millisecond latency with sorted sets, atomic counters, and pub/sub that DynamoDB can't match. Choose Redis over DynamoDB when you need complex data structures or need to read/write within a single microsecond budget. |
| Aurora for high-throughput OLTP | 5x MySQL / 3x Postgres throughput on the same hardware, with storage auto-scaling to 128TB. Reach for Aurora when standard RDS hits IOPS ceiling — the architecture separates compute from storage, removing the bottleneck. |
Networking: VPC
A VPC (Virtual Private Cloud) is your isolated network within AWS. Every EC2 instance, RDS database, and Lambda (when VPC-attached) lives inside a VPC. Understanding VPC design is critical for security and connectivity.
Core VPC Components
| Component | Purpose |
|---|---|
| VPC | Isolated virtual network. Defined by a CIDR block (e.g., 10.0.0.0/16 = 65,536 IPs). |
| Subnet | A subdivision of the VPC tied to one AZ. Public subnets route to IGW; private subnets route to NAT GW or nowhere. |
| Internet Gateway (IGW) | Allows public subnets to reach the internet. Attached to the VPC. |
| NAT Gateway | Allows private subnet instances to initiate outbound internet connections (but blocks inbound). Lives in a public subnet. |
| Route Table | Defines where traffic is directed. Every subnet is associated with exactly one route table. |
| Security Group | Stateful virtual firewall at the instance/ENI level. Allow rules only; no explicit deny. |
| NACL | Stateless firewall at the subnet level. Supports both allow and deny rules. Rules evaluated by number (lowest first). |
| VPC Peering | Private connectivity between two VPCs (same or different account/region). Not transitive. |
| VPC Endpoint | Private connection from VPC to AWS services without traversing the internet. |
Three-Tier VPC Design
# Typical 3-tier VPC: public / private app / private data
# CIDR: 10.0.0.0/16
#
# Public subnets (load balancers, NAT GW, bastion host)
# 10.0.1.0/24 us-east-1a
# 10.0.2.0/24 us-east-1b
#
# Private app subnets (EC2, ECS, Lambda)
# 10.0.10.0/24 us-east-1a
# 10.0.11.0/24 us-east-1b
#
# Private data subnets (RDS, ElastiCache — no outbound internet needed)
# 10.0.20.0/24 us-east-1a
# 10.0.21.0/24 us-east-1b
# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16 --tag-specifications \
'ResourceType=vpc,Tags=[{Key=Name,Value=my-vpc}]'
# Create public subnet
aws ec2 create-subnet \
--vpc-id vpc-12345678 \
--cidr-block 10.0.1.0/24 \
--availability-zone us-east-1a
# Enable auto-assign public IP for public subnet
aws ec2 modify-subnet-attribute \
--subnet-id subnet-12345678 \
--map-public-ip-on-launch
# Create and attach Internet Gateway
aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway \
--vpc-id vpc-12345678 \
--internet-gateway-id igw-12345678
# Add route to IGW in public route table
aws ec2 create-route \
--route-table-id rtb-12345678 \
--destination-cidr-block 0.0.0.0/0 \
--gateway-id igw-12345678
Security Groups vs NACLs
| Feature | Security Group | NACL |
|---|---|---|
| Level | Instance / ENI | Subnet |
| Statefulness | Stateful (return traffic automatic) | Stateless (must allow inbound AND outbound) |
| Rules | Allow only | Allow and Deny |
| Rule evaluation | All rules evaluated | Evaluated in number order, first match wins |
| Default behavior | Deny all inbound, allow all outbound | Allow all (default NACL) |
| Best use | Per-instance access control | Subnet-level block lists (e.g., block an IP range) |
Production Use Cases
| Use Case | Why This Service |
|---|---|
| Multi-tier isolation (public / private / isolated subnets) | The ALB lives in the public subnet, app servers in private, databases in an isolated subnet with no route to the internet. NACLs add a second layer of defense — stateless deny rules that security groups can't express. |
| VPC peering for cross-account shared services | Connect a shared logging or monitoring account to production without traversing the internet. Traffic stays on the AWS backbone — lower latency and no egress costs compared to routing via an internet gateway. |
| PrivateLink for SaaS integration | Access third-party APIs (Datadog, Snowflake) without the traffic ever leaving the AWS network. Required for PCI-DSS and HIPAA workloads where data must not traverse the public internet. |
| Transit Gateway for hub-and-spoke multi-VPC routing | At 10+ VPCs, full-mesh peering becomes O(n²) routes to manage. Transit Gateway centralizes routing through a single attachment — one place to audit, one place to update. |
Messaging & Queues
Asynchronous messaging decouples producers from consumers, enabling fault tolerance, load leveling, and fan-out patterns.
SQS — Simple Queue Service
| Feature | Standard Queue | FIFO Queue |
|---|---|---|
| Throughput | Unlimited TPS | 300 TPS (3,000 with batching) |
| Ordering | Best-effort (not guaranteed) | Strict FIFO per message group |
| Delivery | At least once (duplicates possible) | Exactly once |
| Use case | High-throughput, order not critical | Financial transactions, user actions |
# Create a queue with dead-letter queue
aws sqs create-queue --queue-name my-dlq
aws sqs get-queue-attributes --queue-url ... --attribute-names QueueArn
aws sqs create-queue \
--queue-name my-queue \
--attributes '{
"VisibilityTimeout": "30",
"MessageRetentionPeriod": "86400",
"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-dlq\",\"maxReceiveCount\":\"3\"}"
}'
# Send a message
aws sqs send-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
--message-body '{"orderId": "abc123", "action": "process"}'
# Receive and process messages
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
--max-number-of-messages 10 \
--wait-time-seconds 20 # Long polling — reduce empty receives
# Delete after processing
aws sqs delete-message \
--queue-url ... \
--receipt-handle "AQEBwJ..."
SNS — Simple Notification Service
SNS is a pub/sub service. Publishers send to a topic; subscribers (SQS, Lambda, HTTP, email, SMS) receive a copy. This enables fan-out: one event triggers many parallel consumers.
# Create a topic
aws sns create-topic --name order-events
# Subscribe an SQS queue to the topic
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
--protocol sqs \
--notification-endpoint arn:aws:sqs:us-east-1:123456789012:order-processing
# Subscribe a Lambda function
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
--protocol lambda \
--notification-endpoint arn:aws:lambda:us-east-1:123456789012:function:send-email
# Publish a message
aws sns publish \
--topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
--message '{"orderId": "abc123", "status": "placed"}' \
--subject "Order Placed"
EventBridge
EventBridge is a serverless event bus. More powerful than SNS for routing: supports content-based routing via rules, schema registry, event replays, and cross-account event buses.
# Create a rule to trigger Lambda when an EC2 instance stops
aws events put-rule \
--name ec2-stopped \
--event-pattern '{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {"state": ["stopped"]}
}' \
--state ENABLED
# Add Lambda as target
aws events put-targets \
--rule ec2-stopped \
--targets 'Id=1,Arn=arn:aws:lambda:us-east-1:123456789012:function:notify-ops'
Production Use Cases
| Use Case | Why This Service |
|---|---|
| SQS: Decoupling microservices | Producer writes at its own pace; consumer processes at its own pace — SQS absorbs the spike. Use Standard queue for maximum throughput, FIFO when order matters (order processing, financial transactions). |
| SQS + DLQ: Poison message handling | Failed messages are quarantined for inspection rather than silently dropped or blocking the queue. In payment and order processing, you cannot afford to lose a message — the DLQ gives you a durable holding area to debug and replay. |
| SNS: Fan-out to multiple consumers simultaneously | One published event triggers email notification + analytics pipeline + audit log in parallel. Doing this with SQS alone requires each consumer to poll independently — SNS pushes to all subscribers in one API call. |
| EventBridge: Cross-account event routing with content-based filtering | Route events to different targets based on payload content without writing routing code. Choose over SNS when you need schema registry for contract enforcement, event replay for debugging, or third-party SaaS integration (Stripe, Auth0 webhooks). |
| EventBridge Scheduler: Cron replacement | One-time or recurring triggers with no infrastructure to maintain. Replaces the pattern of keeping an EC2 instance alive 24/7 just to run a cron job that executes for a few seconds. |
Containers on AWS
AWS offers multiple layers for running containers: ECR for image storage, ECS for container orchestration (AWS-native), and EKS for Kubernetes.
ECR — Elastic Container Registry
# Authenticate Docker to your ECR registry
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin \
123456789012.dkr.ecr.us-east-1.amazonaws.com
# Create a repository
aws ecr create-repository --repository-name my-app
# Build, tag, and push
docker build -t my-app .
docker tag my-app:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
# Enable image scanning on push (checks for CVEs)
aws ecr put-image-scanning-configuration \
--repository-name my-app \
--image-scanning-configuration scanOnPush=true
ECS — Elastic Container Service
ECS has two key concepts: Task Definitions (what to run — image, CPU, memory, environment, ports) and Services (how many copies, load balancer integration, auto-scaling).
| Launch Type | Description | When to use |
|---|---|---|
| Fargate | Serverless — AWS manages the underlying EC2. Pay per task CPU/memory. | Most workloads. No cluster management overhead. |
| EC2 | You manage an EC2 cluster. ECS places tasks on your instances. | GPU workloads, specific instance types, cost optimization at scale. |
// ECS Task Definition (simplified)
{
"family": "my-app",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/myAppTaskRole",
"containerDefinitions": [
{
"name": "my-app",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest",
"portMappings": [
{ "containerPort": 8080, "protocol": "tcp" }
],
"environment": [
{ "name": "ENV", "value": "production" }
],
"secrets": [
{ "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:db-password" }
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/my-app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
}
}
]
}
# Register task definition
aws ecs register-task-definition --cli-input-json file://task-def.json
# Create a service (runs 2 tasks behind a load balancer)
aws ecs create-service \
--cluster my-cluster \
--service-name my-app-svc \
--task-definition my-app:1 \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration 'awsvpcConfiguration={
subnets=[subnet-aaa,subnet-bbb],
securityGroups=[sg-12345678],
assignPublicIp=DISABLED
}' \
--load-balancers 'targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=my-app,containerPort=8080'
# Force new deployment (rolling update)
aws ecs update-service \
--cluster my-cluster \
--service my-app-svc \
--force-new-deployment
EKS — Elastic Kubernetes Service
EKS is managed Kubernetes. AWS runs the control plane (API server, etcd, scheduler). You manage worker nodes (or use Fargate for pods).
# Create a cluster (takes ~10 min)
eksctl create cluster \
--name my-cluster \
--region us-east-1 \
--nodegroup-name workers \
--node-type m5.large \
--nodes 3 \
--nodes-min 1 \
--nodes-max 5 \
--managed
# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name my-cluster
# Verify
kubectl get nodes
kubectl get pods --all-namespaces
Production Use Cases
| Use Case | Why This Service |
|---|---|
| ECS Fargate: Stateless microservices | No cluster management, no AMI patching, pay-per-vCPU-second. Choose over EC2 launch type when your workload is stateless and traffic is variable — Fargate's higher unit cost is offset by eliminating idle EC2 capacity. |
| ECS EC2: GPU workloads and cost-optimized steady-state | Fargate doesn't support GPUs, and at high, predictable throughput EC2 Reserved Instances are 2–3x cheaper per unit than Fargate. Use EC2 launch type when you've right-sized the fleet and can commit to reserved capacity. |
| ECR: Private image registry | IAM-integrated authentication means no separate registry credentials to rotate. Built-in vulnerability scanning catches CVEs before deployment, and lifecycle policies automatically prune untagged images to control storage costs. |
| EKS: Multi-cloud portability and complex orchestration | If your team already operates Kubernetes, EKS avoids retraining and lets you reuse Helm charts, operators, and tooling across clouds. Choose EKS over ECS when you need custom controllers, a service mesh (Istio/Linkerd), or a genuine multi-cloud strategy. |
Monitoring & Logging
Observability on AWS centers on three tools: CloudWatch for metrics and logs, CloudTrail for API audit trails, and X-Ray for distributed tracing.
CloudWatch
| Feature | Description |
|---|---|
| Metrics | Time-series data points. EC2 (CPU, network, disk), Lambda (duration, errors, throttles), RDS (connections, latency) all publish metrics automatically. |
| Alarms | Trigger actions (SNS notification, Auto Scaling, EC2 action) when a metric breaches a threshold. |
| Logs | Log groups and log streams. Lambda writes here automatically. EC2/ECS needs the CloudWatch agent. |
| Log Insights | SQL-like query language over log data. Useful for ad-hoc debugging. |
| Dashboards | Custom real-time metric visualizations across services and regions. |
| Synthetics | Canary scripts that monitor endpoints and APIs on a schedule. |
# Create an alarm: alert when Lambda error rate > 1%
aws cloudwatch put-metric-alarm \
--alarm-name lambda-errors-high \
--metric-name Errors \
--namespace AWS/Lambda \
--dimensions Name=FunctionName,Value=my-api \
--statistic Sum \
--period 300 \
--evaluation-periods 2 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
--treat-missing-data notBreaching
# Publish a custom metric
aws cloudwatch put-metric-data \
--namespace MyApp \
--metric-name OrdersProcessed \
--value 42 \
--unit Count
# Query logs with Log Insights
aws logs start-query \
--log-group-name /aws/lambda/my-api \
--start-time $(date -v-1H +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20'
CloudTrail
CloudTrail records every AWS API call (who, what, when, from where) across your account. It is the primary tool for security investigations and compliance auditing.
# Create a trail that writes to S3 (best practice: multi-region trail)
aws cloudtrail create-trail \
--name my-audit-trail \
--s3-bucket-name my-audit-logs-bucket \
--include-global-service-events \
--is-multi-region-trail \
--enable-log-file-validation
aws cloudtrail start-logging --name my-audit-trail
# Look up recent events for a specific user
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=Username,AttributeValue=alice \
--max-results 10
# Find who deleted an S3 object
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=DeleteObject
X-Ray — Distributed Tracing
X-Ray traces requests as they flow through your application, across Lambda, EC2, ECS, API Gateway, and more. It produces service maps and identifies latency bottlenecks.
import boto3
from aws_xray_sdk.core import xray_recorder, patch_all
# Instrument all boto3 clients automatically
patch_all()
@xray_recorder.capture('process_order')
def process_order(order_id: str) -> dict:
# This creates a subsegment in the trace
with xray_recorder.in_subsegment('validate') as subsegment:
subsegment.put_annotation('orderId', order_id)
result = validate_order(order_id)
with xray_recorder.in_subsegment('persist'):
save_to_dynamodb(result)
return result
Production Use Cases
| Use Case | Why This Service |
|---|---|
| CloudWatch Alarms + SNS: Operational alerting | CPU above 80%, 5xx error rate above 1%, SQS queue depth growing — each alarm can trigger an SNS notification or an Auto Scaling policy. The tight feedback loop between metric → alarm → action is the foundation of self-healing infrastructure. |
| CloudWatch Logs Insights: Ad-hoc log analysis | Query across Lambda, ECS, and API Gateway logs in a single pane without exporting data. Significantly cheaper than Splunk or Datadog for basic log search — pay only for the bytes scanned, not a per-host license. |
| X-Ray: Distributed tracing across services | Visualizes the full request path from API Gateway through Lambda to DynamoDB, with per-segment timing. Without distributed tracing, debugging latency regressions in microservices means correlating timestamps across multiple log streams — X-Ray does it automatically. |
| CloudTrail: Security audit and compliance | Every AWS API call is logged — who did what, from which IP, and when. Required for SOC 2 and HIPAA compliance, and the first tool you reach for when investigating unauthorized resource changes or privilege escalation. |
Infrastructure as Code
IaC treats infrastructure definitions as source code: version-controlled, repeatable, reviewable. On AWS, the native tool is CloudFormation; Terraform is the most popular third-party alternative.
CloudFormation
CloudFormation templates describe a stack — a collection of AWS resources. CloudFormation provisions, updates, and deletes them as a unit.
# cloudformation/api-stack.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: Serverless API stack with Lambda, API Gateway, and DynamoDB
Parameters:
Environment:
Type: String
Default: dev
AllowedValues: [dev, staging, prod]
Description: Deployment environment
LambdaMemory:
Type: Number
Default: 512
MinValue: 128
MaxValue: 10240
Conditions:
IsProd: !Equals [!Ref Environment, prod]
Resources:
# DynamoDB Table
UsersTable:
Type: AWS::DynamoDB::Table
DeletionPolicy: Retain
Properties:
TableName: !Sub '${Environment}-Users'
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: userId
AttributeType: S
KeySchema:
- AttributeName: userId
KeyType: HASH
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: !If [IsProd, true, false]
Tags:
- Key: Environment
Value: !Ref Environment
# IAM Execution Role
LambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: DynamoDBAccess
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- dynamodb:GetItem
- dynamodb:PutItem
- dynamodb:DeleteItem
- dynamodb:Query
Resource: !GetAtt UsersTable.Arn
# Lambda Function
ApiFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: !Sub '${Environment}-users-api'
Runtime: python3.12
Handler: handler.lambda_handler
Role: !GetAtt LambdaExecutionRole.Arn
MemorySize: !Ref LambdaMemory
Timeout: 30
Environment:
Variables:
TABLE_NAME: !Ref UsersTable
ENVIRONMENT: !Ref Environment
Code:
ZipFile: |
def lambda_handler(event, context):
return {'statusCode': 200, 'body': 'ok'}
# API Gateway
ApiGateway:
Type: AWS::ApiGateway::RestApi
Properties:
Name: !Sub '${Environment}-users-api'
EndpointConfiguration:
Types: [REGIONAL]
Outputs:
ApiEndpoint:
Description: API Gateway endpoint URL
Value: !Sub 'https://${ApiGateway}.execute-api.${AWS::Region}.amazonaws.com/prod'
Export:
Name: !Sub '${AWS::StackName}-ApiEndpoint'
UsersTableArn:
Description: DynamoDB table ARN
Value: !GetAtt UsersTable.Arn
Export:
Name: !Sub '${AWS::StackName}-UsersTableArn'
# Validate template syntax
aws cloudformation validate-template --template-body file://api-stack.yaml
# Create/update stack (create-or-update)
aws cloudformation deploy \
--template-file api-stack.yaml \
--stack-name my-api-dev \
--parameter-overrides Environment=dev LambdaMemory=256 \
--capabilities CAPABILITY_IAM \
--no-fail-on-empty-changeset
# Preview changes before applying (changeset)
aws cloudformation create-change-set \
--stack-name my-api-dev \
--change-set-name preview \
--template-body file://api-stack.yaml \
--capabilities CAPABILITY_IAM
aws cloudformation describe-change-set \
--stack-name my-api-dev \
--change-set-name preview
# Describe stack events (debug failed deployments)
aws cloudformation describe-stack-events \
--stack-name my-api-dev \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'
# Delete stack
aws cloudformation delete-stack --stack-name my-api-dev
SAM — Serverless Application Model
SAM extends CloudFormation with shorthand for Lambda, API Gateway, and DynamoDB — saving 80% of the boilerplate for serverless apps.
# template.yaml (SAM)
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
Runtime: python3.12
Timeout: 30
Environment:
Variables:
TABLE_NAME: !Ref UsersTable
Resources:
UsersFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handler.lambda_handler
CodeUri: src/
MemorySize: 512
Policies:
- DynamoDBCrudPolicy:
TableName: !Ref UsersTable
Events:
GetUser:
Type: Api
Properties:
Path: /users/{userId}
Method: GET
CreateUser:
Type: Api
Properties:
Path: /users
Method: POST
UsersTable:
Type: AWS::Serverless::SimpleTable
Properties:
PrimaryKey:
Name: userId
Type: String
# Install SAM CLI
brew tap aws/tap
brew install aws-sam-cli
# Build and run locally
sam build
sam local start-api # Starts local API Gateway
sam local invoke UsersFunction --event event.json
# Deploy
sam deploy --guided # First time: creates samconfig.toml
sam deploy # Subsequent deploys
CloudFormation vs Terraform — When to use which
| Dimension | CloudFormation | Terraform |
|---|---|---|
| Multi-cloud | AWS only | AWS, GCP, Azure, + 1000s of providers |
| State management | AWS manages state in the stack | You manage state file (local or S3 backend) |
| Preview changes | Change sets | terraform plan |
| Drift detection | Built-in | Requires terraform refresh |
| AWS service lag | Zero — new services available day 1 | Depends on provider; usually days to weeks |
| Language | JSON/YAML | HCL (more expressive, supports loops/conditionals well) |
| Rollback | Automatic on failure | Manual; no automatic rollback |
| Best for | AWS-only shops, tight AWS integration | Multi-cloud, teams preferring HCL expressiveness |
In practice: greenfield AWS-only projects often use CloudFormation (or CDK which compiles to CloudFormation). Teams with multi-cloud needs or existing Terraform expertise reach for Terraform.
Production Use Cases
| Use Case | Why This Service |
|---|---|
| CloudFormation: AWS-native single-account infrastructure | Deep integration with every AWS service on day one, built-in drift detection, and no external state file to lose or corrupt. Choose over Terraform when you're AWS-only and want automatic rollback on stack failures. |
| Terraform: Multi-cloud and multi-provider management | Manage AWS + Datadog + PagerDuty + GitHub in a single codebase with HCL's expressive loops and conditionals. The community module registry is unmatched. Choose when your infrastructure spans providers or your team has existing Terraform expertise. |
| SAM: Serverless application development | CloudFormation superset that collapses Lambda + API Gateway + DynamoDB boilerplate by 80%. The killer feature is sam local invoke — run your Lambda locally against a real event payload before deploying, closing the feedback loop dramatically. |
| CDK: Complex infrastructure requiring programmatic logic | Loops, conditionals, inheritance, and type safety in TypeScript or Python — expressing dynamic infrastructure (e.g., deploying N identical services from a list) in YAML becomes unmaintainable quickly. CDK compiles to CloudFormation, so you keep AWS's native rollback and drift detection. |
Common Architecture Patterns
These are the building blocks that appear repeatedly in production AWS architectures.
Three-Tier Web Application
# Layer 1: Public-facing (Load Balancer in public subnet)
# ALB → health checks, SSL termination, routing rules
#
# Layer 2: Application (ECS/EC2 in private app subnet)
# Auto Scaling Group or ECS Service
# No public IPs — only ALB can reach them via security group rule
#
# Layer 3: Data (RDS + ElastiCache in private data subnet)
# Only application layer security group can connect
#
# Traffic flow:
# Internet → Route 53 → CloudFront (optional CDN) → ALB
# → ECS tasks → RDS (reads from replica, writes to primary)
# → ElastiCache (cache layer)
#
# Outbound from private subnets:
# App/Data subnet → NAT Gateway (public subnet) → Internet Gateway → Internet
# Key security group rules:
# ALB SG: inbound 443 from 0.0.0.0/0
# App SG: inbound 8080 from ALB SG only
# DB SG: inbound 5432 from App SG only
Serverless API Pattern
# API Gateway + Lambda + DynamoDB
#
# Request flow:
# Client → API Gateway → Lambda → DynamoDB
#
# API Gateway handles:
# - TLS termination
# - Request throttling (e.g., 10,000 RPS)
# - API key management
# - Request/response transformation
# - Caching (optional)
#
# Lambda handles:
# - Business logic
# - Input validation
# - Auth (via Lambda authorizer or Cognito)
#
# DynamoDB handles:
# - Data persistence
# - Single-digit ms read/write at any scale
#
# Cost characteristics:
# - Zero cost at zero traffic (pay per invocation)
# - Auto-scales to millions of RPS without config
# - No servers to manage or patch
Event-Driven Fan-Out (SNS + SQS)
# Pattern: one event triggers multiple independent consumers
#
# SNS Topic: order-events
# ├── SQS Queue: order-fulfillment → Lambda: reserve inventory
# ├── SQS Queue: order-email → Lambda: send confirmation email
# └── SQS Queue: order-analytics → Lambda: update business metrics
#
# Why SNS → SQS (not SNS → Lambda directly)?
# - SQS acts as a buffer: Lambda invocations are throttled by SQS batch size
# - Dead-letter queues on SQS catch failures without losing events
# - Each consumer scales independently
# - Consumer can be paused (stop polling) without losing messages
#
# CloudFormation snippet for the fan-out:
# Producer publishes ONE message to SNS
aws sns publish \
--topic-arn arn:aws:sns:us-east-1:123456789012:order-events \
--message '{"orderId":"abc123","userId":"user456","total":99.99}'
# All three SQS queues receive a copy simultaneously
# Each Lambda processes independently, at its own pace
Static Site (S3 + CloudFront)
# Architecture:
# GitHub Actions → Build (npm run build) → S3 upload → CloudFront invalidation
#
# S3: hosts built static files (HTML/CSS/JS)
# CloudFront: CDN — caches at edge locations globally, serves HTTPS
# 1. Create S3 bucket for static hosting
aws s3 mb s3://my-static-site --region us-east-1
aws s3 website s3://my-static-site \
--index-document index.html \
--error-document 404.html
# 2. Create CloudFront distribution pointing to S3 origin
# (typically done via console or CloudFormation — CLI is verbose)
# 3. Deploy: sync build output, then invalidate CDN cache
aws s3 sync ./dist/ s3://my-static-site/ \
--delete \
--cache-control "public, max-age=31536000, immutable"
# Invalidate CloudFront cache (HTML files should not be cached long)
aws cloudfront create-invalidation \
--distribution-id ABCDEFGHIJKLMN \
--paths "/*"
# 4. Custom domain: Route 53 alias record → CloudFront distribution
aws route53 change-resource-record-sets \
--hosted-zone-id ZONE123 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "www.example.com",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "Z2FDTNDATAQYW2",
"DNSName": "d111111abcdef8.cloudfront.net",
"EvaluateTargetHealth": false
}
}
}]
}'
Additional Patterns: Blue/Green & Canary Deployments
Blue/Green Deployment
Run two identical environments (blue = current, green = new). Switch traffic instantly at the load balancer or Route 53 level. Rollback is instant — point traffic back to blue.
- ECS: CodeDeploy manages blue/green at the target group level
- Lambda: Use aliases and weighted routing between two function versions
- Elastic Beanstalk: Swap environment URLs
Canary Deployment
Gradually shift traffic to the new version. Start at 1%, watch error rates, then increase to 10%, 50%, 100%.
# Lambda canary via alias weighted routing
aws lambda update-alias \
--function-name my-api \
--name LIVE \
--function-version 5 \
--routing-config AdditionalVersionWeights={"4"=0.1}
# 10% to version 4 (old), 90% to version 5 (new)
# After validating:
aws lambda update-alias \
--function-name my-api \
--name LIVE \
--function-version 5
# 100% to version 5
sam local for Lambda + API Gateway. You can run a fully functional serverless API on your laptop without touching a real AWS account.