Choosing between ECS Fargate and EKS is one of the harder infrastructure calls for a team building on AWS. Pick well and you trade away problems you don't want to own. Pick badly and you end up paying for Kubernetes expertise you don't have, or hitting AWS-specific ceilings on a workload that needed to be portable.
This is the decision framework, cost math, and implementation patterns I lean on when making the call. Either choice pairs well with AWS Cost Anomaly Detection, so runaway containers don't sneak onto the bill.
The three paths AWS gives you
AWS has three reasonable ways to run containers, and the tradeoffs aren't subtle.
ECS (Elastic Container Service)
ECS is Amazon's proprietary orchestrator. It's AWS-native, simpler than Kubernetes, and has first-class hooks into the rest of the platform. The mental model is small: a cluster, services, and task definitions.
What you get:
- An AWS-native API with built-in integration for ALB, CloudWatch, IAM, and Secrets Manager
- Task definitions in JSON that describe containers, resources, networking, and volumes
- A service controller that maintains desired count, runs health checks, and plugs into service discovery
- No control plane charge, so you only pay for the compute your tasks consume
- A Fargate launch type that removes infrastructure management entirely
EKS (Elastic Kubernetes Service)
EKS is managed Kubernetes. You get standard upstream Kubernetes with AWS-specific glue. The upside is portability and the entire CNCF ecosystem. The downside is that you now own Kubernetes.
What you get:
- Full compatibility with the Kubernetes API, kubectl, Helm, operators, the works
- A managed control plane (AWS handles upgrades, patching, HA of the masters)
- Compute flexibility across EC2, Fargate, and hybrid on-prem nodes via Outposts
- Access to the broader Kubernetes tooling community
- Portability across clouds and on-prem, which matters for some org policies
EC2 launch type vs Fargate
Both ECS and EKS let you pick where the workload actually runs:
EC2 launch type:
- You own the instances, including patching, scaling, and capacity headroom
- Per-task cost drops when utilization is high
- You can pick instance types, AMIs, and custom configurations
- Capacity planning and cluster management are now your problem
Fargate launch type:
- AWS owns the infrastructure, no servers to manage
- You pay per vCPU and GB of memory per task
- Scaling happens without capacity planning
- Best fit for variable workloads and small teams
The cost math, honestly
Compute is the easy line item. The real cost includes management overhead and the engineering time you'll spend keeping the cluster healthy.
What ECS costs
ECS itself is free. You pay for the AWS resources underneath.
Fargate pricing:
vCPU: $0.04048 per vCPU per hour
Memory: $0.004445 per GB per hour
Example: 1 vCPU, 2GB RAM task running 24/7
Monthly cost: (0.04048 * 1 + 0.004445 * 2) * 730 hours
= (0.04048 + 0.00889) * 730
= $36.03 per month per task
EC2 launch type (you manage instances):
Example: 3x t3.large instances (2 vCPU, 8GB each)
Cost: $0.0832/hour * 3 * 730 hours = $182.21/month
Plus: EBS volumes, data transfer, load balancers
Can run many containers per instance if utilization is high
What EKS costs
EKS charges for the control plane and the compute separately.
Control plane:
$0.10 per cluster per hour
= $73 per month per cluster (flat, regardless of cluster size)
Compute options:
Option 1, Fargate (same pricing as ECS Fargate):
1 vCPU, 2GB task: $36.03/month per pod
No EC2 management, higher per-unit cost
Option 2, EC2 managed node groups:
Same EC2 costs as ECS EC2 launch type
Plus $73/month for the EKS control plane
You own capacity, patching, and scaling
Real cost comparisons
Scenario 1, small application (10 containers, 1 vCPU, 2GB each):
| Option | Monthly Cost | Management Overhead |
|---|---|---|
| ECS Fargate | $360 | Minimal (hours/month) |
| EKS + Fargate | $433 ($360 + $73) | Low (2-5 hours/month) |
| ECS + EC2 | $200-250 | Medium (10-15 hours/month) |
| EKS + EC2 | $270-320 | High (15-25 hours/month) |
Winner: ECS Fargate, lowest total cost of ownership.
Scenario 2, medium application (50 containers, varying sizes):
| Option | Monthly Cost | Management Overhead |
|---|---|---|
| ECS Fargate | $1,800 | Minimal |
| EKS + Fargate | $1,873 | Low |
| ECS + EC2 | $600-800 | Medium |
| EKS + EC2 | $670-870 | High |
Winner: ECS on EC2 if the team has DevOps depth, ECS Fargate for smaller teams.
Scenario 3, large application (200+ containers, high utilization):
| Option | Monthly Cost | Management Overhead |
|---|---|---|
| ECS Fargate | $7,200+ | Minimal |
| EKS + Fargate | $7,273+ | Low-Medium |
| ECS + EC2 | $2,000-3,000 | Medium-High |
| EKS + EC2 | $2,070-3,070 | High |
Winner: EC2 launch type for cost efficiency, assuming you have a dedicated DevOps team.
The hidden costs people forget:
- EKS adds a real Kubernetes hiring and training bill on top of the AWS bill
- EC2 means you own patching, security updates, capacity planning, and monitoring tooling
- Third-party tools in the Kubernetes ecosystem often have their own licensing (Datadog, New Relic, etc.)
- Networking costs (data transfer, NAT gateway, load balancers) come out roughly the same across options
When ECS Fargate is the right call
A few patterns where Fargate clearly wins.
1. AWS-native applications
If your architecture leans hard on AWS services, ECS plugs in more naturally:
# ECS Task Definition with Native AWS Integration
{
"family": "web-app",
"taskRoleArn": "arn:aws:iam::account:role/ecsTaskRole",
"executionRoleArn": "arn:aws:iam::account:role/ecsExecutionRole",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "app",
"image": "account.dkr.ecr.region.amazonaws.com/app:latest",
"cpu": 512,
"memory": 1024,
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/web-app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"secrets": [
{
"name": "DATABASE_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:region:account:secret:db-password"
}
],
"environment": [
{
"name": "AWS_REGION",
"value": "us-east-1"
}
]
}
],
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024"
}
Good fit when:
- Applications use RDS, DynamoDB, S3, SQS, or SNS heavily
- The team already knows CloudFormation and the AWS APIs
- Microservices have straightforward deployment patterns
- There's no multi-cloud requirement on the roadmap
2. Small teams without Kubernetes expertise
ECS has a much shallower learning curve:
# Deploy a service in ECS (simple AWS CLI)
aws ecs create-service \
--cluster production \
--service-name web-app \
--task-definition web-app:1 \
--desired-count 3 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={
subnets=[subnet-abc123,subnet-def456],
securityGroups=[sg-xyz789],
assignPublicIp=DISABLED
}" \
--load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,containerName=app,containerPort=8080"
# No need to understand Kubernetes concepts like:
# - Pods, ReplicaSets, Deployments
# - ConfigMaps, Secrets (different from AWS Secrets)
# - Ingress Controllers, Service Mesh
# - RBAC, Pod Security Policies
# - CRDs, Operators, Helm charts
Good fit when:
- The team is 1 to 5 developers
- A startup is prioritizing speed over flexibility
- There's no dedicated DevOps or platform team
- Container deployments are straightforward
3. Variable or unpredictable workloads
Fargate's serverless model is a natural fit for traffic that's hard to predict:
# Auto-scaling configuration for ECS Fargate
import boto3
ecs = boto3.client('ecs')
autoscaling = boto3.client('application-autoscaling')
# Register scalable target
autoscaling.register_scalable_target(
ServiceNamespace='ecs',
ResourceId='service/production/web-app',
ScalableDimension='ecs:service:DesiredCount',
MinCapacity=2,
MaxCapacity=50, # Scale from 2 to 50 tasks automatically
RoleARN='arn:aws:iam::account:role/ecsAutoscaleRole'
)
# Target tracking based on CPU
autoscaling.put_scaling_policy(
PolicyName='cpu-scaling',
ServiceNamespace='ecs',
ResourceId='service/production/web-app',
ScalableDimension='ecs:service:DesiredCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 70.0, # Target 70% CPU
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'ECSServiceAverageCPUUtilization'
},
'ScaleInCooldown': 300,
'ScaleOutCooldown': 60
}
)
What you get out of it:
- No capacity planning, no over-provisioning
- You pay for actual usage during spikes
- Automatic scale-down during quiet periods
- No idle instances burning money
Good fit when:
- APIs have unpredictable traffic
- Batch processing volumes vary widely
- Dev and staging environments are intermittent
- Event-driven architectures fire sporadically
4. Speed to production matters
ECS gets to production faster:
// Infrastructure as Code with AWS CDK for ECS
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';
export class WebAppStack extends cdk.Stack {
constructor(scope: cdk.App, id: string) {
super(scope, id);
// Create Fargate service with ALB in ~50 lines
const loadBalancedService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'WebApp', {
taskImageOptions: {
image: ecs.ContainerImage.fromRegistry('nginx'),
containerPort: 80,
environment: {
ENVIRONMENT: 'production'
},
},
cpu: 512,
memoryLimitMiB: 1024,
desiredCount: 3,
publicLoadBalancer: true
});
// Auto-scaling based on requests
const scaling = loadBalancedService.service.autoScaleTaskCount({
minCapacity: 2,
maxCapacity: 10
});
scaling.scaleOnRequestCount('RequestScaling', {
requestsPerTarget: 1000,
targetGroup: loadBalancedService.targetGroup
});
}
}
Time from zero to production:
- ECS Fargate: 15 to 30 minutes
- EKS: 2 to 4 hours (cluster creation plus configuration plus deployment)
When EKS is the right call
A few scenarios where Kubernetes earns the complexity tax.
1. Multi-cloud or hybrid strategy
Kubernetes manifests are portable across providers:
# Standard Kubernetes Deployment (works on EKS, GKE, AKS, on-prem)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
version: v1.2.0
spec:
containers:
- name: app
image: myapp:1.2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: web-app-service
spec:
type: LoadBalancer
selector:
app: web-app
ports:
- port: 80
targetPort: 8080
What you get:
- The same manifests work on any Kubernetes cluster
- Less lock-in to AWS-specific APIs
- Easier migration between providers
- Hybrid deployments with on-prem components are tractable
Good fit when:
- The organization has a multi-cloud policy
- A future cloud migration is on the table
- Hybrid architectures already have on-prem components
- Cloud portability is a contract negotiation lever
2. Complex microservices with advanced networking
The Kubernetes ecosystem has sophisticated networking primitives:
# Service Mesh with Istio on EKS
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: web-app-routes
spec:
hosts:
- web-app.example.com
http:
- match:
- headers:
x-user-type:
exact: premium
route:
- destination:
host: web-app
subset: v2
weight: 100
- route:
- destination:
host: web-app
subset: v1
weight: 90
- destination:
host: web-app
subset: v2
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: web-app-circuit-breaker
spec:
host: web-app
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
What you get:
- A service mesh (Istio, Linkerd) for mTLS, traffic shaping, and observability
- Mature ingress controllers (NGINX, Traefik, Ambassador)
- Network policies for fine-grained security
- Cross-cluster service discovery and federation
Good fit when:
- There are 20+ microservices with complex inter-service communication
- Canary deployments, A/B testing, and traffic splitting are part of the model (the patterns from zero-downtime deployments live here)
- The security model needs mTLS between services
- Deep observability and tracing are non-negotiable
3. Existing Kubernetes investment
If the team already knows Kubernetes, EKS is the familiar option:
# Standard kubectl commands work identically
kubectl get pods -n production
kubectl logs -f deployment/web-app
kubectl exec -it web-app-pod-xyz -- /bin/bash
kubectl port-forward svc/web-app 8080:80
kubectl apply -f manifests/
kubectl rollout status deployment/web-app
kubectl rollout undo deployment/web-app
# Existing Helm charts work without modification
helm install my-app ./charts/web-app \
--namespace production \
--values production-values.yaml
# CI/CD pipelines require minimal changes
# All existing Kubernetes tools, scripts, and automation work
Good fit when:
- The team has Certified Kubernetes Administrators (CKA) on staff
- There are existing Kubernetes tooling investments worth preserving
- A migration from self-managed Kubernetes is in flight
- Projects rely on complex Helm charts and operators
4. Batch processing and ML workloads
Kubernetes is strong at complex job scheduling and GPU workloads:
# Batch processing with Kubernetes Jobs
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing-job
spec:
parallelism: 10 # Run 10 pods in parallel
completions: 100 # Process 100 tasks total
template:
spec:
containers:
- name: processor
image: data-processor:1.0
resources:
requests:
cpu: 2
memory: 4Gi
nvidia.com/gpu: 1 # Request GPU
limits:
cpu: 4
memory: 8Gi
nvidia.com/gpu: 1
env:
- name: TASK_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.name
restartPolicy: OnFailure
nodeSelector:
workload-type: compute-intensive
---
# CronJob for scheduled tasks
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-report
spec:
schedule: "0 2 * * *" # Run at 2 AM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: report-generator
image: report-tool:latest
command: ["python", "generate_report.py"]
restartPolicy: OnFailure
Good fit when:
- ML training pipelines run on Kubeflow or MLflow
- Data processing jobs use Apache Spark on Kubernetes
- Scheduled batch workloads (ETL, reports, data sync) need orchestration
- GPU workloads require specialized instance types
Migration paths
From ECS to EKS
If a migration is on the table, work through it in phases:
# Phase 1: Create EKS cluster
import boto3
eks = boto3.client('eks')
# Create EKS cluster
cluster = eks.create_cluster(
name='production-eks',
version='1.28',
roleArn='arn:aws:iam::account:role/eks-cluster-role',
resourcesVpcConfig={
'subnetIds': ['subnet-abc123', 'subnet-def456', 'subnet-ghi789'],
'securityGroupIds': ['sg-xyz789'],
'endpointPublicAccess': False,
'endpointPrivateAccess': True
},
logging={
'clusterLogging': [
{
'types': ['api', 'audit', 'authenticator', 'controllerManager', 'scheduler'],
'enabled': True
}
]
}
)
print(f"Cluster creating: {cluster['cluster']['name']}")
# Phase 2: Convert ECS task definition to Kubernetes Deployment
# ECS Task Definition (original)
# {
# "family": "web-app",
# "cpu": "512",
# "memory": "1024",
# "containerDefinitions": [{
# "name": "app",
# "image": "myapp:1.0",
# "portMappings": [{"containerPort": 8080}]
# }]
# }
# Kubernetes Deployment (equivalent)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: app
image: myapp:1.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 500m # 512 CPU units = 0.5 vCPU
memory: 1024Mi # 1024 MB
limits:
cpu: 500m
memory: 1024Mi
Migration phases:
- Parallel run for 1 to 2 months, both ECS and EKS with traffic split
- Service-by-service migration, non-critical services first
- Data layer sync, so databases and caches work with both sides
- Gradual traffic shift using ALB weighted targets or Route 53
- ECS decommission once EKS has been stable in production for a while
Starting fresh: a decision tree
For new projects, walk through this:
START
|
v
Do you need multi-cloud portability?
|
+-- YES --> Choose EKS
|
+-- NO
|
v
Do you have Kubernetes expertise on team?
|
+-- YES --> Choose EKS
|
+-- NO
|
v
Do you have 20+ microservices with complex networking?
|
+-- YES --> Invest in Kubernetes, choose EKS
|
+-- NO
|
v
Do you need batch/ML workloads with GPU?
|
+-- YES --> Choose EKS
|
+-- NO
|
v
Is your team < 10 developers?
|
+-- YES --> Choose ECS Fargate
|
+-- NO
|
v
Is cost optimization critical (high utilization)?
|
+-- YES --> Choose ECS with EC2 launch type
|
+-- NO --> Choose ECS Fargate
Deployment patterns that work
ECS Fargate deployment
# CloudFormation template for ECS Fargate
AWSTemplateFormatVersion: '2010-09-09'
Description: 'ECS Fargate Service with ALB'
Resources:
Cluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: production
ClusterSettings:
- Name: containerInsights
Value: enabled
TaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: web-app
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
Cpu: '512'
Memory: '1024'
ExecutionRoleArn: !GetAtt ExecutionRole.Arn
TaskRoleArn: !GetAtt TaskRole.Arn
ContainerDefinitions:
- Name: app
Image: !Sub '${AWS::AccountId}.dkr.ecr.${AWS::Region}.amazonaws.com/web-app:latest'
PortMappings:
- ContainerPort: 8080
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Ref LogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
Environment:
- Name: ENVIRONMENT
Value: production
Secrets:
- Name: DB_PASSWORD
ValueFrom: !Sub 'arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:db-password'
Service:
Type: AWS::ECS::Service
DependsOn: LoadBalancerListener
Properties:
ServiceName: web-app
Cluster: !Ref Cluster
TaskDefinition: !Ref TaskDefinition
DesiredCount: 3
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
SecurityGroups:
- !Ref ServiceSecurityGroup
Subnets:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
AssignPublicIp: DISABLED
LoadBalancers:
- ContainerName: app
ContainerPort: 8080
TargetGroupArn: !Ref TargetGroup
HealthCheckGracePeriodSeconds: 60
AutoScalingTarget:
Type: AWS::ApplicationAutoScaling::ScalableTarget
Properties:
MaxCapacity: 10
MinCapacity: 2
ResourceId: !Sub 'service/${Cluster}/${Service.Name}'
RoleARN: !GetAtt AutoScalingRole.Arn
ScalableDimension: ecs:service:DesiredCount
ServiceNamespace: ecs
AutoScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: cpu-scaling
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref AutoScalingTarget
TargetTrackingScalingPolicyConfiguration:
TargetValue: 70.0
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
ScaleInCooldown: 300
ScaleOutCooldown: 60
EKS deployment with Terraform
# Terraform configuration for EKS cluster
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "production-eks"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
# Enable IRSA (IAM Roles for Service Accounts)
enable_irsa = true
# Managed node groups
eks_managed_node_groups = {
general = {
desired_size = 3
min_size = 2
max_size = 10
instance_types = ["t3.large"]
capacity_type = "ON_DEMAND"
labels = {
workload-type = "general"
}
taints = []
}
compute_intensive = {
desired_size = 1
min_size = 0
max_size = 5
instance_types = ["c5.2xlarge"]
capacity_type = "SPOT"
labels = {
workload-type = "compute-intensive"
}
taints = [{
key = "workload-type"
value = "compute-intensive"
effect = "NoSchedule"
}]
}
}
# Fargate profiles for serverless pods
fargate_profiles = {
default = {
name = "default"
selectors = [
{
namespace = "kube-system"
labels = {
k8s-app = "kube-dns"
}
},
{
namespace = "staging"
}
]
}
}
# Cluster addons
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
}
aws-ebs-csi-driver = {
most_recent = true
}
}
# CloudWatch logging
cluster_enabled_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
tags = {
Environment = "production"
Terraform = "true"
}
}
# Install AWS Load Balancer Controller
resource "helm_release" "aws_load_balancer_controller" {
name = "aws-load-balancer-controller"
repository = "https://aws.github.io/eks-charts"
chart = "aws-load-balancer-controller"
namespace = "kube-system"
set {
name = "clusterName"
value = module.eks.cluster_name
}
set {
name = "serviceAccount.create"
value = "true"
}
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = aws_iam_role.aws_load_balancer_controller.arn
}
}
# Cluster Autoscaler
resource "helm_release" "cluster_autoscaler" {
name = "cluster-autoscaler"
repository = "https://kubernetes.github.io/autoscaler"
chart = "cluster-autoscaler"
namespace = "kube-system"
set {
name = "autoDiscovery.clusterName"
value = module.eks.cluster_name
}
set {
name = "awsRegion"
value = var.aws_region
}
set {
name = "rbac.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = aws_iam_role.cluster_autoscaler.arn
}
}
Monitoring and observability
ECS monitoring
# CloudWatch monitoring for ECS with boto3
import boto3
from datetime import datetime, timedelta
cloudwatch = boto3.client('cloudwatch')
# Get ECS service metrics
response = cloudwatch.get_metric_statistics(
Namespace='AWS/ECS',
MetricName='CPUUtilization',
Dimensions=[
{'Name': 'ServiceName', 'Value': 'web-app'},
{'Name': 'ClusterName', 'Value': 'production'}
],
StartTime=datetime.utcnow() - timedelta(hours=1),
EndTime=datetime.utcnow(),
Period=300, # 5 minutes
Statistics=['Average', 'Maximum']
)
for datapoint in response['Datapoints']:
print(f"Time: {datapoint['Timestamp']}, Avg CPU: {datapoint['Average']:.2f}%")
# Create CloudWatch alarm
cloudwatch.put_metric_alarm(
AlarmName='ecs-high-cpu',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=2,
MetricName='CPUUtilization',
Namespace='AWS/ECS',
Period=300,
Statistic='Average',
Threshold=80.0,
ActionsEnabled=True,
AlarmActions=['arn:aws:sns:region:account:ops-alerts'],
AlarmDescription='Alert when ECS service CPU exceeds 80%',
Dimensions=[
{'Name': 'ServiceName', 'Value': 'web-app'},
{'Name': 'ClusterName', 'Value': 'production'}
]
)
EKS monitoring
# Prometheus monitoring for EKS
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Application metrics
- job_name: 'web-app'
static_configs:
- targets: ['web-app-service:8080']
metrics_path: '/metrics'
---
# Grafana dashboard for EKS
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-eks
namespace: monitoring
data:
eks-cluster.json: |
{
"dashboard": {
"title": "EKS Cluster Overview",
"panels": [
{
"title": "Pod CPU Usage",
"targets": [{
"expr": "sum(rate(container_cpu_usage_seconds_total{pod!=\"\"}[5m])) by (pod)"
}]
},
{
"title": "Pod Memory Usage",
"targets": [{
"expr": "sum(container_memory_working_set_bytes{pod!=\"\"}) by (pod) / 1024 / 1024"
}]
},
{
"title": "Network I/O",
"targets": [
{
"expr": "sum(rate(container_network_receive_bytes_total[5m])) by (pod)"
},
{
"expr": "sum(rate(container_network_transmit_bytes_total[5m])) by (pod)"
}
]
}
]
}
}
How I actually decide
Picking between ECS Fargate and EKS is rarely about which is "better." It's about matching the tool to the workload, the team, and the constraints. ECS Fargate is the fastest path to production with the least operational overhead, and that matters more than most architectural arguments admit. EKS gives you portability and the full Kubernetes ecosystem, which is genuinely useful when the workload calls for it and painful when it doesn't.
For most teams starting fresh on AWS without a multi-cloud mandate, ECS Fargate has the lowest total cost of ownership once you factor in engineering hours. Teams that already run Kubernetes, or that need advanced networking primitives, should go with EKS. Large-scale, high-utilization workloads do better on EC2 launch types in either system, because Fargate's per-task premium adds up at scale.
If I had to compress it: team expertise, architectural complexity, and portability requirements decide it. Use the framework above. Skip the vendor talk tracks.
A reasonable next move
- Assess requirements against the decision tree above
- Calculate total cost of ownership including engineering time, not just compute
- Start with a pilot, ideally a non-critical service, to validate the choice
- Track deployment frequency, time to production, and operational overhead
- Document the rationale somewhere durable so the next architect doesn't redo the work
- Plan for evolution; nothing says you can't migrate later
- Invest in training for whichever platform you pick