Observability: CloudWatch, X-Ray & CloudTrail

CloudWatch metrics & alarms

A metric is a time-series datapoint: namespace + metric name + dimensions + timestamp + value. AWS services publish metrics automatically; your applications publish custom metrics. Alarms watch metrics and trigger SNS, Auto Scaling, or EventBridge actions when thresholds breach.

Namespaces and dimensions

Every metric lives in a namespace — a logical grouping like AWS/EC2, AWS/ECS, AWS/ApplicationELB, or your own OrderService. Within a namespace, metrics are distinguished by dimensions — key/value pairs that slice the data.

Namespace	Example metric	Typical dimensions
AWS/ECS	CPUUtilization	ClusterName, ServiceName
AWS/ApplicationELB	TargetResponseTime	LoadBalancer, TargetGroup
AWS/Lambda	Duration	FunctionName
OrderService (custom)	OrdersPlaced	Environment, Region

Dimensions are how you avoid averaging away signal. An alarm on CPUUtilization without a ServiceName dimension averages every ECS service in the cluster — useless for paging on a single failing service.

Standard vs detailed monitoring

Standard monitoring (free for most services) publishes at 1-minute intervals; EC2 defaults to 5-minute unless you enable detailed monitoring (paid, 1-minute). Detailed monitoring matters when Auto Scaling must react within a minute. RDS Enhanced Monitoring is separate — OS-level metrics at sub-minute granularity via a CloudWatch Logs channel.

Custom metrics and cost

Publishing custom metrics via PutMetricData costs per metric per month (first 10,000 custom metrics free tier in some accounts). Each unique combination of metric name + dimension values counts as a separate metric. High-cardinality dimensions — user IDs, request IDs, order IDs — explode your bill and hit the 150 metrics/second/account API throttle.

Low cardinality only: environment, region, service name, HTTP status class (4xx, 5xx)
Aggregate before publish: count errors in-process, publish one counter per minute
Prefer EMF (see Advanced section) — embed metrics in log lines, no per-call API charge
High-resolution metrics (1-second storage) cost 3× standard storage — use only for burst detection

💰 Cost

Custom metrics at $0.30/metric/month add up fast: 500 unique dimension combinations × 12 months = $1,800/year for metrics alone. A team publishing latency_ms per endpoint (20 endpoints) × environment (3) × region (2) = 120 metrics — manageable. Add customer_id and you're bankrupt.

Alarm types and evaluation

Alarm type	What it watches	When to use
Static threshold	Metric > / < / = fixed value for N periods	CPU > 80%, ALB 5xx > 10/min — simple SLO breaches
Anomaly detection	ML band around expected range (no manual threshold)	Traffic patterns that vary by day-of-week; seasonal workloads
Composite alarm	Boolean logic across multiple alarms (AND/OR/NOT)	Page only when both error rate is high and latency is high — reduces noise
Math expression	Derived metric: m1/m2*100 error percentage	Calculate error rate from RequestCount and HTTPCode_Target_5XX_Count

Composite alarms — reduce pager fatigue

Individual alarms on CPU, memory, and 5xx rate each fire independently during a deploy or traffic spike. A composite alarm combines child alarms: e.g. ALARM("High5xx") AND ALARM("HighLatency") pages on-call; ALARM("HighCPU") OR ALARM("HighMemory") triggers Auto Scaling only. Composite alarms themselves don't incur extra charge beyond child alarms.

Create an ALB 5xx composite alarm

saved globally

# Child alarm: 5xx count > 10 in 2 of 3 minutes
aws cloudwatch put-metric-alarm \
  --alarm-name order-service-high-5xx \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=LoadBalancer,Value=app/order-alb/abc123 \
  --statistic Sum --period 60 --evaluation-periods 3 \
  --threshold 10 --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:ops-alerts

# Composite: page only when 5xx AND latency both breach
aws cloudwatch put-composite-alarm \
  --alarm-name order-service-customer-impact \
  --alarm-rule 'ALARM(order-service-high-5xx) AND ALARM(order-service-high-latency)' \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:pagerduty

resource "aws_cloudwatch_metric_alarm" "high_5xx" {
  alarm_name          = "order-service-high-5xx"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Sum"
  threshold           = 10
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
  dimensions = {
    LoadBalancer = aws_lb.app.arn_suffix
  }
}

resource "aws_cloudwatch_composite_alarm" "customer_impact" {
  alarm_name        = "order-service-customer-impact"
  alarm_description = "Page when errors AND latency both elevated"
  alarm_rule        = "ALARM(${aws_cloudwatch_metric_alarm.high_5xx.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.high_latency.alarm_name})"
  alarm_actions     = [aws_sns_topic.pagerduty.arn]
}

const high5xx = new cloudwatch.Alarm(this, 'High5xx', {
  metric: alb.metricHttpCodeTarget(elbv2.HttpCodeTarget.TARGET_5XX_COUNT),
  threshold: 10, evaluationPeriods: 3,
});
new cloudwatch.CompositeAlarm(this, 'CustomerImpact', {
  alarmRule: cloudwatch.AlarmRule.allOf(high5xx, highLatency),
}).addAlarmAction(new cloudwatchActions.SnsAction(pagerDutyTopic));

⚠️ Pitfall

Alarms on Average statistic hide outliers. ALB TargetResponseTime with Average looks fine while p99 is 8 seconds. Use p99 or p99.9 for latency SLOs; use Sum for error counts.

CloudWatch Logs

Logs are the narrative of what happened inside your application. CloudWatch Logs stores them in log groups (the container) and log streams (per-source sequences). Ingestion, storage, and Insights queries are all billed — retention policy is your first cost lever.

Log groups and log streams

Concept	What it is	Naming convention
Log group	Container for related streams; holds retention, KMS key, subscription filters	/ecs/order-service, /aws/lambda/checkout
Log stream	Sequence of events from one source; auto-created	ECS: ecs/order-service/abc123def (task ID)
Log event	Single line or JSON blob + timestamp (max 256 KB/event)	Structured JSON preferred for Insights queries

ECS Fargate tasks ship stdout/stderr to CloudWatch via the awslogs log driver configured in the task definition. Each task gets its own log stream — when the task restarts, a new stream appears. Spring Boot apps should log JSON (Logback + logstash encoder) so Insights can parse fields without regex hell.

Retention — set it on day one

Default retention is never expire — the most expensive mistake in observability. Production guidance: 30 days hot in CloudWatch Logs, then archive to S3 via subscription filter + Firehose if compliance requires longer. Dev/staging: 7–14 days.

Ingestion: ~$0.50/GB (varies by region)
Storage: ~$0.03/GB/month
Insights queries: $0.005/GB scanned — scoping by time range and fields saves money

CloudWatch Logs Insights queries

Insights is a query language over log events in a log group (or multiple groups). It does not index like Elasticsearch — it scans. Narrow time windows and filter early with | filter before | stats.

Find 5xx errors with request context

fields @timestamp, @message, level, traceId, http.status, http.path, exception.message
| filter level = "ERROR" or http.status >= 500
| sort @timestamp desc
| limit 100

p99 latency by endpoint (structured JSON logs)

fields http.path, duration_ms
| filter ispresent(duration_ms) and http.status = 200
| stats pct(duration_ms, 50) as p50,
        pct(duration_ms, 95) as p95,
        pct(duration_ms, 99) as p99,
        count(*) as requests
  by http.path
| sort p99 desc
| limit 20

Subscription filters

Subscription filters forward matching log events in near-real-time to Lambda, Kinesis Data Streams, Kinesis Data Firehose (S3, OpenSearch, Splunk), or another log group for cross-account centralization. One filter per log group per destination type.

Metric filters — logs to alarms without custom code

Metric filters extract numeric values or count pattern matches from log events and publish CloudWatch metrics. Classic pattern: count ERROR lines → alarm when error rate spikes. The metric appears in a namespace you choose (often the log group name).

Provision log group with retention and error metric filter

saved globally

aws logs create-log-group --log-group-name /ecs/order-service

aws logs put-retention-policy \
  --log-group-name /ecs/order-service \
  --retention-in-days 30

aws logs put-metric-filter \
  --log-group-name /ecs/order-service \
  --filter-name ErrorCount \
  --filter-pattern '{ $.level = "ERROR" }' \
  --metric-transformations \
    metricName=ApplicationErrors,metricNamespace=OrderService,metricValue=1,defaultValue=0

resource "aws_cloudwatch_log_group" "order_service" {
  name              = "/ecs/order-service"
  retention_in_days = 30
  kms_key_id        = aws_kms_key.logs.arn
}

resource "aws_cloudwatch_log_metric_filter" "errors" {
  name           = "ErrorCount"
  log_group_name = aws_cloudwatch_log_group.order_service.name
  pattern        = "{ $.level = \"ERROR\" }"

  metric_transformation {
    name      = "ApplicationErrors"
    namespace = "OrderService"
    value     = "1"
    default_value = "0"
  }
}

import * as logs from 'aws-cdk-lib/aws-logs';

const logGroup = new logs.LogGroup(this, 'OrderServiceLogs', {
  logGroupName: '/ecs/order-service',
  retention: logs.RetentionDays.ONE_MONTH,
  encryptionKey: logsKey,
});

new logs.MetricFilter(this, 'ErrorFilter', {
  logGroup,
  metricNamespace: 'OrderService',
  metricName: 'ApplicationErrors',
  filterPattern: logs.FilterPattern.stringValue('$.level', '=', 'ERROR'),
  metricValue: '1',
  defaultValue: '0',
});

💡 Pro Tip

Save Insights queries in the console as query definitions and pin them to dashboards. For on-call runbooks, link directly to a pre-filled Insights query with the incident time window — saves 5 minutes of fumbling during a outage.

Advanced CloudWatch

Beyond basic metrics and logs: the CloudWatch Agent for host-level telemetry, Container Insights for ECS/EKS, Synthetics for black-box probing, and Embedded Metric Format for high-throughput custom metrics without API throttling.

CloudWatch Agent

Installed on EC2 or on-prem via SSM, the agent collects host metrics (disk, memory, swap), custom metrics (StatsD, collectd), and arbitrary log files. Config lives in SSM Parameter Store. On Fargate, skip the agent — use awslogs + EMF instead.

Container Insights

Turn on at the ECS cluster or EKS cluster level. CloudWatch creates additional log groups (/aws/ecs/containerinsights/{cluster}/performance) and publishes enhanced metrics: per-task CPU/memory, network, storage, and service-level aggregates. Uses more log ingestion volume — budget for it. Replaces much of what you'd otherwise build with custom cAdvisor scraping.

CloudWatch Synthetics (canaries)

Scheduled headless Chrome or Node.js scripts that hit your endpoints from AWS-managed locations. Detects DNS failures, SSL expiry, latency regressions, and broken deploys before customers do. Canaries publish metrics to CloudWatchSynthetics namespace and can trigger alarms. Common pattern: canary logs in → API Gateway → Lambda health check every 5 minutes.

Embedded Metric Format (EMF)

Write a JSON structure to stdout/stderr (or any log destination) and CloudWatch automatically extracts metrics from the log event — no PutMetricData API calls, no per-metric charges beyond log ingestion. The AWS SDK and Micrometer CloudWatch registry can emit EMF.

{
  "_aws": {
    "Timestamp": 1718000000000,
    "CloudWatchMetrics": [{
      "Namespace": "OrderService",
      "Dimensions": [["ServiceName", "Environment"]],
      "Metrics": [
        { "Name": "OrdersPlaced", "Unit": "Count" },
        { "Name": "CheckoutLatency", "Unit": "Milliseconds" }
      ]
    }]
  },
  "ServiceName": "order-service",
  "Environment": "production",
  "OrdersPlaced": 42,
  "CheckoutLatency": 187
}

⚖️ Trade-off

PutMetricData vs EMF: PutMetricData gives immediate metric availability and high-resolution (1s) support but hits API limits and per-metric pricing. EMF piggybacks on log ingestion — cheaper at scale, slight delay, and metrics disappear if log delivery fails. For Spring Boot on ECS, EMF via Micrometer is the production default.

X-Ray distributed tracing

Metrics tell you something is wrong; traces tell you where in a request path. AWS X-Ray collects trace data from SDKs, agents, and integrated services (API Gateway, ALB, Lambda) and renders a service map plus per-request timelines.

Traces, segments, and subsegments

Concept	Definition	Example
Trace	End-to-end path of one request across services	Client → ALB → order-service → payment-service → RDS
Segment	Work done by a single service; has timing, HTTP metadata, errors	order-service handling POST /orders
Subsegment	Finer-grained work within a segment	DynamoDB PutItem, downstream HTTP call, SQL query
Trace ID	Shared across all segments; propagates via HTTP header	X-Amzn-Trace-Id or W3C traceparent

Service map

X-Ray aggregates segments into a directed graph: nodes are services, edges are call volume and latency. Red nodes have elevated error rates; thicker edges show higher traffic. The service map is how you discover unexpected dependencies — "why is catalog-service calling legacy-billing?"

Sampling rules

Tracing every request at 10,000 RPS is expensive and hits X-Ray quotas. Sampling decides which requests get recorded. Default: first request per second + 5% of additional requests (reservoir + fixed rate). Custom sampling rules match on service name, HTTP method, URL path, or attributes.

Always sample errors: custom rule with priority 1, fixed rate 100% when error = true
Sample 1% of health checks: reservoir 0, rate 0.01 on /actuator/health
Dev environment: 100% sampling; production: 5–10% usually sufficient

💰 Cost

X-Ray charges per trace recorded and retrieved. At 5% sampling on 1M requests/day with 4 segments each: ~200K traces/day. Monitor trace volume in Cost Explorer under X-Ray. OpenTelemetry with tail-based sampling (sample only slow/error traces) can reduce cost further than head-based X-Ray defaults.

ADOT and OpenTelemetry

AWS Distro for OpenTelemetry (ADOT) is the AWS-supported OpenTelemetry distribution. Deploy the ADOT Collector as a sidecar on ECS/EKS to receive OTLP from your app and export to X-Ray, CloudWatch, or Prometheus. This is the migration path away from the legacy X-Ray SDK — OpenTelemetry is vendor-neutral; the collector handles backend routing.

App → OTLP → ADOT Collector sidecar → X-Ray (traces) + CloudWatch/EMF (metrics). Lambda uses the ADOT layer.

Java SDK — Spring Boot on ECS

For legacy X-Ray (pre-OTel migration), add the X-Ray SDK and servlet filter. The X-Ray daemon must run alongside the app — on ECS EC2 launch type as a daemon task; on Fargate as a sidecar container listening on UDP 2000.

// Custom subsegment around a downstream call
import com.amazonaws.xray.AWSXRay;
import com.amazonaws.xray.entities.Subsegment;

public PaymentClient {
  public ChargeResult charge(Order order) {
    Subsegment sub = AWSXRay.beginSubsegment("payment-service");
    try {
      sub.putMetadata("orderId", order.getId());
      return restClient.post("/charge", order, ChargeResult.class);
    } catch (Exception e) {
      sub.addException(e);
      throw e;
    } finally {
      AWSXRay.endSubsegment();
    }
  }
}

⚠️ Pitfall

Enabling X-Ray on ALB and API Gateway but not propagating trace headers in your Spring RestTemplate/WebClient calls creates broken traces — orphaned segments that don't link. Use OpenTelemetry auto-instrumentation or ensure X-Amzn-Trace-Id is forwarded on every outbound HTTP call.

🎯 Exam Tip

X-Ray integrates natively with Lambda, API Gateway, ALB, ECS, Elastic Beanstalk. It does not auto-trace RDS queries — you need SDK subsegments. X-Ray vs CloudWatch: X-Ray is request-scoped tracing; CloudWatch is time-series metrics and log storage. Use both together, not either/or.

CloudTrail audit logging

CloudTrail records who did what API call, when, from where, and whether it succeeded. It is not application logging — it is the AWS control-plane audit trail. Required for compliance, forensics, and least-privilege policy refinement.

Management events vs data events

Event type	What is logged	Cost
Management events	Control plane: RunInstances, CreateBucket, AttachRolePolicy	First trail free; additional copies charged
Data events	Data plane: S3 object-level, Lambda invoke, DynamoDB item operations	Per 100,000 events — expensive at scale; enable selectively
Insights events	Detect unusual API activity (rate spikes)	Additional charge; useful for security monitoring

Organization trails

Create a trail in the management (payer) account that logs all member accounts in the organization. Deliver logs to a centralized S3 bucket in a security/audit account via cross-account policy. Member accounts cannot disable or delete the org trail — prevents an attacker with admin creds from turning off auditing before exfiltration.

Enable multi-region delivery, log file validation, KMS encryption, and Athena/Lake for SQL queries. Deliver to a security-account S3 bucket with a restrictive bucket policy.

Log file validation

When enabled, CloudTrail delivers a Digest file alongside log files. The digest contains SHA-256 hashes of each log file in the hour. If someone tampers with a log file in S3, validation fails. Run aws cloudtrail validate-logs during incident response to prove chain of custody.

CloudTrail vs CloudWatch vs X-Ray

Dimension	CloudTrail	CloudWatch	X-Ray
Purpose	API audit — who changed infrastructure	Metrics, logs, alarms — how systems behave	Distributed tracing — request path latency
Data source	AWS API calls (management + optional data)	Service metrics, app logs, custom metrics	Instrumented SDKs, ALB, API GW, Lambda
Retention	S3 (years) + optional Lake; 90-day console view	Configurable per log group; metrics 15 months	30 days default; longer via export to S3
Query	Athena, Lake SQL, CloudTrail event history	Logs Insights, Metric Math, dashboards	Trace map, trace detail, Analytics (SQL)
Compliance	SOC, PCI, HIPAA audit trail — primary tool	Operational monitoring, not audit-grade	Performance debugging, not audit-grade
Spring Boot app logs	No — unless app calls AWS APIs	Yes — stdout via ECS log driver	Yes — if SDK/OTel instrumented

Create an organization trail with validation

saved globally

aws cloudtrail create-trail \
  --name org-audit-trail \
  --s3-bucket-name my-org-cloudtrail-logs \
  --is-organization-trail \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --kms-key-id alias/cloudtrail

aws cloudtrail start-logging --name org-audit-trail

# Verify validation
aws cloudtrail validate-logs \
  --trail-arn arn:aws:cloudtrail:us-east-1:123456789012:trail/org-audit-trail \
  --start-time 2026-06-01T00:00:00Z \
  --end-time 2026-06-02T00:00:00Z

resource "aws_cloudtrail" "org" {
  name                          = "org-audit-trail"
  s3_bucket_name                = aws_s3_bucket.cloudtrail.id
  is_organization_trail         = true
  is_multi_region_trail         = true
  enable_log_file_validation    = true
  kms_key_id                    = aws_kms_key.cloudtrail.arn
  include_global_service_events = true

  event_selector {
    read_write_type           = "All"
    include_management_events = true
  }
}

import * as cloudtrail from 'aws-cdk-lib/aws-cloudtrail';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as kms from 'aws-cdk-lib/aws-kms';

const trail = new cloudtrail.Trail(this, 'OrgAuditTrail', {
  trailName: 'org-audit-trail',
  bucket: trailBucket,
  encryptionKey: trailKey,
  isMultiRegionTrail: true,
  enableFileValidation: true,
  includeGlobalServiceEvents: true,
  isOrganizationTrail: true,
});

🔒 Security

SCP-deny cloudtrail:StopLogging, cloudtrail:DeleteTrail, and cloudtrail:UpdateTrail for all accounts except the security admin role. Pair with EventBridge rule on StopLogging → immediate SNS page — someone is trying to hide their tracks.

Production observability patterns

Putting it together: the three pillars on AWS, keeping the CloudWatch bill predictable, and wiring Spring Boot Micrometer metrics into CloudWatch on ECS Fargate without running a sidecar metric agent.

Three pillars on AWS

Metrics

CloudWatch service metrics + custom/EMF app metrics. Alarms → SNS → PagerDuty. Dashboards per team. SLO burn-rate alarms on error budget.
Logs

Structured JSON to CloudWatch Logs. Insights for investigation. Subscription to S3 for archive. Trace ID in every log line for correlation.
Traces

X-Ray or ADOT/OpenTelemetry. 5–10% head sampling in prod; 100% on errors. Service map for dependency discovery. Link traces to logs via shared trace ID.
Audit (+4th pillar)

CloudTrail org trail to security account. Not operational telemetry — compliance and forensics. Athena queries for "who deleted that security group."

Keeping the CloudWatch bill sane

Retention: 30 days prod, 7 days dev — never "never expire" by default
Log level: INFO in prod; DEBUG only on flagged requests
EMF over PutMetricData with low-cardinality dimensions only
Insights: always scope time range; Container Insights in staging first
CloudTrail data events on sensitive buckets only; audit alarm count quarterly

💰 Cost

A mid-size ECS fleet (20 services, 50 tasks) logging 5 GB/day = ~$75/month ingestion + storage. Add Container Insights (often 2× log volume), 100 custom metrics, and 50 alarms and you're at $200+/month before anyone runs a wide-open Insights query. Set AWS Budgets alerts on the CloudWatch line item.

Spring Boot Micrometer → CloudWatch

Micrometer is the metrics facade in Spring Boot 3. Add the CloudWatch registry with EMF export so metrics flow through stdout → ECS awslogs driver → CloudWatch extracts them automatically. No X-Ray daemon needed for metrics (tracing is separate).

# pom.xml: micrometer-registry-cloudwatch2
# application-production.yml
management:
  metrics:
    export:
      cloudwatch:
        namespace: OrderService
        batch-size: 20
        enabled: true
        step: 1m
    tags:
      application: ${spring.application.name}
      environment: production
    enable:
      jvm: true
      http.server.requests: true
      system: false  # reduce noise on Fargate — no host-level system metrics

ECS task definition — EMF requires no sidecar for metrics

saved globally

# EMF metrics flow via stdout — no cloudwatch:PutMetricData on task role
aws ecs register-task-definition --family order-service \
  --requires-compatibilities FARGATE --network-mode awvpc \
  --cpu 512 --memory 1024 \
  --task-role-arn arn:aws:iam::123456789012:role/order-service-task \
  --execution-role-arn arn:aws:iam::123456789012:role/ecs-execution \
  --container-definitions '[{
    "name":"app",
    "image":"123456789012.dkr.ecr.eu-west-1.amazonaws.com/order-service:latest",
    "environment":[
      {"name":"SPRING_PROFILES_ACTIVE","value":"production"},
      {"name":"MANAGEMENT_METRICS_EXPORT_CLOUDWATCH_ENABLED","value":"true"}
    ],
    "logConfiguration":{"logDriver":"awslogs","options":{
      "awslogs-group":"/ecs/order-service",
      "awslogs-region":"eu-west-1",
      "awslogs-stream-prefix":"ecs"
    }}
  }]'

resource "aws_ecs_task_definition" "order_service" {
  family                   = "order-service"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 512
  memory                   = 1024
  task_role_arn            = aws_iam_role.task.arn
  execution_role_arn       = aws_iam_role.execution.arn

  container_definitions = jsonencode([{
    name  = "app"
    image = "${aws_ecr_repository.app.repository_url}:latest"
    environment = [
      { name = "SPRING_PROFILES_ACTIVE", value = "production" },
      { name = "MANAGEMENT_METRICS_EXPORT_CLOUDWATCH_ENABLED", value = "true" }
    ]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        awslogs-group         = aws_cloudwatch_log_group.order_service.name
        awslogs-region        = var.region
        awslogs-stream-prefix = "ecs"
      }
    }
  }])
}

import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as logs from 'aws-cdk-lib/aws-logs';

const taskDef = new ecs.FargateTaskDefinition(this, 'OrderServiceTask', {
  taskRole,
  executionRole: ecsExecutionRole,
});

const container = taskDef.addContainer('app', {
  image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
  environment: {
    SPRING_PROFILES_ACTIVE: 'production',
    MANAGEMENT_METRICS_EXPORT_CLOUDWATCH_ENABLED: 'true',
  },
  logging: ecs.LogDrivers.awsLogs({
    streamPrefix: 'ecs',
    logGroup: orderServiceLogGroup,
  }),
});

💡 Pro Tip

Put the same traceId in structured logs, HTTP response headers (for support), and X-Ray segments. During an incident, start from the customer's trace ID → X-Ray trace → jump to Logs Insights with that ID → find the stack trace. One ID, three pillars linked.

🎯 Exam Tip

When the exam asks "detect unauthorized API activity across all accounts" → CloudTrail organization trail. "Real-time application performance bottleneck across microservices" → X-Ray. "Alert when EC2 CPU exceeds threshold" → CloudWatch alarm. "Query application logs for error patterns" → CloudWatch Logs Insights. Know which tool answers which question.

Namespaces and dimensions

Standard vs detailed monitoring

Custom metrics and cost

Alarm types and evaluation

Composite alarms — reduce pager fatigue

Create an ALB 5xx composite alarm

Log groups and log streams

Retention — set it on day one

CloudWatch Logs Insights queries

Find 5xx errors with request context

p99 latency by endpoint (structured JSON logs)

Subscription filters

Metric filters — logs to alarms without custom code

Provision log group with retention and error metric filter

CloudWatch Agent

Container Insights

CloudWatch Synthetics (canaries)

Embedded Metric Format (EMF)

Traces, segments, and subsegments

Service map

Sampling rules

ADOT and OpenTelemetry

Java SDK — Spring Boot on ECS

Management events vs data events

Organization trails

Log file validation

CloudTrail vs CloudWatch vs X-Ray

Create an organization trail with validation

Three pillars on AWS

Metrics

Logs

Traces

Audit (+4th pillar)

Keeping the CloudWatch bill sane

Spring Boot Micrometer → CloudWatch

ECS task definition — EMF requires no sidecar for metrics