Observability: CloudWatch, X-Ray & CloudTrail

Production systems fail in ways you cannot predict from unit tests alone. Observability on AWS means knowing what broke (metrics), why it broke (logs and traces), and who changed what (audit). CloudWatch is the default telemetry plane for almost every AWS service; X-Ray adds distributed tracing across microservices; CloudTrail is the immutable API audit log. This chapter covers how to instrument a Spring Boot fleet on ECS without drowning in log storage bills or drowning on-call in false-positive alarms.

developer devops architect Regional services Pay per GB ingested

CloudWatch metrics & alarms

A metric is a time-series datapoint: namespace + metric name + dimensions + timestamp + value. AWS services publish metrics automatically; your applications publish custom metrics. Alarms watch metrics and trigger SNS, Auto Scaling, or EventBridge actions when thresholds breach.

Namespaces and dimensions

Every metric lives in a namespace — a logical grouping like AWS/EC2, AWS/ECS, AWS/ApplicationELB, or your own OrderService. Within a namespace, metrics are distinguished by dimensions — key/value pairs that slice the data.

Namespace Example metric Typical dimensions
AWS/ECS CPUUtilization ClusterName, ServiceName
AWS/ApplicationELB TargetResponseTime LoadBalancer, TargetGroup
AWS/Lambda Duration FunctionName
OrderService (custom) OrdersPlaced Environment, Region

Dimensions are how you avoid averaging away signal. An alarm on CPUUtilization without a ServiceName dimension averages every ECS service in the cluster — useless for paging on a single failing service.

Standard vs detailed monitoring

Standard monitoring (free for most services) publishes at 1-minute intervals; EC2 defaults to 5-minute unless you enable detailed monitoring (paid, 1-minute). Detailed monitoring matters when Auto Scaling must react within a minute. RDS Enhanced Monitoring is separate — OS-level metrics at sub-minute granularity via a CloudWatch Logs channel.

Custom metrics and cost

Publishing custom metrics via PutMetricData costs per metric per month (first 10,000 custom metrics free tier in some accounts). Each unique combination of metric name + dimension values counts as a separate metric. High-cardinality dimensions — user IDs, request IDs, order IDs — explode your bill and hit the 150 metrics/second/account API throttle.

  • Low cardinality only: environment, region, service name, HTTP status class (4xx, 5xx)
  • Aggregate before publish: count errors in-process, publish one counter per minute
  • Prefer EMF (see Advanced section) — embed metrics in log lines, no per-call API charge
  • High-resolution metrics (1-second storage) cost 3× standard storage — use only for burst detection
💰 Cost

Custom metrics at $0.30/metric/month add up fast: 500 unique dimension combinations × 12 months = $1,800/year for metrics alone. A team publishing latency_ms per endpoint (20 endpoints) × environment (3) × region (2) = 120 metrics — manageable. Add customer_id and you're bankrupt.

Alarm types and evaluation

Alarm type What it watches When to use
Static threshold Metric > / < / = fixed value for N periods CPU > 80%, ALB 5xx > 10/min — simple SLO breaches
Anomaly detection ML band around expected range (no manual threshold) Traffic patterns that vary by day-of-week; seasonal workloads
Composite alarm Boolean logic across multiple alarms (AND/OR/NOT) Page only when both error rate is high and latency is high — reduces noise
Math expression Derived metric: m1/m2*100 error percentage Calculate error rate from RequestCount and HTTPCode_Target_5XX_Count

Composite alarms — reduce pager fatigue

Individual alarms on CPU, memory, and 5xx rate each fire independently during a deploy or traffic spike. A composite alarm combines child alarms: e.g. ALARM("High5xx") AND ALARM("HighLatency") pages on-call; ALARM("HighCPU") OR ALARM("HighMemory") triggers Auto Scaling only. Composite alarms themselves don't incur extra charge beyond child alarms.

Create an ALB 5xx composite alarm

saved globally
bash
# Child alarm: 5xx count > 10 in 2 of 3 minutes
aws cloudwatch put-metric-alarm \
  --alarm-name order-service-high-5xx \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_Target_5XX_Count \
  --dimensions Name=LoadBalancer,Value=app/order-alb/abc123 \
  --statistic Sum --period 60 --evaluation-periods 3 \
  --threshold 10 --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:ops-alerts

# Composite: page only when 5xx AND latency both breach
aws cloudwatch put-composite-alarm \
  --alarm-name order-service-customer-impact \
  --alarm-rule 'ALARM(order-service-high-5xx) AND ALARM(order-service-high-latency)' \
  --alarm-actions arn:aws:sns:eu-west-1:123456789012:pagerduty
hcl
resource "aws_cloudwatch_metric_alarm" "high_5xx" {
  alarm_name          = "order-service-high-5xx"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Sum"
  threshold           = 10
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
  dimensions = {
    LoadBalancer = aws_lb.app.arn_suffix
  }
}

resource "aws_cloudwatch_composite_alarm" "customer_impact" {
  alarm_name        = "order-service-customer-impact"
  alarm_description = "Page when errors AND latency both elevated"
  alarm_rule        = "ALARM(${aws_cloudwatch_metric_alarm.high_5xx.alarm_name}) AND ALARM(${aws_cloudwatch_metric_alarm.high_latency.alarm_name})"
  alarm_actions     = [aws_sns_topic.pagerduty.arn]
}
typescript
const high5xx = new cloudwatch.Alarm(this, 'High5xx', {
  metric: alb.metricHttpCodeTarget(elbv2.HttpCodeTarget.TARGET_5XX_COUNT),
  threshold: 10, evaluationPeriods: 3,
});
new cloudwatch.CompositeAlarm(this, 'CustomerImpact', {
  alarmRule: cloudwatch.AlarmRule.allOf(high5xx, highLatency),
}).addAlarmAction(new cloudwatchActions.SnsAction(pagerDutyTopic));
⚠️ Pitfall

Alarms on Average statistic hide outliers. ALB TargetResponseTime with Average looks fine while p99 is 8 seconds. Use p99 or p99.9 for latency SLOs; use Sum for error counts.

CloudWatch Logs

Logs are the narrative of what happened inside your application. CloudWatch Logs stores them in log groups (the container) and log streams (per-source sequences). Ingestion, storage, and Insights queries are all billed — retention policy is your first cost lever.

Log groups and log streams

Concept What it is Naming convention
Log group Container for related streams; holds retention, KMS key, subscription filters /ecs/order-service, /aws/lambda/checkout
Log stream Sequence of events from one source; auto-created ECS: ecs/order-service/abc123def (task ID)
Log event Single line or JSON blob + timestamp (max 256 KB/event) Structured JSON preferred for Insights queries

ECS Fargate tasks ship stdout/stderr to CloudWatch via the awslogs log driver configured in the task definition. Each task gets its own log stream — when the task restarts, a new stream appears. Spring Boot apps should log JSON (Logback + logstash encoder) so Insights can parse fields without regex hell.

Retention — set it on day one

Default retention is never expire — the most expensive mistake in observability. Production guidance: 30 days hot in CloudWatch Logs, then archive to S3 via subscription filter + Firehose if compliance requires longer. Dev/staging: 7–14 days.

  • Ingestion: ~$0.50/GB (varies by region)
  • Storage: ~$0.03/GB/month
  • Insights queries: $0.005/GB scanned — scoping by time range and fields saves money

CloudWatch Logs Insights queries

Insights is a query language over log events in a log group (or multiple groups). It does not index like Elasticsearch — it scans. Narrow time windows and filter early with | filter before | stats.

Find 5xx errors with request context

sql
fields @timestamp, @message, level, traceId, http.status, http.path, exception.message
| filter level = "ERROR" or http.status >= 500
| sort @timestamp desc
| limit 100

p99 latency by endpoint (structured JSON logs)

sql
fields http.path, duration_ms
| filter ispresent(duration_ms) and http.status = 200
| stats pct(duration_ms, 50) as p50,
        pct(duration_ms, 95) as p95,
        pct(duration_ms, 99) as p99,
        count(*) as requests
  by http.path
| sort p99 desc
| limit 20

Subscription filters

Subscription filters forward matching log events in near-real-time to Lambda, Kinesis Data Streams, Kinesis Data Firehose (S3, OpenSearch, Splunk), or another log group for cross-account centralization. One filter per log group per destination type.

Metric filters — logs to alarms without custom code

Metric filters extract numeric values or count pattern matches from log events and publish CloudWatch metrics. Classic pattern: count ERROR lines → alarm when error rate spikes. The metric appears in a namespace you choose (often the log group name).

Provision log group with retention and error metric filter

saved globally
bash
aws logs create-log-group --log-group-name /ecs/order-service

aws logs put-retention-policy \
  --log-group-name /ecs/order-service \
  --retention-in-days 30

aws logs put-metric-filter \
  --log-group-name /ecs/order-service \
  --filter-name ErrorCount \
  --filter-pattern '{ $.level = "ERROR" }' \
  --metric-transformations \
    metricName=ApplicationErrors,metricNamespace=OrderService,metricValue=1,defaultValue=0
hcl
resource "aws_cloudwatch_log_group" "order_service" {
  name              = "/ecs/order-service"
  retention_in_days = 30
  kms_key_id        = aws_kms_key.logs.arn
}

resource "aws_cloudwatch_log_metric_filter" "errors" {
  name           = "ErrorCount"
  log_group_name = aws_cloudwatch_log_group.order_service.name
  pattern        = "{ $.level = \"ERROR\" }"

  metric_transformation {
    name      = "ApplicationErrors"
    namespace = "OrderService"
    value     = "1"
    default_value = "0"
  }
}
typescript
import * as logs from 'aws-cdk-lib/aws-logs';

const logGroup = new logs.LogGroup(this, 'OrderServiceLogs', {
  logGroupName: '/ecs/order-service',
  retention: logs.RetentionDays.ONE_MONTH,
  encryptionKey: logsKey,
});

new logs.MetricFilter(this, 'ErrorFilter', {
  logGroup,
  metricNamespace: 'OrderService',
  metricName: 'ApplicationErrors',
  filterPattern: logs.FilterPattern.stringValue('$.level', '=', 'ERROR'),
  metricValue: '1',
  defaultValue: '0',
});
💡 Pro Tip

Save Insights queries in the console as query definitions and pin them to dashboards. For on-call runbooks, link directly to a pre-filled Insights query with the incident time window — saves 5 minutes of fumbling during a outage.

Advanced CloudWatch

Beyond basic metrics and logs: the CloudWatch Agent for host-level telemetry, Container Insights for ECS/EKS, Synthetics for black-box probing, and Embedded Metric Format for high-throughput custom metrics without API throttling.

CloudWatch Agent

Installed on EC2 or on-prem via SSM, the agent collects host metrics (disk, memory, swap), custom metrics (StatsD, collectd), and arbitrary log files. Config lives in SSM Parameter Store. On Fargate, skip the agent — use awslogs + EMF instead.

Container Insights

Turn on at the ECS cluster or EKS cluster level. CloudWatch creates additional log groups (/aws/ecs/containerinsights/{cluster}/performance) and publishes enhanced metrics: per-task CPU/memory, network, storage, and service-level aggregates. Uses more log ingestion volume — budget for it. Replaces much of what you'd otherwise build with custom cAdvisor scraping.

CloudWatch Synthetics (canaries)

Scheduled headless Chrome or Node.js scripts that hit your endpoints from AWS-managed locations. Detects DNS failures, SSL expiry, latency regressions, and broken deploys before customers do. Canaries publish metrics to CloudWatchSynthetics namespace and can trigger alarms. Common pattern: canary logs in → API Gateway → Lambda health check every 5 minutes.

Embedded Metric Format (EMF)

Write a JSON structure to stdout/stderr (or any log destination) and CloudWatch automatically extracts metrics from the log event — no PutMetricData API calls, no per-metric charges beyond log ingestion. The AWS SDK and Micrometer CloudWatch registry can emit EMF.

json
{
  "_aws": {
    "Timestamp": 1718000000000,
    "CloudWatchMetrics": [{
      "Namespace": "OrderService",
      "Dimensions": [["ServiceName", "Environment"]],
      "Metrics": [
        { "Name": "OrdersPlaced", "Unit": "Count" },
        { "Name": "CheckoutLatency", "Unit": "Milliseconds" }
      ]
    }]
  },
  "ServiceName": "order-service",
  "Environment": "production",
  "OrdersPlaced": 42,
  "CheckoutLatency": 187
}
⚖️ Trade-off

PutMetricData vs EMF: PutMetricData gives immediate metric availability and high-resolution (1s) support but hits API limits and per-metric pricing. EMF piggybacks on log ingestion — cheaper at scale, slight delay, and metrics disappear if log delivery fails. For Spring Boot on ECS, EMF via Micrometer is the production default.

X-Ray distributed tracing

Metrics tell you something is wrong; traces tell you where in a request path. AWS X-Ray collects trace data from SDKs, agents, and integrated services (API Gateway, ALB, Lambda) and renders a service map plus per-request timelines.

Traces, segments, and subsegments

Concept Definition Example
Trace End-to-end path of one request across services Client → ALB → order-service → payment-service → RDS
Segment Work done by a single service; has timing, HTTP metadata, errors order-service handling POST /orders
Subsegment Finer-grained work within a segment DynamoDB PutItem, downstream HTTP call, SQL query
Trace ID Shared across all segments; propagates via HTTP header X-Amzn-Trace-Id or W3C traceparent

Service map

X-Ray aggregates segments into a directed graph: nodes are services, edges are call volume and latency. Red nodes have elevated error rates; thicker edges show higher traffic. The service map is how you discover unexpected dependencies — "why is catalog-service calling legacy-billing?"

Sampling rules

Tracing every request at 10,000 RPS is expensive and hits X-Ray quotas. Sampling decides which requests get recorded. Default: first request per second + 5% of additional requests (reservoir + fixed rate). Custom sampling rules match on service name, HTTP method, URL path, or attributes.

  • Always sample errors: custom rule with priority 1, fixed rate 100% when error = true
  • Sample 1% of health checks: reservoir 0, rate 0.01 on /actuator/health
  • Dev environment: 100% sampling; production: 5–10% usually sufficient
💰 Cost

X-Ray charges per trace recorded and retrieved. At 5% sampling on 1M requests/day with 4 segments each: ~200K traces/day. Monitor trace volume in Cost Explorer under X-Ray. OpenTelemetry with tail-based sampling (sample only slow/error traces) can reduce cost further than head-based X-Ray defaults.

ADOT and OpenTelemetry

AWS Distro for OpenTelemetry (ADOT) is the AWS-supported OpenTelemetry distribution. Deploy the ADOT Collector as a sidecar on ECS/EKS to receive OTLP from your app and export to X-Ray, CloudWatch, or Prometheus. This is the migration path away from the legacy X-Ray SDK — OpenTelemetry is vendor-neutral; the collector handles backend routing.

App → OTLP → ADOT Collector sidecar → X-Ray (traces) + CloudWatch/EMF (metrics). Lambda uses the ADOT layer.

Java SDK — Spring Boot on ECS

For legacy X-Ray (pre-OTel migration), add the X-Ray SDK and servlet filter. The X-Ray daemon must run alongside the app — on ECS EC2 launch type as a daemon task; on Fargate as a sidecar container listening on UDP 2000.

java
// Custom subsegment around a downstream call
import com.amazonaws.xray.AWSXRay;
import com.amazonaws.xray.entities.Subsegment;

public PaymentClient {
  public ChargeResult charge(Order order) {
    Subsegment sub = AWSXRay.beginSubsegment("payment-service");
    try {
      sub.putMetadata("orderId", order.getId());
      return restClient.post("/charge", order, ChargeResult.class);
    } catch (Exception e) {
      sub.addException(e);
      throw e;
    } finally {
      AWSXRay.endSubsegment();
    }
  }
}
⚠️ Pitfall

Enabling X-Ray on ALB and API Gateway but not propagating trace headers in your Spring RestTemplate/WebClient calls creates broken traces — orphaned segments that don't link. Use OpenTelemetry auto-instrumentation or ensure X-Amzn-Trace-Id is forwarded on every outbound HTTP call.

🎯 Exam Tip

X-Ray integrates natively with Lambda, API Gateway, ALB, ECS, Elastic Beanstalk. It does not auto-trace RDS queries — you need SDK subsegments. X-Ray vs CloudWatch: X-Ray is request-scoped tracing; CloudWatch is time-series metrics and log storage. Use both together, not either/or.

CloudTrail audit logging

CloudTrail records who did what API call, when, from where, and whether it succeeded. It is not application logging — it is the AWS control-plane audit trail. Required for compliance, forensics, and least-privilege policy refinement.

Management events vs data events

Event type What is logged Cost
Management events Control plane: RunInstances, CreateBucket, AttachRolePolicy First trail free; additional copies charged
Data events Data plane: S3 object-level, Lambda invoke, DynamoDB item operations Per 100,000 events — expensive at scale; enable selectively
Insights events Detect unusual API activity (rate spikes) Additional charge; useful for security monitoring

Organization trails

Create a trail in the management (payer) account that logs all member accounts in the organization. Deliver logs to a centralized S3 bucket in a security/audit account via cross-account policy. Member accounts cannot disable or delete the org trail — prevents an attacker with admin creds from turning off auditing before exfiltration.

Enable multi-region delivery, log file validation, KMS encryption, and Athena/Lake for SQL queries. Deliver to a security-account S3 bucket with a restrictive bucket policy.

Log file validation

When enabled, CloudTrail delivers a Digest file alongside log files. The digest contains SHA-256 hashes of each log file in the hour. If someone tampers with a log file in S3, validation fails. Run aws cloudtrail validate-logs during incident response to prove chain of custody.

CloudTrail vs CloudWatch vs X-Ray

Dimension CloudTrail CloudWatch X-Ray
Purpose API audit — who changed infrastructure Metrics, logs, alarms — how systems behave Distributed tracing — request path latency
Data source AWS API calls (management + optional data) Service metrics, app logs, custom metrics Instrumented SDKs, ALB, API GW, Lambda
Retention S3 (years) + optional Lake; 90-day console view Configurable per log group; metrics 15 months 30 days default; longer via export to S3
Query Athena, Lake SQL, CloudTrail event history Logs Insights, Metric Math, dashboards Trace map, trace detail, Analytics (SQL)
Compliance SOC, PCI, HIPAA audit trail — primary tool Operational monitoring, not audit-grade Performance debugging, not audit-grade
Spring Boot app logs No — unless app calls AWS APIs Yes — stdout via ECS log driver Yes — if SDK/OTel instrumented

Create an organization trail with validation

saved globally
bash
aws cloudtrail create-trail \
  --name org-audit-trail \
  --s3-bucket-name my-org-cloudtrail-logs \
  --is-organization-trail \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --kms-key-id alias/cloudtrail

aws cloudtrail start-logging --name org-audit-trail

# Verify validation
aws cloudtrail validate-logs \
  --trail-arn arn:aws:cloudtrail:us-east-1:123456789012:trail/org-audit-trail \
  --start-time 2026-06-01T00:00:00Z \
  --end-time 2026-06-02T00:00:00Z
hcl
resource "aws_cloudtrail" "org" {
  name                          = "org-audit-trail"
  s3_bucket_name                = aws_s3_bucket.cloudtrail.id
  is_organization_trail         = true
  is_multi_region_trail         = true
  enable_log_file_validation    = true
  kms_key_id                    = aws_kms_key.cloudtrail.arn
  include_global_service_events = true

  event_selector {
    read_write_type           = "All"
    include_management_events = true
  }
}
typescript
import * as cloudtrail from 'aws-cdk-lib/aws-cloudtrail';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as kms from 'aws-cdk-lib/aws-kms';

const trail = new cloudtrail.Trail(this, 'OrgAuditTrail', {
  trailName: 'org-audit-trail',
  bucket: trailBucket,
  encryptionKey: trailKey,
  isMultiRegionTrail: true,
  enableFileValidation: true,
  includeGlobalServiceEvents: true,
  isOrganizationTrail: true,
});
🔒 Security

SCP-deny cloudtrail:StopLogging, cloudtrail:DeleteTrail, and cloudtrail:UpdateTrail for all accounts except the security admin role. Pair with EventBridge rule on StopLogging → immediate SNS page — someone is trying to hide their tracks.

Production observability patterns

Putting it together: the three pillars on AWS, keeping the CloudWatch bill predictable, and wiring Spring Boot Micrometer metrics into CloudWatch on ECS Fargate without running a sidecar metric agent.

Three pillars on AWS

  • Metrics

    CloudWatch service metrics + custom/EMF app metrics. Alarms → SNS → PagerDuty. Dashboards per team. SLO burn-rate alarms on error budget.

  • Logs

    Structured JSON to CloudWatch Logs. Insights for investigation. Subscription to S3 for archive. Trace ID in every log line for correlation.

  • Traces

    X-Ray or ADOT/OpenTelemetry. 5–10% head sampling in prod; 100% on errors. Service map for dependency discovery. Link traces to logs via shared trace ID.

  • Audit (+4th pillar)

    CloudTrail org trail to security account. Not operational telemetry — compliance and forensics. Athena queries for "who deleted that security group."

Keeping the CloudWatch bill sane

  • Retention: 30 days prod, 7 days dev — never "never expire" by default
  • Log level: INFO in prod; DEBUG only on flagged requests
  • EMF over PutMetricData with low-cardinality dimensions only
  • Insights: always scope time range; Container Insights in staging first
  • CloudTrail data events on sensitive buckets only; audit alarm count quarterly
💰 Cost

A mid-size ECS fleet (20 services, 50 tasks) logging 5 GB/day = ~$75/month ingestion + storage. Add Container Insights (often 2× log volume), 100 custom metrics, and 50 alarms and you're at $200+/month before anyone runs a wide-open Insights query. Set AWS Budgets alerts on the CloudWatch line item.

Spring Boot Micrometer → CloudWatch

Micrometer is the metrics facade in Spring Boot 3. Add the CloudWatch registry with EMF export so metrics flow through stdout → ECS awslogs driver → CloudWatch extracts them automatically. No X-Ray daemon needed for metrics (tracing is separate).

yaml
# pom.xml: micrometer-registry-cloudwatch2
# application-production.yml
management:
  metrics:
    export:
      cloudwatch:
        namespace: OrderService
        batch-size: 20
        enabled: true
        step: 1m
    tags:
      application: ${spring.application.name}
      environment: production
    enable:
      jvm: true
      http.server.requests: true
      system: false  # reduce noise on Fargate — no host-level system metrics

ECS task definition — EMF requires no sidecar for metrics

saved globally
bash
# EMF metrics flow via stdout — no cloudwatch:PutMetricData on task role
aws ecs register-task-definition --family order-service \
  --requires-compatibilities FARGATE --network-mode awvpc \
  --cpu 512 --memory 1024 \
  --task-role-arn arn:aws:iam::123456789012:role/order-service-task \
  --execution-role-arn arn:aws:iam::123456789012:role/ecs-execution \
  --container-definitions '[{
    "name":"app",
    "image":"123456789012.dkr.ecr.eu-west-1.amazonaws.com/order-service:latest",
    "environment":[
      {"name":"SPRING_PROFILES_ACTIVE","value":"production"},
      {"name":"MANAGEMENT_METRICS_EXPORT_CLOUDWATCH_ENABLED","value":"true"}
    ],
    "logConfiguration":{"logDriver":"awslogs","options":{
      "awslogs-group":"/ecs/order-service",
      "awslogs-region":"eu-west-1",
      "awslogs-stream-prefix":"ecs"
    }}
  }]'
hcl
resource "aws_ecs_task_definition" "order_service" {
  family                   = "order-service"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 512
  memory                   = 1024
  task_role_arn            = aws_iam_role.task.arn
  execution_role_arn       = aws_iam_role.execution.arn

  container_definitions = jsonencode([{
    name  = "app"
    image = "${aws_ecr_repository.app.repository_url}:latest"
    environment = [
      { name = "SPRING_PROFILES_ACTIVE", value = "production" },
      { name = "MANAGEMENT_METRICS_EXPORT_CLOUDWATCH_ENABLED", value = "true" }
    ]
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        awslogs-group         = aws_cloudwatch_log_group.order_service.name
        awslogs-region        = var.region
        awslogs-stream-prefix = "ecs"
      }
    }
  }])
}
typescript
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as logs from 'aws-cdk-lib/aws-logs';

const taskDef = new ecs.FargateTaskDefinition(this, 'OrderServiceTask', {
  taskRole,
  executionRole: ecsExecutionRole,
});

const container = taskDef.addContainer('app', {
  image: ecs.ContainerImage.fromEcrRepository(repo, 'latest'),
  environment: {
    SPRING_PROFILES_ACTIVE: 'production',
    MANAGEMENT_METRICS_EXPORT_CLOUDWATCH_ENABLED: 'true',
  },
  logging: ecs.LogDrivers.awsLogs({
    streamPrefix: 'ecs',
    logGroup: orderServiceLogGroup,
  }),
});
💡 Pro Tip

Put the same traceId in structured logs, HTTP response headers (for support), and X-Ray segments. During an incident, start from the customer's trace ID → X-Ray trace → jump to Logs Insights with that ID → find the stack trace. One ID, three pillars linked.

🎯 Exam Tip

When the exam asks "detect unauthorized API activity across all accounts" → CloudTrail organization trail. "Real-time application performance bottleneck across microservices" → X-Ray. "Alert when EC2 CPU exceeds threshold" → CloudWatch alarm. "Query application logs for error patterns" → CloudWatch Logs Insights. Know which tool answers which question.