← Back to Project 12 min read

Building Snapshot-Sleuth: A Serverless Forensic Automation System

How event-driven architecture and Infrastructure as Code delivered 71% faster incident response

Cloud Security Engineering · January 21, 2025

When a security incident strikes, time is the enemy. Every minute spent on manual forensic analysis is a minute an attacker could be deepening their foothold. For security teams handling cloud workloads, the challenge is amplified: how do you analyze potentially terabytes of disk data quickly, consistently, and without specialized forensic expertise?

This is the story of Snapshot-Sleuth, a serverless forensic automation system that transforms EBS snapshot analysis from a manual, hours-long process into an automated workflow completing in minutes. Built on AWS serverless primitives and orchestrated through event-driven patterns, the system achieved a 71% reduction in processing time while handling over 100 forensic analyses across 50+ security incidents.

The architecture spans 16 CDK stacks, 30+ Lambda functions, and ~25,000 lines of code across TypeScript, Python, and React. More importantly, it demonstrates how thoughtful architectural decisions compound into measurable operational impact.

The Challenge: Manual Forensics at Cloud Scale

Before automation, forensic analysis of EBS snapshots followed a familiar pattern: an analyst would receive a snapshot, manually copy it to a forensic workstation, mount the volumes, run various scanning tools, collect artifacts, and compile findings. A typical 100GB snapshot took 2-3 hours of analyst time. Larger volumes—common in enterprise workloads—could consume an entire day.

The process suffered from several compounding problems:

Serial execution bottlenecks. Each step waited on the previous one. Copying a snapshot blocked scanning. Scanning blocked analysis. Nothing ran in parallel.

Expertise concentration. Only a handful of team members understood the full forensic toolkit. When they were unavailable, incidents queued.

Inconsistent coverage. Manual processes meant variable depth of analysis. Some analysts ran comprehensive scans; others took shortcuts under time pressure.

Scale limitations. The legacy system struggled with snapshots over 5TB. Large enterprise volumes required workarounds or went unanalyzed.

Regional constraints. Data sovereignty requirements meant evidence couldn’t cross certain boundaries. Analysts needed to work within specific regions, complicating tooling setup.

The core insight was that most forensic triage follows predictable patterns. The same tools run in the same order, producing similar artifact types. What varied was only the input data. This made the problem amenable to automation—not replacing analyst judgment, but eliminating repetitive manual steps.

Architecture Evolution: From Monolithic to Event-Driven

The redesign centered on three principles: event-driven processing, serverless-first with hybrid compute, and Infrastructure as Code for consistency.

Event-Driven Ingestion

The workflow begins when an analyst shares an EBS snapshot with the forensic system’s AWS account. This simple action—standard AWS snapshot sharing—triggers everything that follows.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4a90a4', 'secondaryColor': '#f4f4f4', 'tertiaryColor': '#fff', 'primaryTextColor': '#333', 'lineColor': '#666'}}}%%
flowchart TB
    subgraph Source["Evidence Source"]
        A[("EBS Snapshot<br/>Shared by Analyst")]
    end

    subgraph EventDriven["Event-Driven Ingestion"]
        BEventBridge<br/>Rule
        C[["SQS Queue"]]
        DEventBridge<br/>Pipes
    end

    subgraph Orchestration["Serverless Orchestration"]
        E["Step Functions<br/>State Machine"]
        F["Validation<br/>Lambda"]
        G["Copy Snapshot<br/>AWS API"]
    end

    subgraph ForensicAnalysis["Forensic Analysis (EC2)"]
        H["EC2 Instance<br/>(Graviton ARM64)"]
        I["Mount &<br/>Analyze"]
        J["YARA<br/>Scanner"]
        K["ClamAV<br/>Scanner"]
        L["Artifact<br/>Collector"]
        M["Timeline<br/>Generator"]
    end

    subgraph Storage["Evidence Storage"]
        N[("S3 Evidence<br/>Bucket")]
        O[("DynamoDB<br/>State Table")]
    end

    subgraph Notifications["Notifications & Integration"]
        P["Slack<br/>Notification"]
        Q["Incident<br/>Management"]
        R["Dashboard<br/>Updates"]
    end

    A -->|"Snapshot Shared"| B
    B -->|"Event Detected"| C
    C -->|"Queue Processing"| D
    D -->|"Invoke"| E
    E -->|"Validate"| F
    F -->|"Parallel"| G
    F -->|"Launch"| H
    G -->|"Snapshot Ready"| H
    H --> I
    I --> J
    I --> K
    I --> L
    I --> M
    J -->|"Upload"| N
    K -->|"Upload"| N
    L -->|"Upload"| N
    M -->|"Upload"| N
    H -->|"Callback"| E
    E -->|"Update State"| O
    E -->|"Complete"| P
    E -->|"Update"| Q
    O -->|"Metrics"| R

    classDef source fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef event fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    classDef orchestration fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
    classDef forensic fill:#fce4ec,stroke:#e91e63,stroke-width:2px
    classDef storage fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    classDef notify fill:#e0f7fa,stroke:#00bcd4,stroke-width:2px

    class A source
    class B,C,D event
    class E,F,G orchestration
    class H,I,J,K,L,M forensic
    class N,O storage
    class P,Q,R notify

EventBridge detects the snapshot sharing event and routes it to an SQS queue. EventBridge Pipes (in supported regions) or a fallback Lambda invokes the Step Functions state machine. This decoupled ingestion provides natural backpressure handling and ensures no events are lost during system updates.

Serverless Orchestration with Hybrid Compute

Step Functions orchestrates the workflow, but the actual forensic analysis runs on EC2. This hybrid approach reflects a key architectural decision: use Lambda for coordination (fast, scalable, cost-efficient) but EC2 for compute-intensive forensics (predictable performance, larger storage, longer runtime).

The connection uses Step Functions’ Task Token callback pattern. When the state machine needs to run forensic analysis, it launches an EC2 instance with a task token embedded in the userdata. The EC2 instance performs analysis—which can run for hours on large snapshots—and calls back to Step Functions when complete:

# EC2 instance callback after forensic analysis
sfn_client.send_task_success(
    taskToken=task_token,
    output=json.dumps({
        "status": "completed",
        "evidence_location": f"s3://{bucket}/{case_id}/",
        "findings_summary": summary
    })
)

This pattern gives us the best of both worlds: Step Functions handles state management, retries, and timeouts, while EC2 provides the compute resources forensic tools need.

16-Stack Architecture

The infrastructure is organized into 16 specialized CDK stacks, each with clear responsibilities:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4a90a4'}}}%%
flowchart TB
    subgraph Foundation["Foundation Layer"]
        SR["ServiceRolesStack<br/><i>15+ IAM Roles</i>"]
        LOG["LoggingStack<br/><i>CloudTrail Audit</i>"]
        DB["DatabaseStack<br/><i>Global DynamoDB</i>"]
    end

    subgraph Core["Core Processing Layer"]
        COL["CollectorStack<br/><i>Step Functions + 8 Lambdas</i>"]
        IMG["ImageBuilderStack<br/><i>AMI Pipeline</i>"]
        MET["MetricsStack<br/><i>EventBridge Metrics</i>"]
    end

    subgraph Observability["Observability Layer"]
        MON["MonitoringStack<br/><i>8 Dashboards</i>"]
        ALERT["AlertingStack<br/><i>Security Events</i>"]
    end

    subgraph Lifecycle["Lifecycle Management"]
        CLEAN["CleanupStack<br/><i>Resource Lifecycle</i>"]
    end

    subgraph Analytics["Analytics Layer"]
        ANA["AnalyticsStack<br/><i>Glue + Athena</i>"]
        QS["QuickSightStack<br/><i>BI Dashboards</i>"]
    end

    subgraph Intelligence["Intelligence Layer"]
        AI["IntelligenceStack<br/><i>Bedrock AI</i>"]
    end

    subgraph Testing["Testing Layer"]
        TEST["TestingStack<br/><i>Integration Tests</i>"]
    end

    subgraph Frontend["Frontend Layer"]
        WEB["WebStack<br/><i>React + CloudFront</i>"]
        WIKI["WikiStack<br/><i>Documentation</i>"]
    end

    SR --> LOG
    SR --> DB
    SR --> COL
    LOG --> COL
    DB --> COL
    SR --> IMG
    COL --> MET
    SR --> MON
    MET --> MON
    SR --> ALERT
    COL --> CLEAN
    DB --> ANA
    ANA --> QS
    SR --> AI
    SR --> TEST
    SR --> WEB
    MON --> WIKI

    classDef foundation fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
    classDef core fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
    classDef observe fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    classDef lifecycle fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    classDef analytics fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef intel fill:#fce4ec,stroke:#e91e63,stroke-width:2px
    classDef testing fill:#e0f7fa,stroke:#00bcd4,stroke-width:2px
    classDef frontend fill:#fafafa,stroke:#607d8b,stroke-width:2px

    class SR,LOG,DB foundation
    class COL,IMG,MET core
    class MON,ALERT observe
    class CLEAN lifecycle
    class ANA,QS analytics
    class AI intel
    class TEST testing
    class WEB,WIKI frontend

This separation provides several benefits. Stacks deploy independently, reducing blast radius during updates. IAM roles are centralized in ServiceRolesStack, preventing permission sprawl. The observability layer can evolve without touching core processing.

Technical Implementation Deep Dive

Management Plane: Lambda Orchestration

The TypeScript Lambda functions handle workflow coordination. Each function uses AWS Lambda Powertools for consistent logging, metrics, and tracing:

import { Logger } from '@aws-lambda-powertools/logger';
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
import { Tracer } from '@aws-lambda-powertools/tracer';

const logger = new Logger({ serviceName: 'snapshot-sleuth-validator' });
const metrics = new Metrics({ namespace: 'Sleuth/Workflow' });
const tracer = new Tracer({ serviceName: 'snapshot-sleuth-validator' });

export const handler = async (event: SnapshotEvent) => {
  logger.info('Processing snapshot', { snapshotId: event.snapshotId });

  const subsegment = tracer.getSegment()?.addNewSubsegment('ValidateSnapshot');

  try {
    const isValid = await validateSnapshotAccess(event.snapshotId);
    metrics.addMetric('ValidationSucceeded', MetricUnits.Count, 1);
    subsegment?.addAnnotation('validated', true);
    return { ...event, validated: true };
  } finally {
    subsegment?.close();
  }
};

The key Lambda functions include:

Validator: Confirms snapshot accessibility and extracts metadata
InstanceLauncher: Provisions EC2 with appropriate configuration based on snapshot size
IncidentUpdater: Synchronizes status with incident management systems
NotificationSender: Delivers completion alerts via Slack webhooks

Data Plane: Python Forensic Orchestrator

The EC2-based forensic analysis runs a Python orchestrator that coordinates four specialized tools:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4a90a4'}}}%%
flowchart LR
    subgraph Download["Image Acquisition"]
        A["ColdSnap<br/>Download"]
        B["Raw Disk<br/>Image"]
    end

    subgraph Mount["Volume Mounting"]
        C["Partition<br/>Detection"]
        D["Filesystem<br/>Identification"]
        E["Mount<br/>Volumes"]
    end

    subgraph Scan["Parallel Scanning"]
        F["YARA<br/>Pattern Scan"]
        G["ClamAV<br/>Malware Scan"]
        H["Artifact<br/>Collection"]
        I["Timeline<br/>Generation"]
    end

    subgraph Collect["Evidence Collection"]
        J["JSON<br/>Reports"]
        K["Markdown<br/>Summary"]
        L["Timeline<br/>CSV"]
    end

    subgraph Upload["Evidence Upload"]
        M["S3 Upload<br/>w/ Encryption"]
        N["Metadata<br/>to DynamoDB"]
    end

    A -->|"Stream"| B
    B --> C
    C --> D
    D -->|"ext4/xfs/NTFS/LVM"| E
    E --> F
    E --> G
    E --> H
    E --> I
    F --> J
    G --> J
    H --> K
    I --> L
    J --> M
    K --> M
    L --> M
    M --> N

    classDef download fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
    classDef mount fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    classDef scan fill:#fce4ec,stroke:#e91e63,stroke-width:2px
    classDef collect fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef upload fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px

    class A,B download
    class C,D,E mount
    class F,G,H,I scan
    class J,K,L collect
    class M,N upload

Each scanner implements a common interface with timeout protection—critical when scanning unknown filesystems:

class ForensicScanner(ABC):
    def __init__(self, timeout_seconds: int = 3600):
        self.timeout_seconds = timeout_seconds

    def execute_with_timeout(self, target_path: str) -> ScanResult:
        signal.signal(signal.SIGALRM, timeout_handler)
        signal.alarm(self.timeout_seconds)

        try:
            return self.scan(target_path)
        except TimeoutError:
            return ScanResult(
                tool_name=self.__class__.__name__,
                success=False,
                error_message="Execution timed out"
            )
        finally:
            signal.alarm(0)

The factory pattern makes it straightforward to add new scanners without modifying orchestration code:

class ScannerFactory:
    _scanners = {
        "yara": YaraScanner,
        "clamav": ClamAVScanner,
        "artifacts": ArtifactCollector,
        "timeline": TimelineGenerator,
    }

    @classmethod
    def create(cls, scanner_type: str, **kwargs) -> ForensicScanner:
        return cls._scanners[scanner_type](**kwargs)

Frontend: React Dashboard

The React frontend provides real-time visibility into forensic workflows. Built with Cloudscape Design System and React Query, it offers automatic refresh and efficient caching:

export const useEngagements = (filters?: EngagementFilters) => {
  return useQuery<Engagement[]>(
    ['engagements', filters],
    async () => {
      const response = await fetch('/api/engagements', {
        credentials: 'include',  // SSO authentication
      });
      return response.json();
    },
    {
      refetchInterval: 30000,  // Refresh every 30 seconds
      staleTime: 10000,
    }
  );
};

Forensic Tools Integration

YARA Pattern Detection

YARA scans the entire mounted filesystem looking for patterns defined in rule files. These rules can detect malware signatures, suspicious file contents, or known indicators of compromise. The scanner emits detailed metrics:

@xray_recorder.capture("yara_scan")
def scan(self, target_path: str) -> ScanResult:
    rules = yara.compile(filepath=self.rules_path)

    for root, dirs, files in os.walk(target_path):
        for filename in files:
            matches = rules.match(filepath, timeout=60)
            if matches:
                detections.append({
                    "file": filepath,
                    "rules": [m.rule for m in matches],
                    "tags": [tag for m in matches for tag in m.tags]
                })

    metrics.add_metric("DetectionCount", "Count", len(detections))
    return ScanResult(tool_name="YARA", detections=detections)

Multi-Filesystem Support

Cloud workloads run on various operating systems and configurations. The partition analyzer handles:

Linux filesystems: ext2, ext3, ext4, xfs, btrfs
Windows filesystems: NTFS
Logical volumes: LVM detection and activation
Encrypted volumes: LUKS container identification

Each filesystem type has a specialized handler implementing a common mounting interface. The system automatically detects partition types and applies appropriate mounting options.

Performance & Scale

The automation delivered measurable improvements across all snapshot sizes:

Snapshot Size	Legacy Manual	Automated System	Improvement
8 GB	~35 minutes	~10 minutes	71%
100 GB	~2.5 hours	~45 minutes	70%
1 TB	~10 hours	~2.5 hours	75%
5 TB	~24+ hours	~8 hours	67%

Average improvement: 71%

Beyond raw speed, the system delivers:

150+ hours of cumulative analyst time recovered
90% adoption rate among the security team
65% of cases where automation identified root cause
100+ snapshots processed across 50+ incidents

Several architectural decisions enable this performance:

Parallel tool execution. All four scanners run simultaneously rather than sequentially. A 1TB snapshot that previously took 10 hours of serial scanning now completes in 2.5 hours of parallel work.

Right-sized compute. The instance launcher analyzes snapshot size and provisions appropriate EC2 instances. Small snapshots get m7g.xlarge; larger volumes get m7g.4xlarge with additional EBS throughput.

Graviton instances. ARM64 Graviton processors provide better price-performance than x86 equivalents, reducing per-analysis costs by approximately 40%.

Regional processing. Evidence stays within its originating region, eliminating cross-region data transfer delays and maintaining compliance with data sovereignty requirements.

Operational Excellence

Observability Architecture

The system evolved through three observability phases:

Phase 1: Basic logging. CloudWatch Logs for all Lambda functions and EC2 instances. Structured logging with consistent fields (case ID, snapshot ID, region) enables correlation.

Phase 2: Distributed tracing. X-Ray integration across Lambda and EC2 provides end-to-end visibility. Custom decorators enable graceful degradation when X-Ray isn’t available:

def trace(name: Optional[str] = None, namespace: str = "Sleuth"):
    def decorator(func: Callable):
        @wraps(func)
        def wrapper(*args, **kwargs):
            if not XRAY_ENABLED:
                return func(*args, **kwargs)

            with xray_recorder.in_subsegment(name or func.__name__) as subsegment:
                subsegment.put_annotation("namespace", namespace)
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    subsegment.add_exception(e, stack=True)
                    raise
        return wrapper
    return decorator

Phase 3: Modular dashboards. Eight specialized CloudWatch dashboards, each a reusable CDK construct:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4a90a4'}}}%%
flowchart TB
    subgraph Sources["Metric Sources"]
        SF["Step Functions<br/>Executions"]
        EC2["EC2 Forensic<br/>Instances"]
        LAM["Lambda<br/>Functions"]
        CAN["Canary<br/>Tests"]
    end

    subgraph Collection["EventBridge Collection"]
        EBEventBridge<br/>Rules
        PROC["Metrics<br/>Processor<br/>Lambda"]
    end

    subgraph Namespaces["CloudWatch Namespaces"]
        NS1["Sleuth/Workflow<br/><i>Execution metrics</i>"]
        NS2["Sleuth/Tools<br/><i>Tool performance</i>"]
        NS3["Sleuth/Evidence<br/><i>Collection metrics</i>"]
        NS4["Sleuth/Service<br/><i>Health status</i>"]
        NS5["Sleuth/Canary<br/><i>Test results</i>"]
    end

    subgraph Dashboards["8 Specialized Dashboards"]
        D1["Workflow<br/>Dashboard"]
        D2["Tools<br/>Dashboard"]
        D3["Evidence<br/>Dashboard"]
        D4["EC2 Perf<br/>Dashboard"]
        D5["Global Health<br/>Dashboard"]
        D6["Regional<br/>Dashboard"]
        D7["Alerts<br/>Dashboard"]
        D8["Operations<br/>Dashboard"]
    end

    subgraph Alerts["Alerting"]
        ALM["CloudWatch<br/>Alarms"]
        GATE["Pipeline<br/>Gate"]
        NOTIFY["Notifications"]
    end

    SF -->|"State Changes"| EB
    EC2 -->|"Tool Metrics"| EB
    LAM -->|"Function Metrics"| EB
    CAN -->|"Test Results"| EB

    EB --> PROC
    PROC --> NS1
    PROC --> NS2
    PROC --> NS3
    PROC --> NS4
    PROC --> NS5

    NS1 --> D1
    NS2 --> D2
    NS3 --> D3
    NS1 --> D4
    NS4 --> D5
    NS1 --> D6
    NS4 --> D7
    NS4 --> D8

    NS4 --> ALM
    ALM --> GATE
    ALM --> NOTIFY

    classDef source fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
    classDef collect fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    classDef namespace fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    classDef dashboard fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    classDef alert fill:#fce4ec,stroke:#e91e63,stroke-width:2px

    class SF,EC2,LAM,CAN source
    class EB,PROC collect
    class NS1,NS2,NS3,NS4,NS5 namespace
    class D1,D2,D3,D4,D5,D6,D7,D8 dashboard
    class ALM,GATE,NOTIFY alert

The metrics processor captures Step Functions state changes via EventBridge and emits CloudWatch metrics, enabling near-real-time dashboard updates without polling.

Testing Strategy

Testing follows a hierarchical approach, balancing speed with thoroughness:

Integration tests (< 5 min): Validate resource existence and IAM permissions. Run on every deployment.
Component tests (< 10 min): Verify EventBridge triggers, SQS message flow, Step Functions invocation.
Workflow tests (< 15 min): End-to-end snapshot processing with synthetic data.
Canary tests (hourly): Continuous validation in production environments.

All test resources are tagged with a standard identifier (TEST-0404), enabling automated daily cleanup that removes orphaned resources without affecting real incidents.

Security Hardening

Forensic systems handle sensitive data by definition. Security measures include:

VPC isolation: Forensic EC2 instances run in private subnets with no internet gateway. All AWS API access goes through VPC endpoints.
Encryption everywhere: S3 buckets use KMS encryption. DynamoDB, SQS, and SNS all enable encryption at rest. EFS volumes are encrypted.
Least privilege IAM: 15+ specialized roles with minimal permissions. Cross-account access uses explicit role assumptions.
Audit logging: Multi-region CloudTrail with 10-year retention captures all API activity.

Lessons Learned

What Worked Well

Event-driven architecture simplifies scaling. The EventBridge → SQS → Step Functions chain handles traffic spikes gracefully. During an incident involving multiple compromised systems, the queue absorbed bursts while processing continued at sustainable rates.

Modular dashboards beat monolithic monitoring. The original monitoring code exceeded 1,000 lines in a single file. Refactoring into eight specialized constructs (~200-500 lines each) made maintenance tractable and enabled dashboard reuse across environments.

Task token callbacks are powerful. The Step Functions callback pattern elegantly bridges serverless orchestration with long-running EC2 workloads. State persists in Step Functions while compute runs wherever it needs to.

CDK constructs compound value. Reusable constructs for queues, topics, and Lambda configurations ensured consistency across 16 stacks. When security requirements changed, updates propagated automatically.

Challenges Overcome

Regional variations require abstraction. Not all AWS regions support all services equally. EventBridge Pipes isn’t available everywhere. Some regions lack specific VPC endpoints. Configuration files map region capabilities, and deployment logic adapts accordingly.

Large snapshot handling needed iteration. Initial designs struggled with snapshots over 2TB. The solution: EFS for working storage, streaming downloads via ColdSnap, and parallel partition mounting.

Dashboard performance requires selectivity. Early dashboards pulled too many metrics, causing slow load times. Focusing each dashboard on a specific domain (workflow, tools, evidence) improved both performance and usability.

Future Directions

The architecture supports several planned enhancements:

Additional forensic tools: Memory analysis capabilities, registry deep-dive for Windows volumes
ML-based anomaly detection: Pattern recognition for unusual file distributions or access patterns
Automated remediation workflows: Integration with response orchestration for common scenarios
Cross-cloud support: Extending snapshot analysis to Azure and GCP disk images

Conclusion

Snapshot-Sleuth demonstrates that thoughtful architecture choices compound into meaningful operational improvements. The 71% reduction in processing time isn’t magic—it’s the aggregate effect of parallel execution, right-sized compute, event-driven coordination, and consistent automation.

For teams considering similar systems, the key takeaways are:

Automate the predictable. If analysts run the same tools in the same order, that’s a candidate for automation.
Use hybrid compute strategically. Lambda for orchestration, EC2 for heavy lifting. Don’t force serverless where it doesn’t fit.
Invest in observability early. The dashboard evolution from basic logs to distributed tracing to specialized monitoring paid dividends in debugging and optimization.
Design for regional variation. Cloud infrastructure varies by region. Abstract these differences rather than fighting them.
Make Infrastructure as Code non-negotiable. 16 stacks deploying consistently across 20+ regions would be impossible without CDK.

The best forensic analysis still requires human judgment. What automation provides is the time for analysts to exercise that judgment—minutes instead of hours, insight instead of data extraction. That’s the difference between responding to incidents and actually resolving them.

This case study describes architectural patterns and approaches from a production forensic automation system. Specific tool names, metrics, and implementation details have been generalized for public presentation.