Disaster Recovery and Backup Strategies in Cloud-Native Microservices System Design

Introduction

Disaster Recovery (DR) and Backup Strategies are critical components of system design to ensure business continuity, data protection, and rapid recovery from catastrophic events such as hardware failures, cyberattacks, natural disasters, or human errors in distributed systems. In cloud-native microservices architectures, which aim for high scalability (e.g., 1M req/s), high availability (e.g., 99.999% uptime), and compliance with standards like GDPR, HIPAA, SOC2, and PCI-DSS, DR and backup strategies mitigate risks and minimize downtime. This comprehensive analysis details the principles, tools, implementation approaches, advantages, limitations, and trade-offs of disaster recovery and backup strategies, with C# code examples as per your preference. It integrates foundational distributed systems concepts from your prior queries, including CAP Theorem, consistency models, consistent hashing, idempotency, unique IDs (e.g., Snowflake), heartbeats, failure handling, single points of failure (SPOFs), checksums, GeoHashing, rate limiting, Change Data Capture (CDC), load balancing, quorum consensus, multi-region deployments, capacity planning, backpressure handling, exactly-once vs. at-least-once semantics, event-driven architecture (EDA), microservices design, inter-service communication, data consistency, deployment strategies, testing strategies, Domain-Driven Design (DDD), API Gateway, Saga Pattern, Strangler Fig Pattern, Sidecar/Ambassador/Adapter Patterns, Resiliency Patterns, Service Mesh, Micro Frontends, API Versioning, Cloud-Native Design, Cloud Service Models, Containers vs. VMs, Kubernetes Architecture & Scaling, Serverless Architecture, 12-Factor App Principles, CI/CD Pipelines, Infrastructure as Code (IaC), Cloud Security Basics (IAM, Secrets, Key Management), Cost Optimization, Observability (Metrics, Tracing, Logging), Authentication & Authorization (OAuth2, OpenID Connect), Encryption in Transit and at Rest, Securing APIs (Rate Limits, Throttling, HMAC, JWT), Security Considerations in Microservices, Monitoring & Logging Strategies, Distributed Tracing (Jaeger, Zipkin, OpenTelemetry), Zero Trust Architecture, Chaos Engineering, and Auditing & Compliance. Leveraging your interest in e-commerce integrations, API scalability, resilient systems, cost efficiency, observability, authentication, encryption, API security, microservices security, monitoring, tracing, zero trust, chaos engineering, and compliance, this guide provides a structured framework for designing robust, recoverable, and compliant cloud-native systems.

Core Principles of Disaster Recovery and Backup Strategies

Disaster recovery and backup strategies focus on minimizing downtime (Mean Time to Recovery, MTTR < 15s), ensuring data integrity, and maintaining compliance during failures. These principles align with Resiliency Patterns, Observability, Cloud Security, and Chaos Engineering from your prior queries.

  • Key Principles:
    • Recovery Time Objective (RTO): Minimize downtime (e.g., < 15s for critical systems).
    • Recovery Point Objective (RPO): Minimize data loss (e.g., < 1min of data).
    • Data Protection: Encrypt backups with KMS, as per your Encryption query.
    • Redundancy: Use multi-region deployments for failover, as per your multi-region query.
    • Automation: Automate DR with CI/CD Pipelines and IaC, as per your CI/CD and IaC queries.
    • Monitoring: Track recovery with metrics, tracing, logging, as per your Observability and Distributed Tracing queries.
    • Security: Secure backups with IAM, mTLS, and Zero Trust, as per your Cloud Security and Zero Trust queries.
    • Testing: Validate DR with Chaos Engineering, as per your Chaos Engineering query.
    • Compliance: Ensure backups meet GDPR, HIPAA, SOC2, PCI-DSS, as per your Auditing & Compliance query.
    • Cost Efficiency: Optimize backup storage and DR costs, as per your Cost Optimization query.
  • Mathematical Foundation:
    • RTO: RTO = detection_time + failover_time, e.g., 5s detection + 10s failover = 15s.
    • RPO: RPO = backup_interval × data_rate, e.g., 1min × 1MB/s = 1MB data loss.
    • Backup Storage Cost: Cost = data_size × storage_cost_per_GB, e.g., 1TB × $0.02/GB = $20.48/month.
    • Availability: Availability = 1 − (downtime ÷ total_time), e.g., 99.999% with 5s downtime/day.
    • Failover Latency: Latency = replication_time + switchover_time, e.g., 50ms + 10ms = 60ms.
  • Integration with Prior Concepts:
    • CAP Theorem: Prioritizes AP for DR operations, CP for backup integrity, as per your CAP query.
    • Consistency Models: Uses eventual consistency for backups, strong consistency for critical data, as per your data consistency query.
    • Consistent Hashing: Routes backup traffic, as per your load balancing query.
    • Idempotency: Ensures safe DR retries, as per your idempotency query.
    • Failure Handling: Uses retries, timeouts, circuit breakers, as per your Resiliency Patterns query.
    • Heartbeats: Monitors DR services (< 5s), as per your heartbeats query.
    • SPOFs: Avoids via multi-region backups, as per your SPOFs query.
    • Checksums: Verifies backup integrity, as per your checksums query.
    • GeoHashing: Routes DR traffic by region, as per your GeoHashing query.
    • Rate Limiting: Caps DR API access, as per your rate limiting and Securing APIs queries.
    • CDC: Syncs backup data, as per your data consistency query.
    • Load Balancing: Distributes DR traffic, as per your load balancing query.
    • Multi-Region: Enables failover (< 50ms), as per your multi-region query.
    • Backpressure: Manages DR load, as per your backpressure query.
    • EDA: Triggers DR events via Kafka, as per your EDA query.
    • Saga Pattern: Coordinates DR workflows, as per your Saga query.
    • DDD: Aligns DR with Bounded Contexts, as per your DDD query.
    • API Gateway: Routes DR APIs, as per your API Gateway query.
    • Strangler Fig: Migrates legacy DR systems, as per your Strangler Fig query.
    • Service Mesh: Secures DR communication with mTLS, as per your Service Mesh query.
    • Micro Frontends: Supports DR for UI, as per your Micro Frontends query.
    • API Versioning: Tracks DR API versions, as per your API Versioning query.
    • Cloud-Native Design: Core to DR, as per your Cloud-Native Design query.
    • Cloud Service Models: Secures IaaS/PaaS/FaaS DR, as per your Cloud Service Models query.
    • Containers vs. VMs: Uses containers for DR, as per your Containers vs. VMs query.
    • Kubernetes: Uses KubeDR for DR, as per your Kubernetes query.
    • Serverless: Uses Lambda for DR tasks, as per your Serverless query.
    • 12-Factor App: Logs DR events to stdout, as per your 12-Factor query.
    • CI/CD Pipelines: Automates DR deployment, as per your CI/CD query.
    • IaC: Provisions DR infrastructure, as per your IaC query.
    • Cloud Security: Uses IAM, KMS for DR, as per your Cloud Security query.
    • Cost Optimization: Reduces DR costs, as per your Cost Optimization query.
    • Observability: Monitors DR with metrics/tracing/logs, as per your Observability query.
    • Authentication & Authorization: Secures DR with OAuth2/OIDC, as per your Authentication query.
    • Encryption: Protects backups, as per your Encryption query.
    • Securing APIs: Uses rate limiting, HMAC, JWT for DR APIs, as per your Securing APIs query.
    • Security Considerations: Aligns DR with security, as per your Security Considerations query.
    • Monitoring & Logging: Tracks DR metrics/logs, as per your Monitoring & Logging query.
    • Distributed Tracing: Traces DR actions with Jaeger/OpenTelemetry, as per your Distributed Tracing query.
    • Zero Trust: Enforces strict DR access, as per your Zero Trust query.
    • Chaos Engineering: Tests DR resilience, as per your Chaos Engineering query.
    • Auditing & Compliance: Ensures DR meets GDPR/HIPAA/SOC2/PCI-DSS, as per your Auditing & Compliance query.

Disaster Recovery and Backup Tools

1. AWS Backup

  • Overview: A managed service for centralized backup and recovery across AWS services (e.g., S3, RDS, EBS).
  • Mechanisms:
    • Automates backups with schedules (e.g., daily at 2 AM).
    • Supports cross-region replication for DR.
    • Integrates with CloudTrail for audit logs, as per your Auditing & Compliance query.
  • Implementation:
    • Configure backup plans for S3 and DynamoDB.
    • Enable cross-region replication (e.g., us-east-1 to us-west-2).
    • Monitor with CloudWatch and OpenTelemetry, as per your Monitoring & Logging and Distributed Tracing queries.
  • Applications:
    • E-commerce: Backup order data.
    • Financial Systems: Backup transaction logs.
  • Key Features:
    • RPO < 1min, RTO < 15s for critical data.
    • Integrates with KMS for encryption, as per your Encryption query.

2. Veeam

  • Overview: A third-party backup and DR solution for cloud, on-premises, and hybrid environments.
  • Mechanisms:
    • Provides incremental backups and replication.
    • Supports Kubernetes and VMs, as per your Containers vs. VMs query.
    • Integrates with Jaeger for tracing, as per your Distributed Tracing query.
  • Implementation:
    • Deploy Veeam agents on ECS or Kubernetes.
    • Configure daily backups with 7-year retention (GDPR).
    • Test recovery with Chaos Engineering, as per your Chaos Engineering query.
  • Applications:
    • Healthcare: Backup PHI for HIPAA.
    • IoT: Backup sensor data.
  • Key Features:
    • RPO < 5min, RTO < 30s.
    • Supports multi-region replication, as per your multi-region query.

3. Velero

  • Overview: An open-source tool for Kubernetes backup and DR, supporting pod, PVC, and cluster resource backups.
  • Mechanisms:
    • Backs up Kubernetes objects to S3 or MinIO.
    • Supports scheduled backups and restores.
    • Integrates with Service Mesh for secure communication, as per your Service Mesh query.
  • Implementation:
    • Deploy Velero in Kubernetes clusters.
    • Backup to S3 with KMS encryption, as per your Encryption query.
    • Monitor with Prometheus and Jaeger, as per your Observability and Distributed Tracing queries.
  • Applications:
    • E-commerce: Backup microservices state.
    • Financial Systems: Backup Kubernetes workloads.
  • Key Features:
    • RPO < 1min, RTO < 60s.
    • Open-source, cost-effective for startups.

Detailed Analysis

Advantages

  • Resilience: Reduces MTTR by 90% (e.g., from 60s to 6s), as per your Resiliency Patterns query.
  • Data Protection: Ensures zero data loss with frequent backups (RPO < 1min).
  • Compliance: Meets GDPR (7-year logs), HIPAA (6-year logs), SOC2, PCI-DSS, as per your Auditing & Compliance query.
  • Automation: CI/CD and IaC reduce setup errors by 90%, as per your CI/CD and IaC queries.
  • Scalability: Handles 1M req/s, as per your API scalability interest.
  • Observability: Tracks DR with metrics, tracing, logging, as per your Observability and Distributed Tracing queries.

Limitations

  • Complexity: Multi-region DR increases design and operational effort.
  • Cost: Backup storage (e.g., S3: $0.02/GB/month) and replication add expenses.
  • Overhead: Backup and failover add latency (e.g., 10ms for replication).
  • Maintenance: Requires regular DR testing and updates.
  • Data Consistency: Eventual consistency in backups may cause temporary discrepancies, as per your data consistency query.

Trade-Offs

  1. RTO vs. Cost:
    • Trade-Off: Active-active DR minimizes RTO but increases costs.
    • Decision: Use active-passive for non-critical services, active-active for critical.
    • Interview Strategy: Propose active-active for finance, active-passive for analytics.
  2. RPO vs. Storage:
    • Trade-Off: Frequent backups reduce RPO but increase storage costs.
    • Decision: Use frequent backups for PCI-DSS, hourly for SOC2.
    • Interview Strategy: Justify frequent backups for banking, hourly for e-commerce.
  3. Open-Source vs. Managed:
    • Trade-Off: Velero is cost-effective but requires management; AWS Backup is simpler but vendor-specific.
    • Decision: Use Velero for Kubernetes, AWS Backup for AWS ecosystems.
    • Interview Strategy: Highlight Velero for startups, AWS Backup for enterprises.
  4. Consistency vs. Availability:
    • Trade-Off: Strong consistency for backups reduces availability, as per your CAP query.
    • Decision: Use eventual consistency for backups, strong for critical data.
    • Interview Strategy: Propose CDC for backup sync, DynamoDB for critical data.

Integration with Prior Concepts

  • CAP Theorem: Prioritizes AP for DR operations, CP for backups, as per your CAP query.
  • Consistency Models: Eventual consistency for backups, strong for critical data, as per your data consistency query.
  • Consistent Hashing: Routes backup traffic, as per your load balancing query.
  • Idempotency: Ensures safe DR retries, as per your idempotency query.
  • Failure Handling: Uses retries, timeouts, circuit breakers, as per your Resiliency Patterns query.
  • Heartbeats: Monitors DR services (< 5s), as per your heartbeats query.
  • SPOFs: Avoids via multi-region backups, as per your SPOFs query.
  • Checksums: Verifies backup integrity, as per your checksums query.
  • GeoHashing: Routes DR traffic, as per your GeoHashing query.
  • Rate Limiting: Caps DR API access, as per your rate limiting query.
  • CDC: Syncs backup data, as per your data consistency query.
  • Load Balancing: Distributes DR traffic, as per your load balancing query.
  • Multi-Region: Enables failover, as per your multi-region query.
  • Backpressure: Manages DR load, as per your backpressure query.
  • EDA: Triggers DR events, as per your EDA query.
  • Saga Pattern: Coordinates DR workflows, as per your Saga query.
  • DDD: Aligns DR with Bounded Contexts, as per your DDD query.
  • API Gateway: Routes DR APIs, as per your API Gateway query.
  • Strangler Fig: Migrates legacy DR systems, as per your Strangler Fig query.
  • Service Mesh: Secures DR communication, as per your Service Mesh query.
  • Micro Frontends: Supports DR for UI, as per your Micro Frontends query.
  • API Versioning: Tracks DR APIs, as per your API Versioning query.
  • Cloud-Native Design: Core to DR, as per your Cloud-Native Design query.
  • Cloud Service Models: Secures IaaS/PaaS/FaaS DR, as per your Cloud Service Models query.
  • Containers vs. VMs: Uses containers for DR, as per your Containers vs. VMs query.
  • Kubernetes: Uses Velero for DR, as per your Kubernetes query.
  • Serverless: Uses Lambda for DR tasks, as per your Serverless query.
  • 12-Factor App: Logs DR events, as per your 12-Factor query.
  • CI/CD Pipelines: Automates DR deployment, as per your CI/CD query.
  • IaC: Provisions DR infrastructure, as per your IaC query.
  • Cloud Security: Uses IAM, KMS for DR, as per your Cloud Security query.
  • Cost Optimization: Reduces DR costs, as per your Cost Optimization query.
  • Observability: Monitors DR, as per your Observability query.
  • Authentication & Authorization: Secures DR with OAuth2/OIDC, as per your Authentication query.
  • Encryption: Protects backups, as per your Encryption query.
  • Securing APIs: Secures DR APIs, as per your Securing APIs query.
  • Security Considerations: Aligns DR with security, as per your Security Considerations query.
  • Monitoring & Logging: Tracks DR logs, as per your Monitoring & Logging query.
  • Distributed Tracing: Traces DR actions, as per your Distributed Tracing query.
  • Zero Trust: Enforces strict DR access, as per your Zero Trust query.
  • Chaos Engineering: Tests DR resilience, as per your Chaos Engineering query.
  • Auditing & Compliance: Ensures DR meets compliance, as per your Auditing & Compliance query.

Real-World Use Cases

1. E-Commerce Platform

  • Context: An e-commerce platform (e.g., Shopify integration, as per your query) processes 100,000 orders/day, needing GDPR/PCI-DSS compliance.
  • Implementation:
    • Tool: AWS Backup with S3 cross-region replication.
    • Backup: Daily S3 backups, encrypted with KMS, as per your Encryption query.
    • DR: Active-passive failover to us-west-2, as per your multi-region query.
    • Security: IAM roles, mTLS via Service Mesh, as per your Cloud Security and Service Mesh queries.
    • Monitoring: Jaeger and CloudWatch for DR metrics, as per your Distributed Tracing and Observability queries.
    • EDA: Kafka for DR events, CDC for backup sync, as per your EDA and data consistency queries.
    • CI/CD: Terraform and GitHub Actions, as per your CI/CD and IaC queries.
    • Chaos Engineering: AWS FIS for DR testing, as per your Chaos Engineering query.
    • Metrics: RTO < 15s, RPO < 1min, 100,000 req/s, 99.999% uptime.
  • Trade-Off: RTO vs. cost.
  • Strategic Value: Ensures order data recovery, GDPR compliance.

2. Healthcare System

  • Context: A telemedicine platform processes 10,000 patient records/day, requiring HIPAA compliance.
  • Implementation:
    • Tool: Veeam for Azure Blob Storage backups.
    • Backup: Hourly PHI backups, encrypted with Key Vault, as per your Encryption query.
    • DR: Active-active failover across Azure regions, as per your multi-region query.
    • Security: Azure AD with RBAC, as per your Authentication query.
    • Monitoring: OpenTelemetry and Azure Monitor, as per your Distributed Tracing query.
    • EDA: Service Bus for DR events, as per your EDA query.
    • Chaos Engineering: Pod failure tests, as per your Chaos Engineering query.
    • Metrics: RTO < 30s, RPO < 5min, 1,000 req/s, 99.99% uptime.
  • Trade-Off: RPO vs. storage cost.
  • Strategic Value: Protects PHI, meets HIPAA requirements.

3. Financial Transaction System

  • Context: A banking system processes 500,000 transactions/day, needing PCI-DSS/SOC2 compliance, as per your tagging system query.
  • Implementation:
    • Tool: Velero for Kubernetes backups to S3.
    • Backup: Incremental transaction backups, encrypted with KMS, as per your Encryption query.
    • DR: Active-active failover with GeoHashing, as per your multi-region and GeoHashing queries.
    • Security: JWT and IAM, as per your Authentication query.
    • Monitoring: Jaeger and Prometheus, as per your Distributed Tracing query.
    • Chaos Engineering: AWS FIS for transaction failures, as per your Chaos Engineering query.
    • Metrics: RTO < 10s, RPO < 1min, 10,000 tx/s, 99.99% uptime.
  • Trade-Off: Consistency vs. availability.
  • Strategic Value: Ensures transaction recovery, PCI-DSS compliance.

Implementation Guide

// Order Service with Disaster Recovery (C#)
using Amazon.CloudWatch;
using Amazon.CloudWatch.Model;
using Amazon.XRay.Recorder.Core;
using Amazon.KMS;
using Amazon.KMS.Model;
using Amazon.S3;
using Amazon.S3.Model;
using Confluent.Kafka;
using Microsoft.AspNetCore.Mvc;
using Microsoft.IdentityModel.Tokens;
using OpenTelemetry;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
using Polly;
using Serilog;
using System;
using System.Diagnostics;
using System.IdentityModel.Tokens.Jwt;
using System.Net.Http;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

namespace OrderContext
{
    [ApiController]
    [Route("v1/orders")]
    public class OrderController : ControllerBase
    {
        private readonly IHttpClientFactory _clientFactory;
        private readonly IProducer<Null, string> _kafkaProducer;
        private readonly IAsyncPolicy<HttpResponseMessage> _resiliencyPolicy;
        private readonly Tracer _tracer;
        private readonly AmazonCloudWatchClient _cloudWatchClient;
        private readonly AmazonKMSClient _kmsClient;
        private readonly AmazonS3Client _s3Client;

        public OrderController(IHttpClientFactory clientFactory, IProducer<Null, string> kafkaProducer)
        {
            _clientFactory = clientFactory;
            _kafkaProducer = kafkaProducer;

            // Initialize AWS clients with IAM role
            _cloudWatchClient = new AmazonCloudWatchClient();
            _kmsClient = new AmazonKMSClient();
            _s3Client = new AmazonS3Client();

            // Initialize X-Ray for Distributed Tracing
            AWSSDKHandler.RegisterXRayForAllServices();

            // Initialize OpenTelemetry for Tracing
            _tracer = Sdk.CreateTracerProviderBuilder()
                .AddSource("OrderService")
                .SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("OrderService"))
                .AddXRayTraceExporter(options => { options.Region = "us-east-1"; })
                .AddJaegerExporter(options =>
                {
                    options.AgentHost = Environment.GetEnvironmentVariable("JAEGER_AGENT_HOST");
                    options.AgentPort = 6831;
                })
                .Build()
                .GetTracer("OrderService");

            // Resiliency: Circuit Breaker, Retry, Timeout
            _resiliencyPolicy = Policy.WrapAsync(
                Policy<HttpResponseMessage>
                    .HandleTransientHttpError()
                    .Or<Exception>(ex => ex.Message.Contains("ChaosTest"))
                    .CircuitBreakerAsync(5, TimeSpan.FromSeconds(30)),
                Policy<HttpResponseMessage>
                    .HandleTransientHttpError()
                    .Or<Exception>(ex => ex.Message.Contains("ChaosTest"))
                    .WaitAndRetryAsync(3, retryAttempt => TimeSpan.FromMilliseconds(100 * Math.Pow(2, retryAttempt))),
                Policy.TimeoutAsync<HttpResponseMessage>(TimeSpan.FromMilliseconds(500))
            );

            // Serilog with CloudWatch sink (12-Factor Logs, GDPR)
            Log.Logger = new LoggerConfiguration()
                .WriteTo.Console()
                .WriteTo.AmazonCloudWatch(
                    logGroup: "/ecs/order-service",
                    logStreamPrefix: "ecs",
                    cloudWatchClient: _cloudWatchClient)
                .CreateLogger();
        }

        [HttpPost]
        public async Task<IActionResult> CreateOrder([FromBody] Order order, [FromHeader(Name = "Authorization")] string authHeader, [FromHeader(Name = "X-HMAC-Signature")] string hmacSignature, [FromHeader(Name = "X-Request-Timestamp")] string timestamp, [FromHeader(Name = "X-Device-ID")] string deviceId, [FromHeader(Name = "X-Consent-Token")] string consentToken)
        {
            using var span = _tracer.StartActiveSpan("CreateOrder");
            span.SetAttribute("orderId", order.OrderId);
            span.SetAttribute("userId", order.UserId);
            span.SetAttribute("deviceId", deviceId);
            span.SetAttribute("consentToken", consentToken);

            // Start X-Ray segment (SOC2/PCI-DSS)
            AWSXRayRecorder.Instance.BeginSegment("OrderService", order.OrderId);

            // Simulate chaos for DR testing
            if (Environment.GetEnvironmentVariable("CHAOS_ENABLED") == "true" && new Random().Next(0, 100) < 10)
            {
                Log.Error("Chaos test: Simulating DR failure for Order {OrderId}", order.OrderId);
                span.RecordException(new Exception("ChaosTest: Simulated DR failure"));
                span.SetStatus(Status.Error);
                await LogMetricAsync("DRFailure", 1);
                throw new Exception("ChaosTest: Simulated DR failure");
            }

            // Verify Consent (GDPR)
            using var consentSpan = _tracer.StartSpan("VerifyConsent");
            if (!await VerifyConsentAsync(consentToken, order.UserId))
            {
                Log.Error("Invalid consent for User {UserId}, Order {OrderId}", order.UserId, order.OrderId);
                span.RecordException(new Exception("Invalid consent"));
                span.SetStatus(Status.Error);
                await LogMetricAsync("ConsentVerificationFailed", 1);
                return BadRequest("Invalid consent");
            }
            consentSpan.End();

            // Verify Device (Zero Trust, PCI-DSS)
            using var deviceSpan = _tracer.StartSpan("VerifyDevice");
            if (!await VerifyDeviceAsync(deviceId))
            {
                Log.Error("Invalid device {DeviceId} for Order {OrderId}", deviceId, order.OrderId);
                span.RecordException(new Exception("Invalid device"));
                span.SetStatus(Status.Error);
                await LogMetricAsync("DeviceVerificationFailed", 1);
                return Unauthorized("Invalid device");
            }
            deviceSpan.End();

            // Rate Limiting (Zero Trust, PCI-DSS)
            using var rateLimitSpan = _tracer.StartSpan("CheckRateLimit");
            if (!await CheckRateLimitAsync(order.UserId, deviceId))
            {
                Log.Error("Rate limit exceeded for User {UserId}, Device {DeviceId}", order.UserId, deviceId);
                span.RecordException(new Exception("Rate limit exceeded"));
                span.SetStatus(Status.Error);
                await LogMetricAsync("RateLimitExceeded", 1);
                return StatusCode(429, "Too Many Requests");
            }
            rateLimitSpan.End();

            // Validate JWT (Zero Trust, HIPAA/SOC2)
            using var jwtSpan = _tracer.StartSpan("ValidateJwt");
            if (!await ValidateJwtAsync(authHeader))
            {
                Log.Error("Invalid or missing JWT for Order {OrderId}", order.OrderId);
                span.RecordException(new Exception("Invalid JWT"));
                span.SetStatus(Status.Error);
                await LogMetricAsync("JwtValidationFailed", 1);
                return Unauthorized();
            }
            jwtSpan.End();

            // Validate HMAC-SHA256 (Zero Trust, PCI-DSS)
            using var hmacSpan = _tracer.StartSpan("ValidateHmac");
            if (!await ValidateHmacAsync(order, hmacSignature, timestamp))
            {
                Log.Error("Invalid HMAC for Order {OrderId}", order.OrderId);
                span.RecordException(new Exception("Invalid HMAC"));
                span.SetStatus(Status.Error);
                await LogMetricAsync("HmacValidationFailed", 1);
                return BadRequest("Invalid HMAC signature");
            }
            hmacSpan.End();

            // Idempotency check with Snowflake ID (PCI-DSS)
            var requestId = Guid.NewGuid().ToString(); // Simplified Snowflake ID
            using var idempotencySpan = _tracer.StartSpan("CheckIdempotency");
            if (await IsProcessedAsync(requestId))
            {
                Log.Information("Order {OrderId} already processed", order.OrderId);
                span.SetAttribute("idempotent", true);
                await LogMetricAsync("IdempotentRequest", 1);
                return Ok("Order already processed");
            }
            idempotencySpan.End();

            // Encrypt order amount with AWS KMS (GDPR/HIPAA/PCI-DSS)
            using var encryptionSpan = _tracer.StartSpan("EncryptOrder");
            var encryptResponse = await _kmsClient.EncryptAsync(new EncryptRequest
            {
                KeyId = Environment.GetEnvironmentVariable("KMS_KEY_ARN"),
                Plaintext = Encoding.UTF8.GetBytes(order.Amount.ToString())
            });
            var encryptedAmount = Convert.ToBase64String(encryptResponse.CiphertextBlob);
            encryptionSpan.End();

            // Compute SHA-256 checksum (PCI-DSS)
            using var checksumSpan = _tracer.StartSpan("ComputeChecksum");
            var checksum = ComputeChecksum(encryptedAmount);
            checksumSpan.End();

            // Store encrypted data in S3 with backup metadata
            using var storageSpan = _tracer.StartSpan("StoreOrder");
            var putRequest = new PutObjectRequest
            {
                BucketName = Environment.GetEnvironmentVariable("S3_BUCKET"),
                Key = $"orders/{requestId}",
                ContentBody = System.Text.Json.JsonSerializer.Serialize(new { order.OrderId, encryptedAmount, checksum, consentToken, backupRegion = Environment.GetEnvironmentVariable("AWS_REGION") }),
                ServerSideEncryptionMethod = ServerSideEncryptionMethod.AWSKMS,
                ServerSideEncryptionKeyManagementServiceKeyId = Environment.GetEnvironmentVariable("KMS_KEY_ARN")
            };
            try
            {
                await _s3Client.PutObjectAsync(putRequest);
            }
            catch (AmazonS3Exception ex)
            {
                Log.Error("S3 storage failed for Order {OrderId}: {Error}", order.OrderId, ex.Message);
                span.RecordException(ex);
                span.SetStatus(Status.Error);
                await LogMetricAsync("S3StorageFailed", 1);
                throw;
            }
            storageSpan.End();

            // Replicate to secondary region for DR
            using var replicationSpan = _tracer.StartSpan("ReplicateOrder");
            var secondaryBucket = Environment.GetEnvironmentVariable("S3_SECONDARY_BUCKET");
            var replicateRequest = new CopyObjectRequest
            {
                SourceBucket = Environment.GetEnvironmentVariable("S3_BUCKET"),
                SourceKey = $"orders/{requestId}",
                DestinationBucket = secondaryBucket,
                DestinationKey = $"orders/{requestId}",
                ServerSideEncryptionMethod = ServerSideEncryptionMethod.AWSKMS,
                ServerSideEncryptionKeyManagementServiceKeyId = Environment.GetEnvironmentVariable("KMS_KEY_ARN")
            };
            try
            {
                await _s3Client.CopyObjectAsync(replicateRequest);
            }
            catch (AmazonS3Exception ex)
            {
                Log.Error("S3 replication failed for Order {OrderId}: {Error}", order.OrderId, ex.Message);
                span.RecordException(ex);
                span.SetStatus(Status.Error);
                await LogMetricAsync("S3ReplicationFailed", 1);
                throw;
            }
            replicationSpan.End();

            // Call Payment Service via Service Mesh (mTLS, PCI-DSS)
            using var paymentSpan = _tracer.StartSpan("CallPaymentService");
            var client = _clientFactory.CreateClient("PaymentService");
            var payload = System.Text.Json.JsonSerializer.Serialize(new
            {
                order_id = order.OrderId,
                encrypted_amount = encryptedAmount,
                checksum = checksum
            });
            var response = await _resiliencyPolicy.ExecuteAsync(async () =>
            {
                var request = new HttpRequestMessage(HttpMethod.Post, Environment.GetEnvironmentVariable("PAYMENT_SERVICE_URL"))
                {
                    Content = new StringContent(payload, Encoding.UTF8, "application/json"),
                    Headers = { { "Authorization", authHeader }, { "X-HMAC-Signature", hmacSignature }, { "X-Request-Timestamp", timestamp }, { "X-Device-ID", deviceId }, { "X-Consent-Token", consentToken } }
                };
                var result = await client.SendAsync(request);
                result.EnsureSuccessStatusCode();
                return result;
            });
            paymentSpan.End();

            // Publish DR event for EDA/CDC (GDPR/SOC2)
            using var eventSpan = _tracer.StartSpan("PublishEvent");
            var @event = new OrderCreatedEvent
            {
                EventId = requestId,
                OrderId = order.OrderId,
                EncryptedAmount = encryptedAmount,
                Checksum = checksum,
                ConsentToken = consentToken,
                BackupRegion = Environment.GetEnvironmentVariable("AWS_REGION")
            };
            try
            {
                await _kafkaProducer.ProduceAsync(Environment.GetEnvironmentVariable("KAFKA_TOPIC"), new Message<Null, string>
                {
                    Value = System.Text.Json.JsonSerializer.Serialize(@event)
                });
            }
            catch (ProduceException<Null, string> ex)
            {
                Log.Error("Kafka publish failed for Order {OrderId}: {Error}", order.OrderId, ex.Message);
                span.RecordException(ex);
                span.SetStatus(Status.Error);
                await LogMetricAsync("KafkaPublishFailed", 1);
                throw;
            }
            eventSpan.End();

            // Log metrics (SOC2)
            await LogMetricAsync("OrderProcessed", 1);

            Log.Information("Order {OrderId} processed and backed up successfully for Device {DeviceId} with Consent {ConsentToken}", order.OrderId, deviceId, consentToken);
            AWSXRayRecorder.Instance.EndSegment();
            return Ok(order);
        }

        [HttpPost("restore/{orderId}")]
        public async Task<IActionResult> RestoreOrder(string orderId, [FromHeader(Name = "Authorization")] string authHeader)
        {
            using var span = _tracer.StartActiveSpan("RestoreOrder");
            span.SetAttribute("orderId", orderId);

            // Validate JWT (Zero Trust, HIPAA/SOC2)
            if (!await ValidateJwtAsync(authHeader))
            {
                Log.Error("Invalid JWT for RestoreOrder {OrderId}", orderId);
                span.RecordException(new Exception("Invalid JWT"));
                span.SetStatus(Status.Error);
                await LogMetricAsync("JwtValidationFailed", 1);
                return Unauthorized();
            }

            // Attempt restore from primary region
            using var restoreSpan = _tracer.StartSpan("RestoreFromPrimary");
            var getRequest = new GetObjectRequest
            {
                BucketName = Environment.GetEnvironmentVariable("S3_BUCKET"),
                Key = $"orders/{orderId}"
            };
            try
            {
                var response = await _s3Client.GetObjectAsync(getRequest);
                var data = System.Text.Json.JsonSerializer.Deserialize<OrderBackupData>(response.ResponseStream);
                Log.Information("Order {OrderId} restored from primary region", orderId);
                await LogMetricAsync("RestoreSuccess", 1);
                return Ok(data);
            }
            catch (AmazonS3Exception ex)
            {
                Log.Warning("Primary restore failed for Order {OrderId}: {Error}", orderId, ex.Message);
                span.RecordException(ex);
                await LogMetricAsync("PrimaryRestoreFailed", 1);
            }
            restoreSpan.End();

            // Failover to secondary region
            using var failoverSpan = _tracer.StartSpan("RestoreFromSecondary");
            getRequest.BucketName = Environment.GetEnvironmentVariable("S3_SECONDARY_BUCKET");
            try
            {
                var response = await _s3Client.GetObjectAsync(getRequest);
                var data = System.Text.Json.JsonSerializer.Deserialize<OrderBackupData>(response.ResponseStream);
                Log.Information("Order {OrderId} restored from secondary region", orderId);
                await LogMetricAsync("SecondaryRestoreSuccess", 1);
                return Ok(data);
            }
            catch (AmazonS3Exception ex)
            {
                Log.Error("Secondary restore failed for Order {OrderId}: {Error}", orderId, ex.Message);
                span.RecordException(ex);
                span.SetStatus(Status.Error);
                await LogMetricAsync("SecondaryRestoreFailed", 1);
                throw;
            }
            failoverSpan.End();
        }

        private async Task<bool> VerifyConsentAsync(string consentToken, string userId)
        {
            // Simulated consent verification
            return await Task.FromResult(!string.IsNullOrEmpty(consentToken));
        }

        private async Task<bool> VerifyDeviceAsync(string deviceId)
        {
            // Simulated device verification
            return await Task.FromResult(!string.IsNullOrEmpty(deviceId));
        }

        private async Task<bool> CheckRateLimitAsync(string userId, string deviceId)
        {
            // Simulated Redis-based rate limiting (token bucket, 1,000 req/s)
            return await Task.FromResult(true);
        }

        private async Task<bool> ValidateJwtAsync(string authHeader)
        {
            if (string.IsNullOrEmpty(authHeader) || !authHeader.StartsWith("Bearer "))
                return false;

            var token = authHeader.Substring("Bearer ".Length).Trim();
            var handler = new JwtSecurityTokenHandler();
            try
            {
                var jwt = handler.ReadJwtToken(token);
                var issuer = Environment.GetEnvironmentVariable("COGNITO_ISSUER");
                var jwksUrl = $"{issuer}/.well-known/jwks.json";

                var jwks = await GetJwksAsync(jwksUrl);
                var validationParameters = new TokenValidationParameters
                {
                    IssuerSigningKeys = jwks.Keys,
                    ValidIssuer = issuer,
                    ValidAudience = Environment.GetEnvironmentVariable("COGNITO_CLIENT_ID"),
                    ValidateIssuer = true,
                    ValidateAudience = true,
                    ValidateLifetime = true
                };

                handler.ValidateToken(token, validationParameters, out var validatedToken);
                await LogMetricAsync("JwtValidationSuccess", 1);
                return true;
            }
            catch (Exception ex)
            {
                Log.Error("JWT validation failed: {Error}", ex.Message);
                return false;
            }
        }

        private async Task<bool> ValidateHmacAsync(Order order, string hmacSignature, string timestamp)
        {
            var secret = Environment.GetEnvironmentVariable("API_SECRET");
            var payload = $"{order.OrderId}:{order.Amount}:{timestamp}";
            var computedHmac = ComputeHmac(payload, secret);
            var isValid = hmacSignature == computedHmac;

            if (isValid)
                await LogMetricAsync("HmacValidationSuccess", 1);
            return await Task.FromResult(isValid);
        }

        private async Task<JsonWebKeySet> GetJwksAsync(string jwksUrl)
        {
            var client = _clientFactory.CreateClient();
            var response = await client.GetStringAsync(jwksUrl);
            return new JsonWebKeySet(response);
        }

        private async Task<bool> IsProcessedAsync(string requestId)
        {
            // Simulated idempotency check (e.g., Redis)
            return await Task.FromResult(false);
        }

        private async Task LogMetricAsync(string metricName, double value)
        {
            var request = new PutMetricDataRequest
            {
                Namespace = "Ecommerce/OrderService",
                MetricData = new List<MetricDatum>
                {
                    new MetricDatum
                    {
                        MetricName = metricName,
                        Value = value,
                        Unit = StandardUnit.Count,
                        Timestamp = DateTime.UtcNow
                    }
                }
            };
            try
            {
                await _cloudWatchClient.PutMetricDataAsync(request);
            }
            catch (AmazonCloudWatchException ex)
            {
                Log.Error("Failed to log metric {MetricName}: {Error}", metricName, ex.Message);
            }
        }

        private string ComputeHmac(string data, string secret)
        {
            using var hmac = new HMACSHA256(Encoding.UTF8.GetBytes(secret));
            var bytes = Encoding.UTF8.GetBytes(data);
            var hash = hmac.ComputeHash(bytes);
            return Convert.ToBase64String(hash);
        }

        private string ComputeChecksum(string data)
        {
            using var sha256 = SHA256.Create();
            var bytes = Encoding.UTF8.GetBytes(data);
            var hash = sha256.ComputeHash(bytes);
            return Convert.ToBase64String(hash);
        }
    }

    public class Order
    {
        public string OrderId { get; set; }
        public double Amount { get; set; }
        public string UserId { get; set; }
    }

    public class OrderCreatedEvent
    {
        public string EventId { get; set; }
        public string OrderId { get; set; }
        public string EncryptedAmount { get; set; }
        public string Checksum { get; set; }
        public string ConsentToken { get; set; }
        public string BackupRegion { get; set; }
    }

    public class OrderBackupData
    {
        public string OrderId { get; set; }
        public string EncryptedAmount { get; set; }
        public string Checksum { get; set; }
        public string ConsentToken { get; set; }
        public string BackupRegion { get; set; }
    }
}

Terraform: Disaster Recovery Infrastructure

# main.tf
provider "aws" {
  region = "us-east-1"
}

provider "aws" {
  alias  = "secondary"
  region = "us-west-2"
}

resource "aws_vpc" "ecommerce_vpc" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

resource "aws_vpc" "ecommerce_vpc_secondary" {
  provider             = aws.secondary
  cidr_block           = "10.1.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

resource "aws_subnet" "subnet_a" {
  vpc_id            = aws_vpc.ecommerce_vpc.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"
}

resource "aws_subnet" "subnet_b" {
  vpc_id            = aws_vpc.ecommerce_vpc.id
  cidr_block        = "10.0.2.0/24"
  availability_zone = "us-east-1b"
}

resource "aws_subnet" "subnet_a_secondary" {
  provider          = aws.secondary
  vpc_id            = aws_vpc.ecommerce_vpc_secondary.id
  cidr_block        = "10.1.1.0/24"
  availability_zone = "us-west-2a"
}

resource "aws_subnet" "subnet_b_secondary" {
  provider          = aws.secondary
  vpc_id            = aws_vpc.ecommerce_vpc_secondary.id
  cidr_block        = "10.1.2.0/24"
  availability_zone = "us-west-2b"
}

resource "aws_security_group" "ecommerce_sg" {
  vpc_id = aws_vpc.ecommerce_vpc.id
  ingress {
    protocol    = "tcp"
    from_port   = 443
    to_port     = 443
    cidr_blocks = ["0.0.0.0/0"]
  }
  ingress {
    protocol    = "udp"
    from_port   = 6831
    to_port     = 6831
    cidr_blocks = ["10.0.0.0/16"]
  }
}

resource "aws_security_group" "ecommerce_sg_secondary" {
  provider = aws.secondary
  vpc_id   = aws_vpc.ecommerce_vpc_secondary.id
  ingress {
    protocol    = "tcp"
    from_port   = 443
    to_port     = 443
    cidr_blocks = ["0.0.0.0/0"]
  }
  ingress {
    protocol    = "udp"
    from_port   = 6831
    to_port     = 6831
    cidr_blocks = ["10.1.0.0/16"]
  }
}

resource "aws_iam_role" "order_service_role" {
  name = "order-service-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "order_service_policy" {
  name = "order-service-policy"
  role = aws_iam_role.order_service_role.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "cloudwatch:PutMetricData",
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "cognito-idp:AdminInitiateAuth",
          "kms:Encrypt",
          "kms:Decrypt",
          "s3:PutObject",
          "s3:GetObject",
          "s3:CopyObject",
          "sqs:SendMessage",
          "xray:PutTraceSegments",
          "xray:PutTelemetryRecords",
          "securityhub:BatchImportFindings",
          "sns:Publish",
          "backup:StartBackupJob",
          "backup:StartRestoreJob"
        ],
        Resource = [
          "arn:aws:cloudwatch:*:123456789012:metric/*",
          "arn:aws:logs:*:123456789012:log-group:/ecs/order-service:*",
          "arn:aws:cognito-idp:*:123456789012:userpool/*",
          "arn:aws:kms:*:123456789012:key/*",
          "arn:aws:s3:::ecommerce-bucket/*",
          "arn:aws:s3:::ecommerce-bucket-secondary/*",
          "arn:aws:sqs:*:123456789012:dead-letter-queue",
          "arn:aws:xray:*:123456789012:*",
          "arn:aws:securityhub:*:123456789012:*",
          "arn:aws:sns:*:123456789012:ecommerce-alerts",
          "arn:aws:backup:*:123456789012:backup-vault/*"
        ]
      }
    ]
  })
}

resource "aws_kms_key" "kms_key" {
  description = "KMS key for ecommerce encryption"
  enable_key_rotation = true
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = { AWS = aws_iam_role.order_service_role.arn }
        Action = ["kms:Encrypt", "kms:Decrypt"]
        Resource = "*"
      }
    ]
  })
}

resource "aws_kms_key" "kms_key_secondary" {
  provider = aws.secondary
  description = "KMS key for ecommerce secondary region"
  enable_key_rotation = true
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = { AWS = aws_iam_role.order_service_role.arn }
        Action = ["kms:Encrypt", "kms:Decrypt"]
        Resource = "*"
      }
    ]
  })
}

resource "aws_s3_bucket" "ecommerce_bucket" {
  bucket = "ecommerce-bucket"
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        kms_master_key_id = aws_kms_key.kms_key.arn
        sse_algorithm     = "aws:kms"
      }
    }
  }
}

resource "aws_s3_bucket" "ecommerce_bucket_secondary" {
  provider = aws.secondary
  bucket = "ecommerce-bucket-secondary"
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        kms_master_key_id = aws_kms_key.kms_key_secondary.arn
        sse_algorithm     = "aws:kms"
      }
    }
  }
}

resource "aws_s3_bucket_replication_configuration" "replication" {
  bucket = aws_s3_bucket.ecommerce_bucket.bucket
  role   = aws_iam_role.order_service_role.arn
  rule {
    id     = "replication-rule"
    status = "Enabled"
    destination {
      bucket        = aws_s3_bucket.ecommerce_bucket_secondary.arn
      storage_class = "STANDARD"
      encryption_configuration {
        replica_kms_key_id = aws_kms_key.kms_key_secondary.arn
      }
    }
    source_selection_criteria {
      sse_kms_encrypted_objects {
        status = "Enabled"
      }
    }
  }
}

resource "aws_backup_vault" "backup_vault" {
  name        = "ecommerce-backup-vault"
  kms_key_arn = aws_kms_key.kms_key.arn
}

resource "aws_backup_vault" "backup_vault_secondary" {
  provider    = aws.secondary
  name        = "ecommerce-backup-vault-secondary"
  kms_key_arn = aws_kms_key.kms_key_secondary.arn
}

resource "aws_backup_plan" "backup_plan" {
  name = "ecommerce-backup-plan"
  rule {
    rule_name         = "daily-backup"
    target_vault_name = aws_backup_vault.backup_vault.name
    schedule          = "cron(0 2 * * ? *)" # Daily at 2 AM
    lifecycle {
      delete_after = 2555 # 7 years for GDPR
    }
    copy_action {
      destination_vault_arn = aws_backup_vault.backup_vault_secondary.arn
      lifecycle {
        delete_after = 2555
      }
    }
  }
}

resource "aws_backup_selection" "backup_selection" {
  name         = "ecommerce-backup-selection"
  plan_id      = aws_backup_plan.backup_plan.id
  iam_role_arn = aws_iam_role.order_service_role.arn
  resources = [
    aws_s3_bucket.ecommerce_bucket.arn,
    aws_ecs_cluster.ecommerce_cluster.arn
  ]
}

resource "aws_cognito_user_pool" "ecommerce_user_pool" {
  name = "ecommerce-user-pool"
  password_policy {
    minimum_length = 8
    require_numbers = true
    require_symbols = true
    require_uppercase = true
  }
  mfa_configuration = "REQUIRED"
  software_token_mfa_configuration {
    enabled = true
  }
}

resource "aws_cognito_user_pool_client" "ecommerce_client" {
  name                = "ecommerce-client"
  user_pool_id        = aws_cognito_user_pool.ecommerce_user_pool.id
  allowed_oauth_flows = ["code"]
  allowed_oauth_scopes = ["orders/read", "orders/write", "profile"]
  callback_urls       = ["https://ecommerce.example.com/callback"]
  supported_identity_providers = ["COGNITO"]
}

resource "aws_api_gateway_rest_api" "ecommerce_api" {
  name = "ecommerce-api"
}

resource "aws_api_gateway_resource" "orders_resource" {
  rest_api_id = aws_api_gateway_rest_api.ecommerce_api.id
  parent_id   = aws_api_gateway_rest_api.ecommerce_api.root_resource_id
  path_part   = "orders"
}

resource "aws_api_gateway_method" "orders_post" {
  rest_api_id   = aws_api_gateway_rest_api.ecommerce_api.id
  resource_id   = aws_api_gateway_resource.orders_resource.id
  http_method   = "POST"
  authorization = "COGNITO_USER_POOLS"
  authorizer_id = aws_api_gateway_authorizer.cognito_authorizer.id
}

resource "aws_api_gateway_method" "orders_restore" {
  rest_api_id   = aws_api_gateway_rest_api.ecommerce_api.id
  resource_id   = aws_api_gateway_resource.orders_resource.id
  http_method   = "POST"
  path_part     = "restore/{orderId}"
  authorization = "COGNITO_USER_POOLS"
  authorizer_id = aws_api_gateway_authorizer.cognito_authorizer.id
}

resource "aws_api_gateway_authorizer" "cognito_authorizer" {
  name                   = "cognito-authorizer"
  rest_api_id            = aws_api_gateway_rest_api.ecommerce_api.id
  type                   = "COGNITO_USER_POOLS"
  provider_arns          = [aws_cognito_user_pool.ecommerce_user_pool.arn]
}

resource "aws_api_gateway_method_settings" "orders_settings" {
  rest_api_id = aws_api_gateway_rest_api.ecommerce_api.id
  stage_name  = "prod"
  method_path = "${aws_api_gateway_resource.orders_resource.path_part}/*"
  settings {
    throttling_rate_limit  = 1000
    throttling_burst_limit = 10000
    metrics_enabled       = true
    logging_level         = "INFO"
  }
}

resource "aws_api_gateway_deployment" "ecommerce_deployment" {
  rest_api_id = aws_api_gateway_rest_api.ecommerce_api.id
  stage_name  = "prod"
  depends_on  = [aws_api_gateway_method.orders_post, aws_api_gateway_method.orders_restore]
}

resource "aws_ecs_cluster" "ecommerce_cluster" {
  name = "ecommerce-cluster"
}

resource "aws_ecs_service" "order_service" {
  name            = "order-service"
  cluster         = aws_ecs_cluster.ecommerce_cluster.id
  task_definition = aws_ecs_task_definition.order_task.arn
  desired_count   = 5
  launch_type     = "FARGATE"
  network_configuration {
    subnets         = [aws_subnet.subnet_a.id, aws_subnet.subnet_b.id]
    security_groups = [aws_security_group.ecommerce_sg.id]
  }
}

resource "aws_ecs_task_definition" "order_task" {
  family                   = "order-service"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "256"
  memory                   = "512"
  execution_role_arn       = aws_iam_role.order_service_role.arn
  container_definitions = jsonencode([
    {
      name  = "order-service"
      image = "<your-ecr-repo>:latest"
      essential = true
      portMappings = [
        {
          containerPort = 443
          hostPort      = 443
        }
      ]
      environment = [
        { name = "KAFKA_BOOTSTRAP_SERVERS", value = "kafka:9092" },
        { name = "KAFKA_TOPIC", value = "orders" },
        { name = "PAYMENT_SERVICE_URL", value = "https://payment-service:8080/v1/payments" },
        { name = "JAEGER_AGENT_HOST", value = "jaeger-agent" },
        { name = "COGNITO_ISSUER", value = aws_cognito_user_pool.ecommerce_user_pool.endpoint },
        { name = "COGNITO_CLIENT_ID", value = aws_cognito_user_pool_client.ecommerce_client.id },
        { name = "KMS_KEY_ARN", value = aws_kms_key.kms_key.arn },
        { name = "S3_BUCKET", value = aws_s3_bucket.ecommerce_bucket.bucket },
        { name = "S3_SECONDARY_BUCKET", value = aws_s3_bucket.ecommerce_bucket_secondary.bucket },
        { name = "API_SECRET", value = "<your-api-secret>" },
        { name = "CHAOS_ENABLED", value = "true" }
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/order-service"
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "ecs"
        }
      }
    },
    {
      name  = "istio-proxy"
      image = "istio/proxyv2:latest"
      essential = true
      environment = [
        { name = "ISTIO_META_WORKLOAD_NAME", value = "order-service" }
      ]
    }
  ])
}

resource "aws_sqs_queue" "dead_letter_queue" {
  name = "dead-letter-queue"
}

resource "aws_lb" "ecommerce_alb" {
  name               = "ecommerce-alb"
  load_balancer_type = "application"
  subnets            = [aws_subnet.subnet_a.id, aws_subnet.subnet_b.id]
  security_groups    = [aws_security_group.ecommerce_sg.id]
  enable_http2       = true
}

resource "aws_lb" "ecommerce_alb_secondary" {
  provider           = aws.secondary
  name               = "ecommerce-alb-secondary"
  load_balancer_type = "application"
  subnets            = [aws_subnet.subnet_a_secondary.id, aws_subnet.subnet_b_secondary.id]
  security_groups    = [aws_security_group.ecommerce_sg_secondary.id]
  enable_http2       = true
}

resource "aws_lb_target_group" "order_tg" {
  name        = "order-tg"
  port        = 443
  protocol    = "HTTPS"
  vpc_id      = aws_vpc.ecommerce_vpc.id
  health_check {
    path     = "/health"
    interval = 5
    timeout  = 3
    protocol = "HTTPS"
  }
}

resource "aws_lb_target_group" "order_tg_secondary" {
  provider    = aws.secondary
  name        = "order-tg-secondary"
  port        = 443
  protocol    = "HTTPS"
  vpc_id      = aws_vpc.ecommerce_vpc_secondary.id
  health_check {
    path     = "/health"
    interval = 5
    timeout  = 3
    protocol = "HTTPS"
  }
}

resource "aws_lb_listener" "order_listener" {
  load_balancer_arn = aws_lb.ecommerce_alb.arn
  port              = 443
  protocol          = "HTTPS"
  certificate_arn   = "<your-acm-certificate-arn>"
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.order_tg.arn
  }
}

resource "aws_lb_listener" "order_listener_secondary" {
  provider          = aws.secondary
  load_balancer_arn = aws_lb.ecommerce_alb_secondary.arn
  port              = 443
  protocol          = "HTTPS"
  certificate_arn   = "<your-acm-certificate-arn-secondary>"
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.order_tg_secondary.arn
  }
}

resource "aws_cloudwatch_log_group" "order_log_group" {
  name              = "/ecs/order-service"
  retention_in_days = 2555 # 7 years for GDPR
}

resource "aws_cloudwatch_metric_alarm" "dr_failure_alarm" {
  alarm_name          = "DRFailure"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "DRFailure"
  namespace           = "Ecommerce/OrderService"
  period              = 60
  statistic           = "Sum"
  threshold           = 1
  alarm_description   = "Triggers when DR failures are detected"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

resource "aws_sns_topic" "alerts" {
  name = "ecommerce-alerts"
}

resource "aws_xray_group" "ecommerce_xray_group" {
  group_name = "ecommerce-xray-group"
  filter_expression = "service(order-service)"
}

resource "aws_ecs_service" "jaeger_service" {
  name            = "jaeger-service"
  cluster         = aws_ecs_cluster.ecommerce_cluster.id
  task_definition = aws_ecs_task_definition.jaeger_task.arn
  desired_count   = 1
  launch_type     = "FARGATE"
  network_configuration {
    subnets         = [aws_subnet.subnet_a.id, aws_subnet.subnet_b.id]
    security_groups = [aws_security_group.ecommerce_sg.id]
  }
}

resource "aws_ecs_task_definition" "jaeger_task" {
  family                   = "jaeger-service"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "256"
  memory                   = "512"
  execution_role_arn       = aws_iam_role.order_service_role.arn
  container_definitions = jsonencode([
    {
      name  = "jaeger-agent"
      image = "jaegertracing/all-in-one:latest"
      essential = true
      portMappings = [
        {
          containerPort = 6831
          hostPort      = 6831
          protocol      = "udp"
        },
        {
          containerPort = 16686
          hostPort      = 16686
        }
      ]
      environment = [
        { name = "COLLECTOR_ZIPKIN_HTTP_PORT", value = "9411" }
      ]
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/jaeger-service"
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "ecs"
        }
      }
    }
  ])
}

resource "aws_cloudwatch_log_group" "jaeger_log_group" {
  name              = "/ecs/jaeger-service"
  retention_in_days = 2555 # 7 years for GDPR
}

resource "aws_fis_experiment_template" "ecs_task_termination" {
  description = "Simulate ECS task termination for DR testing"
  role_arn    = aws_iam_role.order_service_role.arn
  stop_conditions {
    source = "aws:cloudwatch:metric"
    value  = "aws:cloudwatch:metric:DRFailure>1"
  }
  action {
    name        = "terminate-task"
    action_id   = "aws:ecs:terminate-task"
    target {
      key   = "Cluster"
      value = aws_ecs_cluster.ecommerce_cluster.name
    }
    parameters = {
      taskCount = "1"
    }
  }
  target {
    resource_type = "aws:ecs:task"
    resource_tag {
      key   = "aws:ecs:service-name"
      value = aws_ecs_service.order_service.name
    }
    selection_mode = "COUNT(1)"
  }
}

output "alb_endpoint" {
  value = aws_lb.ecommerce_alb.dns_name
}

output "alb_endpoint_secondary" {
  value = aws_lb.ecommerce_alb_secondary.dns_name
}

output "api_gateway_endpoint" {
  value = aws_api_gateway_deployment.ecommerce_deployment.invoke_url
}

output "kms_key_arn" {
  value = aws_kms_key.kms_key.arn
}

output "s3_bucket_name" {
  value = aws_s3_bucket.ecommerce_bucket.bucket
}

output "s3_bucket_name_secondary" {
  value = aws_s3_bucket.ecommerce_bucket_secondary.bucket
}

output "jaeger_endpoint" {
  value = "http://jaeger-service:16686"
}

GitHub Actions Workflow for Disaster Recovery

# .github/workflows/dr.yml
name: Disaster Recovery Pipeline
on:
  schedule:
    - cron: "0 0 * * 0" # Weekly DR tests
  workflow_dispatch:
jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v2
      with:
        terraform_version: 1.3.0
    - name: Terraform Init
      run: terraform init
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    - name: Terraform Plan
      run: terraform plan
    - name: Terraform Apply
      if: github.event_name == 'workflow_dispatch'
      run: terraform apply -auto-approve
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    - name: Scan for Misconfigurations
      run: terraform fmt -check -recursive
  backup:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Run AWS Backup Job
      run: |
        aws backup start-backup-job --backup-vault-name ecommerce-backup-vault --backup-plan-id $(aws backup list-backup-plans --query 'BackupPlansList[?BackupPlanName==`ecommerce-backup-plan`].BackupPlanId' --output text) --resource-arn $(aws s3 ls --query 'Buckets[?Name==`ecommerce-bucket`].Name' --output text)
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  dr_test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Run AWS FIS Experiment
      run: |
        aws fis start-experiment --experiment-template-id $(aws fis list-experiment-templates --query 'experimentTemplates[?description==`Simulate ECS task termination for DR testing`].id' --output text)
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    - name: Verify DR Metrics
      run: |
        aws cloudwatch get-metric-statistics --namespace Ecommerce/OrderService --metric-name DRFailure --start-time $(date -u -Iseconds -d '-5 minutes') --end-time $(date -u -Iseconds) --period 60 --statistics Sum
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  container_scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Run Trivy Scanner
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: "<your-ecr-repo>:latest"
        format: "table"
        exit-code: "1"
        severity: "CRITICAL,HIGH"
  security_scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Run AWS Security Hub Scan
      run: aws securityhub batch-import-findings --findings file://security-findings.json
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Implementation Details

  • Backup:
    • AWS Backup for daily S3 backups, encrypted with KMS, as per your Encryption query.
    • Cross-region replication to us-west-2 for DR.
    • Checksums for backup integrity, as per your checksums query.
  • Disaster Recovery:
    • Active-passive failover with ALB in secondary region, as per your multi-region query.
    • Restore endpoint to recover from primary or secondary region.
    • Chaos Engineering with AWS FIS for DR testing, as per your Chaos Engineering query.
  • Security:
    • IAM roles, mTLS via Service Mesh, as per your Cloud Security and Service Mesh queries.
    • JWT and HMAC validation, as per your Securing APIs query.
    • Zero Trust with Cognito MFA, as per your Zero Trust query.
  • Resilience:
    • Polly for circuit breakers (5 failures, 30s cooldown), retries (3 attempts), timeouts (500ms).
    • DLQs for failed events, as per your Resiliency Patterns query.
    • Heartbeats (< 5s) for DR health, as per your heartbeats query.
  • Monitoring:
    • CloudWatch alarms for DR failures (> 1/min).
    • Jaeger traces DR actions, as per your Distributed Tracing query.
  • EDA: Kafka for DR events, CDC for backup sync, as per your EDA and data consistency queries.
  • CI/CD:
    • Terraform and GitHub Actions deploy DR infrastructure, as per your CI/CD and IaC queries.
    • Weekly DR tests with AWS FIS.
    • Trivy scans containers, as per your Containers vs. VMs query.
  • Compliance:
    • 7-year backup retention for GDPR, as per your Auditing & Compliance query.
    • CloudTrail for audit logs, as per your Monitoring & Logging query.
  • Metrics: RTO < 15s, RPO < 1min, 100,000 req/s, 99.999% uptime.

Advanced Implementation Considerations

  • Performance Optimization:
    • Cache JWT validations to reduce latency (< 3ms).
    • Use regional KMS/S3 endpoints for low latency (< 50ms).
    • Compress backups to reduce storage costs.
  • Scalability:
    • Scale ECS and S3 for 1M req/s.
    • Use Serverless (Lambda) for DR tasks, as per your Serverless query.
  • Resilience:
    • Implement retries, timeouts, circuit breakers for DR operations.
    • Deploy HA DR services (multi-AZ).
    • Monitor with heartbeats (< 5s), as per your heartbeats query.
  • Observability:
    • Track SLIs: RTO (< 15s), RPO (< 1min), throughput (> 100,000 req/s).
    • Use Jaeger/OpenTelemetry for tracing, CloudWatch for metrics, as per your Distributed Tracing and Observability queries.
  • Security:
    • Rotate KMS keys every 30 days, as per your Encryption query.
    • Use AWS Security Hub for compliance checks, as per your Auditing & Compliance query.
    • Scan containers with Trivy, as per your Containers vs. VMs query.
  • Testing:
    • Validate DR with Terratest and Chaos Engineering, as per your Chaos Engineering query.
    • Simulate regional outages for failover testing.
  • Multi-Region:
    • Deploy DR per region for low latency (< 50ms).
    • Use GeoHashing for DR routing, as per your GeoHashing query.
  • Cost Optimization:
    • Optimize S3 storage ($0.02/GB/month), AWS Backup costs, as per your Cost Optimization query.
    • Use infrequent access storage for non-critical backups.

Discussing in System Design Interviews

  1. Clarify Requirements:
    • Ask: “What’s the RTO/RPO target? Compliance needs (GDPR/PCI-DSS)? Data volume (1TB/day)?”
    • Example: Confirm e-commerce needing RTO < 15s, healthcare requiring RPO < 5min.
  2. Propose Strategy:
    • Suggest AWS Backup for S3, Velero for Kubernetes, active-passive DR, integrated with Jaeger, KMS, and Service Mesh, as per your Distributed Tracing, Encryption, and Service Mesh queries.
    • Example: “Use AWS Backup for e-commerce, Veeam for healthcare.”
  3. Address Trade-Offs:
    • Explain: “Active-active DR minimizes RTO but increases costs; frequent backups reduce RPO but raise storage costs.”
    • Example: “Use active-active for finance, active-passive for e-commerce.”
  4. Optimize and Monitor:
    • Propose: “Optimize with compressed backups, monitor with CloudWatch/Jaeger.”
    • Example: “Track RTO (< 15s) and RPO (< 1min).”
  5. Handle Edge Cases:
    • Discuss: “Use DLQs for failed DR events, encrypt backups, test with Chaos Engineering.”
    • Example: “Ensure GDPR-compliant backups for e-commerce.”
  6. Iterate Based on Feedback:
    • Adapt: “If cost is a concern, use Velero; if simplicity, use AWS Backup.”
    • Example: “Use Velero for startups, AWS Backup for enterprises.”

Conclusion

Disaster recovery and backup strategies ensure business continuity and data protection through automated backups, multi-region failover, and compliance with GDPR, HIPAA, SOC2, and PCI-DSS. By integrating EDA, Saga Pattern, DDD, API Gateway, Strangler Fig, Service Mesh, Micro Frontends, API Versioning, Cloud-Native Design, Kubernetes, Serverless, 12-Factor App, CI/CD, IaC, Cloud Security, Cost Optimization, Observability, Authentication, Encryption, Securing APIs, Security Considerations, Monitoring & Logging, Distributed Tracing, Zero Trust, Chaos Engineering, and Auditing & Compliance, DR achieves scalability (1M req/s), resilience (RTO < 15s, RPO < 1min), and regulatory adherence. The C# implementation and Terraform configuration demonstrate DR for an e-commerce platform using AWS Backup, S3 replication, Jaeger, KMS, and Istio, with checksums, heartbeats, rate limiting, and Chaos Engineering. Architects can leverage these strategies to build robust e-commerce, healthcare, and financial systems, balancing resilience, performance, and cost.

Uma Mahesh
Uma Mahesh

Author is working as an Architect in a reputed software company. He is having nearly 21+ Years of experience in web development using Microsoft Technologies.

Articles: 264