Pattern: Dynamic Infrastructure Provisioning¶

Create and destroy isolated CloudFormation stacks per unit of work at runtime, enabling full resource isolation, independent scaling, and guaranteed cleanup for batch processing workloads.

When to Use¶

Workload requires resource isolation per case/batch (separate queues, network rules, scaling policies)
Database connection pools have hard limits requiring precise accounting across active workloads
Processing is long-running (minutes to hours) with per-batch lifecycle
Clean teardown is critical — no orphaned resources after processing completes
Different batches may have vastly different scaling profiles

When NOT to Use¶

Short-lived processing (< 15 min) — use Lambda instead
Shared infrastructure is acceptable — use static CDK stacks
Stack creation latency (3-5 min) is unacceptable for the use case
Simple queue isolation suffices — use per-batch SQS queues without full stacks

Architecture¶

                    ┌──────────────────────────┐
                    │   Orchestrator Lambda     │
                    │   (global, always-on)     │
                    └─────┬───────────┬────────┘
                          │           │
              JOB_STARTED │           │ LOADER_FINISHED
                          ▼           ▼
                  ┌──────────┐  ┌──────────┐
                  │ Create   │  │ Delete   │
                  │ CF Stack │  │ CF Stack │
                  └─────┬────┘  └─────┬────┘
                        │             │
                        ▼             ▼
              ┌───────────────────────────────┐
              │  Per-Import Stack              │
              │  UploaderStack-{caseId}-{batchId}│
              │                               │
              │  ├── SQS Queues (main + DLQ)  │
              │  ├── ECS Tasks (auto-scaled)  │
              │  ├── Security Groups          │
              │  └── Event Source Mappings     │
              └───────────────────────────────┘

Key Components¶

1. Pre-Synthesized CloudFormation Template¶

The per-import CDK app is synthesized at deploy time and uploaded to S3. The orchestrator downloads this template and passes case/batch-specific parameters at runtime:

// Per-import CDK app (synthesized → S3 at deploy time)
const app = new cdk.App();
new UploaderEcsStack(app, 'UploaderStack', {
  // CaseId and BatchId are CloudFormation Parameters
  // resolved at runtime by the orchestrator
});

// Orchestrator creates stack at runtime
await cloudFormation.createStack({
  StackName: `UploaderStack-${caseId}-${batchId}`,
  TemplateURL: `https://s3.amazonaws.com/${bucket}/${template}`,
  Parameters: [
    { ParameterKey: 'CaseId', ParameterValue: caseId },
    { ParameterKey: 'BatchId', ParameterValue: batchId },
  ],
});

2. Self-Referencing Event Polling¶

Instead of Step Functions wait states or polling loops, the orchestrator uses SNS → SQS → self as an asynchronous status-check mechanism:

// After creating a stack, publish a status-check event
await publishEvent({
  eventType: 'UPLOADER_CREATE_INITIATED',
  caseId, batchId,
  retryCount: 0,
});

// SNS routes this back to the orchestrator's own SQS queue
// On next invocation, orchestrator checks CF stack status:
const status = await describeStack(stackName);

if (status === 'CREATE_IN_PROGRESS') {
  // Re-publish to poll again (SQS visibility timeout = natural delay)
  await publishEvent({
    eventType: 'UPLOADER_CREATE_INITIATED',
    retryCount: retryCount + 1,
  });
} else if (status === 'CREATE_COMPLETE') {
  await publishEvent({ eventType: 'UPLOADER_STARTED' });
} else if (status === 'ROLLBACK_COMPLETE' && retryCount < MAX_RETRIES) {
  await deleteStack(stackName);
  await publishEvent({
    eventType: 'UPLOADER_CREATE_INITIATED',
    retryCount: retryCount + 1,
  });
}

Advantages over Step Functions: - No additional cost (SQS is already in use) - Natural backoff via SQS visibility timeout - Same event-driven model as the rest of the system - Retry count preserved in the event payload

3. Global Throttling via Resource Accounting¶

Before creating a new stack, the orchestrator checks total resource usage across all active stacks:

const runningTasks = await getRunningTaskCount(clusterName);
const activeConnections = runningTasks * PER_TASK_CONNECTION;

if (activeConnections > MAX_ALLOWED_CONNECTIONS) {
  // Throttle: re-enqueue with delay
  await publishEvent({
    eventType: 'UPLOADER_THROTTLED',
    delaySeconds: RE_ENQUEUE_DELAY,
    detail: { activeConnections, maxAllowed: MAX_ALLOWED_CONNECTIONS },
  });
  return;
}

// Safe to create — proceed with stack provisioning
await createStack(caseId, batchId);

4. State Tracking via SSM Parameter Store¶

Track per-stack state to prevent duplicate creation and manage lifecycle:

// On throttle, record state in SSM
await ssm.putParameter({
  Name: `/nge/docUploader/UploaderStack-${caseId}-${batchId}`,
  Value: JSON.stringify({ UPLOADER_THROTTLED: true }),
  Type: 'String',
  Overwrite: true,
});

// Before creating, check if already throttled
const state = await ssm.getParameter({ Name: paramName });
if (JSON.parse(state.Value).UPLOADER_THROTTLED) {
  // Already being handled — skip duplicate creation
  return;
}

Why SSM over DynamoDB? Simple key-value state (not queries), survives Lambda cold starts, explicit cleanup on stack deletion, no table provisioning needed.

Stack Lifecycle State Machine¶

JOB_STARTED
  │
  ├── Connections OK ──→ CreateStack ──→ CREATE_INITIATED (poll)
  │                                         │
  │                         CREATE_IN_PROGRESS → re-poll
  │                         CREATE_COMPLETE → UPLOADER_STARTED
  │                         ROLLBACK_COMPLETE → retry or FAILED
  │
  └── Throttled ──→ UPLOADER_THROTTLED (delay 900s → retry)

LOADER_FINISHED
  │
  ├── Queues empty ──→ DeleteStack ──→ DELETE_INITIATED (poll)
  │                                       │
  │                       DELETE_IN_PROGRESS → re-poll
  │                       DELETE_COMPLETE → UPLOADER_FINISHED
  │
  └── Queues not empty ──→ re-enqueue (check again later)

IMPORT_CANCELLED
  │
  └── DeleteStack ──→ CANCEL_INITIATED (poll) ──→ UPLOADER_CANCELLED

Design Considerations¶

CF creation latency (3-5 min) — acceptable for batch imports running hours; use self-referencing polling to avoid blocking.
Stack naming collisions — use deterministic names (UploaderStack-{caseId}-{batchId}) to prevent duplicates and enable idempotent creation checks.
Nested stacks for modularity — split per-import resources into nested stacks (queues, task definitions, ECS service) for independent updates.
Template versioning — store synthesized templates in S3 with version prefixes; orchestrator always uses the latest deployed template.
Orphan protection — if orchestrator fails mid-lifecycle, re-processing the same event checks existing stack status rather than creating a duplicate.

Real-World Usage¶

documentuploader — creates per-case/batch ECS Fargate stacks with 3-container tasks (Poller, DocEngine, Redirection Service), SQS queues, and DLQs. See reference-implementations/documentuploader.md.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.