#Infrastructure
#Infrastructure
You start with one AWS account. You write some Terraform. Life is good.
Then ops says “we need a separate prod, staging, and dev”. Then security says “and logging, audit, and shared-services.” Suddenly you’re staring at 6+ AWS accounts and a Terraform repo that hardcodes provider "aws" blocks with literal account IDs everywhere.
This post is the layout I converged on after running multi-account infra at multiple clients — the one that scales from 3 accounts to 30 without rewriting the world.
The two patterns I see most often, and what’s wrong with them:
Anti-pattern 1: one giant repo, one giant state file
provider "aws" { alias = "prod", region = "eu-west-1", profile = "prod" }
provider "aws" { alias = "staging", region = "eu-west-1", profile = "staging" }
provider "aws" { alias = "dev", region = "eu-west-1", profile = "dev" }
module "vpc_prod" { source = "./vpc" providers = { aws = aws.prod } }
module "vpc_staging" { source = "./vpc" providers = { aws = aws.staging } }
Every terraform plan touches all three accounts. A typo in dev rolls a plan in prod. Blast radius: comically large.
Anti-pattern 2: copy-paste the repo per environment
infra-prod/, infra-staging/, infra-dev/ — three repos, three nearly-identical sets of modules, and a forever drift problem the first time someone “fixes” something in prod without backporting.
infra/
├── modules/ # generic, reusable, no environment-specific code
│ ├── vpc/
│ ├── eks/
│ ├── rds/
│ └── monitoring/
├── live/ # one folder per (account, region, stack)
│ ├── prod-eu-west-1/
│ │ ├── network/
│ │ │ ├── main.tf # `module "vpc" { source = "../../../modules/vpc" }`
│ │ │ ├── backend.tf # s3 backend, key = "prod/eu-west-1/network.tfstate"
│ │ │ └── terraform.tfvars
│ │ ├── platform/ # eks, ingress, addons
│ │ └── data/ # rds, elasticache
│ ├── staging-eu-west-1/
│ └── shared-services/
└── bootstrap/ # the one-time setup: state buckets, IAM roles
Each leaf folder gets its own state file. Blast radius per apply: one stack, one account, one region. A staging plan can’t touch prod because the providers, the credentials, and the state file are all different.
You don’t want long-lived IAM users per account. The model that scales:
In ~/.aws/config:
[profile sso]
sso_session = company
sso_account_id = 111111111111
sso_role_name = DeveloperAccess
region = eu-west-1
[profile prod]
source_profile = sso
role_arn = arn:aws:iam::222222222222:role/TerraformAdmin
region = eu-west-1
[profile staging]
source_profile = sso
role_arn = arn:aws:iam::333333333333:role/TerraformAdmin
region = eu-west-1
In each leaf stack’s provider.tf:
provider "aws" {
region = var.region
profile = var.profile # e.g. "prod"
default_tags {
tags = {
ManagedBy = "terraform"
Environment = var.environment
Repo = "infra"
Stack = "network"
}
}
}
Now aws sso login --sso-session company once a day, and Terraform inherits the credentials. No static keys, full audit trail via CloudTrail.
Both work. The pragmatic answer:
shared-services account, with a key prefix per account. Lower ops cost, single backup/replication policy, single KMS key to rotate.I default to shared. Bucket layout:
s3://acme-tfstate/
├── prod/eu-west-1/network.tfstate
├── prod/eu-west-1/platform.tfstate
├── staging/eu-west-1/network.tfstate
└── shared-services/global/iam.tfstate
backend.tf in each stack:
terraform {
backend "s3" {
bucket = "acme-tfstate"
key = "prod/eu-west-1/network.tfstate"
region = "eu-west-1"
dynamodb_table = "acme-tfstate-locks"
encrypt = true
}
}
The DynamoDB lock table is non-negotiable. Two engineers apply-ing the same stack without it will corrupt your state in spectacular ways.
terraform_remote_stateThe platform stack needs the VPC ID from the network stack. The data stack needs the EKS OIDC URL from the platform stack. Don’t hand-copy these — read them:
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "acme-tfstate"
key = "prod/eu-west-1/network.tfstate"
region = "eu-west-1"
}
}
module "eks" {
source = "../../../modules/eks"
vpc_id = data.terraform_remote_state.network.outputs.vpc_id
subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
}
Two rules: outputs are the contract between stacks (don’t break them lightly), and never read state from a different ownership boundary without the other team’s blessing — you’ve just created a non-obvious dependency they can’t see.
In GitLab CI:
.tf-plan:
image: hashicorp/terraform:1.7
before_script:
- aws sso login --sso-session company --no-browser # or use OIDC
- cd $STACK_PATH
- terraform init
script:
- terraform plan -out=tfplan -input=false
artifacts:
paths: [$STACK_PATH/tfplan]
expire_in: 1 day
plan:prod-network:
extends: .tf-plan
variables:
STACK_PATH: live/prod-eu-west-1/network
rules:
- changes: ["modules/vpc/**", "live/prod-eu-west-1/network/**"]
A change to modules/vpc/ triggers a plan in every stack that depends on it. A change to one leaf folder triggers only that stack. The rules: changes block is doing the heavy lifting — without it, every merge runs every plan.
For prod, gate the apply job behind a manual approval. Always.
required_providers + version pins go in every leaf stack. Not just at the root. State files are versioned; you don’t want a fresh init to pull provider 6.0 against state written by 5.2.moved blocks for refactors. Renaming a resource without a moved block makes Terraform delete and recreate it. With a 5-line moved block, it’s a no-op.terraform destroy from a laptop on a prod stack. Make the only path to destruction a CI job with manual approval. The number of “I had the wrong shell open” incidents this prevents pays for itself.The trick to Terraform at scale isn’t a magic module or a special CLI flag. It’s boring boundaries: one state per (account, region, stack), one role-assumption chain, one source of truth for shared values via remote state. Everything else is mechanical.
— Abdel