#Infrastructure

Terraform across multiple AWS accounts without losing your mind

Abdelfattah Hilmi · Dec 8, 2025 ·

#Terraform#AWS#IaC#DevOps

The problem

You start with one AWS account. You write some Terraform. Life is good.

Then ops says “we need a separate prod, staging, and dev”. Then security says “and logging, audit, and shared-services.” Suddenly you’re staring at 6+ AWS accounts and a Terraform repo that hardcodes provider "aws" blocks with literal account IDs everywhere.

This post is the layout I converged on after running multi-account infra at multiple clients — the one that scales from 3 accounts to 30 without rewriting the world.

Don’t do this

The two patterns I see most often, and what’s wrong with them:

Anti-pattern 1: one giant repo, one giant state file

provider "aws" { alias = "prod",    region = "eu-west-1", profile = "prod" }
provider "aws" { alias = "staging", region = "eu-west-1", profile = "staging" }
provider "aws" { alias = "dev",     region = "eu-west-1", profile = "dev" }

module "vpc_prod"    { source = "./vpc" providers = { aws = aws.prod    } }
module "vpc_staging" { source = "./vpc" providers = { aws = aws.staging } }

Every terraform plan touches all three accounts. A typo in dev rolls a plan in prod. Blast radius: comically large.

Anti-pattern 2: copy-paste the repo per environment

infra-prod/, infra-staging/, infra-dev/ — three repos, three nearly-identical sets of modules, and a forever drift problem the first time someone “fixes” something in prod without backporting.

The layout I use

infra/
├── modules/                  # generic, reusable, no environment-specific code
│   ├── vpc/
│   ├── eks/
│   ├── rds/
│   └── monitoring/
├── live/                     # one folder per (account, region, stack)
│   ├── prod-eu-west-1/
│   │   ├── network/
│   │   │   ├── main.tf       # `module "vpc" { source = "../../../modules/vpc" }`
│   │   │   ├── backend.tf    # s3 backend, key = "prod/eu-west-1/network.tfstate"
│   │   │   └── terraform.tfvars
│   │   ├── platform/         # eks, ingress, addons
│   │   └── data/             # rds, elasticache
│   ├── staging-eu-west-1/
│   └── shared-services/
└── bootstrap/                # the one-time setup: state buckets, IAM roles

Each leaf folder gets its own state file. Blast radius per apply: one stack, one account, one region. A staging plan can’t touch prod because the providers, the credentials, and the state file are all different.

Authenticating: the assume-role chain

You don’t want long-lived IAM users per account. The model that scales:

You authenticate once into a “users” or SSO account.
Terraform assumes a role into the target account.
(Optional) Terraform assumes a role from there into a sub-account.

In ~/.aws/config:

[profile sso]
sso_session = company
sso_account_id = 111111111111
sso_role_name  = DeveloperAccess
region         = eu-west-1

[profile prod]
source_profile = sso
role_arn       = arn:aws:iam::222222222222:role/TerraformAdmin
region         = eu-west-1

[profile staging]
source_profile = sso
role_arn       = arn:aws:iam::333333333333:role/TerraformAdmin
region         = eu-west-1

In each leaf stack’s provider.tf:

provider "aws" {
  region  = var.region
  profile = var.profile           # e.g. "prod"

  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Environment = var.environment
      Repo        = "infra"
      Stack       = "network"
    }
  }
}

Now aws sso login --sso-session company once a day, and Terraform inherits the credentials. No static keys, full audit trail via CloudTrail.

State: one bucket per account, or one shared?

Both work. The pragmatic answer:

One shared state bucket in a shared-services account, with a key prefix per account. Lower ops cost, single backup/replication policy, single KMS key to rotate.
One bucket per account if your compliance regime requires data isolation per environment.

I default to shared. Bucket layout:

s3://acme-tfstate/
├── prod/eu-west-1/network.tfstate
├── prod/eu-west-1/platform.tfstate
├── staging/eu-west-1/network.tfstate
└── shared-services/global/iam.tfstate

backend.tf in each stack:

terraform {
  backend "s3" {
    bucket         = "acme-tfstate"
    key            = "prod/eu-west-1/network.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "acme-tfstate-locks"
    encrypt        = true
  }
}

The DynamoDB lock table is non-negotiable. Two engineers apply-ing the same stack without it will corrupt your state in spectacular ways.

Cross-account references with `terraform_remote_state`

The platform stack needs the VPC ID from the network stack. The data stack needs the EKS OIDC URL from the platform stack. Don’t hand-copy these — read them:

data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "acme-tfstate"
    key    = "prod/eu-west-1/network.tfstate"
    region = "eu-west-1"
  }
}

module "eks" {
  source     = "../../../modules/eks"
  vpc_id     = data.terraform_remote_state.network.outputs.vpc_id
  subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
}

Two rules: outputs are the contract between stacks (don’t break them lightly), and never read state from a different ownership boundary without the other team’s blessing — you’ve just created a non-obvious dependency they can’t see.

CI: one pipeline, parameterized by stack

In GitLab CI:

.tf-plan:
  image: hashicorp/terraform:1.7
  before_script:
    - aws sso login --sso-session company --no-browser   # or use OIDC
    - cd $STACK_PATH
    - terraform init
  script:
    - terraform plan -out=tfplan -input=false
  artifacts:
    paths: [$STACK_PATH/tfplan]
    expire_in: 1 day

plan:prod-network:
  extends: .tf-plan
  variables:
    STACK_PATH: live/prod-eu-west-1/network
  rules:
    - changes: ["modules/vpc/**", "live/prod-eu-west-1/network/**"]

A change to modules/vpc/ triggers a plan in every stack that depends on it. A change to one leaf folder triggers only that stack. The rules: changes block is doing the heavy lifting — without it, every merge runs every plan.

For prod, gate the apply job behind a manual approval. Always.

A few things I wish I’d known sooner

required_providers + version pins go in every leaf stack. Not just at the root. State files are versioned; you don’t want a fresh init to pull provider 6.0 against state written by 5.2.
Use moved blocks for refactors. Renaming a resource without a moved block makes Terraform delete and recreate it. With a 5-line moved block, it’s a no-op.
Never terraform destroy from a laptop on a prod stack. Make the only path to destruction a CI job with manual approval. The number of “I had the wrong shell open” incidents this prevents pays for itself.
Separate state for things that change at different cadences. Network changes once a quarter. Apps change daily. Keep them in different state files so a bad app deploy can’t bring down the VPC plan.

Closing thought

The trick to Terraform at scale isn’t a magic module or a special CLI flag. It’s boring boundaries: one state per (account, region, stack), one role-assumption chain, one source of truth for shared values via remote state. Everything else is mechanical.

— Abdel