← cd ../blog

#Infrastructure

Terraform across multiple AWS accounts without losing your mind

Abdelfattah Hilmi ·
#Terraform#AWS#IaC#DevOps

The problem

You start with one AWS account. You write some Terraform. Life is good.

Then ops says “we need a separate prod, staging, and dev”. Then security says “and logging, audit, and shared-services.” Suddenly you’re staring at 6+ AWS accounts and a Terraform repo that hardcodes provider "aws" blocks with literal account IDs everywhere.

This post is the layout I converged on after running multi-account infra at multiple clients — the one that scales from 3 accounts to 30 without rewriting the world.

Don’t do this

The two patterns I see most often, and what’s wrong with them:

Anti-pattern 1: one giant repo, one giant state file

provider "aws" { alias = "prod",    region = "eu-west-1", profile = "prod" }
provider "aws" { alias = "staging", region = "eu-west-1", profile = "staging" }
provider "aws" { alias = "dev",     region = "eu-west-1", profile = "dev" }

module "vpc_prod"    { source = "./vpc" providers = { aws = aws.prod    } }
module "vpc_staging" { source = "./vpc" providers = { aws = aws.staging } }

Every terraform plan touches all three accounts. A typo in dev rolls a plan in prod. Blast radius: comically large.

Anti-pattern 2: copy-paste the repo per environment

infra-prod/, infra-staging/, infra-dev/ — three repos, three nearly-identical sets of modules, and a forever drift problem the first time someone “fixes” something in prod without backporting.

The layout I use

infra/
├── modules/                  # generic, reusable, no environment-specific code
│   ├── vpc/
│   ├── eks/
│   ├── rds/
│   └── monitoring/
├── live/                     # one folder per (account, region, stack)
│   ├── prod-eu-west-1/
│   │   ├── network/
│   │   │   ├── main.tf       # `module "vpc" { source = "../../../modules/vpc" }`
│   │   │   ├── backend.tf    # s3 backend, key = "prod/eu-west-1/network.tfstate"
│   │   │   └── terraform.tfvars
│   │   ├── platform/         # eks, ingress, addons
│   │   └── data/             # rds, elasticache
│   ├── staging-eu-west-1/
│   └── shared-services/
└── bootstrap/                # the one-time setup: state buckets, IAM roles

Each leaf folder gets its own state file. Blast radius per apply: one stack, one account, one region. A staging plan can’t touch prod because the providers, the credentials, and the state file are all different.

Authenticating: the assume-role chain

You don’t want long-lived IAM users per account. The model that scales:

  1. You authenticate once into a “users” or SSO account.
  2. Terraform assumes a role into the target account.
  3. (Optional) Terraform assumes a role from there into a sub-account.

In ~/.aws/config:

[profile sso]
sso_session = company
sso_account_id = 111111111111
sso_role_name  = DeveloperAccess
region         = eu-west-1

[profile prod]
source_profile = sso
role_arn       = arn:aws:iam::222222222222:role/TerraformAdmin
region         = eu-west-1

[profile staging]
source_profile = sso
role_arn       = arn:aws:iam::333333333333:role/TerraformAdmin
region         = eu-west-1

In each leaf stack’s provider.tf:

provider "aws" {
  region  = var.region
  profile = var.profile           # e.g. "prod"

  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Environment = var.environment
      Repo        = "infra"
      Stack       = "network"
    }
  }
}

Now aws sso login --sso-session company once a day, and Terraform inherits the credentials. No static keys, full audit trail via CloudTrail.

State: one bucket per account, or one shared?

Both work. The pragmatic answer:

I default to shared. Bucket layout:

s3://acme-tfstate/
├── prod/eu-west-1/network.tfstate
├── prod/eu-west-1/platform.tfstate
├── staging/eu-west-1/network.tfstate
└── shared-services/global/iam.tfstate

backend.tf in each stack:

terraform {
  backend "s3" {
    bucket         = "acme-tfstate"
    key            = "prod/eu-west-1/network.tfstate"
    region         = "eu-west-1"
    dynamodb_table = "acme-tfstate-locks"
    encrypt        = true
  }
}

The DynamoDB lock table is non-negotiable. Two engineers apply-ing the same stack without it will corrupt your state in spectacular ways.

Cross-account references with terraform_remote_state

The platform stack needs the VPC ID from the network stack. The data stack needs the EKS OIDC URL from the platform stack. Don’t hand-copy these — read them:

data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "acme-tfstate"
    key    = "prod/eu-west-1/network.tfstate"
    region = "eu-west-1"
  }
}

module "eks" {
  source     = "../../../modules/eks"
  vpc_id     = data.terraform_remote_state.network.outputs.vpc_id
  subnet_ids = data.terraform_remote_state.network.outputs.private_subnet_ids
}

Two rules: outputs are the contract between stacks (don’t break them lightly), and never read state from a different ownership boundary without the other team’s blessing — you’ve just created a non-obvious dependency they can’t see.

CI: one pipeline, parameterized by stack

In GitLab CI:

.tf-plan:
  image: hashicorp/terraform:1.7
  before_script:
    - aws sso login --sso-session company --no-browser   # or use OIDC
    - cd $STACK_PATH
    - terraform init
  script:
    - terraform plan -out=tfplan -input=false
  artifacts:
    paths: [$STACK_PATH/tfplan]
    expire_in: 1 day

plan:prod-network:
  extends: .tf-plan
  variables:
    STACK_PATH: live/prod-eu-west-1/network
  rules:
    - changes: ["modules/vpc/**", "live/prod-eu-west-1/network/**"]

A change to modules/vpc/ triggers a plan in every stack that depends on it. A change to one leaf folder triggers only that stack. The rules: changes block is doing the heavy lifting — without it, every merge runs every plan.

For prod, gate the apply job behind a manual approval. Always.

A few things I wish I’d known sooner

Closing thought

The trick to Terraform at scale isn’t a magic module or a special CLI flag. It’s boring boundaries: one state per (account, region, stack), one role-assumption chain, one source of truth for shared values via remote state. Everything else is mechanical.

Abdel

← back to /blog