Skip to main content

AWS Deployment

Deploy the FailZero agent on AWS using EC2 or ECS/Fargate.

EC2

1. Create IAM Role

# Create trust policy
cat > trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create role
aws iam create-role \
  --role-name failzero-agent \
  --assume-role-policy-document file://trust-policy.json

2. Attach IAM Policies

# Create policy for DR operations
cat > failzero-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "rds:PromoteReadReplica",
        "rds:DescribeDBInstances",
        "rds:ModifyDBInstance"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "route53:ChangeResourceRecordSets",
        "route53:ListHostedZones",
        "route53:GetHostedZone"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:UpdateAutoScalingGroup",
        "autoscaling:DescribeAutoScalingGroups"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:*:*:secret:failzero-*"
    }
  ]
}
EOF

aws iam put-role-policy \
  --role-name failzero-agent \
  --policy-name failzero-dr-operations \
  --policy-document file://failzero-policy.json
Scope permissions to specific resources in production. These examples use wildcards for simplicity.

3. Create Instance Profile

aws iam create-instance-profile \
  --instance-profile-name failzero-agent

aws iam add-role-to-instance-profile \
  --instance-profile-name failzero-agent \
  --role-name failzero-agent

4. Launch EC2 Instance

# Store token in Secrets Manager
aws secretsmanager create-secret \
  --name failzero-agent-token \
  --secret-string "fzat_your_token"

# Create user data script
cat > user-data.sh << 'EOF'
#!/bin/bash
yum update -y
yum install -y docker
systemctl start docker
systemctl enable docker

# Fetch token from Secrets Manager
TOKEN=$(aws secretsmanager get-secret-value \
  --secret-id failzero-agent-token \
  --query SecretString \
  --output text)

docker run -d \
  --name failzero-agent \
  --restart unless-stopped \
  -e FAILZERO_AGENT_TOKEN=$TOKEN \
  -e FAILZERO_API_URL=https://api.failzero.io \
  -e PROVIDER_TYPE=aws \
  -e AWS_ACCOUNT_ID=123456789012 \
  -e AWS_REGION=us-east-1 \
  failzero/agent:latest
EOF

# Launch instance
aws ec2 run-instances \
  --image-id ami-0c55b159cbfafe1f0 \
  --instance-type t3.small \
  --iam-instance-profile Name=failzero-agent \
  --user-data file://user-data.sh \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=failzero-agent}]'

ECS/Fargate

1. Create Task Execution Role

# Trust policy for ECS
cat > ecs-trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create execution role
aws iam create-role \
  --role-name failzero-agent-execution \
  --assume-role-policy-document file://ecs-trust-policy.json

# Attach managed policy
aws iam attach-role-policy \
  --role-name failzero-agent-execution \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

# Create task role (for DR operations)
aws iam create-role \
  --role-name failzero-agent-task \
  --assume-role-policy-document file://ecs-trust-policy.json

# Attach DR policy (from step 2 above)
aws iam put-role-policy \
  --role-name failzero-agent-task \
  --policy-name failzero-dr-operations \
  --policy-document file://failzero-policy.json

2. Create Task Definition

{
  "family": "failzero-agent",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/failzero-agent-execution",
  "taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/failzero-agent-task",
  "containerDefinitions": [
    {
      "name": "failzero-agent",
      "image": "failzero/agent:latest",
      "essential": true,
      "environment": [
        {"name": "FAILZERO_API_URL", "value": "https://api.failzero.io"},
        {"name": "PROVIDER_TYPE", "value": "aws"},
        {"name": "AWS_ACCOUNT_ID", "value": "123456789012"},
        {"name": "AWS_REGION", "value": "us-east-1"}
      ],
      "secrets": [
        {
          "name": "FAILZERO_AGENT_TOKEN",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:failzero-agent-token"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/failzero-agent",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}
Register the task definition:
aws ecs register-task-definition \
  --cli-input-json file://task-definition.json

3. Create ECS Service

# Create cluster (if needed)
aws ecs create-cluster --cluster-name failzero

# Create log group
aws logs create-log-group --log-group-name /ecs/failzero-agent

# Create service
aws ecs create-service \
  --cluster failzero \
  --service-name failzero-agent \
  --task-definition failzero-agent:1 \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-xxx],assignPublicIp=ENABLED}"
Replace subnet-xxx and sg-xxx with your actual VPC subnet and security group IDs. The security group must allow outbound HTTPS.

IAM Permissions

Minimum Required

ServiceActionsPurpose
RDSrds:PromoteReadReplica, rds:DescribeDBInstancesPromote replicas
Route53route53:ChangeResourceRecordSetsUpdate DNS records

Optional (Based on DR Plan)

ServiceActionsPurpose
Auto Scalingautoscaling:UpdateAutoScalingGroupScale compute
ECSecs:UpdateServiceScale containers
Secrets Managersecretsmanager:GetSecretValueRead secrets
S3s3:GetObject, s3:PutObjectBackup operations
SNSsns:PublishNotifications

Verify Deployment

EC2

# Connect to instance
aws ssm start-session --target i-xxxxxxxxxxxxx

# Check Docker logs
docker logs failzero-agent

ECS

# List tasks
aws ecs list-tasks --cluster failzero --service-name failzero-agent

# View logs
aws logs tail /ecs/failzero-agent --follow
Expected output:
[Agent] Starting FailZero Agent...
[Agent] Registering with FailZero API...
[Agent] Registered successfully for organization: your-org
[Agent] Agent started successfully

Troubleshooting

Permission denied errors:
  • Verify IAM role/policy is attached correctly
  • Check task role (not execution role) has DR permissions
  • Ensure Secrets Manager permissions for token retrieval
Cannot reach API:
  • Verify security group allows outbound HTTPS (port 443)
  • Check NAT gateway if running in private subnet
  • Ensure VPC endpoints or internet gateway is configured
Task keeps restarting:
  • Check CloudWatch logs for error messages
  • Verify Secrets Manager secret exists and is accessible
  • Confirm AWS_ACCOUNT_ID and AWS_REGION are correct

Next Steps