Self-host on AWS

This guide shows you how to deploy the Braintrust data plane in your AWS account using the Braintrust Terraform module. This is the recommended way to self-host Braintrust on AWS.

Set up the data plane

Deploy the Braintrust data plane infrastructure in your AWS account.

Braintrust recommends deploying in a dedicated AWS account. AWS enforces account-level Lambda concurrency limits, and since Braintrust’s API runs on Lambda, sharing an account with other workloads can lead to throttling and service disruptions. A dedicated account also aligns with AWS best practices for workload isolation and security.

Configure the Terraform module

The Braintrust Terraform module contains all the necessary resources for a self-hosted Braintrust data plane.

Copy the entire contents of the examples/braintrust-data-plane directory from the terraform-aws-braintrust-data-plane repository into your own repository.
In provider.tf, configure your AWS account and region. Supported regions: us-east-1, us-east-2, us-west-2, eu-west-1, ca-central-1, and ap-southeast-2. If you require support for a different region, contact Braintrust.
In terraform.tf, set up your remote backend (typically S3 and DynamoDB).
In main.tf, customize the Braintrust deployment settings. The defaults are suitable for a large production-sized deployment. Adjust them based on your needs, but keep in mind the hardware requirements.
Each deployment must have a unique deployment_name within the same AWS account (max 18 characters). The default is "braintrust", change this if you have multiple deployments. Resource names (IAM roles, RDS instances, S3 buckets) are prefixed with this value and will collide if duplicated.

Brainstore instances require instance types with local NVMe storage for caching (e.g., c8gd, c5d, m5d, i3, i4i families). Generic instance types without local storage (t3, m5, c5) are not supported and will fail at plan time.

Initialize AWS account

Braintrust recommends a dedicated AWS account for your Braintrust deployment.If you’re using a new AWS account, run the create-service-linked-roles.sh script to create all necessary IAM service-linked roles for the deployment:

./scripts/create-service-linked-roles.sh

Configure Brainstore license

Your deployment includes Brainstore, a high-performance query engine for real-time trace ingestion. Brainstore requires a license key.

In the Braintrust UI, go to Settings > Data plane.
Copy your Brainstore license.
If you don’t see your data plane configuration, contact Braintrust to enable self-hosting.
Pass the key to Terraform. The recommended approach is to store the license key in AWS Secrets Manager and reference it using a Terraform data source:
```
data "aws_secretsmanager_secret_version" "brainstore_license" {
  secret_id = "braintrust/brainstore-license-key"
}
```
Then pass data.aws_secretsmanager_secret_version.brainstore_license.secret_string as the brainstore_license_key value in the module. Alternatively, you can pass the key without storing it in Secrets Manager:
- Set TF_VAR_brainstore_license_key=your-key in your environment.
- Pass it via command line: terraform apply -var 'brainstore_license_key=your-key'.
- Add it to an uncommitted terraform.tfvars or .auto.tfvars file.
Do not commit the license key to your git repository.

Deploy the module

Initialize and apply the Terraform configuration:

terraform init
terraform apply

The first terraform apply may fail with transient errors such as ASG health check timeouts (while instances are still booting) or Lambda rate limits. Re-running terraform apply resolves these.

This will create all necessary AWS resources including:

Two isolated VPCs:
- Main VPC: Hosts Braintrust services (API, database, Redis, Brainstore)
- Quarantine VPC: Runs user-defined functions (scorers, tools) in network isolation. This creates ~30 Lambda functions across multiple runtimes. This is required for most production use cases.
Lambda functions for the Braintrust API
Public CloudFront endpoint and API Gateway
EC2 Auto-scaling group for Brainstore
PostgreSQL database, Redis cache, and S3 buckets
KMS key for encryption

Get your API URL

After the deployment completes, get your API URL from the Terraform outputs:

terraform output

You should see output similar to:

api_url = "https://dx6atff6gocr6.cloudfront.net"

Save this URL - you’ll need it to configure your Braintrust organization.

Configure your organization

Connect your Braintrust organization to your newly deployed data plane.

Changing your live organization’s API URL can disrupt access for existing users. If you are testing, create a new Braintrust organization for your data plane instead of updating your live environment.

Point your organization to your data plane

In the Braintrust UI, go to Settings > Data plane.
In API URL area, select Edit.
Enter the API URL from the last step.
Leave the other fields blank.
If your deployment is accessed through a VPN or is otherwise on a private network (not accessible from the public internet), enable Data plane is on a private network. This enables Chrome’s Local Network Access permission handling, which is required for browser access to private network resources. When enabled, Chrome will prompt users to grant permission for the Braintrust UI to access your self-hosted data plane. See Grant browser permissions for details.
Select Save.

Verify the connection

The UI will automatically test the connection to your new data plane. Verify that the ping to each endpoint is successful.

Update the deployment

Run terraform apply to update your deployment. This will apply any infrastructure changes and update the Lambda functions while preserving your data.

terraform apply

Carefully review the output of terraform plan before applying any changes to your deployment. If you see something unexpected, like deletion of a database or S3 bucket, contact Braintrust for help.

To pin to a specific Terraform module version, add a ?ref=<version> to the module source:

module "braintrust-data-plane" {
  source = "github.com/braintrustdata/terraform-braintrust-data-plane?ref=vX.Y.Z"

  # ... other configuration ...
}

Terraform releases: GitHub Releases

Debug issues

If you encounter issues, you can use the dump-logs.sh script to collect logs:

./scripts/dump-logs.sh <deployment_name> [--minutes N] [--service <svc1,svc2,...|all>]

For example, to dump 60 minutes of logs for the bt-sandbox deployment, run:

./scripts/dump-logs.sh bt-sandbox

This will save logs for all services to a logs-<deployment_name> directory, which you can share with the Braintrust team for debugging.

Customize the deployment

Use an existing VPC

To deploy into an existing VPC instead of creating a new one, set create_vpc = false and provide your VPC and subnet IDs:

module "braintrust-data-plane" {
  source = "github.com/braintrustdata/terraform-braintrust-data-plane"

  create_vpc = false

  existing_vpc_id              = "vpc-xxxxxxxxx"
  existing_private_subnet_1_id = "subnet-xxxxxxxxx"
  existing_private_subnet_2_id = "subnet-xxxxxxxxx"
  existing_private_subnet_3_id = "subnet-xxxxxxxxx"
  existing_public_subnet_1_id  = "subnet-xxxxxxxxx"

  # ... other configuration ...
}

Your existing VPC must have:

At least 3 private subnets across different availability zones
At least 1 public subnet
Internet and NAT gateways with properly configured route tables

The module manages its own security groups. To also use an existing quarantine VPC, set existing_quarantine_vpc_id and the corresponding existing_quarantine_private_subnet_*_id variables.

Use custom tags

To apply custom tags to all resources, pass the custom_tags parameter to the Braintrust module:

module "braintrust-data-plane" {
  source = "github.com/braintrustdata/terraform-aws-braintrust-data-plane"

  custom_tags = {
    Environment = "production"
    Team        = "ml-platform"
    CostCenter  = "engineering"
  }

  # ... other configuration ...
}

These tags will be applied to all resources including Brainstore EC2 instances, volumes, and ENIs. The deployment name variable automatically prefixes resource names and applies a BraintrustDeploymentName tag across all resources.

Use the custom_tags parameter instead of the AWS provider’s default_tags configuration. Due to a Terraform limitation, default_tags are not applied to resources that use launch templates, such as Brainstore instances.

Redis instance sizing

Important for AWS: Avoid using burstable Redis instances (t-family instances like cache.t4g.micro) in production. These instances use CPU credits that can be exhausted during high-load periods, leading to performance throttling.Instead, use non-burstable instances like cache.r7g.large, cache.r6g.medium, or cache.r5.large for predictable performance. Even if these instances seem oversized initially, they provide consistent performance without the risk of CPU credit exhaustion.

VPC connectivity

To connect Braintrust’s VPC to other internal resources (like an LLM gateway), use one of the following approaches:

Create a VPC Endpoint Service for your internal resource, then create a VPC Interface Endpoint inside of the Braintrust “Quarantine” VPC
Set up VPC peering with the Braintrust “Quarantine” VPC

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Set up the data plane

Configure your organization

Update the deployment

Debug issues

Customize the deployment

Use an existing VPC

Use custom tags

Redis instance sizing

VPC connectivity

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Set up the data plane

​Configure your organization

​Update the deployment

​Debug issues

​Customize the deployment

​Use an existing VPC

​Use custom tags

​Redis instance sizing

​VPC connectivity

Set up the data plane

Configure your organization

Update the deployment

Debug issues

Customize the deployment

Use an existing VPC

Use custom tags

Redis instance sizing

VPC connectivity