Release notes

Release notes

Week of 2023-11-20

  • Added support for new Perplexity models (ex: pplx-7b-online) to playground
  • Released AI proxy: access many LLMs using one API w/ caching
  • Added load balancing endpoints to AI proxy
  • Updated org-level view to show projects and prompt playground sessions
  • Added ability to batch delete experiments
  • Added support for Claude 2.1 in playground

Week of 2023-11-13

  • Made experiment column resized widths persistent
  • Fixed our libraries including Autoevals to work with OpenAI’s new libraries

    Added OpenAI function calling in the prompt playground

  • Added support for function calling and tools in our prompt playground
  • Added tabs on a project page for datasets, experiments, etc.

Week of 2023-11-06

  • Improved selectors for diffing and comparison modes on experiment view
  • Added support for new OpenAI models (GPT4 preview, 3.5turbo-1106) in playground
  • Added support for OS models (Mistral, Codellama, Llama2, etc.) in playground using Perplexity's APIs

Week of 2023-10-30

  • Improved experiment sidebar to be fully responsive and resizable
  • Improved tooltips within the web UI
  • Multiple performance optimizations and bug fixes

Week of 2023-10-23


  • Multiple performance optimizations and bug fixes

Week of 2023-10-16

  • Launched new tracing feature: log and visualize complex LLM chains and executions.
  • Added a new “text-block” prompt type in the playground that just returns a string or variable back without a LLM call (useful for chaining prompts and debugging)
  • Increased default # of rows per page from 10 to 100 for experiments
  • UI fixes and improvements for the side panel and tooltips
  • The experiment dashboard can be customized to show the most relevant charts

Week of 2023-10-09

  • Performance improvements related to user sessions

Week of 2023-10-02

  • All experiment loading HTTP requests are 100-200ms faster
  • The prompt playground now supports autocomplete
  • Dataset versions are now displayed on the datasets page


Week of 2023-09-18

  • The Eval framework is now supported in Python! See the updated evals guide for more information:
from braintrust import Eval
from autoevals import LevenshteinScorer
  "Say Hi Bot",
  data=lambda: [
          "input": "Foo",
          "expected": "Hi Foo",
          "input": "Bar",
          "expected": "Hello Bar",
  ],  # Replace with your eval dataset
  task=lambda input: "Hi " + input,  # Replace with your LLM call
  • Onboarding and signup flow for new users
  • Switch product font to Inter

Week of 2023-09-11

  • Big performance improvements for registering experiments (down from ~5s to <1s). Update the SDK to take advantage of these improvements.

  • New graph shows aggregate accuracy between experiments for each score.

    Score Comparison Chart

  • Throw errors in the prompt playground if you reference an invalid variable.

  • A significant backend database change which significantly improves performance while reducing costs. Please contact us if you have not already heard from us about upgrading your deployment.

  • No more record size constraints (previously, strings could be at most 64kb long).

  • New autoevals for numeric diff and JSON diff

Week of 2023-09-05

  • You can duplicate prompt sessions, prompts, and dataset rows in the prompt playground.
  • You can download prompt sessions as JSON files (including the prompt templates, prompts, and completions).
  • You can adjust model parameters (e.g. temperature) in the prompt playground.
  • You can publicly share experiments (e.g. Alpaca Evals (opens in a new tab)).
  • Datasets now support editing, deleting, adding, and copying rows in the UI.
  • There is no longer a 64KB limit on strings.

Week of 2023-08-28

  • The prompt playground is now live! We're excited to get your feedback as we continue to build this feature out. See the docs for more information.

Sync Playground

Week of 2023-08-21

  • A new chart shows experiment progress per score over time.

Experiment Progress

  • The eval CLI now supports --watch, which will automatically re-run your evaluation when you make changes to your code.
  • You can now edit datasets in the UI.

Edit Dataset

Week of 2023-08-14

  • Introducing datasets! You can now upload datasets to Braintrust and use them in your experiments. Datasets are versioned, and you can use them in multiple experiments. You can also use datasets to compare your model's performance against a baseline. Learn more about how to create and use datasets in the docs.
  • Fix several performance issues in the SDK and UI.

Week of 2023-08-07

  • Complex data is now substantially more performant in the UI. Prior to this change, we ran schema inference over the entire input, output, expected, and metadata fields, which could result in complex structures that were slow and difficult to work with. Now, we simply treat these fields as JSON types.
  • The UI updates in real-time as new records are logged to experiments.
  • Ergonomic improvements to the SDK and CLI:
    • The JS library is now Isomorphic and supports both Node.js and the browser.
    • The Evals CLI warns you when no files match the .eval.[ts|js] pattern.

Week of 2023-07-31

  • You can now break down scores by metadata fields:

Grouped Score Chart

  • Improve performance for experiment loading (especially complex experiments). Prior to this change, you may have seen experiments take 30s+ occasionally or even fail. To enable this, you'll need to update your CloudFormation.

  • Support for renaming and deleting experiments:

Rename Delete Menu

  • When you expand a cell in detail view, the row is now highlighted:

Highlight Row

Week of 2023-07-24

  • A new framework for expressing evaluations in a much simpler way:
import { Eval } from "braintrust";
import { Factuality } from "autoevals";
Eval("My Evaluation", {
  data: () => [
      input: "Which country has the highest population?",
      expected: "China",
      meta: { type: "question" },
  task: (input) => callModel(input),
  scores: [Factuality],

Besides being much easier than the logging SDK, this framework sets the foundation for evaluations that can be run automatically as your code changes, built and run in the cloud, and more. We are very excited about the use cases it will open up!

  • inputs is now input in the SDK (>= 0.0.23) and UI. You do not need to make any code changes, although you should gradually start using the input field instead of inputs in your SDK calls, as inputs is now deprecated and will eventually be removed.
  • Improved diffing behavior for nested arrays.

Week of 2023-07-17

  • A couple of SDK updates (>= v0.0.21) that allow you to update an existing experiment init(..., update=True) and specify an id in log(..., id='my-custom-id'). These tools are useful for running an experiment across multiple processes, tasks, or machines, and idempotently logging the same record (identified by its id).
  • Tables with lots and lots of columns are now visually more compact in the UI:


Table before


Table after

Week of 2023-07-10


If you have Braintrust installed in your own cloud environment, make sure to update the CloudFormation (available at (opens in a new tab)) to include some functionality the Node.js SDK relies on.

You can do this in the AWS console, or by running the following command (with the braintrust command included in the Python SDK).

braintrust install api <YOUR_CLOUDFORMAT_STACK_NAME> --update-template
  • You can now swap the primary and comparison experiment with a single click.

Swap experiments

  • You can now compare output vs. expected within an experiment.

Diff output and expected

  • Version 0.0.19 is out for the SDK. It is an important update that throws an error if your payload is larger than 64KB in size.

Week of 2023-07-03

  • Support for real-time updates, using Redis. Prior to this, Braintrust would wait for your data warehouse to sync up with Kafka before you could view an experiment, often leading to a minute or two of time before a page loads. Now, we cache experiment records as your experiment is running, making experiments load instantly. To enable this, you'll need to update your CloudFormation.

  • New settings page that consolidates team, installation, and API key settings. You can now invite team members to your Braintrust account from the "Team" page.

    Settings Page

  • The experiment page now shows commit information for experiments run inside of a git repository.

    Git info

Week of 2023-06-26

  • Experiments track their git metadata and automatically find a "base" experiment to compare against, using your repository's base branch.
  • The Python SDK's summarize() method now returns an ExperimentSummary object with score differences against the base experiment (v0.0.10).
  • Organizations can now be "multi-tenant", i.e. you do not need to install in your cloud account. If you start with a multi-tenant account to try out Braintrust, and decide to move it into your own account, Braintrust can migrate it for you.

Week of 2023-06-19

Week of 2023-06-12

  • New braintrust install CLI for installing the CloudFormation
  • Improved performance for event logging in the SDK
  • Auto-merge experiment fields with different types (e.g. number and string)

Week of 2023-06-05

  • Tutorial guide + notebook

  • Automatically refresh cognito tokens in the Python client

  • New filter and sort operators on the experiments table:

    • Filter experiments by changes to scores (e.g. only examples with a lower score than another experiment)
    • Custom SQL filters
    • Filter and sort bubbles to visualize/clear current operations
  • [Alpha] SQL query explorer to run arbitrary queries against one or more experiments

    SQL Explorer