Overview

Understanding how TokenBill AI Gateway manages model routing and usage tracking.

How it all works together

Merchants set up providers, endpoints, and routing in the console. API users call the gateway with your access key. When the VE budget is exhausted, the gateway returns an error that includes a checkout link for top-up.

MerchantConsole setup in TokenBill

Configure Providers

Add Providers and their Physical Models on the Providers page. Note their retail pricing.

Create a Virtual Endpoint

Create a Virtual Endpoint (VE), set its credit budget, and generate access keys. We recommend creating a VE with commonly used models as a template first, then quickly copying its config to new VEs via the ve_template_id parameter.

Map Logical Models

Inside the VE, define Logical Models (e.g. 'gpt-4o'). Map each to one or more Physical Models from your providers and assign weights for traffic split and failover.

API userCalls your Virtual Endpoint with an access key

Call the Gateway

Point your client at the VE Base URL and send the Logical Model name in the request body. The Smart Proxy routes to the configured Physical Model.

# VE_ID = Virtual Endpoint id in the URL path; LOGICAL_MODEL_NAME = Logical Model name in the JSON "model" field
curl -X POST 'https://<VE_ID>.ve.tokenbill.io/v1/chat/completions' \
  -H 'Authorization: Bearer <ACCESS_KEY>' \
  -H 'Content-Type: application/json' \
  -d '{"model":"<LOGICAL_MODEL_NAME>","messages":[{"role":"user","content":"Hello"}]}'

Budget exhaustedGateway response includes a checkout URL

Checkout when quota is exceeded

When credit spend reaches the VE budget, the gateway returns an error (default HTTP 429, type insufficient_quota). The message includes a checkout URL to the TokenBill frontend (e.g. /ve-status?ve_id=…). The end user opens that link to complete top-up—the session is created automatically when they land on the page. Merchants can still use the Checkout API for custom integrations.

Topology

Three-Tier Architecture

TokenBill uses a unified 3-tier architecture to provide high availability, load balancing, and clear billing separation.

Virtual Endpoint (VE)

Auth & Keys

Budget

Logical Models (LM)— the model names your client uses

gpt-4o

OpenAI / gpt-4oW:60

Anthropic / claude-3-5W:40

Real-time DP: 60/40

↓ Cascading Fallback (Shuffle Array)

Anthropic / claude-3-5Tier 1

OpenAI / gpt-4o-miniTier 2

dall-e-3

OpenAI / dall-e-3

routes to

Physical Models (PM)

Upstream Provider Pool

OpenAIgpt-4o, dall-e-3

Anthropicclaude-3.5-sonnet

Physical Model (PM)

Provided by Providers

The lowest level. Represents a real AI model provided by an upstream service (like OpenAI, Anthropic). It has physical limits (Unit Limit & Credit Limit) and a retail price per unit of consumption (Credit per Unit).

Virtual Endpoint (VE)

TokenBill API Gateway

The middle tier. Represents your unified API gateway endpoint and cost center. You create VEs to manage strict financial budgets (Credit Budget) for different applications or teams. All physical usage is losslessly converted into unified Credits and deducted centrally at the VE level.

Logical Model (LM)

The Model Name You Use

The highest level. This is the model name you use in your API requests (e.g., 'gpt-4o'). Inside a VE, you map each Logical Model to one or more identical Physical Models for automatic fallback and high-availability load balancing.

Traffic Engine

Request Flow

Every API call to a Virtual Endpoint goes through the TokenBill Smart Proxy which resolves the Logical Model, selects the best Physical Model via priority + weight, and streams the response back to the client.

Client App

Resolve LM → PM

TokenBill Smart Proxy

Stream response

Upstream Providers

Dynamic Smart Routing

The Smart Proxy engine powers true real-time, in-memory traffic distribution. It utilizes an ultra-fast local L1 cache to compute weighted routes in microseconds without remote polling.

Cascading Fallback Trees

Assign priority levels. The gateway cascades traffic deeply: if the primary model fails, all other siblings in the same tier are shuffled dynamically and appended to the fallback pipeline automatically.

Real-time Load Balancing

Within the same priority level, traffic is mathematically distributed by weight on every single request. Unlike basic cache-pinned routing, this prevents load hotspots and guarantees accurate traffic limits.

Full Streaming Support

The proxy supports SSE (Server-Sent Events) streaming with zero-buffering pass-through. Responses are streamed chunk-by-chunk from the upstream provider directly to your client.

Magic Command

Magic Command: /tokenbill

No extra REST API needed. Simply type /tokenbill in any chat message, and the gateway intercepts it to return your VE's balance, budget, and available models. The response never contains api_key or upstream credentials.

Open the ve-status details page →

AI ChatPowered by TokenBill

Write me a poem about spring

Spring breeze brushes the willows long, Peach blossoms reflect the sun, filling the courtyard...

/tokenbill

TokenBill VE Info

{ "ve_id": "ve_abc123", "name": "User Alice", "credit_remaining": 998800, "credit_budget": 1000000, "logical_models": ["gpt-4", "gpt-3.5"] }

View full details on the ve-status page

Write another poem about summer

Budget Exhausted

Your Credit balance is insufficient. Please top up: https://tokenbill.io/ve-status?ve_id=ve_abc123

Checkout

VE: ve_abc123

Exhausted

$ 50.00

+ 5,000,000 Credits

$10

$50

$100

Pay securely

Current Usage1.2M Tokens

Requests8,401

Remaining Balance0 Credits

Tip: Use the exact same client, base_url, and api_key as normal chat. Just type /tokenbill in the message content. The gateway intercepts locally, consuming no credits and making no upstream calls.

Financial Engine

Two-Layer Billing Architecture

TokenBill decouples physical model consumption from financial accounting using a standardized Credit and Unit system.

Physical Layer: Metering & Pricing

Unit & Credit per Unit

Unit is the universal metric for API consumption (Tokens, Images, Seconds). Credit per Unit standardizes the physical cost into the system's abstract currency.

Text Model

Unit: Tokens

e.g. 0.05 Credit / 1K Unit

Image Model

Unit: Images

e.g. 2.00 Credit / 1 Unit

Audio Model

Unit: Seconds

e.g. 0.15 Credit / 1 Unit

Virtual Layer: Budget & Risk Control

Credit, Budget & Limits

Virtual Endpoints operate purely on Credits. Budgets are allocated and deducted unifiedly across all modalities. Once the spent Credit hits the Budget limit, traffic is natively shielded by the gateway.

Endpoint Wallet

Active

Unified Wallet CurrencyCREDIT

Spend / Budget1,250 / 5,000

Multi-Format Upstream Support

Each Provider is a custom upstream account — your own deployment, a third-party API, or a cloud service. TokenBill normalizes different API formats behind a single OpenAI-compatible interface.

OpenAI

Standard format for most providers: OpenAI, Deepseek, Qwen, self-hosted vLLM / Ollama, etc.

Anthropic

Native Anthropic Messages API format for Claude models

API Reference

OpenAI API

TokenBill supports all standard OpenAI-compatible endpoints with automated routing and usage tracking.

View Full API Reference

Anthropic API

Integrate using Anthropic Messages API format with x-api-key authentication.

Anthropic API Docs