Overview

Understanding how TokenBill AI Gateway manages model routing and usage tracking.

How it all works together

Merchants set up providers, endpoints, and routing in the console. API users call the gateway with your access key. When the VE budget is exhausted, the gateway returns an error that includes a checkout link for top-up.

MerchantConsole setup in TokenBill
1
Configure Providers
Add Providers and their Physical Models on the Providers page. Note their retail pricing.
2
Create a Virtual Endpoint
Create a Virtual Endpoint (VE), set its credit budget, and generate access keys. We recommend creating a VE with commonly used models as a template first, then quickly copying its config to new VEs via the ve_template_id parameter.
3
Map Logical Models
Inside the VE, define Logical Models (e.g. 'gpt-4o'). Map each to one or more Physical Models from your providers and assign weights for traffic split and failover.
API userCalls your Virtual Endpoint with an access key
4
Call the Gateway
Point your client at the VE Base URL and send the Logical Model name in the request body. The Smart Proxy routes to the configured Physical Model.
# VE_ID = Virtual Endpoint id in the URL path; LOGICAL_MODEL_NAME = Logical Model name in the JSON "model" field
curl -X POST 'https://<VE_ID>.ve.test.tokenbill.io/v1/chat/completions' \
  -H 'Authorization: Bearer <ACCESS_KEY>' \
  -H 'Content-Type: application/json' \
  -d '{"model":"<LOGICAL_MODEL_NAME>","messages":[{"role":"user","content":"Hello"}]}'
Budget exhaustedGateway response includes a checkout URL
5
Checkout when quota is exceeded
When credit spend reaches the VE budget, the gateway returns an error (default HTTP 429, type insufficient_quota). The message includes a checkout URL to the TokenBill frontend (e.g. /checkout?ve_id=…). The end user opens that link to complete top-up—the session is created automatically when they land on checkout. Merchants can still use the Checkout API for custom integrations.
Topology

Three-Tier Architecture

TokenBill uses a unified 3-tier architecture to provide high availability, load balancing, and clear billing separation.

Virtual Endpoint (VE)
Auth & Keys
Budget
Logical Models (LM)— the model names your client uses
gpt-4o
OpenAI / gpt-4oW:70
Azure / gpt-4oW:30
Real-time DP: 70/30
↓ Cascading Fallback (Shuffle Array)
Anthropic / claude-3-5Tier 1
Azure / gpt-4o-standbyTier 2
dall-e-3
OpenAI / dall-e-3
routes to
Physical Models (PM)
Upstream Provider Pool
OpenAIgpt-4o, dall-e-3
Azuregpt-4o
Anthropicclaude-3.5-sonnet
Physical Model (PM)
Provided by Providers
The lowest level. Represents a real AI model provided by an upstream service (like OpenAI, AWS, Azure, Anthropic). It has physical limits (Unit Limit & Credit Limit) and a retail price per unit of consumption (Credit per Unit).
Virtual Endpoint (VE)
TokenBill API Gateway
The middle tier. Represents your unified API gateway endpoint and cost center. You create VEs to manage strict financial budgets (Credit Budget) for different applications or teams. All physical usage is losslessly converted into unified Credits and deducted centrally at the VE level.
Logical Model (LM)
The Model Name You Use
The highest level. This is the model name you use in your API requests (e.g., 'gpt-4o'). Inside a VE, you map each Logical Model to one or more identical Physical Models for automatic fallback and high-availability load balancing.
Traffic Engine

Request Flow

Every API call to a Virtual Endpoint goes through the TokenBill Smart Proxy which resolves the Logical Model, selects the best Physical Model via priority + weight, and streams the response back to the client.

Client App
Resolve LM → PM
TokenBill Smart Proxy
Stream response
Upstream Providers

Dynamic Smart Routing

The Smart Proxy engine powers true real-time, in-memory traffic distribution. It utilizes an ultra-fast local L1 cache to compute weighted routes in microseconds without remote polling.

Cascading Fallback Trees
Assign priority levels. The gateway cascades traffic deeply: if the primary model fails, all other siblings in the same tier are shuffled dynamically and appended to the fallback pipeline automatically.
Real-time Load Balancing
Within the same priority level, traffic is mathematically distributed by weight on every single request. Unlike basic cache-pinned routing, this prevents load hotspots and guarantees accurate traffic limits.
Full Streaming Support
The proxy supports SSE (Server-Sent Events) streaming with zero-buffering pass-through. Responses are streamed chunk-by-chunk from the upstream provider directly to your client.
Magic Command

Magic Command: /tokenbill

No extra REST API needed. Simply type /tokenbill in any chat message, and the gateway intercepts it to return your VE's balance, budget, and available models. The response never contains api_key or upstream credentials.

See the full checkout top-up flow →
AI ChatPowered by TokenBill
Write me a poem about spring
Spring breeze brushes the willows long, Peach blossoms reflect the sun, filling the courtyard...
/tokenbill
TokenBill VE Info
{ "ve_id": "ve_abc123", "name": "User Alice", "credit_remaining": 998800, "credit_budget": 1000000, "logical_models": ["gpt-4", "gpt-3.5"] }
Write another poem about summer

Budget Exhausted

Your Credit balance is insufficient. Please top up: https://tokenbill.io/checkout?ve_id=ve_abc123

Checkout

VE: ve_abc123

Exhausted
$ 50.00
+ 5,000,000 Credits
$10
$50
$100
Pay securely
Current Usage1.2M Tokens
Requests8,401
Remaining Balance0 Credits
Tip: Use the exact same client, base_url, and api_key as normal chat. Just type /tokenbill in the message content. The gateway intercepts locally, consuming no credits and making no upstream calls.
Financial Engine

Two-Layer Billing Architecture

TokenBill decouples physical model consumption from financial accounting using a standardized Credit and Unit system.

Physical Layer: Metering & Pricing
Unit & Credit per Unit
Unit is the universal metric for API consumption (Tokens, Images, Seconds). Credit per Unit standardizes the physical cost into the system's abstract currency.
Text Model
Unit: Tokens
e.g. 0.05 Credit / 1K Unit
Image Model
Unit: Images
e.g. 2.00 Credit / 1 Unit
Audio Model
Unit: Seconds
e.g. 0.15 Credit / 1 Unit
Virtual Layer: Budget & Risk Control
Credit, Budget & Limits
Virtual Endpoints operate purely on Credits. Budgets are allocated and deducted unifiedly across all modalities. Once the spent Credit hits the Budget limit, traffic is natively shielded by the gateway.
Endpoint Wallet
Active
Unified Wallet CurrencyCREDIT
Spend / Budget1,250 / 5,000

Multi-Format Upstream Support

Each Provider is a custom upstream account — your own deployment, a third-party API, or a cloud service. TokenBill normalizes different API formats behind a single OpenAI-compatible interface.

OpenAI Compatible
Standard format for most providers: OpenAI, Deepseek, Qwen, self-hosted vLLM / Ollama, etc.
Azure OpenAI
Supports Azure deployment names, API versions, and region routing
Anthropic
Native Anthropic Messages API format for Claude models
AWS Bedrock
IAM-based authentication for AWS-managed foundation models

API Reference

API Reference
TokenBill supports all standard OpenAI-compatible endpoints with automated routing and usage tracking.
Checkout Integration
How to integrate TokenBill payment and checkout sessions in web or WeChat Mini Program.