November 20, 2025 · 10 min read

Rate Limiting Done Right — Beyond Simple Token Buckets

Featured image — rate limiting flow diagram

Every engineer who's built a public API has implemented rate limiting. Most of them have implemented it the same way: a token bucket per API key, reset every minute, return 429 when the bucket empties. It works until you start looking at your logs and notice that your most expensive endpoints are burning through your infrastructure budget while your cheapest endpoints sit idle.

Token buckets count requests. Production rate limiting should count cost.

The Problem With Counting Requests

Treating a GET /status request the same as a POST /bulk-import that processes 10,000 records is a modeling error. Both consume one token from the same bucket. But one generates 1ms of CPU and 50 bytes of DB reads; the other generates 2 seconds of CPU, 50MB of DB writes, and probably a background job.

When you rate limit by request count, sophisticated clients learn to batch operations and fire large payloads instead of many small ones. Your rate limiter becomes irrelevant for the traffic patterns that actually stress your system. You end up capacity-planning for burst behavior that your rate limiter was supposed to prevent.

The alternative is cost-based limiting. Assign each endpoint (or each operation type) a cost multiplier. A lightweight read might cost 1 unit. A complex search might cost 10. A bulk operation might cost 1 unit per record, with a per-request minimum of 20. Your token bucket runs on cost units instead of request count. Clients get the same bucket size but burn through it faster when they're doing expensive work.

Sliding Window vs Fixed Window

Fixed window rate limiting resets at the same time for everyone — say, the top of each minute. This creates a predictable burst opportunity: a client who hits their limit at 12:00:55 knows they'll have a fresh bucket in five seconds. They can queue up 60 requests and fire them all at 12:01:00. The theoretical limit is 2x your per-minute cap delivered in a few seconds at window boundaries.

Sliding window fixes this by tracking requests within a rolling time window rather than a fixed one. If the window is 60 seconds, then at any given moment, the system looks back 60 seconds and counts what's there. There's no "reset moment" to exploit. Burst at the boundary is no longer possible because the boundary doesn't exist.

The implementation cost is higher. Fixed windows need a counter and a reset time per key — two integer fields in a fast key-value store. Sliding windows need a sorted set of timestamps per key, which consumes more memory and requires a range query to count. At scale with many API keys, that memory cost adds up. Most production systems use a leaky bucket or sliding log approximation that trades a small amount of accuracy for much lower memory overhead.

For most APIs, a fixed window with a reasonable burst buffer (allowing up to 150% of the per-minute limit in a 10-second burst) is pragmatic and sufficient. Pure sliding windows are worth implementing if you're building an API where abuse potential is high and the accuracy matters.

Multi-Tier Rate Limiting

Single-tier rate limiting — one bucket per API key — misses two important control dimensions: per-endpoint limits and global limits.

Per-endpoint limits let you protect expensive operations independently of overall request volume. You might allow 1,000 requests/minute globally but only 10 requests/minute against your data export endpoint. Without per-endpoint limits, a client who discovers the expensive endpoint and hammers it will consume all their quota on operations that disproportionately stress your system.

Global limits let you protect against a single client consuming so much capacity that it degrades service for others. Even on paid plans, you usually want some absolute ceiling per client. What that ceiling is depends on your infrastructure capacity, but it should exist. "Unlimited" plans that have no hard ceiling are a support and infrastructure risk.

The implementation: maintain three buckets per client — global, per-endpoint, and (optionally) per-IP within a client's account. Check all three. Return 429 when any bucket is empty. Include in the 429 response which limit was hit and when it resets. That last part is critical for developer experience.

The 429 Response Matters More Than You Think

A 429 with no useful headers forces the client to guess. They don't know if they hit a per-minute limit or a per-day limit or a per-endpoint limit. They don't know when to retry. They don't know how much capacity they have left. So they implement exponential backoff starting at 1 second, which is often either too aggressive (hammering you during the backoff period) or too conservative (waiting five minutes when they could retry in one second).

Useful headers on a 429:

Retry-After: seconds until the client should try again
X-RateLimit-Limit: the limit that was hit
X-RateLimit-Remaining: units remaining in the current window
X-RateLimit-Reset: Unix timestamp when the window resets
X-RateLimit-Limit-Type: which tier triggered the limit (global, endpoint, burst)

Include these headers on every response, not just on 429s. Clients who track remaining quota proactively can throttle themselves before hitting limits. This reduces your 429 rate and reduces the support burden of developers asking "why did I get a 429?"

Abuse Detection Beyond Request Volume

Rate limiting is a blunt instrument for abuse prevention. A determined attacker who stays under your rate limits can still cause significant damage through targeted credential stuffing, scraping, or data exfiltration. Rate limiting buys you time; it's not a security boundary.

For actual abuse patterns, you need behavioral signals: unusual access patterns to sensitive endpoints, requests distributed across many IP addresses from a single key, requests for resources the client has never previously accessed, timing patterns that look machine-generated rather than user-driven. None of these are caught by token buckets.

The practical advice: rate limiting is infrastructure protection. Abuse detection is a separate concern that requires behavioral analysis. Conflating the two leads to over-aggressive rate limits (false positives on legitimate users) and under-detection of sophisticated abuse. Build them separately, even if they share the same enforcement mechanism (blocking a key or IP).

Monitor your API rate limit behavior in production

APIForge shows you which clients are hitting rate limits, which endpoints are getting hammered, and what your actual quota utilization looks like across your customer base. No guesswork, just data.

Start Free