Scaling NLP Systems
Millions of requests handle করার architecture।
১ user এ API কাজ করছে। ১০০ user — ঠিক আছে। ১ million daily request? — server crash, latency 10 second, GPU বিল $10K/month। NLP system scale করা মানে শুধু big server না — smart architecture। Batching, caching, model optimization, load balancing — সব কিছু combine করে millions of request handle।
Scaling strategy: (1) Horizontal scaling — multiple replica + load balancer, (2) Vertical scaling — bigger GPU/CPU, (3) Dynamic batching — multiple request একসাথে process (Triton, vLLM), (4) Model optimization — quantization (INT8), distillation, ONNX, TensorRT, (5) Caching — frequent query এর response cache (Redis), (6) Async queue — long task background worker (Celery, RQ)।
ভাবুন একটা restaurant — peak hour এ চলছে। Solution: (1) আরো branch খুলুন (horizontal), (2) বড় kitchen বানান (vertical), (3) একই order দশজনের একসাথে রান্না করুন (batching), (4) chef এর recipe efficient করুন (model optimization), (5) popular dish আগে থেকে রেডি (cache), (6) catering order background এ ready (async queue)।
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
from transformers import pipeline
import asyncio
import hashlib
import time
from collections import defaultdict
app = FastAPI()
classifier = pipeline("sentiment-analysis")
cache = {}
batch_queue: list = []
batch_lock = asyncio.Lock()
BATCH_SIZE = 8
BATCH_WAIT_MS = 50
class Req(BaseModel):
text: str
def cache_key(text: str) -> str:
return hashlib.md5(text.encode()).hexdigest()
async def process_batch():
async with batch_lock:
if not batch_queue:
return
items = batch_queue[:BATCH_SIZE]
del batch_queue[:BATCH_SIZE]
texts = [it["text"] for it in items]
results = classifier(texts)
for it, res in zip(items, results):
cache[cache_key(it["text"])] = res
it["future"].set_result(res)
async def batch_worker():
while True:
await asyncio.sleep(BATCH_WAIT_MS / 1000)
if batch_queue:
await process_batch()
@app.on_event("startup")
async def startup():
asyncio.create_task(batch_worker())
@app.post("/predict")
async def predict(req: Req):
key = cache_key(req.text)
if key in cache:
return {"cached": True, "result": cache[key]}
fut = asyncio.get_event_loop().create_future()
async with batch_lock:
batch_queue.append({"text": req.text, "future": fut})
result = await fut
return {"cached": False, "result": result}Cache layer (Redis prod এ) দিয়ে duplicate request avoid। Dynamic batching: incoming request queue তে যায়, every 50ms worker batch process করে — GPU 1 request আর 8 request এ similar time নেয়, throughput 8x। Async future দিয়ে individual response wait।
Dynamic batching + Redis cache + ONNX quantized model দিয়ে একটা inference server। Load test দিয়ে p50/p95/p99 latency report করুন।