PHASE 8 · অধ্যায় 39

স্কেলিং NLP সিস্টেম

Scaling NLP Systems

Millions of requests handle করার architecture।

ভূমিকা

১ user এ API কাজ করছে। ১০০ user — ঠিক আছে। ১ million daily request? — server crash, latency 10 second, GPU বিল $10K/month। NLP system scale করা মানে শুধু big server না — smart architecture। Batching, caching, model optimization, load balancing — সব কিছু combine করে millions of request handle।

ধারণা

Scaling strategy: (1) Horizontal scaling — multiple replica + load balancer, (2) Vertical scaling — bigger GPU/CPU, (3) Dynamic batching — multiple request একসাথে process (Triton, vLLM), (4) Model optimization — quantization (INT8), distillation, ONNX, TensorRT, (5) Caching — frequent query এর response cache (Redis), (6) Async queue — long task background worker (Celery, RQ)।

সহজ ব্যাখ্যা

ভাবুন একটা restaurant — peak hour এ চলছে। Solution: (1) আরো branch খুলুন (horizontal), (2) বড় kitchen বানান (vertical), (3) একই order দশজনের একসাথে রান্না করুন (batching), (4) chef এর recipe efficient করুন (model optimization), (5) popular dish আগে থেকে রেডি (cache), (6) catering order background এ ready (async queue)।

বাস্তব ব্যবহার

OpenAI — vLLM/TensorRT-LLM দিয়ে massive throughput।
Triton Inference Server — NVIDIA এর batching engine।
ChatGPT — KV cache + speculative decoding।
Redis caching common query এ।
Kubernetes HPA (Horizontal Pod Autoscaler)।

ধাপে ধাপে বিশ্লেষণ

Step 1 — Profile bottleneck

CPU bound? GPU bound? I/O bound?

Step 2 — Quantize model

FP32 → INT8 (4x faster, 4x smaller)।

Step 3 — Dynamic batching

Request queue, 50ms window, batch process।

Step 4 — Cache layer

Redis দিয়ে frequent input এর response store।

Step 5 — Horizontal scale

K8s HPA — CPU/QPS metric অনুযায়ী replica।

Step 6 — CDN + edge

Static asset CDN, edge inference (Cloudflare Workers AI)।

Python কোড

from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
from transformers import pipeline
import asyncio
import hashlib
import time
from collections import defaultdict

app = FastAPI()
classifier = pipeline("sentiment-analysis")

cache = {}
batch_queue: list = []
batch_lock = asyncio.Lock()
BATCH_SIZE = 8
BATCH_WAIT_MS = 50

class Req(BaseModel):
    text: str

def cache_key(text: str) -> str:
    return hashlib.md5(text.encode()).hexdigest()

async def process_batch():
    async with batch_lock:
        if not batch_queue:
            return
        items = batch_queue[:BATCH_SIZE]
        del batch_queue[:BATCH_SIZE]

    texts = [it["text"] for it in items]
    results = classifier(texts)
    for it, res in zip(items, results):
        cache[cache_key(it["text"])] = res
        it["future"].set_result(res)

async def batch_worker():
    while True:
        await asyncio.sleep(BATCH_WAIT_MS / 1000)
        if batch_queue:
            await process_batch()

@app.on_event("startup")
async def startup():
    asyncio.create_task(batch_worker())

@app.post("/predict")
async def predict(req: Req):
    key = cache_key(req.text)
    if key in cache:
        return {"cached": True, "result": cache[key]}

    fut = asyncio.get_event_loop().create_future()
    async with batch_lock:
        batch_queue.append({"text": req.text, "future": fut})

    result = await fut
    return {"cached": False, "result": result}

ব্যাখ্যা

Cache layer (Redis prod এ) দিয়ে duplicate request avoid। Dynamic batching: incoming request queue তে যায়, every 50ms worker batch process করে — GPU 1 request আর 8 request এ similar time নেয়, throughput 8x। Async future দিয়ে individual response wait।

সাধারণ ভুল

Cache invalidation strategy নেই — stale data।
Batch size খুব বড় — latency বাড়ে।
Async function এ sync model call — event loop block।
Auto-scaling threshold ভুল — over-provision/under-provision।
Monitoring ছাড়া scale — কোথায় bottleneck জানা যায় না।

অনুশীলন

Locust/k6 দিয়ে load test, 1000 RPS handle করুন।
ONNX এ model convert করে latency compare।
Redis cache integrate করুন।
Kubernetes HPA setup করে auto-scale demonstrate।

ছোট প্রজেক্ট

Scalable Inference Server

Dynamic batching + Redis cache + ONNX quantized model দিয়ে একটা inference server। Load test দিয়ে p50/p95/p99 latency report করুন।

সারাংশ

Scaling = horizontal + vertical + smart optimization।
Batching, caching, quantization — throughput multiplier।
Async queue দিয়ে long task offload।
K8s + autoscaler দিয়ে demand-based replica।
Architecture decision data driven (profile first)।

পূর্ববর্তী · অধ্যায় 38

মডেল ডিপ্লয়মেন্ট

পরবর্তী · অধ্যায় 40

মনিটরিং ও লগিং