Monitoring & Logging
Production model এর health track করা।
Production এ model deploy হলো — but কিছুদিন পর users complain: 'answer ভুল হচ্ছে'। কেন? Data drift? Model degraded? API slow? Without monitoring এবং logging — অন্ধের মত guessing। প্রতিটা serious ML system এর core: observability।
ML Observability তিনটা pillar: (1) Logs — কী request এসেছে, কী response, error trace, (2) Metrics — latency (p50/p95/p99), throughput (RPS), error rate, GPU utilization, (3) Traces — distributed request flow। Tools: Prometheus + Grafana (metrics), ELK/Loki (logs), OpenTelemetry (traces), Sentry (error tracking), Evidently AI / WhyLabs (data drift, model drift)।
ভাবুন hospital এ ICU patient — heart rate, BP, oxygen, EKG — continuous monitor। Anomaly হলে alarm। ML model production এ patient এর মত — latency rise, accuracy drop, GPU memory leak — early detect না করলে disaster। Logging = patient diary, metrics = vital signs, traces = full body scan।
import logging
import time
import json
from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from pydantic import BaseModel
from transformers import pipeline
logging.basicConfig(
level=logging.INFO,
format='%(message)s',
)
logger = logging.getLogger("nlp-api")
REQUEST_COUNT = Counter("nlp_requests_total", "Total predict requests", ["endpoint", "status"])
REQUEST_LATENCY = Histogram("nlp_request_latency_seconds", "Latency", ["endpoint"])
app = FastAPI()
classifier = pipeline("sentiment-analysis")
class Req(BaseModel):
text: str
@app.middleware("http")
async def log_requests(request: Request, call_next):
start = time.time()
response = await call_next(request)
elapsed = time.time() - start
log = {
"method": request.method,
"path": request.url.path,
"status": response.status_code,
"latency_ms": round(elapsed * 1000, 2),
}
logger.info(json.dumps(log))
return response
@app.post("/predict")
async def predict(req: Req):
with REQUEST_LATENCY.labels(endpoint="/predict").time():
try:
result = classifier(req.text)[0]
REQUEST_COUNT.labels(endpoint="/predict", status="success").inc()
return {"label": result["label"], "score": result["score"]}
except Exception as e:
REQUEST_COUNT.labels(endpoint="/predict", status="error").inc()
logger.error(json.dumps({"error": str(e), "input": req.text[:100]}))
raise
@app.get("/metrics")
def metrics():
from fastapi.responses import Response
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)Structured JSON logging middleware প্রতিটা request এ method, path, status, latency log। Prometheus Counter + Histogram দিয়ে metric। /metrics endpoint Prometheus scrape করবে। Grafana সেখান থেকে dashboard বানাবে। Error case এ input এর first 100 char log (PII careful)।
FastAPI NLP app + Prometheus + Grafana + Loki — docker-compose এ একসাথে। Dashboard এ live latency, throughput, error rate। Latency > 500ms হলে Slack alert webhook।