PHASE 3 · অধ্যায় 16

স্প্যাম ডিটেকশন

Spam Detection

Email বা SMS থেকে spam filter করা।

ভূমিকা

আপনার inbox এ daily কত spam আসে? Gmail কীভাবে 99.9% spam filter করে দেয়? এর পেছনে যে algorithm দশক ধরে কাজ করছে — সেটা NLP classification এর সবচেয়ে famous use case: Spam Detection।

ধারণা

Spam Detection একটি binary text classification problem: প্রতিটা message কে 'spam' অথবা 'ham' (not spam) এ label করা। Classical approach: Naive Bayes + Bag of Words। Modern approach: deep learning + metadata (sender, links, attachments)। Spammer ক্রমাগত evolve করে, তাই model কেও regularly retrain করতে হয়।

সহজ ব্যাখ্যা

Spam এ কিছু word বারবার আসে: 'free', 'winner', 'lottery', 'click here', 'urgent', 'congratulations'। Naive Bayes এই word গুলোর probability শেখে — কোন word spam এ কতবার আসে vs ham এ। নতুন email এ এসব word এর combination দেখে decide করে। সহজ কিন্তু surprisingly effective।

বাস্তব ব্যবহার

Gmail/Outlook এর spam filter।
SMS spam blocker — telecom company গুলো।
Comment spam detection — blog, YouTube comment।
Phishing email detection — bank, corporate security।
Fake review detection — Amazon, TripAdvisor।

ধাপে ধাপে বিশ্লেষণ

Step 1 — Dataset

Spam/Ham label করা SMS বা email collection (e.g. SMS Spam Collection)।

Step 2 — Clean

Lowercase, URL/number remove, stopword filter।

Step 3 — Vectorize

CountVectorizer বা TF-IDF।

Step 4 — Naive Bayes train

MultinomialNB — spam detection এর classic choice।

Step 5 — Threshold tune

False positive (ham → spam) কমানো critical।

Python কোড

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report

messages = [
    "Congratulations! You won a free iPhone. Click here to claim now!",
    "Hey, are we still meeting at 5 pm today?",
    "URGENT: Your account will be suspended. Verify now!",
    "Mom, please buy some milk on your way home.",
    "WINNER!! You have been selected for a $1000 gift card.",
    "The meeting notes are attached. Let me know your thoughts.",
    "Free entry into our weekly prize draw! Text WIN to 80085.",
    "Can you send me the project file before noon?",
]
labels = ["spam", "ham", "spam", "ham", "spam", "ham", "spam", "ham"]

pipe = Pipeline([
    ("vec", CountVectorizer()),
    ("clf", MultinomialNB()),
])

pipe.fit(messages, labels)

test = [
    "You won a lottery! Claim your prize now",
    "Lunch at 1 pm?",
]
for msg in test:
    pred = pipe.predict([msg])[0]
    proba = pipe.predict_proba([msg]).max()
    print(f"'{msg}' -> {pred} ({proba:.2%})")

ব্যাখ্যা

Pipeline দুটো step এক object এ wrap করে — CountVectorizer text কে count vector এ, তারপর MultinomialNB classify। fit() একসাথে দুটো train করে। predict_proba confidence score দেয়। Production এ এই pipeline কে pickle করে save করা হয়।

সাধারণ ভুল

False positive (important email → spam) — user বিরক্ত হবে, এটা avoid করা priority।
Spammer নতুন word ব্যবহার করলে model fail করে — regular retraining জরুরি।
শুধু text দেখা — sender reputation, link analysis ও দরকার।
Imbalanced data (ham >> spam) — class_weight='balanced' ব্যবহার করুন।

অনুশীলন

UCI SMS Spam Collection dataset download করে full pipeline বানান।
Naive Bayes vs Logistic Regression vs SVM compare।
Top 20 'spammy' word বের করুন coefficient দেখে।
নিজের email folder থেকে real spam নিয়ে test।

ছোট প্রজেক্ট

SMS Spam Filter CLI

একটা command-line tool যা SMS Spam Collection dataset এ train হয়, model save করে (joblib), এবং user input নিলে 'SPAM' / 'HAM' label এর সাথে confidence দেয়। Misclassification log করে retraining এর জন্য।

সারাংশ

Spam Detection = binary text classification।
Naive Bayes + BoW = legendary baseline, এখনো relevant।
False positive cost > False negative cost।
Pipeline দিয়ে preprocess + classify একসাথে।
Spammer evolve করে — model কে regularly retrain করতে হয়।

পূর্ববর্তী · অধ্যায় 15

সেন্টিমেন্ট বিশ্লেষণ

পরবর্তী · অধ্যায় 17

টপিক মডেলিং