Spam Detection
Email বা SMS থেকে spam filter করা।
আপনার inbox এ daily কত spam আসে? Gmail কীভাবে 99.9% spam filter করে দেয়? এর পেছনে যে algorithm দশক ধরে কাজ করছে — সেটা NLP classification এর সবচেয়ে famous use case: Spam Detection।
Spam Detection একটি binary text classification problem: প্রতিটা message কে 'spam' অথবা 'ham' (not spam) এ label করা। Classical approach: Naive Bayes + Bag of Words। Modern approach: deep learning + metadata (sender, links, attachments)। Spammer ক্রমাগত evolve করে, তাই model কেও regularly retrain করতে হয়।
Spam এ কিছু word বারবার আসে: 'free', 'winner', 'lottery', 'click here', 'urgent', 'congratulations'। Naive Bayes এই word গুলোর probability শেখে — কোন word spam এ কতবার আসে vs ham এ। নতুন email এ এসব word এর combination দেখে decide করে। সহজ কিন্তু surprisingly effective।
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report
messages = [
"Congratulations! You won a free iPhone. Click here to claim now!",
"Hey, are we still meeting at 5 pm today?",
"URGENT: Your account will be suspended. Verify now!",
"Mom, please buy some milk on your way home.",
"WINNER!! You have been selected for a $1000 gift card.",
"The meeting notes are attached. Let me know your thoughts.",
"Free entry into our weekly prize draw! Text WIN to 80085.",
"Can you send me the project file before noon?",
]
labels = ["spam", "ham", "spam", "ham", "spam", "ham", "spam", "ham"]
pipe = Pipeline([
("vec", CountVectorizer()),
("clf", MultinomialNB()),
])
pipe.fit(messages, labels)
test = [
"You won a lottery! Claim your prize now",
"Lunch at 1 pm?",
]
for msg in test:
pred = pipe.predict([msg])[0]
proba = pipe.predict_proba([msg]).max()
print(f"'{msg}' -> {pred} ({proba:.2%})")Pipeline দুটো step এক object এ wrap করে — CountVectorizer text কে count vector এ, তারপর MultinomialNB classify। fit() একসাথে দুটো train করে। predict_proba confidence score দেয়। Production এ এই pipeline কে pickle করে save করা হয়।
একটা command-line tool যা SMS Spam Collection dataset এ train হয়, model save করে (joblib), এবং user input নিলে 'SPAM' / 'HAM' label এর সাথে confidence দেয়। Misclassification log করে retraining এর জন্য।