PHASE 0 · অধ্যায় 1

NLP কি এবং কেন দরকার

What is NLP & Why It Matters

মানুষ কিভাবে কথা বলে, machine কিভাবে বোঝে — সেই যাত্রার শুরু।

ভূমিকা

কল্পনা করুন — আপনি ChatGPT কে বললেন "আজকের আবহাওয়া কেমন?" — সে কিভাবে বুঝলো? কিভাবে reply লিখলো? এই 'বোঝা' আর 'কথা বলা' এর পেছনের পুরো বিজ্ঞানের নাম NLP — Natural Language Processing।

ধারণা

Natural Language Processing (NLP) হলো Artificial Intelligence এর একটি শাখা, যা computer কে মানুষের ভাষা — text এবং speech — বুঝতে, ব্যাখ্যা করতে এবং উৎপন্ন করতে শেখায়। সহজভাবে: মানুষের ভাষা ↔ মেশিনের ভাষা এর সেতু।

সহজ ব্যাখ্যা

Computer শুধু সংখ্যা বোঝে (0 আর 1)। কিন্তু আমরা কথা বলি বাংলায়, English-এ, যেখানে একই word এর অনেক অর্থ থাকতে পারে, ব্যাকরণ জটিল, এবং context সব বদলে দেয়। NLP এর কাজ হলো এই 'fuzzy' মানব ভাষা কে এমন number এ রূপান্তর করা যা machine process করতে পারে — কিন্তু meaning হারিয়ে না যায়।

বাস্তব ব্যবহার

ChatGPT, Gemini, Claude — সব LLM এর core এ NLP।
Google Search যখন আপনার query বোঝে — NLP।
Gmail এর spam filter — NLP classification।
Google Translate — NLP based sequence-to-sequence।
Siri, Alexa, Google Assistant — speech-to-text এর পর NLP।
Facebook, YouTube এর content moderation — sentiment + classification।

ধাপে ধাপে বিশ্লেষণ

Step 1 — Text সংগ্রহ

প্রথমে raw text data সংগ্রহ করা হয় — article, tweet, message, document।

Step 2 — Preprocessing

Text clean করা — lowercase, punctuation removal, stop words বাদ দেওয়া।

Step 3 — Tokenization

Sentence কে ছোট ছোট token (word/subword) এ ভাঙা।

Step 4 — Representation

প্রতিটা token কে number (vector) এ convert করা — Embedding।

Step 5 — Model Processing

Neural network (RNN, Transformer) দিয়ে meaning বের করা।

Step 6 — Output

Classification, generation, translation — যা দরকার সেই output।

Python কোড

# Your very first NLP program
# We'll use NLTK — the classic NLP library

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# One-time download
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)

text = "Hello! Welcome to NLP. Let's understand how machines read language."

# Step 1: split into sentences
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Step 2: split into words (tokens)
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Step 3: simple stats
print("Total sentences:", len(sentences))
print("Total tokens:", len(tokens))
print("Unique tokens:", len(set(tokens)))

ব্যাখ্যা

উপরের code এ আমরা NLTK library import করেছি। sent_tokenize() একটা paragraph কে sentence এ ভাঙে, আর word_tokenize() প্রতিটা sentence কে word/token এ ভাঙে। এটাই NLP এর সবচেয়ে প্রথম এবং fundamental step — text কে structured units এ ভেঙে আনা।

সাধারণ ভুল

NLP মানেই শুধু ChatGPT — না, NLP একটা বিশাল field যেখানে LLM সবচেয়ে নতুন অধ্যায় মাত্র।
শুধু English এ NLP কাজ করে — ভুল ধারণা। বাংলা, হিন্দি সহ সব ভাষায় NLP সম্ভব।
শুরুতেই Transformer শেখা — আগে Tokenization, BoW, TF-IDF এর foundation শক্ত করুন।

অনুশীলন

নিজের পছন্দের একটা বাংলা paragraph নিন এবং কাগজে লিখে দেখুন এর tokens কি কি হতে পারে।
৫টা real-world NLP application খুঁজে বের করুন যেটা আপনি প্রতিদিন ব্যবহার করেন।
উপরের Python code টা নিজের machine এ run করুন এবং নিজের text দিয়ে try করুন।

ছোট প্রজেক্ট

Mini Project: My First Text Analyzer

একটা simple Python script লিখুন যেটা user থেকে একটা paragraph নেবে এবং print করবে: মোট sentence সংখ্যা, মোট word সংখ্যা, unique word সংখ্যা, এবং average sentence length (words/sentence)।

সারাংশ

NLP = Computer কে মানুষের ভাষা বোঝানোর বিজ্ঞান।
Pipeline: Text → Preprocess → Tokenize → Vectorize → Model → Output।
NLP ছাড়া ChatGPT, Search, Translate, Voice Assistant — কোনোটাই সম্ভব না।
শুরু করতে শুধু Python + NLTK যথেষ্ট — বাকিটা ধাপে ধাপে।

পরবর্তী · অধ্যায় 2

টেক্সট প্রসেসিং এর মূল ভিত্তি