PHASE 0 · অধ্যায় 3

NLP-এর জন্য Python

Python for NLP

String, Regex, File handling — NLP engineer এর daily toolkit।

ভূমিকা

একজন chef এর জন্য ছুরি যেমন — একজন NLP engineer এর জন্য Python ঠিক তেমনই। আপনি কতটা ভাল NLP করতে পারবেন, সেটা অনেকটাই depend করে — আপনি কতটা smoothly Python দিয়ে text manipulate করতে পারেন তার উপর।

ধারণা

NLP এর জন্য Python এর তিনটা core skill খুব গুরুত্বপূর্ণ: (১) String manipulation — text কে কাটাছেঁড়া, খোঁজা, replace করা। (২) Regular Expressions (Regex) — pattern দিয়ে text এ specific structure খুঁজে বের করা। (৩) File handling — disk থেকে text data পড়া এবং লেখা।

সহজ ব্যাখ্যা

String হলো আপনার কাঁচামাল। Regex হলো আপনার microscope — pattern খোঁজার যাদুর কাঁচ। File handling হলো আপনার গুদাম — যেখান থেকে data আসবে এবং যেখানে result যাবে। এই তিনটা একসাথে = NLP এর daily workflow।

বাস্তব ব্যবহার

Email থেকে phone number/email address extract করা — Regex।
Log file থেকে error pattern খুঁজে বের করা — Regex + File handling।
Web scraping এ HTML clean করে শুধু text বের করা — String manipulation।
১০,০০০ news article একসাথে process করা — File handling এর loop।
ChatGPT এর প্রতিটা input/output — string operations এর উপর দাঁড়িয়ে।

ধাপে ধাপে বিশ্লেষণ

Step 1 — String Basics

lower(), upper(), strip(), replace(), split(), join() — daily bread।

Step 2 — String Slicing

text[0:5], text[-3:] — indexing দিয়ে substring বের করা।

Step 3 — Regex Patterns

re module দিয়ে pattern matching — \d, \w, +, *, [a-z]।

Step 4 — re.findall / re.sub

Pattern match করে সব occurrence বের করা বা replace করা।

Step 5 — File Read/Write

with open() context manager দিয়ে safe file handling।

Python কোড

import re

# 1. String operations
text = "  Bangla NLP is Amazing!  "
print(text.strip())                  # remove spaces
print(text.lower())                  # lowercase
print(text.replace("Amazing", "Powerful"))
print(text.split())                  # tokenize by space

# 2. String slicing
sentence = "Natural Language Processing"
print(sentence[:7])                  # 'Natural'
print(sentence[-10:])                # 'Processing'

# 3. Regex: find all emails in a text
sample = "Contact: sadiq@example.com or admin@nlp.io for help."
emails = re.findall(r"[\w\.-]+@[\w\.-]+", sample)
print("Emails:", emails)

# 4. Regex: clean text — keep only letters and spaces
dirty = "Hello!!! 1234 World??? #NLP @AI"
clean = re.sub(r"[^a-zA-Z\s]", "", dirty)
print("Clean:", clean.strip())

# 5. File handling (read a text file safely)
# with open("data.txt", "r", encoding="utf-8") as f:
#     content = f.read()
#     print("Total chars:", len(content))

# Writing a file
with open("output.txt", "w", encoding="utf-8") as f:
    f.write("My first NLP output\n")
    f.write("Bangla NLP is the future.")

ব্যাখ্যা

এই code এ তিনটা skill এক সাথে দেখানো হয়েছে। String operations দিয়ে আমরা text কে clean করছি। Regex (re module) দিয়ে email pattern খুঁজে বের করছি এবং unwanted character remove করছি। File handling এ আমরা সবসময় `with open()` use করি — কারণ এটা automatically file close করে দেয়, error হলেও। `encoding="utf-8"` খুব জরুরি — বিশেষ করে বাংলা text এর জন্য, নাহলে garbage character আসবে।

সাধারণ ভুল

বাংলা file পড়ার সময় encoding না দেওয়া — তখন UnicodeDecodeError আসে।
Regex এ raw string (r"...") ব্যবহার না করা — backslash এ সমস্যা হয়।
file.close() ভুলে যাওয়া — `with open()` ব্যবহার করলে এই সমস্যা থাকে না।
Regex শিখতে গিয়ে সব মুখস্থ করতে যাওয়া — basic ৫-৭টা pattern জানলেই ৯০% কাজ হয়ে যায়।
str.replace() chain না করে loop দিয়ে replace করা — performance খারাপ হয়।

অনুশীলন

একটা string নিন: 'Bangla AI 2025 #NLP @Sadiq'। শুধু letters রেখে বাকি সব remove করুন।
Regex দিয়ে একটা text থেকে সব phone number (১১ digit) extract করুন।
একটা .txt file বানিয়ে তাতে ১০টা বাংলা বাক্য লিখুন। Python দিয়ে পড়ুন এবং প্রতিটা বাক্য আলাদা line এ print করুন।
নিচের pattern গুলো মুখস্থ করুন: \d (digit), \w (word char), \s (space), + (1+), * (0+), [abc] (set)।

ছোট প্রজেক্ট

Mini Project: Text Cleaner Utility

একটা reusable function `clean_text(text)` লিখুন যেটা: (১) lowercase করবে, (২) extra whitespace remove করবে, (৩) URL remove করবে (regex দিয়ে), (৪) email remove করবে, (৫) special character remove করবে, এবং (৬) cleaned text return করবে। এরপর এটা দিয়ে একটা .txt file এর সব line clean করে নতুন file এ save করুন। এই function পরে সব NLP project এ কাজে লাগবে।

সারাংশ

Python এর তিনটা NLP-essential skill: String, Regex, File handling।
Regex = pattern দিয়ে text এ search/replace করার superpower।
File handling এ সবসময় `with open()` + `encoding="utf-8"` ব্যবহার করুন।
ছোট ছোট utility function (যেমন clean_text) লিখে রাখুন — future projects এ অনেক time বাঁচবে।
Phase 0 complete! এবার আমরা real NLP technique এ যাব — Tokenization দিয়ে।

পূর্ববর্তী · অধ্যায় 2

টেক্সট প্রসেসিং এর মূল ভিত্তি

পরবর্তী · অধ্যায় 4

টোকেনাইজেশন