在 Python 中实现自然语言处理(NLP)通常需要结合专门的库和工具。以下是关键步骤和示例代码:
1. 安装常用库
pip install nltk spacy sklearn pandas numpy
python -m spacy download en_core_web_sm # 英文模型
# 中文可用:python -m spacy download zh_core_web_sm
2. 基础文本预处理
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import spacy
# 下载依赖(首次运行需下载)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# 示例文本
text = "Natural Language Processing is amazing! Let's learn how to use it."
# 分词(英文)
tokens = nltk.word_tokenize(text)
print("Tokens:", tokens)
# 去停用词
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
# 词形还原(spacy 更高效)
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
lemmatized = [token.lemma_ for token in doc]
print("Lemmatized:", lemmatized)
3. 特征提取
词袋模型 / TF-IDF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = [
"I love NLP.",
"NLP makes machines understand text.",
"Python is great for NLP."
]
# 词袋模型
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Bag-of-Words:\n", X.toarray())
# TF-IDF
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF:\n", X_tfidf.toarray())
词嵌入(Word Embeddings)
# 使用预训练模型(spacy)
nlp = spacy.load("en_core_web_sm")
doc = nlp("apple orange banana")
for token in doc:
print(token.text, "-> Vector shape:", token.vector.shape)
# 使用Gensim训练Word2Vec
from gensim.models import Word2Vec
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
print("Cat vector:", model.wv["cat"])
4. 模型构建
文本分类(传统方法)
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
# 示例数据(标签:0=负面,1=正面)
texts = ["bad movie", "great film", "terrible acting", "awesome story"]
labels = [0, 1, 0, 1]
# 特征提取
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
# 训练分类器
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
clf = MultinomialNB()
clf.fit(X_train, y_train)
print("Accuracy:", clf.score(X_test, y_test))
深度学习(LSTM/Transformer)
import tensorflow as tf
from transformers import pipeline
# 使用Hugging Face预训练模型(情感分析)
classifier = pipeline("sentiment-analysis")
result = classifier("I love this tutorial!")
print(result) # 输出: [{'label': 'POSITIVE', 'score': 0.9998}]
# 自定义LSTM模型(示例)
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=10000, output_dim=128),
tf.keras.layers.LSTM(64),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
5. 典型应用场景
- 情感分析:判断文本情感倾向。
- 文本分类:新闻分类、垃圾邮件检测。
- 命名实体识别(NER):
nlp = spacy.load("en_core_web_sm") doc = nlp("Apple is looking to buy a U.K. startup for $1 billion.") for ent in doc.ents: print(ent.text, "->", ent.label_)
关键工具库
- NLTK:经典NLP工具(适合教学)。
- spaCy:工业级高效NLP库。
- Transformers(Hugging Face):预训练模型(BERT、GPT等)。
- Gensim:主题建模、词嵌入。
- Scikit-learn:传统机器学习模型。
根据任务复杂度,可以选择从传统方法到深度学习模型的解决方案。对于中文处理,推荐使用 jieba 分词或 THULAC,并替换对应的预训练模型。