【NLP】主题模型/关键词提取

任务：给出一段文字，提取其关键字。

基于 TF-IDF 的关键词提取

一个很直观的判断：一段文字中，哪个词越多，这个词就是关键词。

问题：高频词不等于关键词

很可能 “的” 是高频的
既是去除这些无意义词，也可能有问题。例如某个体育网站在奥运会期间，可能所有的文章中，“奥运会”都是高频词。

以上两个问题用 TF-IDF 解决。

TextRank 关键词提取

解决问题：我们没有海量语料如何办？

我们知道 PageRank，根据链接，形成一个图。然后根据这个图，计算每个页面的重要程度。
文本复用类似的思路。一个词是一个“页面”，一个 2-gram 构成一个 “链接”。然后得到一个 Graph，进而计算得到每个词的重要程度。

代码

LDA 主题模型（效果不很好）

LDA（Latent Dirichlet Allocation，潜在狄利克雷分配）是一个 2002 年提出的主题提取的模型。区别于另一个LDA(Linear Discrimination Analysis) 是一种有监督降维模型。

算法

假设有 D 篇文章，一共涉及了 K 个主题
每篇文章（长度为 $N_D$ ）都有各自的主题分布，主题分布是多项分布，该多项分布的参数服从 Dirichlet 分布，该 Dirichlet 分布的参数为 $\alpha$
每个主题都有各自的词分布，词分布为多项分布，该多项分布的参数服从 Dirichlet 分布，该 Dirichlet 分布的参数为 $\beta$
对于某篇文章中的第 n 个词，首先从该文章的主题分布中采样一个主题，然后在这个主题对应的词分布中采样一个词。不断重复这个随机生成过程，直到 D 篇文章全部完成上述过程。

缺点

主题个数 K 是个超参数，需要事先指定
使用 bag-of-words 表示文本，忽略了词的顺序和深层次的语义，表征能力有限
stopwords/stemming/lemmatization 都要处理，这也是 bag-of-words 带来的缺点

step1：载入数据 + 分词 + 清洗

# %% 载入停用词
with open('data/hit_stopwords.txt') as f:
    stopwords = {line.strip() for line in f.readlines()} | {'\n'}
with open('THUCNews/data/train.txt', 'r') as f:
    content = f.readlines()

documents_raw = [i.split('\t')[0] for i in content]
Y_true = [int(i.split('\t')[1].strip()) for i in content]

# %% 清洗
import jieba
import re

documents = [jieba.lcut(document, cut_all=False) for document in documents_raw]  # 拆词
regex = re.compile(u'[\u4e00-\u9fa5]')  # 仅保留中文，(剔除 数字、英文)
documents = [[word for word in document if regex.match(word) is not None] for document in documents]
documents = [[word for word in document if word not in stopwords] for document in documents]  # 停词
documents = [' '.join(document) for document in documents]

step2:做 CountVectorizer

# %% CountVectorizer
from sklearn.feature_extraction import text

count_vectorizer = text.CountVectorizer(max_df=0.95, min_df=2, token_pattern=r'\b\S+\b')
X_cnt_vec = count_vectorizer.fit_transform(documents)

step2:做 LDA

from sklearn.decomposition import LatentDirichletAllocation

# 手动定义超参数：划分为多少个主题
n_components = 10

lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(X_cnt_vec)
Y_prob = lda.transform(X_cnt_vec)

# lda.get_feature_names_out()

step3:看效果（打印 top words）

def print_top_words(model, feature_names, n_top_words):
    msg_top_words = ''

    for topic_idx, topic in enumerate(model.components_):
        msg_top_words += "\nTopic #{}:\n".format(topic_idx)
        top_idx = topic.argsort()[:-n_top_words - 1:-1]
        sum_ = topic.sum()
        msg_top_words += ' + '.join("{} * {:.3f}".format(feature_names[i], topic[i] / sum_) for i in top_idx)

    return msg_top_words


tf_feature_names = count_vectorizer.get_feature_names_out()
msg_top_words = print_top_words(lda, tf_feature_names, n_top_words=8)

print(msg_top_words)

打印结果类似：

Topic #0:
月 * 0.035 + 基金 * 0.034 + 日 * 0.030 + 市场 * 0.016 + 公司 * 0.012 + 亿 * 0.010 + 中国 * 0.010 + 投资 * 0.009
Topic #1:
龟缩 * 0.021 + 龟 * 0.019 + 龚金晶 * 0.009 + 龚蓓 * 0.008 + 龚松林 * 0.007 + 龚方雄 * 0.007 + 龚峤 * 0.007 + 龚如心 * 0.007
Topic #2:
年 * 0.053 + 新 * 0.025 + 高考 * 0.016 + 考研 * 0.012 + 招生 * 0.009 + 考生 * 0.009 + 网游 * 0.009 + 世界 * 0.008
Topic #3:
获 * 0.010 + 基金 * 0.009 + 曝 * 0.009 + 快讯 * 0.008 + 有望 * 0.007 + 发行 * 0.006 + 行情 * 0.006 + 指数 * 0.006
Topic #4:
图 * 0.037 + 元 * 0.022 + 月 * 0.017 + 均价 * 0.015 + 居 * 0.015 + 开盘 * 0.014 + 折 * 0.014 + 平起 * 0.013
Topic #5:
图 * 0.013 + 前 * 0.011 + 称 * 0.010 + 英国 * 0.008 + 做 * 0.008 + 调查 * 0.008 + 欲 * 0.007 + 宣布 * 0.007
Topic #6:
中 * 0.020 + 元 * 0.014 + 图文 * 0.009 + 万元 * 0.009 + 索尼 * 0.007 + 网上 * 0.007 + 佳能 * 0.006 + 价格 * 0.006
Topic #7:
男子 * 0.023 + 图 * 0.020 + 组图 * 0.019 + 公布 * 0.015 + 岁 * 0.013 + 沪 * 0.012 + 名 * 0.010 + 死亡 * 0.010
Topic #8:
称 * 0.025 + 美国 * 0.015 + 手机 * 0.010 + 计划 * 0.010 + 指 * 0.008 + 高校 * 0.007 + 亿美元 * 0.006 + 达 * 0.006
Topic #9:
中国 * 0.021 + 万 * 0.017 + 图 * 0.016 + 组图 * 0.012 + 日本 * 0.011 + 美 * 0.009 + 上市 * 0.009 + 反弹 * 0.009

指标检验

from sklearn import metrics

# 由于语料是带标签的，因此可以用 ARI 检验“聚类结果”
print('ARI:', metrics.adjusted_rand_score(Y_true, Y_prob.argmax(axis=1)))

# 还可以看一看混淆矩阵
print(metrics.confusion_matrix(Y_true, Y_prob.argmax(axis=1)))

一些参数

lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(X_cnt_vec)


lda.components_array # array, [n_components, n_features]，意义：components_[i, j]值得是 topic i和 word j 的关系强度。

n_components: 主题个数
doc_topic_prior:先验Dirichlet分布的参数$\theta$，默认1/n_components
topic_word_prior:先验Dirichlet分布的参数$\beta$，默认1/n_components
learning_method: 即LDA的求解算法。有 ‘batch’ 和 ‘online’两种选择。 ‘batch’即我们在原理篇讲的变分推断EM算法，而”online”即在线变分推断EM算法，在”batch”的基础上引入了分步训练，将训练样本分批，逐步一批批的用样本更新主题词分布的算法。默认是”batch”。如果数据量少，用”batch”，需要调参少。如果数据量多，用 “online” ，速度较快。
learning_decay：仅仅在算法使用”online”时有意义，取值最好在(0.5, 1.0]
learning_offset：仅仅在算法使用”online”时有意义，取值要大于1。用来减小前面训练样本批次对最终模型的影响。
max_iter ：EM算法的最大迭代次数。
batch_size: 仅仅在算法使用”online”时有意义，即每次EM算法迭代时使用的文档样本的数量。 evaluate_every

代码官方文档。 https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

理论篇 https://zhuanlan.zhihu.com/p/31470216

Top2Vec

《Top2vec: Distributed representations of topics》

算法步骤

使用 doc2vec，把文档映射到一个向量空间。类似 word2vec 中的 skip-gram，是比较浅的网络
降维：使用 UMAP
- 降维的目的是避免维度灾难，提高后续聚类算法的效果和效率
- UMAP 18年提出，是 非线性流形降维。由于 embedding 空间通常有着各向异性的问题，用 PCA 这种线性降维效果是不适合的，而 UMAP 相对于 t-SNE，能更好保留原空间的局部和全局结构，且计算效率更好
聚类：使用 HDBSCAN
- 密度聚类优势能天然挖掘异常点。从业务上看，有一些文档就是不属于某一个主题的，强行划分会影响效果
- 同样是因为 embedding 空间的各向异性，并不适合用 K-means 这种 centroid-based 的聚类方案
- HDBSCAN 相比于 DBSCAN，不用定义领域半径 eps 和密度阈值 MinPt 这两个刺手的参数
- 关于 HDBSCAN 的深入理解，可以看这篇 How HDBSCAN Works
聚类得到的每个簇为一个主题，对同一个簇的所有文档向量求平均，得到该主题的向量
每个主题离该主题向量最近的 N 个词向量对应的词来作为主题表示
- 不需要像 LDA 一样做停用词，常用词停用词通常分布在不同的主题向量中间，不会跟某一个具体的主题很近

优点

通过 HDBSCAN 自动发现主题数
不需要停用词表
不需要 stemming/lemmatization，因为这里是用 embedding 来表示一个文档，而不是用 bags-of-words
相比于 bags-of-words，用 embedding 无疑对文本的表征能力更强
把词、文档、主题都用同一个 embedding 空间表示，能实现一些实用的功能。如搜索一个词的相似词、一个文档的相似文档、输入主题搜索最相关的文档、输入词搜于文档等，由于有向量，这些搜索都能相当丝滑

缺点

聚类选择的是 HDBSCAN 这种 density-based 的聚类方法，但是在求主题向量时，是从 centroid-based 的角度，即是通过同一簇下的向量求平均得到主题向量，这会导致得到的主题向量是不准确的，从而造成主题表示是不准确的。
实际上，作者在论文里，有实验过几种方法，但 average embedding 是最简单和有效的。
项目引用了一些比较小众的代码库，部署时有风险

代码

项目地址：https://github.com/ddangelov/Top2Vec

import jieba


def jieba_cut_remove_stopwords(text):
    """结巴分词，并去除停用词
    """
    words = jieba.lcut(text)
    results = []
    for word in words:
        if word not in stopwords:
            results.append(word)
    return results


# %%
from top2vec import Top2Vec

model = Top2Vec(documents=documents_raw,
                embedding_model='doc2vec',
                speed="learn",
                workers=8,
                tokenizer=jieba_cut_remove_stopwords)

查看模型情况

# %%查看主题数，查看数量最高的前N个主题的主题词，主题词得分（是通过向量相似度求得的）
print(f"主题数：{model.get_num_topics()}")

# 文档数量最多的 N 个主题的主题词，主题词得分，主题下的文档数
topic_words, word_scores, topic_nums = model.get_topics(2)
print(f"前2主题的主题词：{topic_words}")
print(f"前2主题的主题词分布：{word_scores}")

# %%查看离某个主题最近的N篇文档
documents, document_scores, document_ids = model.search_documents_by_topic(
    topic_num=5, num_docs=4)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

使用模型

# %%返回最接近的2个主题
sentence = '阿朗与中国移动运营商签署7.5亿欧元合作协议'
keywords = list(set(jieba_cut_remove_stopwords(sentence)) & set(model.vocab))
model.search_topics(
    keywords
    , num_topics=2)

# %% 返回最接近的5个文档

documents, document_scores, document_ids = model.search_documents_by_keywords(
    keywords=keywords, num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

保存和加载模型

# 保存
model.save('top2vec.model')

# 加载：
model = Top2Vec.load('top2vec.model')

BertTopic

为了克服 Top2Vec 的缺点，BertTopic 并不是把文档和词都嵌入到同一个空间，而是单独对文档进行 embedding 编码，然后同样过降维和聚类，得到不同的主题。但在寻找主题表示时，是把同一个主题下的所有文档看成一个大文档，然后通过 c-TF-IDF 最高的 N 个词作为该主题表示。简单点说，BerTopic 寻找主题表示时用的是 bags-of-words。

算法流程

把文档映射到 embedding 空间
- 这里可以用预训练模型，像 BERT、RoBERTa，或者经过 fine tuning 的预训练模型进行文档编码，相对于 Top2Vec 用的Doc2Vec，预训练模型无疑表征能力更强
降维。同 Top2Vec，用的 UMAP
聚类。同 Top2Vec，用的 HDBSCAN
主题表示。如上面所提，把属于同一个 topic 的所有文档看成一个大文档。然后算 c-TF-IDF

优点

用上了预训练模型，表征能力强
把文档 embedding 编码跟主题表示分开来，更加灵活。embedding 这步是不需要去除停用词表这些操作的，而主题表示这一步由于用的是 bags-of-words，还是需要处理 stopwords/stemming/lemmatization 的，而且可以试验 n-gram 等方式，由于跟文档编码分开了，所以不需要重新计算 embedding

劣势

每个文档只分到一个topic，但实际情况下，部分文档涉及多个主题
主题表示那步用的 bags-of-words，可能会导致很多冗余词，想想假如某个主题表示是【好，棒，优秀，真好…】这些词，这就是冗余。这个问题其实 Top2Vec 也会出现，但论文里也提到了一种 maximal marginal relevance 的算法能减少主题表示的冗余度

总结

但在具体场景，BerTopic 未必就是最好，例如真实场景遇到的数据，跟用的预训练模型不搭，最后效果就会大打折扣，这时候还不如用 LDA 这种方式来得更好。

https://maartengr.github.io/BERTopic/index.html

参考资料

https://zhuanlan.zhihu.com/p/587025298：LDA、Top2Vec、BerTopic