自然语言处理(NLP)之使用LSTM进行文本情感分析

情感分析简介

文本情感分析（Sentiment Analysis）是自然语言处理（NLP）方法中常见的应用，也是一个有趣的基本任务，尤其是以提炼文本情绪内容为目的的分类。它是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程。

本文将介绍情感分析中的情感极性（倾向）分析。所谓情感极性分析，指的是对文本进行褒义、贬义、中性的判断。在大多应用场景下，只分为两类。例如对于“喜爱”和“厌恶”这两个词，就属于不同的情感倾向。

本文将详细介绍如何使用深度学习模型中的LSTM模型来实现文本的情感分析。

文本介绍及语料分析

以某电商网站中某个商品的评论作为语料（corpus.csv），该数据集的下载网址为：/renjunxiang/Text-Classification/blob/master/TextClassification/data/data_single.csv ，该数据集一共有4310条评论数据，文本的情感分为两类：“正面”和“反面”，该数据集的前几行如下：

evaluation,label用了一段时间，感觉还不错，可以,正面电视非常好，已经是家里的第二台了。第一天下单，第二天就到本地了，可是物流的人说车坏了，一直催，客服也帮着催，到第三天下午5点才送过来。父母年纪大了，买个大电视画面清晰，趁着耳朵还好使，享受几年。,正面电视比想象中的大好多，画面也很清晰，系统很智能，更多功能还在摸索中,正面不错,正面用了这么多天了，感觉还不错。夏普的牌子还是比较可靠。希望以后比较耐用，现在是考量质量的时候。,正面物流速度很快，非常棒，今天就看了电视，非常清晰，非常流畅，一次非常完美的购物体验,正面非常好，客服还特意打电话做回访,正面物流小哥不错，辛苦了，东西还没用,正面送货速度快，质量有保障，活动价格挺好的。希望用的久，不出问题。,正面

接着我们需要对语料做一个简单的分析：

数据集中的情感分布；数据集中的评论句子长度分布。

使用以下Python脚本，我们可以统计出数据集中的情感分布以及评论句子长度分布。

import pandas as pdimport matplotlib.pyplot as pltfrom matplotlib import font_managerfrom itertools import accumulate# 设置matplotlib绘图时的字体my_font = font_manager.FontProperties(fname='C:\Windows\Fonts\simfang.ttf')# 统计句子长度及出现次数的频数df = pd.read_csv('./data_single.csv')print(df.groupby('label')['label'].count())df['length'] = df['evaluation'].apply(lambda x: len(x))# print(df)len_df = df.groupby('length').count()sent_length = len_df.index.tolist()sent_freq = len_df['evaluation'].tolist()# 绘制句子长度及出现频数统计图plt.bar(sent_length, sent_freq)plt.title("句子长度及出现频数统计图", fontproperties=my_font)plt.xlabel("句子长度", fontproperties=my_font)plt.ylabel("句子长度出现的频数", fontproperties=my_font)plt.savefig("./句子长度及出现频数统计图.png")plt.close()# 绘制句子长度累计分布函数（CDF）sent_pentage_list = [(count / sum(sent_freq)) for count in accumulate(sent_freq)]# 绘制CDFplt.plot(sent_length, sent_pentage_list)# 寻找分位点为quantile的句子长度quantile = 0.91# print(list(sent_pentage_list))for length, per in zip(sent_length, sent_pentage_list):if round(per, 2) == quantile:index = lengthbreakprint('\n分位点为%s的句子长度：%d' % (quantile, index))# 绘制句子长度累积分布函数图plt.plot(sent_length, sent_pentage_list)plt.hlines(quantile, 0, index, colors="c", linestyles="dashed")plt.vlines(index, 0, quantile, colors="c", linestyles="dashed")plt.text(0, quantile, str(quantile))plt.text(index, 0, str(index))plt.title("句子长度累积分布函数图", fontproperties=my_font)plt.xlabel("句子长度", fontproperties=my_font)plt.ylabel("句子长度累积频率", fontproperties=my_font)plt.savefig("./句子长度累积分布函数图.png")plt.close()

运行结果：

label正面 1908负面 2375Name: label, dtype: int64分位点为0.91的句子长度：183

可以看到，正反面两类情感的比例差不多。句子长度及出现频数统计图如下：

句子长度累积分布函数图如下：

可以看到，大多数样本的句子长度集中在1-200之间，句子长度累计频率取0.91分位点，则长度为183左右。

使用LSTM模型

接着我们使用深度学习中的LSTM模型来对上述数据集做情感分析，笔者实现的模型框架如下：

完整的Python代码如下：

import pickleimport numpy as npimport pandas as pdfrom keras.utils import np_utils, plot_modelfrom keras.models import Sequentialfrom keras.preprocessing.sequence import pad_sequencesfrom keras.layers import LSTM, Dense, Embedding, Dropoutfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# 导入数据# 文件的数据中，特征为evaluation, 类别为label.def load_data(filepath, input_shape=20):df = pd.read_csv(filepath)# 标签及词汇表labels, vocabulary = list(df['label'].unique()), list(df['evaluation'].unique())# print(len(labels))# print(len(vocabulary))# 构造字符级别的特征string = ''for word in vocabulary:string += word# print(string)vocabulary = set(string)# print(vocabulary)# 字典列表word_dictionary = {word: i + 1 for i, word in enumerate(vocabulary)}with open('word_dict.pk', 'wb') as f:pickle.dump(word_dictionary, f)inverse_word_dictionary = {i + 1: word for i, word in enumerate(vocabulary)}label_dictionary = {label: i for i, label in enumerate(labels)}with open('label_dict.pk', 'wb') as f:pickle.dump(label_dictionary, f)output_dictionary = {i: labels for i, labels in enumerate(labels)}vocab_size = len(word_dictionary.keys()) # 词汇表大小label_size = len(label_dictionary.keys()) # 标签类别数量# print(vocab_size, labels)# 序列填充，按input_shape填充，长度不足的按0补充x = [[word_dictionary[word] for word in sent] for sent in df['evaluation']]x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)y = [[label_dictionary[sent]] for sent in df['label']]y = [np_utils.to_categorical(label, num_classes=label_size) for label in y]y = np.array([list(_[0]) for _ in y])return x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary# 创建深度学习模型， Embedding + LSTM + Softmax.def create_LSTM(n_units, input_shape, output_dim, filepath):x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = load_data(filepath)model = Sequential()model.add(Embedding(input_dim=vocab_size + 1, output_dim=output_dim,input_length=input_shape, mask_zero=True))model.add(LSTM(n_units, input_shape=(x.shape[0], x.shape[1])))model.add(Dropout(0.2))model.add(Dense(label_size, activation='softmax'))pile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])plot_model(model, to_file='./model_lstm.png', show_shapes=True)model.summary()return model# 模型训练def model_train(input_shape, filepath, model_save_path):# 将数据集分为训练集和测试集，占比为9:1# input_shape = 100x, y, output_dictionary, vocab_size, label_size, inverse_word_dictionary = load_data(filepath, input_shape)train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.1, random_state=42)# 模型输入参数，需要自己根据需要调整n_units = 100batch_size = 32epochs = 5output_dim = 20# 模型训练lstm_model = create_LSTM(n_units, input_shape, output_dim, filepath)lstm_model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=1)# 模型保存lstm_model.save(model_save_path)N = test_x.shape[0] # 测试的条数predict = []label = []for start, end in zip(range(0, N, 1), range(1, N + 1, 1)):sentence = [inverse_word_dictionary[i] for i in test_x[start] if i != 0]y_predict = lstm_model.predict(test_x[start:end])label_predict = output_dictionary[np.argmax(y_predict[0])]label_true = output_dictionary[np.argmax(test_y[start:end])]print(''.join(sentence), label_true, label_predict) # 输出预测结果predict.append(label_predict)label.append(label_true)acc = accuracy_score(predict, label) # 预测准确率print('模型在测试集上的准确率为: %s.' % acc)if __name__ == '__main__':filepath = './data_single.csv'input_shape = 180# load_data(filepath, input_shape)model_save_path = './corpus_model.h5'model_train(input_shape, filepath, model_save_path)

对上述模型，共训练5次，训练集和测试集比例为9:1，输出的结果为：

Model: "sequential"_________________________________________________________________Layer (type) Output Shape Param # =================================================================embedding (Embedding) (None, 180, 20) 43100_________________________________________________________________lstm (LSTM) (None, 100)48400_________________________________________________________________dropout (Dropout) (None, 100)0 _________________________________________________________________dense (Dense)(None, 2) 202 =================================================================Total params: 91,702Trainable params: 91,702Non-trainable params: 0_________________________________________________________________Epoch 1/5121/121 [==============================] - 15s 94ms/step - loss: 0.5719 - accuracy: 0.6683Epoch 2/5121/121 [==============================] - 8s 65ms/step - loss: 0.2164 - accuracy: 0.9286Epoch 3/5121/121 [==============================] - 7s 58ms/step - loss: 0.1884 - accuracy: 0.9385Epoch 4/5121/121 [==============================] - 7s 57ms/step - loss: 0.1435 - accuracy: 0.9590Epoch 5/5121/121 [==============================] - 7s 57ms/step - loss: 0.1161 - accuracy: 0.9646硬件一般，但是软件很棒，负面负面客服态度好。电视还没有开始用，还不知道效果。用了再评价正面负面对的起这样的价钱支持京东想要下单的亲可以放心下单加油语音遥控器没有希望京东送一个谢谢负面正面非常差，8月8日1元预购的电视礼包，说不发就不发了，真是非常差劲，真后悔在这家店买东西，大家不要再来了。负面负面京东物流慢了些，本来应该昨天送到的，结果今天才送到。电视还可以，稍微有点延迟，性价比很高。负面负面后还选择了这个创维一些国产的品牌，但是仔细参考参数之后还是做出了一个大胆的选择，选择的微鲸，使用了，之后的感觉非常的不错，老婆也非常的喜欢，感觉比那个乐视的话还是有一定的优势。可惜的话就是会员不够多，然后的话价格稍微贵了一点，现在可能那个平板液晶平板这一块也涨价了，所以说这个也情有可原。喜欢的可以大胆下手了，不会失望的，至少比创维，海信要好多了…性价比更高！负面负面卧室用的，画面挺清晰，但是不能离近看，否则颜色会很诡异，安装师傅挺好的正面正面............电视机一般，低端机不要求那么高咯。负面负面很好，两点下单上午就到了，服务很好。正面正面帮朋友买的，好好好好好好好好正面正面模型在测试集上的准确率为: 0.9230769230769231.

模型预测

接着，我们利用刚刚训练好的模型，对新的数据进行测试。在这儿随机改造上述样本的评论，然后预测其情感倾向。情感预测的Python代码如下：

# Import the necessary modulesimport pickleimport numpy as npfrom keras.models import load_modelfrom keras.preprocessing.sequence import pad_sequences# 导入字典with open('word_dict.pk', 'rb') as f:word_dictionary = pickle.load(f)with open('label_dict.pk', 'rb') as f:output_dictionary = pickle.load(f)try:# 数据预处理input_shape = 180sent = "很满意，电视非常好。护眼模式，很好，也很清晰。"x = [[word_dictionary[word] for word in sent]]x = pad_sequences(maxlen=input_shape, sequences=x, padding='post', value=0)# 载入模型model_save_path = './corpus_model.h5'lstm_model = load_model(model_save_path)# 模型预测y_predict = lstm_model.predict(x)label_dict = {v: k for k, v in output_dictionary.items()}print('输入语句: %s' % sent)print('情感预测结果: %s' % label_dict[np.argmax(y_predict)])except KeyError as err:print("您输入的句子有汉字不在词汇表中，请重新输入！")print("不在词汇表中的单词为：%s." % err)

运行结果：

输入语句: 很满意，电视非常好。护眼模式，很好，也很清晰。情感预测结果: 正面