【NLP文本分类】对IMDB电影评论进行情感分析

概述

对imdb中的电影评论进行分类，是一个二分类的问题，这是一种重要且广泛适用的机器学习问题。

数据

imdb的数据包含50000条电影评论。拥有25000条训练数据以及25000条评估数据，有着相同数量的正面与负面评论。

下载imdb数据

imdb中的数据已经被预处理好，为整数序列，每个整数代表着一个特定单词。可用imdb的词典进行翻译。（/text-datasets/imdb.npz）如果不能科学上网，可以在/s/1pNDbE3VMdYJiiXyaN2roaw 提取码：0wnn下载

读取数据

import tensorflow as tffrom tensorflow import kerasimport numpy as npmdb = keras.datasets.imdb(train_data, train_labels), (test_data, test_labels) = imdb.load_data('/home/kesci/input/idmb2286/imdb.npz',num_words=10000)

将load_data中的路径改为imdb.npz所在的路径，num_words=15000保留出现频率最高的前10000个词。丢弃罕见单词以保持数据的可管理。

了解数据

在处理数据前，我们需要先了解数据，经过数据的预处理后，每个例子都是整数序列，以整数来表示电影的单词。每个整数对应词典的一个单词。用0和1来确定label的种类。

print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

Training entries: 25000, labels: 25000

我们可以看下第一条评论

print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

每条数据的单词数量不同，然而神经网络的输入要求长度必须相同。我们将在下面解决这个问题。

len(train_data[0]),len(train_data[1])

(218,219)

将整数转换为文本

# A dictionary mapping words to an integer indexword_index = imdb.get_word_index()# The first indices are reservedword_index = {k:(v+3) for k,v in word_index.items()} word_index["<PAD>"] = 0word_index["<START>"] = 1word_index["<UNK>"] = 2 # unknownword_index["<UNUSED>"] = 3reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])def decode_review(text):return ' '.join([reverse_word_index.get(i, '?') for i in text])

我们可以用decode_review将整数序列转换为文本

decode_review(train_data[20])

"<START> shown in australia as <UNK> this incredibly bad movie is so bad that you become <UNK> and have to watch it to the end just to see if it could get any worse and it does the storyline is so predictable it seems written by a high school <UNK> class the sets are pathetic but marginally better than the <UNK> and the acting is wooden br br the infant <UNK> seems to have been stolen from the props <UNK> of <UNK> <UNK> there didn't seem to be a single original idea in the whole movie br br i found this movie to be so bad that i laughed most of the way through br br malcolm mcdowell should hang his head in shame he obviously needed the money"

准备数据

必须在输入神经网络前转换为张量。

可转换为独热向量或填充数组，使他们具有相同长度，然后创建一个num_example*max_length的整型张量。可以将此作为神经网络的第一层。

在次我们使用第二种方法

train_data = keras.preprocessing.sequence.pad_sequences(train_data,value=word_index["<PAD>"],padding='post',maxlen=256)test_data = keras.preprocessing.sequence.pad_sequences(test_data,value=word_index["<PAD>"],padding='post',maxlen=256)

将数据集转换为256维，不足的从后面补齐（补零）。

len(train_data[0]), len(train_data[1])

（256，256）

看下处理后的数据

print(train_data[0])

[ 1 14 22 16 43 530 973 1622 1385 65 458 446866 39414 173 36 2565 25 100 43 838 11250 67029 35 480 2845 1504 172 1121672 336 385 394 172 4536 1111 17 546 3813 4474 192 50 166 147 2025 19 14 224 1920 4613 4694 22 71 87 12 16 43 53038 76 15 13 12474 22 17 515 17 12 16626 1825 62 386 128 3168 10654 2223 5244 16 480 66 3785 334 130 12 1638 6195 25 124 51 36 135 48 25 1415 336 22 12 215 28 77 525 14 407 16 821031184 107 117 5952 15 256427 37665 723 36 71 43 530 476 26 400 317 4674 12118 1029 13 104 884 381 15 297 98 322071 56 26 1416 194 7486 184 226 22 21134 476 26 4805 144 30 5535 18 51 36 28224 92 25 1044 226 65 16 38 1334 88 1216 2835 16 4472 113 103 32 15 16 5345 19178 3200000000000000000000000000000000000000]

创建模型

input shape is the vocabulary count used for the movie reviews (10,000 words)vocab_size = 10000model = keras.Sequential()model.add(keras.layers.Embedding(vocab_size, 16))model.add(keras.layers.GlobalAveragePooling1D())#model.add(keras.layers.GlobalMaxPooling1D())model.add(keras.layers.Dense(16, activation=tf.nn.relu))model.add(keras.layers.Dropout(0.5))model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))model.summary()

损失函数

模型需要一个损失函数和一个用于训练的优化器。由于这是二元分类问题和概率模型输出（具有S形激活的单个单元层），我们将使用binary_crossentropy损失函数。

pile(optimizer=tf.train.AdamOptimizer(),loss='binary_crossentropy',metrics=['accuracy'])

建立验证集

10000以前为验证集，一万以后为训练集

x_val = train_data[:10000]partial_x_train = train_data[10000:]y_val = train_labels[:10000]partial_y_train = train_labels[10000:]

训练模型

#每一Epochs都进行F1计算import numpy as npfrom keras.callbacks import Callbackfrom keras.engine.training import Modelfrom sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_scoreclass Metrics(Callback):def on_train_begin(self, logs={}):self.val_f1s = []self.val_recalls = []self.val_precisions = []def on_epoch_end(self, epoch, logs={}):val_predict = (np.asarray(self.model.predict(self.validation_data[0]))).round()val_targ = self.validation_data[1]_val_f1 = f1_score(val_targ, val_predict,average='weighted')_val_recall = recall_score(val_targ, val_predict,average='weighted')_val_precision = precision_score(val_targ, val_predict,average='weighted')self.val_f1s.append(_val_f1)self.val_recalls.append(_val_recall)self.val_precisions.append(_val_precision)print( ' — val_f1: %f — val_precision: %f — val_recall %f' %(_val_f1, _val_precision, _val_recall))returnmetrics = Metrics()

from keras.callbacks import EarlyStoppingearlystopping=keras.callbacks.EarlyStopping(monitor='val_acc', patience=8, verbose=0, mode='max')history = model.fit(partial_x_train,partial_y_train,epochs=90,batch_size=512,validation_data=(x_val, y_val),callbacks=[metrics,earlystopping],verbose=1)

测试模型

results = model.evaluate(test_data, test_labels)print(results)

25000/25000 [==============================] - 2s 61us/step

[0.31110355438232423, 0.87736]

我们可以看到损失函数为0.31，准确率为0.87.