2000字范文,分享全网优秀范文,学习好帮手!
2000字范文 > lstm训练情感分析的优点_LSTM对电影评论进行简单的情感分析

lstm训练情感分析的优点_LSTM对电影评论进行简单的情感分析

时间:2019-10-28 19:26:50

相关推荐

lstm训练情感分析的优点_LSTM对电影评论进行简单的情感分析

今天自己尝试使用LSTM对电影评论进行简单的情感分析

代码中npy文件:

代码使用的数据集是IMDB,网盘地址:

首先读取已经做好的词向量模型

import numpy as np

# 这里有两个表,一个是ID和单词的映射关系,一个是ID和词向量的映射关系

wordsList = np.load('./npy/wordsList.npy')

wordsList = wordsList.tolist()

wordsList = [word.decode('UTF-8') for word in wordsList]

wordVectors = np.load('./npy/wordVectors.npy')

这里可以打印看看维数和长度等

# 打印长度

print(len(wordsList))

print(wordVectors.shape)

400000

(400000, 50)

这里先举个简单的例子:

我们是先找到单词的索引index,然后再通过Index找到对应词的词向量(50维)

nameIndex = wordsList.index('name')

wordVectors[nameIndex]

array([ 0.20957 , 0.75197 , -0.48559 , 0.1302 , 0.60071 , 0.43273 ,

-0.95424 , -0.19335 , -0.66756 , -0.25893 , 0.66367 , 1.0509 ,

0.10627 , -0.75438 , 0.45617 , 0.37878 , -0.40237 , 0.1821 ,

-0.028768, 0.24349 , -0.35723 , -0.55817 , 0.14103 , 0.58807 ,

0.076804, -1.972 , -1.4459 , 0.081884, -0.29207 , -0.65623 ,

2.718 , -0.96886 , -0.33354 , -0.19526 , 0.33918 , -0.24307 ,

0.29058 , -0.37178 , -0.38133 , -0.20901 , 0.48504 , 0.20702 ,

-0.5754 , -0.32403 , -0.19267 , -0.043298, -0.57702 , -0.4727 ,

0.42171 , -0.14112 ], dtype=float32)

所以通过这个操作我们可以得到每个词的向量表示,因此,句子就是在上面再增加一个维度,比如一个句子长度是20,那么等会变成词向量的表示就是(20,50)。当然这里只是举一个小例子,实际训练还需要设定一个固定的句子长度(多截少补),还需要考虑batch_size。

这里先读取训练数据集(积极和消极各25000条)

from os import listdir

from os.path import isfile, join

# 指定好数据集位置,这里需要一个个读取

positiveFiles = ['lmdb/train/pos/' + f for f in listdir('lmdb/train/pos/') if isfile(join('lmdb/train/pos/', f))]

negativeFiles = ['lmdb/train/neg/' + f for f in listdir('lmdb/train/neg/') if isfile(join('lmdb/train/neg/', f))]

numWords = []

# 分别统计积极和消极情感数据集

for pf in positiveFiles:

with open(pf, "r", encoding='utf-8') as f:

line = f.readline()

counter = len(line.split())

numWords.append(counter)

# print('积极情感数据集加载完毕')

for pf in negativeFiles:

with open(pf, "r", encoding='utf-8') as f:

line = f.readline()

counter = len(line.split())

numWords.append(counter)

# print('消极情感数据集加载完毕')

numFiles = len(numWords)

print('全部序列数量', numFiles)

print('全部词语数量', sum(numWords))

print('平均每个评论序列词语数量', sum(numWords)/len(numWords))

全部序列数量 25000

全部词语数量 5844680

平均每个评论序列词语数量 233.7872

我们需要确定最长序列长度,所以这里用图表形式先展示一下:

import matplotlib.pyplot as plt

%matplotlib inline

plt.hist(numWords, 50)

plt.xlabel('Sequence Length')

plt.ylabel('Frequency')

plt.axis([0, 1200, 0, 8000])

plt.show()

统计

从直方图可以粗略看到,序列长度在200左右占大部分。这里可以将最大长度设为250。

maxSeqLength = 250

然后需要将文本序列转换成索引矩阵,先用正则做一个简单的转换

import re

strip_special_chars = pile("[^A-Za-z0-9 ]+")

# 过滤一下

def cleanSentences(string):

string = string.lower().replace("

", " ")

return re.sub(strip_special_chars, "", string.lower())

接下来是对25000条序列都做一次 词->ID的映射 形成一个25000×250的矩阵,计算较久。直接使用处理好的索引矩阵文件,词->ID映射代码如下:

# ids = np.zeros((numFiles, maxSeqLength), dtype='int32')

# fileCounter = 0

# for pf in positiveFiles:

# with open(pf, "r") as f:

# indexCounter = 0

# line = f.readline()

# cleanedLine = cleanSentences(line)

# split = cleanedLine.split()

# for word in split:

# try:

# ids[fileCounter][indexCounter] = wordsList.index(word)

# except ValueError:

# ids[fileCounter][indexCounter] = 599999

# indexCounter = indexCounter + 1

# if indexCounter >= maxSeqLength:

# break

# fileCounter = fileCounter + 1

# for nf in negativeFiles:

# with open(nf, "r") as f:

# indexCounter = 0

# line = f.readline()

# cleanedLine = cleanSentences(line)

# split = cleanedLine.split()

# for word in split:

# try:

# ids[fileCounter][indexCounter] = wordsList.index(word)

# except ValueError:

# ids[fileCounter][indexCounter] = 599999

# indexCounter = indexCounter + 1

# if indexCounter >= maxSeqLength:

# break

# fileCounter = fileCounter + 1

# np.save('idsMatrix', ids)

ids = np.load('npy/idsMatrix.npy')

下面开始构建模型,使用tensorflow图模型。首先定义一些超参数,例如批处理大小,LSTM单元个数,分类类别和训练次数

batchSize = 24

lstmUnits = 64

numClasses = 2

numDimensions = 50

iterations = 50000

输入数据的维度应该是 batchSize×250(最大序列长度)×50(词向量维度)

输出数据的维度应该是 batchSize×2(分类数目)

import tensorflow as tf

tf.reset_default_graph()

labels = tf.placeholder(tf.float32, [batchSize, numClasses])

input_data = tf.placeholder(tf.int32, [batchSize, maxSeqLength]) # 这里只是中间结果,还没转换成词向量

data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]), dtype=tf.float32)

data = tf.nn.embedding_lookup(wordVectors, input_data)

这里可以打印看看data格式

print(data)

然后构造模型:先使用tf.nn.rnn_cell.BasicLSTMCell函数,然后设置一个dropout参数避免过拟合,最后输入到tf.nn.dynamic_rnn展开整个网络

lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits)

lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75)

value, _ = tf.nn.dynamic_rnn(lstmCell, data, dtype=tf.float32)

# 打印看看获取的数据

print(value)

# 权重参数初始化

weight = tf.Variable(tf.truncated_normal([lstmUnits, numClasses]))

bias = tf.Variable(tf.constant(0.1, shape=[numClasses]))

value = tf.transpose(value, [1, 0, 2])

# 取最终的结果值

last = tf.gather(value, int(value.get_shape()[0])-1)

prediction = (tf.matmul(last, weight) + bias)

print(prediction)

然后定义正确的预测函数和正确率评估参数。正确的预测形式是查看最后输出的0-1向量是否和标记的0-1向量相同

correctPred = tf.equal(tf.argmax(prediction, 1), tf.argmax(labels, 1))

accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

最后,使用一个交叉熵损失函数作为损失值。对于优化器,使用Adam,并且采用默认的学习率:

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))

optimizer = tf.train.AdamOptimizer().minimize(loss)

为了训练,需要定义辅助函数:

from random import randint

# 制作batch数据,通过数据集索引位置来设置训练集和测试集

# 并且让batch中正负样本各占一半,同时给定其当前标签

def getTrainBatch():

labels = []

arr = np.zeros([batchSize, maxSeqLength])

for i in range(batchSize):

if (i % 2 == 0):

num = randint(1, 11499)

labels.append([1,0])

else:

num = randint(13499, 24999)

labels.append([0,1])

arr[i] = ids[num-1:num]

return arr, labels

def getTestBatch():

labels = []

arr = np.zeros([batchSize, maxSeqLength])

for i in range(batchSize):

num = randint(11499, 13499)

if (num <= 12499):

labels.append([1,0])

else:

labels.append([0,1])

arr[i] = ids[num-1:num]

return arr, labels

训练

sess = tf.InteractiveSession()

saver = tf.train.Saver()

sess.run(tf.global_variables_initializer())

for i in range(iterations):

# 通过辅助函数拿到batch数据

nextBatch, nextBatchLabels = getTrainBatch()

sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels})

# 每隔1000次打印一下当前的结果

if (i % 100 == 0 and i != 0):

loss_ = sess.run(loss, {input_data: nextBatch, labels: nextBatchLabels})

accuracy_ = sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})

print("iteration {}/{}...".format(i+1, iterations),

"loss {}...".format(loss_),

"accuracy {}...".format(accuracy_))

# 每1W次保存一下当前模型

if (i % 10000 == 0 and i != 0):

save_path = saver.save(sess, "models/pretrained_lstm.ckpt", global_step=i)

print("saved to %s" % save_path)

训练大概就是这样,因为我的笔记本太差,跑太久了。所以设置每100次打印一次。有条件的可以跑一下。

下面是在测试集上面跑的代码:

sess = tf.InteractiveSession()

saver = tf.train.Saver()

saver.restore(sess, tf.train.latest_checkpoint('models'))

然后导入测试数据集,进行测试

test_iterations = 10

for i in range(test_iterations):

nextBatch, nextBatchLabels = getTestBatch()

print("Accuracy for this batch:", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。