1 文本查看

拿到新数据，总是想先打开数据，看看字段和数据情况。然而，我的电脑运存只有16G，超过4G的文本数据如果用记事本或notepad++等文本编辑器直接打开，会一下子涌入运存中，打开很慢或者直接打不开。

EmEditor软件读取大文件很方便。不是免费的，需要注册：EmEditor (Text Editor) – Text Editor for Windows supporting large files and Unicode!

2 文本读取

2.1 文本分块读取

import pandas as pdtable = pd.read_csv(r"G:data.txt",sep = '\t', #制表符分隔header = None, #我这份数据无表头encoding = 'utf-8',error_bad_lines = False, #遇到错误数据行忽略warn_bad_lines = True,iterator=True, #开启迭代器chunksize=10000 #读取10000个数据为一个块)path = r"G:\test"i = 0for item in table:i += 1print("正在处理第{}个文件".format(i))item.to_csv(path + "_test_" + str(i) + ".csv", index=False,encoding = 'utf-8')

2.2 中文文本编码获取

用pandas的read_csv读取中文文本时，首先要知道文本的编码是什么，并在encoding这个参数这里设置正确的编码。否则，读取到的数据会是乱码。EmEditor软件可以直接查看文本编码和文本分隔符类型。

也可以python中的chardet包来获取文本编码。

#方法一import pandas as pd import os import chardetdef get_encoding(filename): """ 返回文件编码格式，因为是按行读取，所以比较适合小文件""" with open(filename,'rb') as f: return chardet.detect(f.read())['encoding']original_file = r"G:\data.txt"print(get_encoding(original_file))#方法二from chardet.universaldetector import UniversalDetectororiginal_file = r"G:\data.txt"usock = open(original_file, 'rb')detector = UniversalDetector()for line in usock.readlines():detector.feed(line)if detector.done: breakdetector.close()usock.close()print (detector.result)#chardet不可能总是正确的猜测。如果你需要正确处理样本，你真的需要知道它们的编码

2.3 中文文本编码转换

EmEditor软件可以转换编码，也可以用如下代码转换编码。下面的代码是将编码转换为“utf-8”。

import codecsdef handleEncoding(original_file,newfile):#newfile=original_file[0:original_file.rfind(.)]+'_copy.csv'f=open(original_file,'rb+')content=f.read()#读取文件内容，content为bytes类型，而非string类型source_encoding='utf-8'#####确定encoding类型try:content.decode('utf-8').encode('utf-8')source_encoding='utf-8'except:try:content.decode('gbk').encode('utf-8')source_encoding='gbk'except:try:content.decode('gb2312').encode('utf-8')source_encoding='gb2312'except:try:content.decode('gb18030').encode('utf-8')source_encoding='gb18030'except:try:content.decode('big5').encode('utf-8')source_encoding='big5'except:try:content.decode('cp936').encode('utf-8')source_encoding='cp936'except:content.decode('gbk').encode('utf-8')source_encoding='gbk'f.close()#####按照确定的encoding读取文件内容，并另存为utf-8编码：block_size=10000with codecs.open(original_file,'r',source_encoding) as f:with codecs.open(newfile,'w','utf-8') as f2:while True:content=f.read(block_size)if not content:breakf2.write(content)original_file = r"G:\data.txt"newfile = r"G:\data_new.txt"handleEncoding(original_file,newfile)

2.3 文本并行处理

想对分块后的数据，同时运行函数Fuction_test(x)，考虑并行处理。

#GPU并行 dask包

还在研究中……

#CPU并行joblib包的Parallel函数

还在研究中……

from joblib import Parallel, delayeddef Fuction_test(x):y = x + 10return yParallel(n_jobs=-1)(delayed(Fuction_test)(item) for item in table)