导入需要的包:numpy、pandas
import numpy as py
import pandas as pd
创建一个表:df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range( 102, periods=6),
"city":[Beijing , SH, guangzhou , Shenzhen, shanghai, Beijing ],
"age":[23,44,54,32,34,32],
"category":[100-A,100-B,110-A,110-C,210-A,130-F],
"price":[1200,np.nan,2133,5433,np.nan,4432]},
columns =[id,date,city,category,age,price])
得到如下表:
Python处理重复数据
drop_duplicates函数删除重复值。以city列为例,city字段中存在重复值。默认情况下drop_duplicates()将删除后出现的重复值。增加keep=‘last’参数后将删除最先出现的重复值,保留最后的值。下面是具体的代码和比较结果。df["city"].drop_duplicates()保