Hotel booking demand 酒店预订需求分析

数据集下载地址：

/jessemostipak/hotel-booking-demand

数据集为葡萄牙的一家城市酒店和一家度假酒店酒店预订数据。数据的时间跨度从7月1日至8月31日。

该数据集同时包含城市酒店和度假酒店的预订信息，包括预订时间、停留时间，成人/儿童/婴儿人数以及可用停车位数量等信息。

适用场景：社会科学、旅行、酒店、用户行为，不具有明显的行业标识，可进行常规用户行为分析。

数据量：32列共12W数据量。

可以定义的问题：

1)基本情况：城市酒店和假日酒店预订需求和入住率比较；

2)用户行为：提前预订时长、入住时长、预订间隔、餐食预订情况；

3)一年中最佳预订酒店时间；

4)利用Logistic预测酒店预订。

部分列含义说明：

hotel : 酒店名称， city hotel 66% , Reort hotel 34%is_canceled : 01取值；1：取消预订，0：未取消预订lead_time : 预订输入日期到到达日期之间经过的天数arrival_date_year : 入住年份arrival_date_month：入住月份arrival_date_week_numberarrival_date_day_of_monthstays_in_weekend_nights：入住或预订入住酒店的周末住宿数（周六或周日）stays_in_week_nights：工作日停留天数adults: 成人人数children: 儿童人数babiesmeal：预订的膳食类型country ：国家market_segment：细分市场名称distribution_channel：预订方式：旅行社，直接预定，公司统一预订is_repeated_guest：回头客？1(是）：0（否）previous_cancellations：客户在当前预订之前取消的先前预订数previous_bookings_not_canceledreserved_room_typeassigned_room_typebooking_changesdeposit_type:说明客户是否为保证预订而存款。agent：代理商companydays_in_waiting_list :在与客户确认预订之前，预订在等候列表中的天数customer_typeadr ：通过将所有住宿交易的总和除以住宿日来定义的每日平均费率（就是每天的平均花费）required_car_parking_spaces：客户要求的停车位数量total_of_special_requests：客户提出的特殊要求数量（例如双床或高楼层）reservation_status：预订最后状态，已取消（cancel）客户取消预订;退房（check-out）reservation_status_date

文章目录

数据挖掘步骤1、确定目标2、数据探索3、数据预处理可视化分析酒店类型入住率分析预付款的影响提前预订时间对入住率的影响预定方式的影响不同月份的预订取消率月份对入住人数的影响月份与平均住房消费影响取消预订的因素快速分析工具分享

数据挖掘步骤

1、确定要发现的目标

2、数据采集

3、数据探索

4、数据预处理

5、数据挖掘（模型选择）

6、模式评估

1、确定目标

预测用户是否会取消预订

2、数据探索

这一步主要包括数据特征的基本统计描述、数据特征间的相似/相异性等。可以采用可视化将数据特征展示出来。

#导入数据集import pandas as pdhotel_data = pd.read_csv('./hotel_bookings.csv')hotel_data.shape

(119390, 32)

#查看前5行hotel_data.head()

变量分析:哪些是输出变量？哪些是输入

这里显然 is_canceled是作为分类结果

hotel_data.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 119390 entries, 0 to 119389Data columns (total 32 columns):# ColumnNon-Null Count Dtype --- -------------------- ----- 0 hotel 119390 non-null object 1 is_canceled 119390 non-null int64 2 lead_time 119390 non-null int64 3 arrival_date_year119390 non-null int64 4 arrival_date_month 119390 non-null object 5 arrival_date_week_number 119390 non-null int64 6 arrival_date_day_of_month 119390 non-null int64 7 stays_in_weekend_nights 119390 non-null int64 8 stays_in_week_nights 119390 non-null int64 9 adults119390 non-null int64 10 children 119386 non-null float6411 babies119390 non-null int64 12 meal 119390 non-null object 13 country118902 non-null object 14 market_segment 119390 non-null object 15 distribution_channel 119390 non-null object 16 is_repeated_guest119390 non-null int64 17 previous_cancellations119390 non-null int64 18 previous_bookings_not_canceled 119390 non-null int64 19 reserved_room_type 119390 non-null object 20 assigned_room_type 119390 non-null object 21 booking_changes 119390 non-null int64 22 deposit_type119390 non-null object 23 agent 103050 non-null float6424 company6797 non-null float6425 days_in_waiting_list 119390 non-null int64 26 customer_type 119390 non-null object 27 adr 119390 non-null float6428 required_car_parking_spaces119390 non-null int64 29 total_of_special_requests 119390 non-null int64 30 reservation_status 119390 non-null object 31 reservation_status_date 119390 non-null object dtypes: float64(4), int64(16), object(12)memory usage: 29.1+ MB

可以看到是有缺失列的

看一下数据描述：

发现这个数据大都是离散型数据和字符型数据

3、数据预处理

（1）数据清理

对缺失值和异常值处理：

对缺失列进行查看

hotel_data.isnull().sum()[hotel_data.isnull().sum()!=0]

children 4country 488agent 16340company112593dtype: int64

company缺失较多，删除。

children和country、agent较少，填充。

data = hotel_data.copy(deep=True) #拷贝一份新的数据用于操作data.drop("company", axis=1, inplace=True) #inplace为True：在源数据上删除

查看children数据分布

import matplotlib.pyplot as plt#人数分布children_num = [data['children'][data['children']==i].count() for i in range(11)]x = range(len(children_num))plt.bar(x,children_num)for a,b in zip(x,children_num):plt.text(a,b+0.05,'%.0f' %b, ha='center',va='bottom')

可以看到绝大部分数据都是0，并且是离散型数据，因此我们用众数（mode）填充。其他两个字段同样的操作：

data["agent"].fillna(0, inplace=True)data["children"].fillna(data_new["children"].mode()[0], inplace=True)data["country"].fillna(data_new["country"].mode()[0], inplace=True)

异常值处理：看不太出来，先放着

可视化分析

酒店类型入住率分析

可以看到city hotel的预订数约为度假酒店的两倍，但于此同时城市酒店的取消率也是远高于度假酒店，城市酒店中有42%预订后会取消。

预付款的影响

查看预付定金的情况（no Deposit：无定金，non Refund:不可退还的）

88%的人预订无需付定金，12%的人付不可退还的定金，其实还有不到0.1%的人支付的是可退还的定金。

查看预付款方式对取消预订的影响：

非常神奇的是不可退款的预订竟然有99%取消了预订，这非常的不符合我们的预想，有点不合常理（初看还以为定金不可退的只有1%取消了）。。。这里可能是数据有误或者一些其他偶然原因。

提前预订时间对入住率的影响

能够明显看到的是提前预订的时间越短，就越不太会取消预订。

但是考虑到不同酒店入住率不同，这里应该再分为酒店类型查看

城市酒店随着预定日期的延长取消率也增大的趋势要明显得多，这可能跟度假酒店的属性有关，人们会更早的提前规划假期。

预定方式的影响

不同月份的预订取消率

月平均预定量与取消预订率

ordered_months = ["January", "February", "March", "April", "May", "June", "July", "August","September", "October", "November", "December"]for hotel in ['City Hotel','Resort Hotel']:fig, ax1 = plt.subplots()ax2 = ax1.twinx()data_hotel=data[data.hotel==hotel]monthly = data_hotel.groupby('arrival_date_month').size()monthly /= 2monthly.loc[['July', 'August']] = monthly.loc[['July', 'August']] * 2 / 3sns.barplot(list(range(1, 13)), monthly[ordered_months], ax=ax1)ax2.plot(range(12), data_hotel.groupby('arrival_date_month')['is_canceled'].mean()[ordered_months].values, 'ro-')ax1.set_xlabel('Month')ax2.set_ylabel('Cancellation rate')ax1.set_title('city hotel')ax1.set_title('resort hotel')

月份对入住人数的影响

月份对预订人数的影响（由于不同15、并没有全年的信息，所以这里分开年份来看），这里看的是没有取消预订的：

单独看的：

汇总来看：

显然在5月左右（夏季）和10月（秋季）入住人数较多，而冬季人较少。

月份与平均住房消费

全部数据下的平均日费用：

只取未取消预订的数据：

似乎真正入住后会偏消费低一些。

影响取消预订的因素

原始数据包含三十多个特征，如何利用这些特征？由于特征数较多，我们并不需要利用上所有的特征。下面找出与is_canceled字段相关性最大的特征：

在pandas中，pandas相关性分析有这样一个函数：

DataFrame.corr(method='pearson', min_periods=1)'''参数说明：method：可选值为{‘pearson’, ‘kendall’, ‘spearman’}pearson：Pearson相关系数来衡量两个数据集合是否在一条线上面，即针对线性数据的相关系数计算，针对非线性数据便会有误差。kendall：用于反映分类变量相关性的指标，即针对无序序列的相关系数，非正太分布的数据spearman：非线性的，非正太分析的数据的相关系数min_periods：样本最少的数据量返回值：各类型之间的相关系数DataFrame表格。'''

不过在此之前，得先对数据进行类型转换，你可以自定义转换对应的值，比如下面将月份映射为数字：

再比如对hotel的处理：

不过也可以利用LabelEncoder() 将转换成连续的数值型变量。即是对不连续的数字或者文本进行编号

如：

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()le.fit_transform(['a','c','kk','ss','a'])

输出：

array([0, 1, 2, 3, 0], dtype=int64)

接下来对object数据进行转换：

查看需要转换的列：

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()data_copy=data.copy()data_copy['agent']= data_copy['agent'].astype(int)data_copy['country']= data_copy['country'].astype(str)data_copy['hotel'] = le.fit_transform(data_copy['hotel'])data_copy['arrival_date_month'] = le.fit_transform(data_copy['arrival_date_month'])data_copy['meal'] = le.fit_transform(data_copy['meal'])data_copy['country'] = le.fit_transform(data_copy['country'])data_copy['market_segment']= le.fit_transform(data_copy['market_segment'])data_copy['distribution_channel']=le.fit_transform(data_copy['distribution_channel'])data_copy['is_repeated_guest'] = le.fit_transform(data_copy['is_repeated_guest'])data_copy['reserved_room_type'] = le.fit_transform(data_copy['reserved_room_type'])data_copy['assigned_room_type'] = le.fit_transform(data_copy['assigned_room_type'])data_copy['deposit_type'] = le.fit_transform(data_copy['deposit_type'])data_copy['agent'] = le.fit_transform(data_copy['agent'])data_copy['customer_type'] = le.fit_transform(data_copy['customer_type'])data_copy['reservation_status'] = le.fit_transform(data_copy['reservation_status'])

import numpy as npdata_corr=data_copy.corr(method='spearman')np.abs(data_corr['is_canceled']).sort_values(ascending=False)#降序

结果：

is_canceled 1.000000reservation_status0.942691deposit_type 0.477061lead_time0.316635previous_cancellations 0.270233country 0.260023total_of_special_requests 0.258520required_car_parking_spaces 0.197397assigned_room_type0.188455booking_changes 0.185107distribution_channel 0.173662hotel 0.136531previous_bookings_not_canceled 0.115354customer_type 0.099269days_in_waiting_list 0.098237is_repeated_guest 0.084793reserved_room_type0.067462adults 0.067027adr 0.050876stays_in_week_nights 0.041418babies 0.034306market_segment0.026324agent 0.024613arrival_date_year 0.018066meal0.014453arrival_date_week_number0.007589arrival_date_day_of_month 0.006142stays_in_weekend_nights 0.004106children0.002803arrival_date_month0.001408Name: is_canceled, dtype: float64

其中由于reservation_status：预订最后状态，所以其实它是is_canceled字段基本一致。

deposit_type 0.477061

lead_time 0.316635

previous_cancellations 0.270233

country 0.260023

total_of_special_requests 0.258520

预付款的影响，上面已经说过了，可能是数据有问题。因为不能退定金反而高达99%的取消率

提前预定天数：

先前取消预订的次数：

地区数量比例：

套用预测模型，酒店就可以提前获知哪些用户可能取消订单，及时采取补救措施。

比如，提前联系取消可能性较大的用户，通过沟通，让他们尽可能更早地取消，给酒店预留更多的时间出售房间。

或者，也可以与有取消倾向的用户联系，向其介绍酒店的优点，给出一些入住奖励，力挽狂澜挽留可能流失的用户。

快速分析工具分享

如果你想偷懒的话，下面几句代码可以一键生成初步分析：

import seaborn as snsimport pandas as pdimport pandas_profiling as ppimport matplotlib.pyplot as plt#data就是一开始导入的数据，dataframereport = pp.ProfileReport(data)report

保存为HTML文件：

report.to_file('report.html')

部分结果：

heatmap 热力图：