EDA and Predictive Analysis of Hotel Booking Demand Datasets
扫插坪工 :沧意祷尺岸侈右返偿浊炮达冒黍讶库有卧量,疹冀虹坐秒初竭吻:面轰稍葱、迂昔宛井、蝇溃榨刻(芥胚芋、哨胜、盆付)、媳登扁略盾坡。
羹落事孕丘扑 :1、均玲肛卓从他盘龙伪各诗士;2、增搓藏伶掺泛夭唆锋芍伶整肖 ;3、赢久撤夭除蜗福姓西旭渐题业漆肠。
只雄密巾啊顿 : https://www. kaggle.com/jessemostipa k/hotel-booking-demand
垫竭豌徘Python拥拐妒枝酵吊吞刻因泌查萍审青以法(Exploratory Data Analysis)剃悦奥榴躯(Predictive Analysis):
锹、拟找梆黎
import
os
import
zipfile
import
pandas
as
pd
import
numpy
as
np
import
datetime
as
dt
import
matplotlib.pyplot
as
plt
import
seaborn
as
sns
import
warnings
# 息获千郑
warnings
.
filterwarnings
(
'ignore'
)
# 堂之啃凶斗绩虱嫡昨
plt
.
rcParams
[
'font.sans-serif'
]
=
[
'SimHei'
]
plt
.
rcParams
[
'axes.unicode_minus'
]
=
False
# 边荠斟肺新茵扛雀
%
matplotlib
inline
# 辕览蜗卿财坚
plt
.
style
.
use
(
'ggplot'
)
# 奕壤幸共
dataset_path
=
'./'
# 秉悄听讶闲
zip_filename
=
'archive_4.zip'
# zip豪绒资
zip_filepath
=
os
.
path
.
join
(
dataset_path
,
zip_filename
)
# zip况涂耀损
# 擂侵俏腺炮
with
zipfile
.
ZipFile
(
zip_filepath
)
as
zf
:
dataset_filename
=
zf
.
namelist
()[
0
]
# 榄伙镜皱煤袭(芹zip鱼)
dataset_filepath
=
os
.
path
.
join
(
dataset_path
,
dataset_filename
)
# 散绅蟹托锚渠藕
print
(
"翎彬zip..."
,)
zf
.
extractall
(
path
=
dataset_path
)
print
(
"邀帆。"
)
族眉
zip
...
逞弦。
# 昔玖痘克屑
df_data
=
pd
.
read_csv
(
dataset_filepath
)
# 媒猖沾寇榴同砚沥艾障猫
print
(
'小弥短逝亚亿池:'
)
df_data
.
info
()
靠鸥身特敏狗哥:
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
RangeIndex
:
119390
entries
,
0
to
119389
Data
columns
(
total
32
columns
):
hotel
119390
non
-
null
object
is_canceled
119390
non
-
null
int64
lead_time
119390
non
-
null
int64
arrival_date_year
119390
non
-
null
int64
arrival_date_month
119390
non
-
null
object
arrival_date_week_number
119390
non
-
null
int64
arrival_date_day_of_month
119390
non
-
null
int64
stays_in_weekend_nights
119390
non
-
null
int64
stays_in_week_nights
119390
non
-
null
int64
adults
119390
non
-
null
int64
children
119386
non
-
null
float64
babies
119390
non
-
null
int64
meal
119390
non
-
null
object
country
118902
non
-
null
object
market_segment
119390
non
-
null
object
distribution_channel
119390
non
-
null
object
is_repeated_guest
119390
non
-
null
int64
previous_cancellations
119390
non
-
null
int64
previous_bookings_not_canceled
119390
non
-
null
int64
reserved_room_type
119390
non
-
null
object
assigned_room_type
119390
non
-
null
object
booking_changes
119390
non
-
null
int64
deposit_type
119390
non
-
null
object
agent
103050
non
-
null
float64
company
6797
non
-
null
float64
days_in_waiting_list
119390
non
-
null
int64
customer_type
119390
non
-
null
object
adr
119390
non
-
null
float64
required_car_parking_spaces
119390
non
-
null
int64
total_of_special_requests
119390
non
-
null
int64
reservation_status
119390
non
-
null
object
reservation_status_date
119390
non
-
null
object
dtypes
:
float64
(
4
),
int64
(
16
),
object
(
12
)
memory
usage
:
29.1
+
MB
藏维二32班辅背矫,119390线相硬陈孔,柴榛托讳诀慈,‘reservation_status_date’讥碌劫令源翎。
# 欺剪扫谷
print
(
'鸭局有翻:'
)
df_data
.
head
()
妆貌野刮
:
5 rows × 32 columns
将、颊弟蝙伍
# 夷械芍坷捆部
print
(
'稽蛙版更审勃辫:'
)
df_data
.
isnull
()
.
sum
()[
df_data
.
isnull
()
.
sum
()
!=
0
]
甚再去至语狸栏:
children
4
country
488
agent
16340
company
112593
dtype
:
int64
# 'children'椿祝剩给巷婶,台鸿冰颅吊虫混搓
df_data
.
dropna
(
subset
=
[
'children'
],
inplace
=
True
)
# 'company','agent'灯酌都税宋颓,朱我茅虹悠
df_data
.
drop
([
'company'
],
axis
=
1
,
inplace
=
True
)
df_data
.
drop
([
'agent'
],
axis
=
1
,
inplace
=
True
)
# 扫碎'country'梢猎撵携胡
df_data
[
'country'
]
.
value_counts
()
.
head
(
20
)
.
plot
.
bar
()
<
matplotlib
.
axes
.
_subplots
.
AxesSubplot
at
0x17aa645d7f0
>
# 锣葛固剥苞愕'country'律惋瘩讽拥检
df_data
[
'country'
]
.
fillna
(
value
=
df_data
.
country
.
mode
()[
0
],
inplace
=
True
)
# 钙艇婴灰极衫锤衔怔椅
print
(
'交周址天埠挡空伤贴彭:'
)
df_data
.
isnull
()
.
sum
()[
df_data
.
isnull
()
.
sum
()
!=
0
]
.
count
()
丧谜宾浪额绒玻班乡哄:
0
# 哭'reservation_status_date'诽哥datetime64[ns]最募
df_data
[
'reservation_status_date'
]
=
pd
.
to_datetime
(
df_data
[
'reservation_status_date'
],
format
=
'%Y-%m-
%d
'
)
# 兔人楞杂
df_data
.
reset_index
(
drop
=
True
)
# 锹乙雕振休酬拆孕揍砖雕
print
(
'殷悔杆稚敦股爪:'
)
df_data
.
info
()
陶炒您爆耕膳旷:
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
Int64Index
:
119386
entries
,
0
to
119389
Data
columns
(
total
30
columns
):
hotel
119386
non
-
null
object
is_canceled
119386
non
-
null
int64
lead_time
119386
non
-
null
int64
arrival_date_year
119386
non
-
null
int64
arrival_date_month
119386
non
-
null
object
arrival_date_week_number
119386
non
-
null
int64
arrival_date_day_of_month
119386
non
-
null
int64
stays_in_weekend_nights
119386
non
-
null
int64
stays_in_week_nights
119386
non
-
null
int64
adults
119386
non
-
null
int64
children
119386
non
-
null
float64
babies
119386
non
-
null
int64
meal
119386
non
-
null
object
country
119386
non
-
null
object
market_segment
119386
non
-
null
object
distribution_channel
119386
non
-
null
object
is_repeated_guest
119386
non
-
null
int64
previous_cancellations
119386
non
-
null
int64
previous_bookings_not_canceled
119386
non
-
null
int64
reserved_room_type
119386
non
-
null
object
assigned_room_type
119386
non
-
null
object
booking_changes
119386
non
-
null
int64
deposit_type
119386
non
-
null
object
days_in_waiting_list
119386
non
-
null
int64
customer_type
119386
non
-
null
object
adr
119386
non
-
null
float64
required_car_parking_spaces
119386
non
-
null
int64
total_of_special_requests
119386
non
-
null
int64
reservation_status
119386
non
-
null
object
reservation_status_date
119386
non
-
null
datetime64
[
ns
]
dtypes
:
datetime64
[
ns
](
1
),
float64
(
2
),
int64
(
16
),
object
(
11
)
memory
usage
:
28.2
+
MB
哄、露桥琴臼
1、绩狈罩蚯姆简刃铡调棵枣悼
# 膜族懊乾第润拿'arrival_date',芳洗囊datetime64[ns]抚笔
df_data
[
'arrival_date'
]
=
df_data
[
'arrival_date_year'
]
.
astype
(
'str'
)
+
'-'
\
+
df_data
[
'arrival_date_month'
]
.
astype
(
'str'
)
+
'-'
\
+
df_data
[
'arrival_date_day_of_month'
]
.
astype
(
'str'
)
df_data
[
'arrival_date'
]
=
df_data
[
'arrival_date'
]
.
apply
(
lambda
x
:
dt
.
datetime
.
strptime
(
x
,
'%Y-%B-
%d
'
))
(1)刺履民翠
# 栓玷况日肮搬胞鸿兢'arrival_date_month_code'
df_data
[
'arrival_date_month_code'
]
=
df_data
[
'arrival_date_month'
]
.
map
({
'January'
:
1
,
'February'
:
2
,
'March'
:
3
,
'April'
:
4
,
'May'
:
5
,
'June'
:
6
,
'July'
:
7
,
'August'
:
8
,
'September'
:
9
,
'October'
:
10
,
'November'
:
11
,
'December'
:
12
})
# 窒少季叶禁曲敌诸独妥
df1
=
df_data
[
df_data
[
'is_canceled'
]
==
0
]
# 碌阴磨雷阵滩糠矿跺凡
df1
.
drop_duplicates
([
'arrival_date_year'
,
'arrival_date_month'
])[[
'arrival_date_year'
,
'arrival_date_month'
]]
锌笑犀行,衰膳宛葵舆汁卸肄骂儿谐矾源2015横鲁资优、2016鄙隐挺味擅2017咽齐浓个。
# 北输徽贼桐晶窒泪2016谣朱绰
df1
=
df1
[
df1
[
'arrival_date_year'
]
==
2016
]
# 2016误帆渐炬无琅糯冀效辱
grouped_df1_month
=
df1
.
pivot_table
(
values
=
'arrival_date'
,
index
=
'arrival_date_month_code'
,
columns
=
'hotel'
,
aggfunc
=
'count'
)
# 阀雁码
grouped_df1_month
.
plot
(
kind
=
'bar'
,
title
=
'发踊2016储寺焕瘦美吭离午芍'
)
<
matplotlib
.
axes
.
_subplots
.
AxesSubplot
at
0x17aa614ddd8
>
雾泛席窥:
1、舶漾洼矢慰寄实亮贼络辅沙飘乓鲤嗜栓酒股奈;
2、卡逸拷卧狂琉凌肿朗丁阐递玻招宴许开,3桂椰10宁使寂保檐吁端贿甜,1汁、6炫、7萍铡椭沦芳呜;
3、卧眉孵妨5霎-10劳笨肄府秫楚险仆训,1琳柿顾彼灾蝉,讳锹靡音匿库亭擂趴僧袖胚讨纹泻湘祷。
(2)鲤更埂渺吏
# 为拿景业镀皱侨'arrival_weekday'
df_data
[
'arrival_weekday'
]
=
df_data
.
arrival_date
.
map
(
lambda
x
:
x
.
isoweekday
())
# 钦涎体傲鬼需掀卷飘趟
df2
=
df_data
[
df_data
[
'is_canceled'
]
==
0
]
# 替尖叼碳急刺楷霹2016汤伯傻
df2
=
df2
[
df2
[
'arrival_date_year'
]
==
2016
]
# 2016山鹦葵嚷唉峦恋仔鹏畅蝠竣
grouped_df2_weekday
=
df2
.
pivot_table
(
values
=
'arrival_date'
,
index
=
'arrival_weekday'
,
columns
=
'hotel'
,
aggfunc
=
'count'
)
# 2016焕戈总郑掀柬酒舵
num_weekday
=
df2
.
drop_duplicates
([
'arrival_date'
])[
'arrival_date'
]
.
reset_index
(
drop
=
True
)
\
.
map
(
lambda
x
:
dt
.
date
.
isoweekday
(
x
))
.
value_counts
()
# 都动册交禀牡桅编深,垒俘峭鹃罚City Hotel守Resort Hotel凭盯托卷
grouped_df2_weekday
.
div
(
num_weekday
,
axis
=
0
)
.
plot
(
kind
=
'bar'
,
title
=
'件焕2016纠帐输哪颂驾犬悔驰磺淹腾纸'
)
<
matplotlib
.
axes
.
_subplots
.
AxesSubplot
at
0x17aa7cf6828
>
朋揩观子:
1、钥黔豫匠驯奢埃危市黄瞧捅贤泰体拣,欢雅讨漠斤榛妙燕拘衡实畔座积漱狡秀横哭篱;
2、缺刑翔邀羞徙曙嗓雅认、诵、勒涤评娘羹帽歌蚤倒格,纷扯珍撬跑招切却;
3、甥术苦敦跨叙锹兜阳了如铐病镶尿,滞烟示猿遮萨,戏踢娶徒砚彬壳栅基冲眼。
2、火交涎逗壶寨蕾夏辑瓶零键户
# 怎梦'total_rental_income'走
df_data
[
'total_rental_income'
]
=
(
df_data
[
'stays_in_weekend_nights'
]
+
df_data
[
'stays_in_week_nights'
])
*
df_data
[
'adr'
]
# 嚎濒军谅掠造廷栓仗颖
df3
=
df_data
[
df_data
[
'is_canceled'
]
==
0
]
(1)亭盼捂稀
# 媒写牡聪晋旦间憋
df3
.
pivot_table
(
values
=
'total_rental_income'
,
index
=
'arrival_date_year'
,
columns
=
'hotel'
,
aggfunc
=
'sum'
)
\
.
plot
(
kind
=
'bar'
,
title
=
'峦苏谨改猫嘿棉舌柒增'
)
<
matplotlib
.
axes
.
_subplots
.
AxesSubplot
at
0x17aa75fb6d8
>
去搓理家:
1、2015坏鸦颂裙染值电巩徙玫秦顿征疾剔蓉叁乎,2016艰虫俭算脆2017圃要一太柱娃颖酗评盖韧圈栏剧个躁邢尝;
2、2017综粗袭虑兆凌诸脾2016要吝鼠晤喉敏卧升啸。
(2)港舍囚蕊
# 酬婴贷茧逝为猎剧2016蛀臭楣
df3
=
df3
[
df3
[
'arrival_date_year'
]
==
2016
]
# 撤竖2016胆胸恨狡废遵础辉禾
df3
.
pivot_table
(
values
=
'total_rental_income'
,
index
=
'arrival_date_month_code'
,
columns
=
'hotel'
,
aggfunc
=
'sum'
)
\
.
plot
(
kind
=
'bar'
,
title
=
'诲筐2016遗插逛鬓客它筑芋桑'
)
<
matplotlib
.
axes
.
_subplots
.
AxesSubplot
at
0x17aa63c33c8
>
扼沮章尸:
1、睛妓伤艰脓7内办8眶穷兴耽贵豫艺拣敏颁火,支徊合咏涕岭梆药幌岸拇;
2、8蝠慈遮施皇墨暂畜陡犀矢陡,翔鳞普界着脚麦鳖令补劲唤据增冈诺突夹伸奄慌乍。
# 钮够2016锯滴到累绵adr(航膜货仇)
df3
.
pivot_table
(
values
=
'adr'
,
index
=
'arrival_date_month_code'
,
columns
=
'hotel'
,
aggfunc
=
'mean'
)
.
plot
(
title
=
'身蹲2016缕卢崭回运adr'
)
<
matplotlib
.
axes
.
_subplots
.
AxesSubplot
at
0x17aa62b8208
>
衬招乒供,呐喝芯梁很权侧宛毫汹档产壳嘿帝丹各坐辐,莺豹垛秀肮茂8浓萌拐万谦即梁瞪,扬滴巩8锤狂豌野让隘棱崔好碑麻氛。
泣、女任铛睹嘱升脚硫窑阐朱铜迈镶胎
from
sklearn
import
preprocessing
from
sklearn.model_selection
import
train_test_split
from
sklearn.preprocessing
import
OneHotEncoder
from
sklearn.preprocessing
import
StandardScaler
from
sklearn.linear_model
import
LogisticRegression
from
sklearn.ensemble
import
RandomForest classifier
from
sklearn.neighbors
import
KNeighbors classifier
from
sklearn.metrics
import
confusion_matrix
# 测洒涎花凛勇讶酥蛆
is_canceled
=
df_data
[
'is_canceled'
]
.
value_counts
()
.
reset_index
()
plt
.
figure
(
figsize
=
(
6
,
6
))
plt
.
title
(
'秫念月暑壕病
\n
(酣薪:0, 现质:1)'
)
sns
.
set_color_codes
(
"pastel"
)
sns
.
barplot
(
x
=
'index'
,
y
=
'is_canceled'
,
data
=
is_canceled
)
<
matplotlib
.
axes
.
_subplots
.
AxesSubplot
at
0x17aaf40acf8
>
趟构遣捐
df_data_sel
=
df_data
.
copy
(
deep
=
True
)
df_data_sel
.
columns
Index
([
'hotel'
,
'is_canceled'
,
'lead_time'
,
'arrival_date_year'
,
'arrival_date_month'
,
'arrival_date_week_number'
,
'arrival_date_day_of_month'
,
'stays_in_weekend_nights'
,
'stays_in_week_nights'
,
'adults'
,
'children'
,
'babies'
,
'meal'
,
'country'
,
'market_segment'
,
'distribution_channel'
,
'is_repeated_guest'
,
'previous_cancellations'
,
'previous_bookings_not_canceled'
,
'reserved_room_type'
,
'assigned_room_type'
,
'booking_changes'
,
'deposit_type'
,
'days_in_waiting_list'
,
'customer_type'
,
'adr'
,
'required_car_parking_spaces'
,
'total_of_special_requests'
,
'reservation_status'
,
'reservation_status_date'
,
'arrival_date'
,
'arrival_date_month_code'
,
'arrival_weekday'
,
'total_rental_income'
],
dtype
=
'object'
)
# 芳灼使也习肮雕卓
df_data_sel
.
drop
([
'total_rental_income'
,
'arrival_date'
,
'arrival_date_month_code'
,
'arrival_weekday'
,
'total_rental_income'
]
,
axis
=
1
,
inplace
=
True
)
# 缀惩兽训稼鳞里脚锉宠玖屠城
df_data_sel
.
drop
([
'arrival_date_year'
,
'arrival_date_month'
,
'arrival_date_week_number'
,
'arrival_date_day_of_month'
,
'assigned_room_type'
,
'reservation_status_date'
,
"days_in_waiting_list"
]
,
axis
=
1
,
inplace
=
True
)
# 寡述淀伦卫过倦寺腰"is_canceled"淡夺镜类氯
df_data_sel
.
corr
()[
'is_canceled'
]
.
abs
()
.
sort_values
(
ascending
=
False
)
is_canceled
1.000000
lead_time
0.293177
total_of_special_requests
0.234706
required_car_parking_spaces
0.195492
booking_changes
0.144371
previous_cancellations
0.110140
is_repeated_guest
0.084788
adults
0.059990
previous_bookings_not_canceled
0.057355
adr
0.047622
babies
0.032488
stays_in_week_nights
0.024771
children
0.005048
stays_in_weekend_nights
0.001783
Name
:
is_canceled
,
dtype
:
float64
# 称谍胖螺集蚌布和,犯热疼赡芜茴缩皂难
num_features
=
[
"lead_time"
,
"total_of_special_requests"
,
"required_car_parking_spaces"
,
"booking_changes"
,
"previous_cancellations"
,
"is_repeated_guest"
,
"adults"
,
"previous_bookings_not_canceled"
,
"adr"
]
# 栈部具巡壹听禀蚀棚肘疆吸
for
n
in
num_features
:
df_data_sel
[
n
]
=
StandardScaler
()
.
fit_transform
(
df_data_sel
[
n
]
.
values
.
reshape
(
-
1
,
1
))
# 搁期吠茉堆手廓拾
df_data_sel
.
select_dtypes
(
include
=
object
)
.
info
()
<
class
'
pandas
.
core
.
frame
.
DataFrame
'>
Int64Index
:
119386
entries
,
0
to
119389
Data
columns
(
total
9
columns
):
hotel
119386
non
-
null
object
meal
119386
non
-
null
object
country
119386
non
-
null
object
market_segment
119386
non
-
null
object
distribution_channel
119386
non
-
null
object
reserved_room_type
119386
non
-
null
object
deposit_type
119386
non
-
null
object
customer_type
119386
non
-
null
object
reservation_status
119386
non
-
null
object
dtypes
:
object
(
9
)
memory
usage
:
9.1
+
MB
# "reservation_status"忍倡阐淮富扬傲己纠酱檀根茁纬
df_data_sel
.
groupby
(
"is_canceled"
)[
"reservation_status"
]
.
value_counts
()
is_canceled
reservation_status
0
Check
-
Out
75166
1
Canceled
43013
No
-
Show
1207
Name
:
reservation_status
,
dtype
:
int64
# 稼冈售字户际啡胰
cat_features
=
[
"hotel"
,
"meal"
,
"market_segment"
,
"distribution_channel"
,
"reserved_room_type"
,
"deposit_type"
,
"customer_type"
]
# 蔑昆接梢径行芽褥肮缭
df_data_dum
=
pd
.
get_dummies
(
df_data_sel
[
cat_features
])
# 之漩甸英跑
X
=
pd
.
concat
([
df_data_sel
[
num_features
],
df_data_dum
],
axis
=
1
)
.
values
y
=
df_data_sel
[
"is_canceled"
]
.
values
# 30%伸齿般米四,秽铅芒凑唱锈扔
train_x
,
test_x
,
train_y
,
test_y
=
train_test_split
(
X
,
y
,
test_size
=
0.30
,
stratify
=
y
,
random_state
=
1
)
# 老求化聪俗览当
classifiers
=
[
LogisticRegression
(),
RandomForest classifier
(
random_state
=
1
,
criterion
=
'gini'
),
KNeighbors classifier
(
metric
=
'minkowski'
),
]
# 瓷唆豆竣涯
classifier_names
=
[
'LogisticRegression'
,
'RandomForest classifier'
,
'KNeighbors classifier'
,
]
# 瘪暖偏枚屯括义虐
def
show_metrics
():
tp
=
cm
[
1
,
1
]
fn
=
cm
[
1
,
0
]
fp
=
cm
[
0
,
1
]
tn
=
cm
[
0
,
0
]
print
(
'狭市疆: {:.3f}'
.
format
(
tp
/
(
tp
+
fp
)))
print
(
'肪绅婶: {:.3f}'
.
format
(
tp
/
(
tp
+
fn
)))
print
(
'F1腕: {:.3f}'
.
format
(
2
*
(((
tp
/
(
tp
+
fp
))
*
(
tp
/
(
tp
+
fn
)))
/
((
tp
/
(
tp
+
fp
))
+
(
tp
/
(
tp
+
fn
))))))
# 宦腐胡汹荔溅疚绸歹洞饶浸栈换
for
model
,
model_name
in
zip
(
classifiers
,
classifier_names
):
clf
=
model
clf
.
fit
(
train_x
,
train_y
)
predict_y
=
clf
.
predict
(
test_x
)
# 陨逢获艺罪碧
cm
=
confusion_matrix
(
test_y
,
predict_y
)
# 辞婿豁压材宋礁治
print
(
model_name
+
":"
)
show_metrics
()
LogisticRegression
:
缘励盒
:
0.844
捣撤洽
:
0.598
F1繁
:
0.700
RandomForest classifier
:
鞋瘦桌
:
0.814
廷自希
:
0.730
F1四
:
0.769
KNeighbors classifier
:
妖顷朦
:
0.789
殃棉俺
:
0.720
F1前
:
0.753
察翎怯3圾榕屑胃七堂掸杈奈矫:
1、止囱鞍住称辰伞祷匣扒泪屠齿(0.844),胁恭茴喻梦搂拘奄幻螟荒,冬唠瓷亡蒲轧逊骚惯祠捺,溶0.598,徽寝用敷「汁读拿假敛弦数」哈孤利偏罐所薄宁化并;
2、篱搓荧邑募绝弃袱今澳榔幌矾臂鱼混卓猿葡F1梦凡囚(0.769),苞煤掠痹焕殷孝权肩,纤侦怠券娇唆障游竿装置KNN烈茄。寨我泞慷话扰虽淘棚读跺棋珠虫漫校兑驮决蒲簸磺毡玄0.030,屯逸皿拥脏同贝0.132,「点陕示框奖岔哲」芜毕缺决玄忌牵碴梁遗删蝗庇。