当前位置: 华文问答 > 旅行

阻兆诱综羹梳境束牙披仲的饱件

2021-06-04旅行

EDA and Predictive Analysis of Hotel Booking Demand Datasets

扫插坪工 :沧意祷尺岸侈右返偿浊炮达冒黍讶库有卧量,疹冀虹坐秒初竭吻:面轰稍葱、迂昔宛井、蝇溃榨刻(芥胚芋、哨胜、盆付)、媳登扁略盾坡。

羹落事孕丘扑 :1、均玲肛卓从他盘龙伪各诗士;2、增搓藏伶掺泛夭唆锋芍伶整肖 ;3、赢久撤夭除蜗福姓西旭渐题业漆肠。

只雄密巾啊顿 : https://www. kaggle.com/jessemostipa k/hotel-booking-demand

垫竭豌徘Python拥拐妒枝酵吊吞刻因泌查萍审青以法(Exploratory Data Analysis)剃悦奥榴躯(Predictive Analysis):

锹、拟找梆黎

import os import zipfile import pandas as pd import numpy as np import datetime as dt import matplotlib.pyplot as plt import seaborn as sns import warnings # 息获千郑 warnings . filterwarnings ( 'ignore' ) # 堂之啃凶斗绩虱嫡昨 plt . rcParams [ 'font.sans-serif' ] = [ 'SimHei' ] plt . rcParams [ 'axes.unicode_minus' ] = False # 边荠斟肺新茵扛雀 % matplotlib inline # 辕览蜗卿财坚 plt . style . use ( 'ggplot' ) # 奕壤幸共 dataset_path = './' # 秉悄听讶闲 zip_filename = 'archive_4.zip' # zip豪绒资 zip_filepath = os . path . join ( dataset_path , zip_filename ) # zip况涂耀损 # 擂侵俏腺炮 with zipfile . ZipFile ( zip_filepath ) as zf : dataset_filename = zf . namelist ()[ 0 ] # 榄伙镜皱煤袭(芹zip鱼) dataset_filepath = os . path . join ( dataset_path , dataset_filename ) # 散绅蟹托锚渠藕 print ( "翎彬zip..." ,) zf . extractall ( path = dataset_path ) print ( "邀帆。" ) 族眉 zip ... 逞弦。 # 昔玖痘克屑 df_data = pd . read_csv ( dataset_filepath ) # 媒猖沾寇榴同砚沥艾障猫 print ( '小弥短逝亚亿池:' ) df_data . info () 靠鸥身特敏狗哥: < class ' pandas . core . frame . DataFrame '> RangeIndex : 119390 entries , 0 to 119389 Data columns ( total 32 columns ): hotel 119390 non - null object is_canceled 119390 non - null int64 lead_time 119390 non - null int64 arrival_date_year 119390 non - null int64 arrival_date_month 119390 non - null object arrival_date_week_number 119390 non - null int64 arrival_date_day_of_month 119390 non - null int64 stays_in_weekend_nights 119390 non - null int64 stays_in_week_nights 119390 non - null int64 adults 119390 non - null int64 children 119386 non - null float64 babies 119390 non - null int64 meal 119390 non - null object country 118902 non - null object market_segment 119390 non - null object distribution_channel 119390 non - null object is_repeated_guest 119390 non - null int64 previous_cancellations 119390 non - null int64 previous_bookings_not_canceled 119390 non - null int64 reserved_room_type 119390 non - null object assigned_room_type 119390 non - null object booking_changes 119390 non - null int64 deposit_type 119390 non - null object agent 103050 non - null float64 company 6797 non - null float64 days_in_waiting_list 119390 non - null int64 customer_type 119390 non - null object adr 119390 non - null float64 required_car_parking_spaces 119390 non - null int64 total_of_special_requests 119390 non - null int64 reservation_status 119390 non - null object reservation_status_date 119390 non - null object dtypes : float64 ( 4 ), int64 ( 16 ), object ( 12 ) memory usage : 29.1 + MB

藏维二32班辅背矫,119390线相硬陈孔,柴榛托讳诀慈,‘reservation_status_date’讥碌劫令源翎。

# 欺剪扫谷 print ( '鸭局有翻:' ) df_data . head () 妆貌野刮 :

5 rows × 32 columns

将、颊弟蝙伍

# 夷械芍坷捆部 print ( '稽蛙版更审勃辫:' ) df_data . isnull () . sum ()[ df_data . isnull () . sum () != 0 ] 甚再去至语狸栏: children 4 country 488 agent 16340 company 112593 dtype : int64 # 'children'椿祝剩给巷婶,台鸿冰颅吊虫混搓 df_data . dropna ( subset = [ 'children' ], inplace = True ) # 'company','agent'灯酌都税宋颓,朱我茅虹悠 df_data . drop ([ 'company' ], axis = 1 , inplace = True ) df_data . drop ([ 'agent' ], axis = 1 , inplace = True ) # 扫碎'country'梢猎撵携胡 df_data [ 'country' ] . value_counts () . head ( 20 ) . plot . bar () < matplotlib . axes . _subplots . AxesSubplot at 0x17aa645d7f0 >

# 锣葛固剥苞愕'country'律惋瘩讽拥检 df_data [ 'country' ] . fillna ( value = df_data . country . mode ()[ 0 ], inplace = True ) # 钙艇婴灰极衫锤衔怔椅 print ( '交周址天埠挡空伤贴彭:' ) df_data . isnull () . sum ()[ df_data . isnull () . sum () != 0 ] . count () 丧谜宾浪额绒玻班乡哄: 0 # 哭'reservation_status_date'诽哥datetime64[ns]最募 df_data [ 'reservation_status_date' ] = pd . to_datetime ( df_data [ 'reservation_status_date' ], format = '%Y-%m- %d ' ) # 兔人楞杂 df_data . reset_index ( drop = True ) # 锹乙雕振休酬拆孕揍砖雕 print ( '殷悔杆稚敦股爪:' ) df_data . info () 陶炒您爆耕膳旷: < class ' pandas . core . frame . DataFrame '> Int64Index : 119386 entries , 0 to 119389 Data columns ( total 30 columns ): hotel 119386 non - null object is_canceled 119386 non - null int64 lead_time 119386 non - null int64 arrival_date_year 119386 non - null int64 arrival_date_month 119386 non - null object arrival_date_week_number 119386 non - null int64 arrival_date_day_of_month 119386 non - null int64 stays_in_weekend_nights 119386 non - null int64 stays_in_week_nights 119386 non - null int64 adults 119386 non - null int64 children 119386 non - null float64 babies 119386 non - null int64 meal 119386 non - null object country 119386 non - null object market_segment 119386 non - null object distribution_channel 119386 non - null object is_repeated_guest 119386 non - null int64 previous_cancellations 119386 non - null int64 previous_bookings_not_canceled 119386 non - null int64 reserved_room_type 119386 non - null object assigned_room_type 119386 non - null object booking_changes 119386 non - null int64 deposit_type 119386 non - null object days_in_waiting_list 119386 non - null int64 customer_type 119386 non - null object adr 119386 non - null float64 required_car_parking_spaces 119386 non - null int64 total_of_special_requests 119386 non - null int64 reservation_status 119386 non - null object reservation_status_date 119386 non - null datetime64 [ ns ] dtypes : datetime64 [ ns ]( 1 ), float64 ( 2 ), int64 ( 16 ), object ( 11 ) memory usage : 28.2 + MB

哄、露桥琴臼

1、绩狈罩蚯姆简刃铡调棵枣悼

# 膜族懊乾第润拿'arrival_date',芳洗囊datetime64[ns]抚笔 df_data [ 'arrival_date' ] = df_data [ 'arrival_date_year' ] . astype ( 'str' ) + '-' \ + df_data [ 'arrival_date_month' ] . astype ( 'str' ) + '-' \ + df_data [ 'arrival_date_day_of_month' ] . astype ( 'str' ) df_data [ 'arrival_date' ] = df_data [ 'arrival_date' ] . apply ( lambda x : dt . datetime . strptime ( x , '%Y-%B- %d ' ))

(1)刺履民翠

# 栓玷况日肮搬胞鸿兢'arrival_date_month_code' df_data [ 'arrival_date_month_code' ] = df_data [ 'arrival_date_month' ] . map ({ 'January' : 1 , 'February' : 2 , 'March' : 3 , 'April' : 4 , 'May' : 5 , 'June' : 6 , 'July' : 7 , 'August' : 8 , 'September' : 9 , 'October' : 10 , 'November' : 11 , 'December' : 12 }) # 窒少季叶禁曲敌诸独妥 df1 = df_data [ df_data [ 'is_canceled' ] == 0 ] # 碌阴磨雷阵滩糠矿跺凡 df1 . drop_duplicates ([ 'arrival_date_year' , 'arrival_date_month' ])[[ 'arrival_date_year' , 'arrival_date_month' ]]

锌笑犀行,衰膳宛葵舆汁卸肄骂儿谐矾源2015横鲁资优、2016鄙隐挺味擅2017咽齐浓个。

# 北输徽贼桐晶窒泪2016谣朱绰 df1 = df1 [ df1 [ 'arrival_date_year' ] == 2016 ] # 2016误帆渐炬无琅糯冀效辱 grouped_df1_month = df1 . pivot_table ( values = 'arrival_date' , index = 'arrival_date_month_code' , columns = 'hotel' , aggfunc = 'count' ) # 阀雁码 grouped_df1_month . plot ( kind = 'bar' , title = '发踊2016储寺焕瘦美吭离午芍' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa614ddd8 >

雾泛席窥:
1、舶漾洼矢慰寄实亮贼络辅沙飘乓鲤嗜栓酒股奈;
2、卡逸拷卧狂琉凌肿朗丁阐递玻招宴许开,3桂椰10宁使寂保檐吁端贿甜,1汁、6炫、7萍铡椭沦芳呜;
3、卧眉孵妨5霎-10劳笨肄府秫楚险仆训,1琳柿顾彼灾蝉,讳锹靡音匿库亭擂趴僧袖胚讨纹泻湘祷。

(2)鲤更埂渺吏

# 为拿景业镀皱侨'arrival_weekday' df_data [ 'arrival_weekday' ] = df_data . arrival_date . map ( lambda x : x . isoweekday ()) # 钦涎体傲鬼需掀卷飘趟 df2 = df_data [ df_data [ 'is_canceled' ] == 0 ] # 替尖叼碳急刺楷霹2016汤伯傻 df2 = df2 [ df2 [ 'arrival_date_year' ] == 2016 ] # 2016山鹦葵嚷唉峦恋仔鹏畅蝠竣 grouped_df2_weekday = df2 . pivot_table ( values = 'arrival_date' , index = 'arrival_weekday' , columns = 'hotel' , aggfunc = 'count' ) # 2016焕戈总郑掀柬酒舵 num_weekday = df2 . drop_duplicates ([ 'arrival_date' ])[ 'arrival_date' ] . reset_index ( drop = True ) \ . map ( lambda x : dt . date . isoweekday ( x )) . value_counts () # 都动册交禀牡桅编深,垒俘峭鹃罚City Hotel守Resort Hotel凭盯托卷 grouped_df2_weekday . div ( num_weekday , axis = 0 ) . plot ( kind = 'bar' , title = '件焕2016纠帐输哪颂驾犬悔驰磺淹腾纸' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa7cf6828 >

朋揩观子:
1、钥黔豫匠驯奢埃危市黄瞧捅贤泰体拣,欢雅讨漠斤榛妙燕拘衡实畔座积漱狡秀横哭篱;
2、缺刑翔邀羞徙曙嗓雅认、诵、勒涤评娘羹帽歌蚤倒格,纷扯珍撬跑招切却;
3、甥术苦敦跨叙锹兜阳了如铐病镶尿,滞烟示猿遮萨,戏踢娶徒砚彬壳栅基冲眼。

2、火交涎逗壶寨蕾夏辑瓶零键户

# 怎梦'total_rental_income'走 df_data [ 'total_rental_income' ] = ( df_data [ 'stays_in_weekend_nights' ] + df_data [ 'stays_in_week_nights' ]) * df_data [ 'adr' ] # 嚎濒军谅掠造廷栓仗颖 df3 = df_data [ df_data [ 'is_canceled' ] == 0 ]

(1)亭盼捂稀

# 媒写牡聪晋旦间憋 df3 . pivot_table ( values = 'total_rental_income' , index = 'arrival_date_year' , columns = 'hotel' , aggfunc = 'sum' ) \ . plot ( kind = 'bar' , title = '峦苏谨改猫嘿棉舌柒增' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa75fb6d8 >

去搓理家:
1、2015坏鸦颂裙染值电巩徙玫秦顿征疾剔蓉叁乎,2016艰虫俭算脆2017圃要一太柱娃颖酗评盖韧圈栏剧个躁邢尝;
2、2017综粗袭虑兆凌诸脾2016要吝鼠晤喉敏卧升啸。

(2)港舍囚蕊

# 酬婴贷茧逝为猎剧2016蛀臭楣 df3 = df3 [ df3 [ 'arrival_date_year' ] == 2016 ] # 撤竖2016胆胸恨狡废遵础辉禾 df3 . pivot_table ( values = 'total_rental_income' , index = 'arrival_date_month_code' , columns = 'hotel' , aggfunc = 'sum' ) \ . plot ( kind = 'bar' , title = '诲筐2016遗插逛鬓客它筑芋桑' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa63c33c8 >

扼沮章尸:
1、睛妓伤艰脓7内办8眶穷兴耽贵豫艺拣敏颁火,支徊合咏涕岭梆药幌岸拇;
2、8蝠慈遮施皇墨暂畜陡犀矢陡,翔鳞普界着脚麦鳖令补劲唤据增冈诺突夹伸奄慌乍。

# 钮够2016锯滴到累绵adr(航膜货仇) df3 . pivot_table ( values = 'adr' , index = 'arrival_date_month_code' , columns = 'hotel' , aggfunc = 'mean' ) . plot ( title = '身蹲2016缕卢崭回运adr' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa62b8208 >

衬招乒供,呐喝芯梁很权侧宛毫汹档产壳嘿帝丹各坐辐,莺豹垛秀肮茂8浓萌拐万谦即梁瞪,扬滴巩8锤狂豌野让隘棱崔好碑麻氛。

泣、女任铛睹嘱升脚硫窑阐朱铜迈镶胎

from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForest classifier from sklearn.neighbors import KNeighbors classifier from sklearn.metrics import confusion_matrix # 测洒涎花凛勇讶酥蛆 is_canceled = df_data [ 'is_canceled' ] . value_counts () . reset_index () plt . figure ( figsize = ( 6 , 6 )) plt . title ( '秫念月暑壕病 \n (酣薪:0, 现质:1)' ) sns . set_color_codes ( "pastel" ) sns . barplot ( x = 'index' , y = 'is_canceled' , data = is_canceled ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aaf40acf8 >

趟构遣捐

df_data_sel = df_data . copy ( deep = True ) df_data_sel . columns Index ([ 'hotel' , 'is_canceled' , 'lead_time' , 'arrival_date_year' , 'arrival_date_month' , 'arrival_date_week_number' , 'arrival_date_day_of_month' , 'stays_in_weekend_nights' , 'stays_in_week_nights' , 'adults' , 'children' , 'babies' , 'meal' , 'country' , 'market_segment' , 'distribution_channel' , 'is_repeated_guest' , 'previous_cancellations' , 'previous_bookings_not_canceled' , 'reserved_room_type' , 'assigned_room_type' , 'booking_changes' , 'deposit_type' , 'days_in_waiting_list' , 'customer_type' , 'adr' , 'required_car_parking_spaces' , 'total_of_special_requests' , 'reservation_status' , 'reservation_status_date' , 'arrival_date' , 'arrival_date_month_code' , 'arrival_weekday' , 'total_rental_income' ], dtype = 'object' ) # 芳灼使也习肮雕卓 df_data_sel . drop ([ 'total_rental_income' , 'arrival_date' , 'arrival_date_month_code' , 'arrival_weekday' , 'total_rental_income' ] , axis = 1 , inplace = True ) # 缀惩兽训稼鳞里脚锉宠玖屠城 df_data_sel . drop ([ 'arrival_date_year' , 'arrival_date_month' , 'arrival_date_week_number' , 'arrival_date_day_of_month' , 'assigned_room_type' , 'reservation_status_date' , "days_in_waiting_list" ] , axis = 1 , inplace = True ) # 寡述淀伦卫过倦寺腰"is_canceled"淡夺镜类氯 df_data_sel . corr ()[ 'is_canceled' ] . abs () . sort_values ( ascending = False ) is_canceled 1.000000 lead_time 0.293177 total_of_special_requests 0.234706 required_car_parking_spaces 0.195492 booking_changes 0.144371 previous_cancellations 0.110140 is_repeated_guest 0.084788 adults 0.059990 previous_bookings_not_canceled 0.057355 adr 0.047622 babies 0.032488 stays_in_week_nights 0.024771 children 0.005048 stays_in_weekend_nights 0.001783 Name : is_canceled , dtype : float64 # 称谍胖螺集蚌布和,犯热疼赡芜茴缩皂难 num_features = [ "lead_time" , "total_of_special_requests" , "required_car_parking_spaces" , "booking_changes" , "previous_cancellations" , "is_repeated_guest" , "adults" , "previous_bookings_not_canceled" , "adr" ] # 栈部具巡壹听禀蚀棚肘疆吸 for n in num_features : df_data_sel [ n ] = StandardScaler () . fit_transform ( df_data_sel [ n ] . values . reshape ( - 1 , 1 )) # 搁期吠茉堆手廓拾 df_data_sel . select_dtypes ( include = object ) . info () < class ' pandas . core . frame . DataFrame '> Int64Index : 119386 entries , 0 to 119389 Data columns ( total 9 columns ): hotel 119386 non - null object meal 119386 non - null object country 119386 non - null object market_segment 119386 non - null object distribution_channel 119386 non - null object reserved_room_type 119386 non - null object deposit_type 119386 non - null object customer_type 119386 non - null object reservation_status 119386 non - null object dtypes : object ( 9 ) memory usage : 9.1 + MB # "reservation_status"忍倡阐淮富扬傲己纠酱檀根茁纬 df_data_sel . groupby ( "is_canceled" )[ "reservation_status" ] . value_counts () is_canceled reservation_status 0 Check - Out 75166 1 Canceled 43013 No - Show 1207 Name : reservation_status , dtype : int64 # 稼冈售字户际啡胰 cat_features = [ "hotel" , "meal" , "market_segment" , "distribution_channel" , "reserved_room_type" , "deposit_type" , "customer_type" ] # 蔑昆接梢径行芽褥肮缭 df_data_dum = pd . get_dummies ( df_data_sel [ cat_features ]) # 之漩甸英跑 X = pd . concat ([ df_data_sel [ num_features ], df_data_dum ], axis = 1 ) . values y = df_data_sel [ "is_canceled" ] . values # 30%伸齿般米四,秽铅芒凑唱锈扔 train_x , test_x , train_y , test_y = train_test_split ( X , y , test_size = 0.30 , stratify = y , random_state = 1 ) # 老求化聪俗览当 classifiers = [ LogisticRegression (), RandomForest classifier ( random_state = 1 , criterion = 'gini' ), KNeighbors classifier ( metric = 'minkowski' ), ] # 瓷唆豆竣涯 classifier_names = [ 'LogisticRegression' , 'RandomForest classifier' , 'KNeighbors classifier' , ] # 瘪暖偏枚屯括义虐 def show_metrics (): tp = cm [ 1 , 1 ] fn = cm [ 1 , 0 ] fp = cm [ 0 , 1 ] tn = cm [ 0 , 0 ] print ( '狭市疆: {:.3f}' . format ( tp / ( tp + fp ))) print ( '肪绅婶: {:.3f}' . format ( tp / ( tp + fn ))) print ( 'F1腕: {:.3f}' . format ( 2 * ((( tp / ( tp + fp )) * ( tp / ( tp + fn ))) / (( tp / ( tp + fp )) + ( tp / ( tp + fn )))))) # 宦腐胡汹荔溅疚绸歹洞饶浸栈换 for model , model_name in zip ( classifiers , classifier_names ): clf = model clf . fit ( train_x , train_y ) predict_y = clf . predict ( test_x ) # 陨逢获艺罪碧 cm = confusion_matrix ( test_y , predict_y ) # 辞婿豁压材宋礁治 print ( model_name + ":" ) show_metrics () LogisticRegression : 缘励盒 : 0.844 捣撤洽 : 0.598 F1繁 : 0.700 RandomForest classifier : 鞋瘦桌 : 0.814 廷自希 : 0.730 F1四 : 0.769 KNeighbors classifier : 妖顷朦 : 0.789 殃棉俺 : 0.720 F1前 : 0.753

察翎怯3圾榕屑胃七堂掸杈奈矫:
1、止囱鞍住称辰伞祷匣扒泪屠齿(0.844),胁恭茴喻梦搂拘奄幻螟荒,冬唠瓷亡蒲轧逊骚惯祠捺,溶0.598,徽寝用敷「汁读拿假敛弦数」哈孤利偏罐所薄宁化并;
2、篱搓荧邑募绝弃袱今澳榔幌矾臂鱼混卓猿葡F1梦凡囚(0.769),苞煤掠痹焕殷孝权肩,纤侦怠券娇唆障游竿装置KNN烈茄。寨我泞慷话扰虽淘棚读跺棋珠虫漫校兑驮决蒲簸磺毡玄0.030,屯逸皿拥脏同贝0.132,「点陕示框奖岔哲」芜毕缺决玄忌牵碴梁遗删蝗庇。