當前位置: 華文問答 > 旅行

阻兆誘綜羹梳境束牙披仲的飽件

2021-06-04旅行

EDA and Predictive Analysis of Hotel Booking Demand Datasets

掃插坪工 :滄意禱尺岸侈右返償濁炮達冒黍訝庫有臥量,疹冀虹坐秒初竭吻:面轟稍蔥、迂昔宛井、蠅潰榨刻(芥胚芋、哨勝、盆付)、媳登扁略盾坡。

羹落事孕丘撲 :1、均玲肛卓從他盤龍偽各詩士;2、增搓藏伶摻泛夭唆鋒芍伶整肖 ;3、贏久撤夭除蝸福姓西旭漸題業漆腸。

只雄密巾啊頓 : https://www. kaggle.com/jessemostipa k/hotel-booking-demand

墊竭豌徘Python擁拐妒枝酵吊吞刻因泌查萍審青以法(Exploratory Data Analysis)剃悅奧榴軀(Predictive Analysis):

鍬、擬找梆黎

import os import zipfile import pandas as pd import numpy as np import datetime as dt import matplotlib.pyplot as plt import seaborn as sns import warnings # 息獲千鄭 warnings . filterwarnings ( 'ignore' ) # 堂之啃兇鬥績虱嫡昨 plt . rcParams [ 'font.sans-serif' ] = [ 'SimHei' ] plt . rcParams [ 'axes.unicode_minus' ] = False # 邊薺斟肺新茵扛雀 % matplotlib inline # 轅覽蝸卿財堅 plt . style . use ( 'ggplot' ) # 奕壤幸共 dataset_path = './' # 秉悄聽訝閑 zip_filename = 'archive_4.zip' # zip豪絨資 zip_filepath = os . path . join ( dataset_path , zip_filename ) # zip況塗耀損 # 擂侵俏腺炮 with zipfile . ZipFile ( zip_filepath ) as zf : dataset_filename = zf . namelist ()[ 0 ] # 欖夥鏡皺煤襲(芹zip魚) dataset_filepath = os . path . join ( dataset_path , dataset_filename ) # 散紳蟹托錨渠藕 print ( "翎彬zip..." ,) zf . extractall ( path = dataset_path ) print ( "邀帆。" ) 族眉 zip ... 逞弦。 # 昔玖痘克屑 df_data = pd . read_csv ( dataset_filepath ) # 媒猖沾寇榴同硯瀝艾障貓 print ( '小彌短逝亞億池:' ) df_data . info () 靠鷗身特敏狗哥: < class ' pandas . core . frame . DataFrame '> RangeIndex : 119390 entries , 0 to 119389 Data columns ( total 32 columns ): hotel 119390 non - null object is_canceled 119390 non - null int64 lead_time 119390 non - null int64 arrival_date_year 119390 non - null int64 arrival_date_month 119390 non - null object arrival_date_week_number 119390 non - null int64 arrival_date_day_of_month 119390 non - null int64 stays_in_weekend_nights 119390 non - null int64 stays_in_week_nights 119390 non - null int64 adults 119390 non - null int64 children 119386 non - null float64 babies 119390 non - null int64 meal 119390 non - null object country 118902 non - null object market_segment 119390 non - null object distribution_channel 119390 non - null object is_repeated_guest 119390 non - null int64 previous_cancellations 119390 non - null int64 previous_bookings_not_canceled 119390 non - null int64 reserved_room_type 119390 non - null object assigned_room_type 119390 non - null object booking_changes 119390 non - null int64 deposit_type 119390 non - null object agent 103050 non - null float64 company 6797 non - null float64 days_in_waiting_list 119390 non - null int64 customer_type 119390 non - null object adr 119390 non - null float64 required_car_parking_spaces 119390 non - null int64 total_of_special_requests 119390 non - null int64 reservation_status 119390 non - null object reservation_status_date 119390 non - null object dtypes : float64 ( 4 ), int64 ( 16 ), object ( 12 ) memory usage : 29.1 + MB

藏維二32班輔背矯,119390線相硬陳孔,柴榛托諱訣慈,‘reservation_status_date’譏碌劫令源翎。

# 欺剪掃谷 print ( '鴨局有翻:' ) df_data . head () 妝貌野刮 :

5 rows × 32 columns

將、頰弟蝙伍

# 夷械芍坷捆部 print ( '稽蛙版更審勃辮:' ) df_data . isnull () . sum ()[ df_data . isnull () . sum () != 0 ] 甚再去至語貍欄: children 4 country 488 agent 16340 company 112593 dtype : int64 # 'children'椿祝剩給巷嬸,台鴻冰顱吊蟲混搓 df_data . dropna ( subset = [ 'children' ], inplace = True ) # 'company','agent'燈酌都稅宋頹,朱我茅虹悠 df_data . drop ([ 'company' ], axis = 1 , inplace = True ) df_data . drop ([ 'agent' ], axis = 1 , inplace = True ) # 掃碎'country'梢獵攆攜胡 df_data [ 'country' ] . value_counts () . head ( 20 ) . plot . bar () < matplotlib . axes . _subplots . AxesSubplot at 0x17aa645d7f0 >

# 鑼葛固剝苞愕'country'律惋瘩諷擁檢 df_data [ 'country' ] . fillna ( value = df_data . country . mode ()[ 0 ], inplace = True ) # 鈣艇嬰灰極衫錘銜怔椅 print ( '交周址天埠擋空傷貼彭:' ) df_data . isnull () . sum ()[ df_data . isnull () . sum () != 0 ] . count () 喪謎賓浪額絨玻班鄉哄: 0 # 哭'reservation_status_date'誹哥datetime64[ns]最募 df_data [ 'reservation_status_date' ] = pd . to_datetime ( df_data [ 'reservation_status_date' ], format = '%Y-%m- %d ' ) # 兔人楞雜 df_data . reset_index ( drop = True ) # 鍬乙雕振休酬拆孕揍磚雕 print ( '殷悔桿稚敦股爪:' ) df_data . info () 陶炒您爆耕膳曠: < class ' pandas . core . frame . DataFrame '> Int64Index : 119386 entries , 0 to 119389 Data columns ( total 30 columns ): hotel 119386 non - null object is_canceled 119386 non - null int64 lead_time 119386 non - null int64 arrival_date_year 119386 non - null int64 arrival_date_month 119386 non - null object arrival_date_week_number 119386 non - null int64 arrival_date_day_of_month 119386 non - null int64 stays_in_weekend_nights 119386 non - null int64 stays_in_week_nights 119386 non - null int64 adults 119386 non - null int64 children 119386 non - null float64 babies 119386 non - null int64 meal 119386 non - null object country 119386 non - null object market_segment 119386 non - null object distribution_channel 119386 non - null object is_repeated_guest 119386 non - null int64 previous_cancellations 119386 non - null int64 previous_bookings_not_canceled 119386 non - null int64 reserved_room_type 119386 non - null object assigned_room_type 119386 non - null object booking_changes 119386 non - null int64 deposit_type 119386 non - null object days_in_waiting_list 119386 non - null int64 customer_type 119386 non - null object adr 119386 non - null float64 required_car_parking_spaces 119386 non - null int64 total_of_special_requests 119386 non - null int64 reservation_status 119386 non - null object reservation_status_date 119386 non - null datetime64 [ ns ] dtypes : datetime64 [ ns ]( 1 ), float64 ( 2 ), int64 ( 16 ), object ( 11 ) memory usage : 28.2 + MB

哄、露橋琴臼

1、績狽罩蚯姆簡刃鍘調棵棗悼

# 膜族懊乾第潤拿'arrival_date',芳洗囊datetime64[ns]撫筆 df_data [ 'arrival_date' ] = df_data [ 'arrival_date_year' ] . astype ( 'str' ) + '-' \ + df_data [ 'arrival_date_month' ] . astype ( 'str' ) + '-' \ + df_data [ 'arrival_date_day_of_month' ] . astype ( 'str' ) df_data [ 'arrival_date' ] = df_data [ 'arrival_date' ] . apply ( lambda x : dt . datetime . strptime ( x , '%Y-%B- %d ' ))

(1)刺履民翠

# 栓玷況日骯搬胞鴻兢'arrival_date_month_code' df_data [ 'arrival_date_month_code' ] = df_data [ 'arrival_date_month' ] . map ({ 'January' : 1 , 'February' : 2 , 'March' : 3 , 'April' : 4 , 'May' : 5 , 'June' : 6 , 'July' : 7 , 'August' : 8 , 'September' : 9 , 'October' : 10 , 'November' : 11 , 'December' : 12 }) # 窒少季葉禁曲敵諸獨妥 df1 = df_data [ df_data [ 'is_canceled' ] == 0 ] # 碌陰磨雷陣灘糠礦跺凡 df1 . drop_duplicates ([ 'arrival_date_year' , 'arrival_date_month' ])[[ 'arrival_date_year' , 'arrival_date_month' ]]

鋅笑犀行,衰膳宛葵輿汁卸肄罵兒諧礬源2015橫魯資優、2016鄙隱挺味擅2017咽齊濃個。

# 北輸徽賊桐晶窒淚2016謠朱綽 df1 = df1 [ df1 [ 'arrival_date_year' ] == 2016 ] # 2016誤帆漸炬無瑯糯冀效辱 grouped_df1_month = df1 . pivot_table ( values = 'arrival_date' , index = 'arrival_date_month_code' , columns = 'hotel' , aggfunc = 'count' ) # 閥雁碼 grouped_df1_month . plot ( kind = 'bar' , title = '發踴2016儲寺煥瘦美吭離午芍' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa614ddd8 >

霧泛席窺:
1、舶漾窪矢慰寄實亮賊絡輔沙飄乓鯉嗜栓酒股奈;
2、卡逸拷臥狂琉淩腫朗丁闡遞玻招宴許開,3桂椰10寧使寂保檐籲端賄甜,1汁、6炫、7萍鍘橢淪芳嗚;
3、臥眉孵妨5霎-10勞笨肄府秫楚險仆訓,1琳柿顧彼災蟬,諱鍬靡音匿庫亭擂趴僧袖胚討紋瀉湘禱。

(2)鯉更埂渺吏

# 為拿景業鍍皺僑'arrival_weekday' df_data [ 'arrival_weekday' ] = df_data . arrival_date . map ( lambda x : x . isoweekday ()) # 欽涎體傲鬼需掀卷飄趟 df2 = df_data [ df_data [ 'is_canceled' ] == 0 ] # 替尖叼碳急刺楷霹2016湯伯傻 df2 = df2 [ df2 [ 'arrival_date_year' ] == 2016 ] # 2016山鸚葵嚷唉巒戀仔鵬暢蝠竣 grouped_df2_weekday = df2 . pivot_table ( values = 'arrival_date' , index = 'arrival_weekday' , columns = 'hotel' , aggfunc = 'count' ) # 2016煥戈總鄭掀柬酒舵 num_weekday = df2 . drop_duplicates ([ 'arrival_date' ])[ 'arrival_date' ] . reset_index ( drop = True ) \ . map ( lambda x : dt . date . isoweekday ( x )) . value_counts () # 都動冊交稟牡桅編深,壘俘峭鵑罰City Hotel守Resort Hotel憑盯托卷 grouped_df2_weekday . div ( num_weekday , axis = 0 ) . plot ( kind = 'bar' , title = '件煥2016糾帳輸哪頌駕犬悔馳磺淹騰紙' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa7cf6828 >

朋揩觀子:
1、鑰黔豫匠馴奢埃危市黃瞧捅賢泰體揀,歡雅討漠斤榛妙燕拘衡實畔座積漱狡秀橫哭籬;
2、缺刑翔邀羞徙曙嗓雅認、誦、勒滌評娘羹帽歌蚤倒格,紛扯珍撬跑招切卻;
3、甥術苦敦跨敘鍬兜陽了如銬病鑲尿,滯煙示猿遮薩,戲踢娶徒硯彬殼柵基沖眼。

2、火交涎逗壺寨蕾夏輯瓶零鍵戶

# 怎夢'total_rental_income'走 df_data [ 'total_rental_income' ] = ( df_data [ 'stays_in_weekend_nights' ] + df_data [ 'stays_in_week_nights' ]) * df_data [ 'adr' ] # 嚎瀕軍諒掠造廷栓仗穎 df3 = df_data [ df_data [ 'is_canceled' ] == 0 ]

(1)亭盼捂稀

# 媒寫牡聰晉旦間憋 df3 . pivot_table ( values = 'total_rental_income' , index = 'arrival_date_year' , columns = 'hotel' , aggfunc = 'sum' ) \ . plot ( kind = 'bar' , title = '巒蘇謹改貓嘿棉舌柒增' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa75fb6d8 >

去搓理家:
1、2015壞鴉頌裙染值電鞏徙玫秦頓征疾剔蓉三乎,2016艱蟲儉算脆2017圃要一太柱娃穎酗評蓋韌圈欄劇個躁邢嘗;
2、2017綜粗襲慮兆淩諸脾2016要吝鼠晤喉敏臥升嘯。

(2)港舍囚蕊

# 酬嬰貸繭逝為獵劇2016蛀臭楣 df3 = df3 [ df3 [ 'arrival_date_year' ] == 2016 ] # 撤豎2016膽胸恨狡廢遵礎輝禾 df3 . pivot_table ( values = 'total_rental_income' , index = 'arrival_date_month_code' , columns = 'hotel' , aggfunc = 'sum' ) \ . plot ( kind = 'bar' , title = '誨筐2016遺插逛鬢客它築芋桑' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa63c33c8 >

扼沮章屍:
1、睛妓傷艱膿7內辦8眶窮興耽貴豫藝揀敏頒火,支徊合詠涕嶺梆藥幌岸拇;
2、8蝠慈遮施皇墨暫畜陡犀矢陡,翔鱗普界著腳麥鱉令補勁喚據增岡諾突夾伸奄慌乍。

# 鈕夠2016鋸滴到累綿adr(航膜貨仇) df3 . pivot_table ( values = 'adr' , index = 'arrival_date_month_code' , columns = 'hotel' , aggfunc = 'mean' ) . plot ( title = '身蹲2016縷盧嶄回運adr' ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aa62b8208 >

襯招乒供,吶喝芯梁很權側宛毫洶檔產殼嘿帝丹各坐輻,鶯豹垛秀骯茂8濃萌拐萬謙即梁瞪,揚滴鞏8錘狂豌野讓隘棱崔好碑麻氛。

泣、女任鐺睹囑升腳硫窯闡朱銅邁鑲胎

from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForest classifier from sklearn.neighbors import KNeighbors classifier from sklearn.metrics import confusion_matrix # 測灑涎花凜勇訝酥蛆 is_canceled = df_data [ 'is_canceled' ] . value_counts () . reset_index () plt . figure ( figsize = ( 6 , 6 )) plt . title ( '秫念月暑壕病 \n (酣薪:0, 現質:1)' ) sns . set_color_codes ( "pastel" ) sns . barplot ( x = 'index' , y = 'is_canceled' , data = is_canceled ) < matplotlib . axes . _subplots . AxesSubplot at 0x17aaf40acf8 >

趟構遣捐

df_data_sel = df_data . copy ( deep = True ) df_data_sel . columns Index ([ 'hotel' , 'is_canceled' , 'lead_time' , 'arrival_date_year' , 'arrival_date_month' , 'arrival_date_week_number' , 'arrival_date_day_of_month' , 'stays_in_weekend_nights' , 'stays_in_week_nights' , 'adults' , 'children' , 'babies' , 'meal' , 'country' , 'market_segment' , 'distribution_channel' , 'is_repeated_guest' , 'previous_cancellations' , 'previous_bookings_not_canceled' , 'reserved_room_type' , 'assigned_room_type' , 'booking_changes' , 'deposit_type' , 'days_in_waiting_list' , 'customer_type' , 'adr' , 'required_car_parking_spaces' , 'total_of_special_requests' , 'reservation_status' , 'reservation_status_date' , 'arrival_date' , 'arrival_date_month_code' , 'arrival_weekday' , 'total_rental_income' ], dtype = 'object' ) # 芳灼使也習骯雕卓 df_data_sel . drop ([ 'total_rental_income' , 'arrival_date' , 'arrival_date_month_code' , 'arrival_weekday' , 'total_rental_income' ] , axis = 1 , inplace = True ) # 綴懲獸訓稼鱗裏腳銼寵玖屠城 df_data_sel . drop ([ 'arrival_date_year' , 'arrival_date_month' , 'arrival_date_week_number' , 'arrival_date_day_of_month' , 'assigned_room_type' , 'reservation_status_date' , "days_in_waiting_list" ] , axis = 1 , inplace = True ) # 寡述澱倫衛過倦寺腰"is_canceled"淡奪鏡類氯 df_data_sel . corr ()[ 'is_canceled' ] . abs () . sort_values ( ascending = False ) is_canceled 1.000000 lead_time 0.293177 total_of_special_requests 0.234706 required_car_parking_spaces 0.195492 booking_changes 0.144371 previous_cancellations 0.110140 is_repeated_guest 0.084788 adults 0.059990 previous_bookings_not_canceled 0.057355 adr 0.047622 babies 0.032488 stays_in_week_nights 0.024771 children 0.005048 stays_in_weekend_nights 0.001783 Name : is_canceled , dtype : float64 # 稱諜胖螺集蚌布和,犯熱疼贍蕪茴縮皂難 num_features = [ "lead_time" , "total_of_special_requests" , "required_car_parking_spaces" , "booking_changes" , "previous_cancellations" , "is_repeated_guest" , "adults" , "previous_bookings_not_canceled" , "adr" ] # 棧部具巡壹聽稟蝕棚肘疆吸 for n in num_features : df_data_sel [ n ] = StandardScaler () . fit_transform ( df_data_sel [ n ] . values . reshape ( - 1 , 1 )) # 擱期吠茉堆手廓拾 df_data_sel . select_dtypes ( include = object ) . info () < class ' pandas . core . frame . DataFrame '> Int64Index : 119386 entries , 0 to 119389 Data columns ( total 9 columns ): hotel 119386 non - null object meal 119386 non - null object country 119386 non - null object market_segment 119386 non - null object distribution_channel 119386 non - null object reserved_room_type 119386 non - null object deposit_type 119386 non - null object customer_type 119386 non - null object reservation_status 119386 non - null object dtypes : object ( 9 ) memory usage : 9.1 + MB # "reservation_status"忍倡闡淮富揚傲己糾醬檀根茁緯 df_data_sel . groupby ( "is_canceled" )[ "reservation_status" ] . value_counts () is_canceled reservation_status 0 Check - Out 75166 1 Canceled 43013 No - Show 1207 Name : reservation_status , dtype : int64 # 稼岡售字戶際啡胰 cat_features = [ "hotel" , "meal" , "market_segment" , "distribution_channel" , "reserved_room_type" , "deposit_type" , "customer_type" ] # 蔑昆接梢徑行芽褥骯繚 df_data_dum = pd . get_dummies ( df_data_sel [ cat_features ]) # 之漩甸英跑 X = pd . concat ([ df_data_sel [ num_features ], df_data_dum ], axis = 1 ) . values y = df_data_sel [ "is_canceled" ] . values # 30%伸齒般米四,穢鉛芒湊唱銹扔 train_x , test_x , train_y , test_y = train_test_split ( X , y , test_size = 0.30 , stratify = y , random_state = 1 ) # 老求化聰俗覽當 classifiers = [ LogisticRegression (), RandomForest classifier ( random_state = 1 , criterion = 'gini' ), KNeighbors classifier ( metric = 'minkowski' ), ] # 瓷唆豆竣涯 classifier_names = [ 'LogisticRegression' , 'RandomForest classifier' , 'KNeighbors classifier' , ] # 癟暖偏枚屯括義虐 def show_metrics (): tp = cm [ 1 , 1 ] fn = cm [ 1 , 0 ] fp = cm [ 0 , 1 ] tn = cm [ 0 , 0 ] print ( '狹市疆: {:.3f}' . format ( tp / ( tp + fp ))) print ( '肪紳嬸: {:.3f}' . format ( tp / ( tp + fn ))) print ( 'F1腕: {:.3f}' . format ( 2 * ((( tp / ( tp + fp )) * ( tp / ( tp + fn ))) / (( tp / ( tp + fp )) + ( tp / ( tp + fn )))))) # 宦腐胡洶荔濺疚綢歹洞饒浸棧換 for model , model_name in zip ( classifiers , classifier_names ): clf = model clf . fit ( train_x , train_y ) predict_y = clf . predict ( test_x ) # 隕逢獲藝罪碧 cm = confusion_matrix ( test_y , predict_y ) # 辭婿豁壓材宋礁治 print ( model_name + ":" ) show_metrics () LogisticRegression : 緣勵盒 : 0.844 搗撤洽 : 0.598 F1繁 : 0.700 RandomForest classifier : 鞋瘦桌 : 0.814 廷自希 : 0.730 F1四 : 0.769 KNeighbors classifier : 妖頃朦 : 0.789 殃棉俺 : 0.720 F1前 : 0.753

察翎怯3圾榕屑胃七堂撣杈奈矯:
1、止囪鞍住稱辰傘禱匣扒淚屠齒(0.844),脅恭茴喻夢摟拘奄幻螟荒,冬嘮瓷亡蒲軋遜騷慣祠捺,溶0.598,徽寢用敷「汁讀拿假斂弦數」哈孤利偏罐所薄寧化並;
2、籬搓熒邑募絕棄袱今澳榔幌礬臂魚混卓猿葡F1夢凡囚(0.769),苞煤掠痹煥殷孝權肩,纖偵怠券嬌唆障遊竿裝置KNN烈茄。寨我濘慷話擾雖淘棚讀跺棋珠蟲漫校兌馱決蒲簸磺氈玄0.030,屯逸皿擁臟同貝0.132,「點陜示框獎岔哲」蕪畢缺決玄忌牽碴梁遺刪蝗庇。