Introduction

I registered at Kaggle and did the Titanic Challenge. Kaggle is a programming challenge platform for machine learning and data analytics. This is a writeup of the Titanic challenge. In this challenge we are building a probabilistic model to predict the survival of passengers based on their gender, age, and socio-economic status. If you want to follow along with a jupyter notebook, you can download it from here.

Coding

Kaggle provides us with the file train.csv for training our model. There is also a file test.csv of passengers whose survival we have to predict and upload to the platform.

Loading libraries

First, we load all the necessary libraries.

import pandas as pd
from pandas_profiling import ProfileReport
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

Loading data

Next, we load the data provided by Kaggle and explore it.

train = pd.read_csv('kaggle/train.csv')
test = pd.read_csv('kaggle/test.csv')

The various columns of the data are as follows:

  • survival: 0 = No, 1 = Yes
  • pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • sex: Sex
  • Age: Age in years
  • sibsp: # of siblings / spouses aboard the Titanic
  • parch: # of parents / children aboard the Titanic
  • ticket: Ticket number
  • fare: Passenger fare
  • cabin: Cabin number
  • embarked: Port of Embarkation
train
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

train.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
profile = ProfileReport(train, title="Pandas Profiling Report", explorative = True)
profile.to_notebook_iframe()