Introduction

I registered at Kaggle and did the Titanic Challenge. Kaggle is a programming challenge platform for machine learning and data analytics. This is a writeup of the Titanic challenge. In this challenge we are building a probabilistic model to predict the survival of passengers based on their gender, age, and socio-economic status. If you want to follow along with a jupyter notebook, you can download it from here.

Coding

Kaggle provides us with the file train.csv for training our model. There is also a file test.csv of passengers whose survival we have to predict and upload to the platform.

Loading libraries

First, we load all the necessary libraries.

import pandas as pd
from pandas_profiling import ProfileReport
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

Loading data

Next, we load the data provided by Kaggle and explore it.

train = pd.read_csv('kaggle/train.csv')
test = pd.read_csv('kaggle/test.csv')

The various columns of the data are as follows:

  • survival: 0 = No, 1 = Yes
  • pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • sex: Sex
  • Age: Age in years
  • sibsp: # of siblings / spouses aboard the Titanic
  • parch: # of parents / children aboard the Titanic
  • ticket: Ticket number
  • fare: Passenger fare
  • cabin: Cabin number
  • embarked: Port of Embarkation
train
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

train.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
profile = ProfileReport(train, title="Pandas Profiling Report", explorative = True)
profile.to_notebook_iframe()

Cleaning Data

We can see that around 20% of entries are missing the age. Further, 77,1% are missing the cabin number. I decided to simply remove these columns. Further, there are values such as the name and the ticket which are irrelevant for a prediction. We again delete them.

del train['Age']
del train['Cabin']
del train['Name']
del train['Ticket']

del test['Age']
del test['Cabin']
del test['Name']
del test['Ticket']

Additionally, the sex is “male” or “female”. This should be 1 or 0. The same is true with the embarkment. People entered the Titanic on Southampton (S), Cherbourg (C) or Queenstown (Q). We want 0, 1, 2.

train['Sex']=train['Sex'].replace({'male':0,'female':1})
train['Embarked']=train['Embarked'].replace({'S':0,'C':1,'Q':2})

test['Sex']=test['Sex'].replace({'male':0,'female':1})
test['Embarked']=test['Embarked'].replace({'S':0,'C':1,'Q':2})

There are some more optimizations we could do. As an example, we could match some of the titles in the names and cluster them into groups. Or we could cluster the Fares into groups of cheap and expensive tickets. But this is good enough.

For the column ‘embarked’ two values are missing. We could either remove these values completely, but then we may lose the other implications of the name and gender. So I decided to fill it in with the most common port.

freq_port = train.Embarked.dropna().mode()[0]
train['Embarked'] = train['Embarked'].fillna(freq_port)

For the testing data set, there is a missing value for the fare for one passenger. (Yeah, this took me a really long time to find.) Again, we insert the mean. This time with different code, though.

test.fillna(test.mean(), inplace=True)

Looking at the cleaned data

Next, we take a look at the data again and see that it is clean.

profile = ProfileReport(train, title="Pandas Profiling Report", explorative = True)
profile.to_notebook_iframe()

Training the model

We are using a Gaussian naive Bayes classifier for classifying the data. As Python takes care of all the difficult stuff, we can essentially treat it like a black box.

features = train.drop('Survived', axis=1)
label = train['Survived']

We split the data in order to evaluate our accuracy and go on to train the model.

X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=50)
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
y_pred = gaussian.predict(X_test)
accuracy = gaussian.score(X_train, y_train) * 100
print(accuracy)
79.21348314606742

Our accuracy is around 80% which is very impressive considering that we did nothing clever and just used the defaults. All that is left is to run our statistical model on the real testing data and submit it to Kaggle.

gaussian.predict(test)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0])
results = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": gaussian.predict(test)
    })

results.to_csv('submission.csv', index=False)

End

When uploading to Kaggle this gives me an accuracy of 0.75358. Nice. No knowledge of math was necessary.



Published

07 May 2021

Category

Writeup

Tags