Introduction

I registered at Kaggle and did the Titanic Challenge. Kaggle is a programming challenge platform for machine learning and data analytics. This is a writeup of the Titanic challenge. In this challenge we are building a probabilistic model to predict the survival of passengers based on their gender, age, and socio-economic status. If you want to follow along with a jupyter notebook, you can download it from here.

Coding

Kaggle provides us with the file train.csv for training our model. There is also a file test.csv of passengers whose survival we have to predict and upload to the platform.

Loading libraries

First, we load all the necessary libraries.

import pandas as pd
from pandas_profiling import ProfileReport
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

Loading data

Next, we load the data provided by Kaggle and explore it.

train = pd.read_csv('kaggle/train.csv')
test = pd.read_csv('kaggle/test.csv')

The various columns of the data are as follows:

survival: 0 = No, 1 = Yes
pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex: Sex
Age: Age in years
sibsp: # of siblings / spouses aboard the Titanic
parch: # of parents / children aboard the Titanic
ticket: Ticket number
fare: Passenger fare
cabin: Cabin number
embarked: Port of Embarkation

train

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

train.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

profile = ProfileReport(train, title="Pandas Profiling Report", explorative = True)
profile.to_notebook_iframe()

Cleaning Data

We can see that around 20% of entries are missing the age. Further, 77,1% are missing the cabin number. I decided to simply remove these columns. Further, there are values such as the name and the ticket which are irrelevant for a prediction. We again delete them.

del train['Age']
del train['Cabin']
del train['Name']
del train['Ticket']

del test['Age']
del test['Cabin']
del test['Name']
del test['Ticket']

Additionally, the sex is “male” or “female”. This should be 1 or 0. The same is true with the embarkment. People entered the Titanic on Southampton (S), Cherbourg (C) or Queenstown (Q). We want 0, 1, 2.

train['Sex']=train['Sex'].replace({'male':0,'female':1})
train['Embarked']=train['Embarked'].replace({'S':0,'C':1,'Q':2})

test['Sex']=test['Sex'].replace({'male':0,'female':1})
test['Embarked']=test['Embarked'].replace({'S':0,'C':1,'Q':2})

There are some more optimizations we could do. As an example, we could match some of the titles in the names and cluster them into groups. Or we could cluster the Fares into groups of cheap and expensive tickets. But this is good enough.

For the column ‘embarked’ two values are missing. We could either remove these values completely, but then we may lose the other implications of the name and gender. So I decided to fill it in with the most common port.

freq_port = train.Embarked.dropna().mode()[0]
train['Embarked'] = train['Embarked'].fillna(freq_port)

For the testing data set, there is a missing value for the fare for one passenger. (Yeah, this took me a really long time to find.) Again, we insert the mean. This time with different code, though.

test.fillna(test.mean(), inplace=True)

Looking at the cleaned data

Next, we take a look at the data again and see that it is clean.

profile = ProfileReport(train, title="Pandas Profiling Report", explorative = True)
profile.to_notebook_iframe()

Training the model

We are using a Gaussian naive Bayes classifier for classifying the data. As Python takes care of all the difficult stuff, we can essentially treat it like a black box.

features = train.drop('Survived', axis=1)
label = train['Survived']

We split the data in order to evaluate our accuracy and go on to train the model.

X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=50)

gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
y_pred = gaussian.predict(X_test)
accuracy = gaussian.score(X_train, y_train) * 100
print(accuracy)

79.21348314606742

Our accuracy is around 80% which is very impressive considering that we did nothing clever and just used the defaults. All that is left is to run our statistical model on the real testing data and submit it to Kaggle.

gaussian.predict(test)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0])

results = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": gaussian.predict(test)
    })

results.to_csv('submission.csv', index=False)

End

When uploading to Kaggle this gives me an accuracy of 0.75358. Nice. No knowledge of math was necessary.

← Previous Archive Next →

Published

07 May 2021

Hacking The Titanic