Hacking The Titanic
Introduction
I registered at Kaggle and did the Titanic Challenge. Kaggle is a programming challenge platform for machine learning and data analytics. This is a writeup of the Titanic challenge. In this challenge we are building a probabilistic model to predict the survival of passengers based on their gender, age, and socio-economic status. If you want to follow along with a jupyter notebook, you can download it from here.
Coding
Kaggle provides us with the file train.csv for training our model. There is also a file test.csv of passengers whose survival we have to predict and upload to the platform.
Loading libraries
First, we load all the necessary libraries.
import pandas as pd
from pandas_profiling import ProfileReport
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
Loading data
Next, we load the data provided by Kaggle and explore it.
train = pd.read_csv('kaggle/train.csv')
test = pd.read_csv('kaggle/test.csv')
The various columns of the data are as follows:
- survival: 0 = No, 1 = Yes
- pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- sex: Sex
- Age: Age in years
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin number
- embarked: Port of Embarkation
train
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
train.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
profile = ProfileReport(train, title="Pandas Profiling Report", explorative = True)
profile.to_notebook_iframe()
Cleaning Data
We can see that around 20% of entries are missing the age. Further, 77,1% are missing the cabin number. I decided to simply remove these columns. Further, there are values such as the name and the ticket which are irrelevant for a prediction. We again delete them.
del train['Age']
del train['Cabin']
del train['Name']
del train['Ticket']
del test['Age']
del test['Cabin']
del test['Name']
del test['Ticket']
Additionally, the sex is “male” or “female”. This should be 1 or 0. The same is true with the embarkment. People entered the Titanic on Southampton (S), Cherbourg (C) or Queenstown (Q). We want 0, 1, 2.
train['Sex']=train['Sex'].replace({'male':0,'female':1})
train['Embarked']=train['Embarked'].replace({'S':0,'C':1,'Q':2})
test['Sex']=test['Sex'].replace({'male':0,'female':1})
test['Embarked']=test['Embarked'].replace({'S':0,'C':1,'Q':2})
There are some more optimizations we could do. As an example, we could match some of the titles in the names and cluster them into groups. Or we could cluster the Fares into groups of cheap and expensive tickets. But this is good enough.
For the column ‘embarked’ two values are missing. We could either remove these values completely, but then we may lose the other implications of the name and gender. So I decided to fill it in with the most common port.
freq_port = train.Embarked.dropna().mode()[0]
train['Embarked'] = train['Embarked'].fillna(freq_port)
For the testing data set, there is a missing value for the fare for one passenger. (Yeah, this took me a really long time to find.) Again, we insert the mean. This time with different code, though.
test.fillna(test.mean(), inplace=True)
Looking at the cleaned data
Next, we take a look at the data again and see that it is clean.
profile = ProfileReport(train, title="Pandas Profiling Report", explorative = True)
profile.to_notebook_iframe()
Training the model
We are using a Gaussian naive Bayes classifier for classifying the data. As Python takes care of all the difficult stuff, we can essentially treat it like a black box.
features = train.drop('Survived', axis=1)
label = train['Survived']
We split the data in order to evaluate our accuracy and go on to train the model.
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=50)
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
y_pred = gaussian.predict(X_test)
accuracy = gaussian.score(X_train, y_train) * 100
print(accuracy)
79.21348314606742
Our accuracy is around 80% which is very impressive considering that we did nothing clever and just used the defaults. All that is left is to run our statistical model on the real testing data and submit it to Kaggle.
gaussian.predict(test)
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0,
1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0])
results = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": gaussian.predict(test)
})
results.to_csv('submission.csv', index=False)
End
When uploading to Kaggle this gives me an accuracy of 0.75358. Nice. No knowledge of math was necessary.