{ "cells": [ { "cell_type": "markdown", "id": "76a37951", "metadata": {}, "source": [ "# Introduction\n", "I registered at Kaggle and did the Titanic Challenge. Kaggle is a programming challenge platform for machine learning and data analytics. This is a writeup of the Titanic challenge. In this challenge we are building a probabilistic model to predict the survival of passengers based on their gender, age, and socio-economic status.\n", "\n", "# Coding\n", "Kaggle provides us with the file train.csv for training our model. There is also a file test.csv of passengers whose survival we have to predict and upload to the platform.\n", "\n", "## Loading libraries\n", "First, we load all the necessary libraries." ] }, { "cell_type": "code", "execution_count": 1, "id": "f8cf915a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pandas_profiling import ProfileReport\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "id": "a8fc8a86", "metadata": {}, "source": [ "## Loading data\n", "Next, we load the data provided by Kaggle and explore it." ] }, { "cell_type": "code", "execution_count": 2, "id": "d1238cb6", "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv('kaggle/train.csv')\n", "test = pd.read_csv('kaggle/test.csv')" ] }, { "cell_type": "markdown", "id": "b315af35", "metadata": {}, "source": [ "The various columns of the data are as follows:\n", "- survival: 0 = No, 1 = Yes\n", "- pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd\n", "- sex: Sex\n", "- Age: Age in years \t\n", "- sibsp: # of siblings / spouses aboard the Titanic \t\n", "- parch: # of parents / children aboard the Titanic \t\n", "- ticket: Ticket number \t\n", "- fare: Passenger fare \t\n", "- cabin: Cabin number \t\n", "- embarked: Port of Embarkation" ] }, { "cell_type": "code", "execution_count": 3, "id": "5bf12433", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
\n", "

891 rows × 12 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", ".. ... ... ... ... \n", "886 Montvila, Rev. Juozas male 27.0 0 \n", "887 Graham, Miss. Margaret Edith female 19.0 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n", "889 Behr, Mr. Karl Howell male 26.0 0 \n", "890 Dooley, Mr. Patrick male 32.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", ".. ... ... ... ... ... \n", "886 0 211536 13.0000 NaN S \n", "887 0 112053 30.0000 B42 S \n", "888 2 W./C. 6607 23.4500 NaN S \n", "889 0 111369 30.0000 C148 C \n", "890 0 370376 7.7500 NaN Q \n", "\n", "[891 rows x 12 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train" ] }, { "cell_type": "code", "execution_count": 4, "id": "732d1081", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Age SibSp \\\n", "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", "\n", " Parch Fare \n", "count 891.000000 891.000000 \n", "mean 0.381594 32.204208 \n", "std 0.806057 49.693429 \n", "min 0.000000 0.000000 \n", "25% 0.000000 7.910400 \n", "50% 0.000000 14.454200 \n", "75% 0.000000 31.000000 \n", "max 6.000000 512.329200 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.describe()" ] }, { "cell_type": "code", "execution_count": 5, "id": "725418b9", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "53629b7a40d947c4a783e6800d335a55", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Summarize dataset: 0%| | 0/25 [00:00" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "profile = ProfileReport(train, title=\"Pandas Profiling Report\", explorative = True)\n", "profile.to_notebook_iframe()" ] }, { "cell_type": "markdown", "id": "f8537b4d", "metadata": {}, "source": [ "## Cleaning Data\n", "We can see that around 20% of entries are missing the age. Further, 77,1% are missing the cabin number. I decided to simply remove these columns.\n", "Further, there are values such as the name and the ticket which are irrelevant for a prediction. We again delete them." ] }, { "cell_type": "code", "execution_count": 6, "id": "276462a1", "metadata": {}, "outputs": [], "source": [ "del train['Age']\n", "del train['Cabin']\n", "del train['Name']\n", "del train['Ticket']\n", "\n", "del test['Age']\n", "del test['Cabin']\n", "del test['Name']\n", "del test['Ticket']" ] }, { "cell_type": "markdown", "id": "f5092cb2", "metadata": {}, "source": [ "Additionally, the sex is \"male\" or \"female\". This should be 1 or 0.\n", "The same is true with the embarkment. People entered the Titanic on Southampton (S), Cherbourg (C) or Queenstown (Q). We want 0, 1, 2." ] }, { "cell_type": "code", "execution_count": 7, "id": "78c8b5dc", "metadata": {}, "outputs": [], "source": [ "train['Sex']=train['Sex'].replace({'male':0,'female':1})\n", "train['Embarked']=train['Embarked'].replace({'S':0,'C':1,'Q':2})\n", "\n", "test['Sex']=test['Sex'].replace({'male':0,'female':1})\n", "test['Embarked']=test['Embarked'].replace({'S':0,'C':1,'Q':2})" ] }, { "cell_type": "markdown", "id": "518bcf63", "metadata": {}, "source": [ "There are some more optimizations we could do. As an example, we could match some of the titles in the names and cluster them into groups. Or we could cluster the Fares into groups of cheap and expensive tickets. But this is good enough." ] }, { "cell_type": "markdown", "id": "1278fd27", "metadata": {}, "source": [ "For the column 'embarked' two values are missing. We could either remove these values completely, but then we may lose the other implications of the name and gender. So I decided to fill it in with the most common port." ] }, { "cell_type": "code", "execution_count": 8, "id": "2c0c2fdd", "metadata": {}, "outputs": [], "source": [ "freq_port = train.Embarked.dropna().mode()[0]\n", "train['Embarked'] = train['Embarked'].fillna(freq_port)" ] }, { "cell_type": "markdown", "id": "7b74d3a6", "metadata": {}, "source": [ "For the testing data set, there is a missing value for the fare for one passenger. (Yeah, this took me a really long time to find.) Again, we insert the mean. This time with different code, though." ] }, { "cell_type": "code", "execution_count": 9, "id": "f16225b9", "metadata": {}, "outputs": [], "source": [ "test.fillna(test.mean(), inplace=True)" ] }, { "cell_type": "markdown", "id": "9c7ebb6f", "metadata": {}, "source": [ "## Looking at the cleaned data\n", "Next, we take a look at the data again and see that it is clean." ] }, { "cell_type": "code", "execution_count": 10, "id": "b719865b", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d0861c76c2e34c289f424230e4f1ac76", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Summarize dataset: 0%| | 0/21 [00:00" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "profile = ProfileReport(train, title=\"Pandas Profiling Report\", explorative = True)\n", "profile.to_notebook_iframe()" ] }, { "cell_type": "markdown", "id": "03a0fde8", "metadata": {}, "source": [ "## Training the model\n", "We are using a Gaussian naive Bayes classifier for classifying the data. As Python takes care of all the difficult stuff, we can essentially treat it like a black box." ] }, { "cell_type": "code", "execution_count": 11, "id": "a3650b6a", "metadata": {}, "outputs": [], "source": [ "features = train.drop('Survived', axis=1)\n", "label = train['Survived']" ] }, { "cell_type": "markdown", "id": "3a5dc0e2", "metadata": {}, "source": [ "We split the data in order to evaluate our accuracy and go on to train the model." ] }, { "cell_type": "code", "execution_count": 12, "id": "2600f05f", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=50)" ] }, { "cell_type": "code", "execution_count": 13, "id": "7a405db2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "79.21348314606742\n" ] } ], "source": [ "gaussian = GaussianNB()\n", "gaussian.fit(X_train, y_train)\n", "y_pred = gaussian.predict(X_test)\n", "accuracy = gaussian.score(X_train, y_train) * 100\n", "print(accuracy)" ] }, { "cell_type": "markdown", "id": "58bbbdd1", "metadata": {}, "source": [ "Our accuracy is around 80% which is very impressive considering that we did nothing clever and just used the defaults. All that is left is to run our statistical model on the real testing data and submit it to Kaggle." ] }, { "cell_type": "code", "execution_count": 14, "id": "3aec5207", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0,\n", " 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,\n", " 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,\n", " 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,\n", " 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,\n", " 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,\n", " 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,\n", " 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,\n", " 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,\n", " 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,\n", " 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,\n", " 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,\n", " 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,\n", " 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,\n", " 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,\n", " 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,\n", " 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,\n", " 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gaussian.predict(test)" ] }, { "cell_type": "code", "execution_count": 15, "id": "970d2002", "metadata": {}, "outputs": [], "source": [ "results = pd.DataFrame({\n", " \"PassengerId\": test[\"PassengerId\"],\n", " \"Survived\": gaussian.predict(test)\n", " })\n", "\n", "results.to_csv('submission.csv', index=False)" ] }, { "cell_type": "markdown", "id": "63bc187b", "metadata": {}, "source": [ "# End\n", "When uploading to Kaggle this gives me an accuracy of 0.75358. Nice. No knowledge of math was necessary." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 5 }