Titanic Survival Prediction

Titanic Survival Prediction using Machine Learning Algorithm

*****************************************************************

Problem Statement

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, the Titanic sank after colliding with an iceberg during her maiden voyage, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Further, there was some element of luck involved in surviving the sinking. Some groups of people were more likely to survive than others, such as women, children, and the upper class.

We have to predict whether an arbitrary passenger on the ship would survive the sinking.

*****************************************************************

Solution

Titanic Survival Prediction is an example of a ‘classification’ problem where the outcome is categorical. Our model needs to predict and classify outcomes as either one of two categories, ‘survived’ or ‘did not survive’ the Titanic Crash.

Following are the steps we are going to follow to predict which passengers survived the tragedy:

  1. Importing the Data through Kaggle
  2. Analyzing the Data
  3. Cleaning the Data
  4. Feature Selection
  5. Model Selection
  6. Training the Model and Testing the Model
  7. Conclusion

****************************************************************

1. Importing the Data through Kaggle

The Dataset comes in a .csv file.

We would be using the Python Libraries namely Pandas and NumPy for Analyzing the Dataset. Further, we would be requiring the ‘sklearn.linear_model’ Library during Model Selection. We would be loading the Libraries through the following command.

Warnings are provided to warn the developer of situations that aren’t necessarily exceptions. A ‘Warning’ in a program is distinct from an ‘Error.’ A Python Program terminates immediately if an error occurs. Conversely, a warning is not critical. It shows some messages, but the program runs. If we want to ignore any warnings provided by Python, we could use the following command to ignore them.

Loading the Dataset

*****************************************************************

2. Analyzing the Dataset

  1. Observing the Dataset

Command: head()

The Function of the Command: Returns the first ’n’ rows of the Dataset (By default, n = 5)

2. Quick Analysis of the Dataset

Command: info()

The Function of the Command: Returns List of Columns along with their Datatypes, Number of Non-Null Values in each Column, Memory used by the Dataset.

3. Statistical Summary of the Dataset

Command: describe()

The Function of the Command: Returns Statistical Data like Percentile, Mean, Median, etc.

4. Number of Rows and Columns

Command: shape

The Function of the Command: Returns a tuple containing Rows and Columns of the Dataset.

5. Observing the Columns in the Dataset

Command: columns

The Function of the Command: Returns the names of the columns in the Dataset. The column names are the features of the Dataset.

6. Data Description

  1. Survived: | 0 Did not survived | 1 Survived |
  2. P-Class: | 1 First Class | 2 Second class | 3 Third class | This can also be seen as a proxy for socioeconomic status.

3. Sex: | Male | Female |

4. Age: Age in years. Fractional, if Age is less than 1

5. SibSp: Number of siblings or spouses on the Titanic

6. Parch: Number of parents or children on the Titanic

7. Ticket: Passenger Ticket Number

8. Fare: Passenger Fare

9. Cabin: Cabin Number

10. Embarked: Point of Embarkation where | C Cherbourg| Q Queenstown| S Southampton |

*****************************************************************

3. Cleaning the Data

  1. Handling the Missing Data in the Dataset

We could observe the number of Missing Values for each Feature in the Dataset through the following code.

We understand,

In the Dataset, Cabin, Age, Fare and Embarked Columns have Null Values.

The number of Missing Values in the Cabin Column is more than 50 %. It would be better to simply drop the Cabin Column entirely than trying to come up with randomly generated values for these Missing Values.

We would be replacing the Missing Values in the Age Column with the Mean of the data sample.

Similiarly, we would be replacing the Missing Values in the Fare Column with the Mean of the data sample.

The Embarked Feature is a Categorical Feature that takes either S, Q, or C. The percentage of Missing Values in this column is less than 22 %. So, we would replace the Missing Values in the Embarked Column with the Mode of the data sample.

To ensure that no other Missing Values are present in the Dataset, we run the following command. We should get a column of 0's.

2. Dropping Irrelevant Columns from the Dataset

Some features probably don’t affect our Survival rate. For example, ‘Name’. Rose didn’t survive because of her name and neither did Jack die because of his name. The other feature is ‘Ticket’. The number on the ticket wouldn’t affect the Survival Rate. Further, the feature ‘Embarked’ too wouldn’t affect the Survival Rate. We recently figured out, dropping the column Cabin is a good way. So, we drop these columns from the Dataset.

3. Handling Categorical Data

ML models can ‘categorize’ Numeric Data very well. But often they find it hard to do the same with Categorical Data. Categorical Data usually has words to represent a certain thing like ‘M’ for Male and ‘F’ for Female in the Sex Column. Having Categorical Data in your model will cause an error message like ‘Your Dataset has Categorical Data while your model requires Numeric Data’ to pop up. So, we need to convert our Categorical Data into Numeric Data.

The Function ‘get_dummies’ from the Pandas Library would be useful for this procedure.

The Features ‘Sex’, ‘PClass’ would be required to undergo this treatment. ‘

The Feature ‘Sex’ would be transformed into Binary Columns of Sex_Male and Sex_Female which would help us answer the questions ‘whether the passenger is a male or a female.’ The Feature ‘PClass’ would be transformed into Binary Columns of PClass_1, PClass_2, PClass_3 which would help us answer the question ‘to which class the passenger belonged to.’

Further, we would be dropping the ‘Sex_Female’ Column. Because we need only one sex indicator.

****************************************************************

4. Feature Selection

Keeping Relevant Features

Both ‘SibSp’ and ‘Parch’ Features are related to traveling with family. For simplicity’s sake ( to avoid multicollinearity ), let us combine the effect of these two features into a single feature which indicates whether or not an individual was traveling alone. Further, we drop the ‘SibSp’ and ‘Parch’ Features from the Dataset.

*****************************************************************

5. Model Selection

There are many Classification Algorithms in Machine Learning, namely Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, etc.

I would be implementing the Logistic Regression algorithm.

Logistic Regression is a mathematical model used in statistics to estimate the probability of an event occurring having been given some previous data. The model works with Binary Data. Either the event happens (1) or the event does not happen (0).

*****************************************************************

6. Training the Model and Testing the Model

We split the Dataset into Training Dataset and Testing Dataset. The Training Dataset is a dataset used to train the model, while the Testing Dataset is a dataset to test the trained model. To have a real-world estimate of the accuracy of our model, we need to test the model on data that it has never seen before. So, we separate the Dataset into Training Dataset and Testing Dataset.

We would be training the Logistic Regression model on the Training Dataset. Further, we would be testing the Logistic Regression model on the Testing Dataset.

*****************************************************************

7. Conclusion

Let’s check the accuracy of the Model

So, the accuracy of our Logistic Regression model is 81 %.

*****************************************************************

Important commands I came across while completing this Project

fillna() Function used to fill the NaN values with a particular value.

isnull().sum() Function returns the number of missing values.

drop() Rows or Columns could be dropped from the Dataset through this function.

‘axis’ The Parameter is used to indicate whether the columns or the rows are to be dropped. If we set ‘axis = 1’, the required columns would be dropped. If we set ‘axis = 0’, the required rows would be dropped.

‘inplace’ The Parameter is used to indicate whether the changes to the Dataset are in place or not. If the value of the Parameter is True, the changes to the Dataset are in place. If the value of the Parameter is False, the changes to the Dataset are not in place. A new Dataset with the required changes would be returned in such a case. (Default value is False)

get_dummies() Function used for Data Manipulation. The function converts Categorical Data into Dummy/Indicator Variables.

np.where() Function syntax : np.where( condition[ x, y] ). The function yield x if the condition evaluates to True, otherwise yield y

Git Hub Link

Happy Learning!

Everyone shines given the right Lightening!!