__STYLES__

Exploring Titanic: Initial Insights from Passenger Data (HNG internship)

Exploring Titanic: Initial Insights from Passenger Data (HNG internship)

About this project

Technical Report: First Glance Analysis of the Titanic Dataset

Introduction

The dataset contains information about passengers who were aboard the Titanic, including details such as passenger identifiers, survival status, ticket class, names, sex, age, number of siblings/spouses aboard, number of parents/children aboard, ticket number, fare, cabin number, and port of embarkation. The purpose of this report is to provide a preliminary overview of the dataset and highlight initial insights without delving into deep analysis.

Dataset Familiarization

The Titanic dataset is divided into three parts:

gender_submission: Contains the predicted survival of passengers based on their gender.

test: Contains passenger information without survival information, used for making predictions.

train: Contains passenger information along with their survival status, used for training models.

Structure and Contents:

gender_submission: 2 columns - PassengerId, Survived

test : 11 columns - PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

train: 12 columns - PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

Initial Data Exploration

gender_submission

  • PassengerId: This is a unique identifier for each passenger.
  • Survived: Binary indicator (1 for survived, 0 for not survived).

Basic statistics:

Total entries: 418

No missing values.

Survived mean: 0.36 which means the survival rate is 36%

outliers? No undefined

test

  • PassengerId: Unique identifier for each passenger.
  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
  • Name: Passenger's name.
  • Sex: Passenger's gender.
  • Age: Passenger's age (86 missing values).
  • SibSp: Number of siblings/spouses aboard.
  • Parch: Number of parents/children aboard.
  • Ticket: Ticket number.
  • Fare: Passenger fare (1 missing value).
  • Cabin: Cabin number (327 missing values).
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

Basic statistics:

Total entries: 418

Missing values: Three column had missing values. The Age column had eighty six (86) missing values, Fare column had just one (1) missing value, and finally, the Cabin column had three hundred and twenty seven(327) missing values . undefined

train:

  • PassengerId: Unique identifier for each passenger.
  • Survived: Binary indicator (1 for survived, 0 for not survived).
  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
  • Name: Passenger's name.
  • Sex: Passenger's gender.
  • Age: Passenger's age (177 missing values).
  • SibSp: Number of siblings/spouses aboard.
  • Parch: Number of parents/children aboard.
  • Ticket: Ticket number.
  • Fare: Passenger fare.
  • Cabin: Cabin number (687 missing values).
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

Basic statistics:

  • Total entries: 891
  • Missing values: Three columns had missing values. The Age column with one hundred and seventy-seven(177) missing values, the Cabin column with the highest number of missing values with six hundred and eighty seven(687) and finally the Embarked column with two(2).undefined

Observations

From the initial data exploration, several key observations can be made:

  1. Gender and Survival: The gender_submission file suggests that survival prediction based on gender shows a survival rate of 36%.
  2. Passenger Class: A significant number of passengers were in the 3rd class.
  3. Missing Data: There are substantial missing values in the Age and Cabin columns across both test and train.
  4. Fare and Embarkation: The fare varies widely, indicating a diverse passenger list in terms of ticket prices. The majority of the passengers embarked from Southampton.

Conclusion

The Titanic dataset provides a comprehensive snapshot of the passengers aboard the Titanic. Initial observations reveal a survival rate of 36% based on gender, a majority of passengers in the 3rd class, and significant missing data in the Age and Cabin columns. These insights suggest areas for further analysis, such as the relationship between survival and passenger class, gender, age, and fare. and also addressing the missing data.

This project was the first stage of HNG internship.

https://hng.tech/internship

https://hng.tech/hire

Discussion and feedback(0 comments)
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.