Jamaley Hussain

4 years ago

Data Cleaning with Pandas Python : Missing Values

According to IBM Data Analytics you can expect to spend up to 80% of your time cleaning data.

Hi, Folks hope you all are doing awesome, In today’s tutorial, we are going to learn Data Cleaning Process with Python and Pandas framework.

Python programming language for data science is very popular and easy to learn. and in Data Science / Machine Learning, Data Cleaning is very important. Most of the time data is not in the form as you want, either its a paragraph, string, image, or any other format.

Suppose you want to create Chart, Bar Graph and you have data like this :

Name    Class    Year
Ram     five     2k17
Mohan   three    2k18
Neeti   Six      2Keighteen

in the above table, we have Name, Class, and Year respectively, and you want to differentiate classes on the basis of Year. But in the above table year is like 2k17,2k18 and 2keighteen, So it is impossible to define which is year 2017 or 2018. For this Situation we have an algorithm named: – Fuzzy logic, this algorithm helps you! for this, we’ll talk later in a separate post.

So here I’m taking an example of salary records, were some of the data are missing and we are going to handle.

# Importing libraries
import pandas as pd
import numpy as np

# Read csv file into a pandas dataframe
df = pd.read_csv("salary.csv")

# Take a look at the first few rows
print df.head()

Here Pandas framework and NumPy will he to clean the data.

df = pd.read_csv("salary.csv")

Reading the CSV (or we say that we are loading the data to clean). And the above code will give us tabular data that counted as row and column.

With the .head() method, we can easily see the first few rows

	Last Name	First Name	Status	        Salary
0	Williams	Mary	        Full Time	35000
1	Brown	        Robert	        NaN	        32000
2	Wilson	        Elizabeth	Part Time	12000
3	Moore	        Jennifer	Full Time	0
4	Brown	        Charles	        Full Time	39000
5	Price	        Lisa	        Part Time	14000
6	Wood	        Daniel	        Part Time	13750
7	Coleman	        Donald	        Full Time	37500
8	Perry	        George	        Part Time	0
9	Steele	        Donna	        Full Time	36750

Standard Missing Values: that Pandas can automatically detect.

Going back to our original dataset, let’s take a look at the “status” column. In the second row, there’s an empty cell.

Clearly these are both missing values. Let’s see how Pandas deals with these.

Table of Contents

1 Count Missing Value
2 Drop Column
3 Duplicates
4 Missing Values

Count Missing Value

Display the numbers of missing value in each column names/feature name which helps us to define what we want.

df.isnull().sum()

#output

Last Name     0
First Name    0
Status        4
Salary        5
dtype: int64

So, Missing value in the feature Salary is 5 and Status is 4

Drop Column

If we have any column in our datasets that is not relevant to our need or irrelevant information we drop that from our datasets. here we have 4 columns , lets assume we don’t need ‘Status’ column for now, So we are dropping that table. lets see how.

columns_to_delete = ['Status'] 
df.drop(columns_to_delete, inplace=True, axis=1)

df.head()

#Output
        Last Name	First Name	Salary
0	Williams	Mary	        35000.0
1	Brown	        Robert	        32000.0
2	Wilson	        Elizabeth	12000.0
3	Moore	        Jennifer	NaN
4	Brown	        Charles	        39000.0

Here, I made a list of the columns I wanted to delete and then passed it to df.drop().

A lot of pandas functions can be run in-place.

In pandas, axis 0 represents rows and axis 1 represents columns, so I’ve indicated axis=1 because I’m deleting columns.

Duplicates

If a user accidentally submits a form twice, or you scrape a website twice without realizing it, or in plenty of other scenarios, you could end up with duplicate data.

Pandas drop_duplicates()
You could have entire rows that are duplicates, or just duplicate values in a column when the column should be unique.

The drop_duplicates() function in pandas can be used in both of these cases.

Either the whole row can be a duplicate, or you can specify columns to check for duplicates with the subset parameter, which takes a list of column names.

The keep parameter designates whether or not to keep any of the duplicates. The default is first so it would keep the first duplicate and drop the rest of them.

Last will drop all duplicates except for the last occurrence. False will drop all of them.

df.drop_duplicates(inplace=True)

There is also the option to run it in-place.

Missing Values

Pandas recognizes a few things as missing values.

Empty cells
“NA”
If the data is the wrong type for the column — string data in a numeric column, for example.

their are many way to handle missing value :-

Delete

When a dataset has missing or null values, it’s important to decide what to do about them in the context of your project. “Delete rows or columns with missing values.”

You can delete missing/null values in pandas with dropna()

For Column – If you wanted to drop any column with missing values.

df.dropna(inplace=True,axis=1)

For Row – You could choose to delete any rows that have a missing value in any column. Or you could delete rows that have a missing value in a subset of columns.

If I want to analyze the violations on a per-building basis, I might just delete rows that do not have a building identification number.

df.dropna(subset=['Salary'],inplace=True

The subset parameter allows you to specify a list of columns

Replace

Replace missing values.

Pandas fillna(), Call fillna() on the DataFrame to fill in missing values.

If you wanted to fill in every missing value with a zero. df.fillna(0)

Or missing values can also be filled in by propagating the value that comes before or after it in the same column

Propagating values backward.

df.fillna(method='bfill')

Propagating values forward.

df.fillna(method='ffill')

There are many ways to determine what to replace missing data with

Median
Mean
Fixed value
Regression to predict correct value
Stochastic regression
Hot-deck imputation — Randomly pick another row that has similar values on other variables and use its value.

We discuss about all the above point in detail on later articles.

Check Python Python Libraries for Data Science and other articles

Top 4 Python Libraries for Data Science in 2018

Data Science – First Step with Python and Pandas (Read CSV File)

What is Data Analysis

I hope this post helped you to know Data Cleaning with Pandas Python : Missing Values. To get the latest news and updates follow us on Twitter & Facebook, subscribe to our YouTube channel. And If you have any query then please let us know by using the comment form.

Flatten list item from Column - Python [Data Cleaning] »

« What is Data Analysis

Categories: Pandas Python

Tags: PandasPythonPython Libraries

Jamaley Hussain: Hello, I am Jamaley. I did my graduation from StaffordShire University UK . Fortunately, I find myself quite passionate about Computers and Technology.

A Comprehensive Guide to Hugging Face Transformers
In the vast field of natural language processing (NLP), Hugging Face Transformers has emerged as…
Predicting House Prices using Machine Learning: A Step-by-Step Guide
As the real estate market continues to grow, there is an increasing demand for accurate…
Confusion matrix – Example, Scenario and Code
A confusion matrix is a table that is often used to describe the performance of…

Count Missing Value

Drop Column

Duplicates

Missing Values

Related posts: