Table of contents
- What is data cleaning?
- Why data cleaning is necessary?
- Introduction
- Step by step process for cleaning your data using Pandas
- How to load CSV data using Pandas in Python
- Remove duplicate data in Pandas Dataframe
- Remove emojis 😁 in Pandas Dataframe
- Change data into lowercase in Pandas Dataframe
- Remove multiple white spaces, tabs and new-lines in Pandas Dataframe
- Remove URLs (Uniform Resource Locators) in Pandas Dataframe
- Drop rows with empty data in Pandas Dataframe
- More data processing
- Conclusion
What is data cleaning?
Data cleaning is a process to remove, add or modify data for analysis and other machine learning tasks. If data cleaning is necessary, it is always done before any kind of analysis or machine learning task.
Clive Humby said, “Data is the new oil.” But we know data still needs to be refined.
Why data cleaning is necessary?
Data is considered one of the major assets of a company. Misleading or inaccurate data is risky and can be a reason for the fall of a company.
Data available to us don't need to be useful every time, we must perform many operations to make it useful. So, it is a good idea to remove unnecessary data and, format and modify important data so that we can use it. In some scenarios, it is also required to add information externally by processing the available data. For example, adding a language column based on some data that already exists or generating a column with an average value based on some other columns’ data.
Introduction
There are many steps involved in the data cleaning process. These all steps are not necessary for everyone to follow or use. To perform the data cleaning, we will use python programming language with pandas library.
I have used python because of its expressiveness and, it is easy to learn and understand. More importantly, python is the choice of many experts for machine learning tasks because a person without a computer science background can easily learn it. Apart from python’s benefits; pandas is a fast, powerful, flexible and easy-to-use open-source data analysis and manipulation tool and it is one of the most popular data analysis and processing tools out there.
To know your data is very important before one starts the data cleaning process because what cleaning process to perform, is all depends on what kind of data one has and what is the nature of that data.
Step by step process for cleaning your data using Pandas
How to load CSV data using Pandas in Python
Before cleaning the data, it is important to load data properly. In this tutorial, I will show basic methods to load data from a CSV file. Find more options to read CSV here.
import pandas as pd
# 1. Read data from csv default way
df = pd.read_csv('my_file.csv')
# 2. Read data from csv using comma as delimiter
df = pd.read_csv('my_file.csv', delimiter=',')
# 3. Read data from csv using comma as delimiter and no headers
df = pd.read_csv('my_file.csv', delimiter=',', header=None)
# 4. Read data from csv using comma as delimiter and custom headers
my_headers = ['Id','Name', 'Type', 'Price']
df = pd.read_csv('my_file.csv', delimiter=',', header=0, names=my_headers)
Remove duplicate data in Pandas Dataframe
Certain steps need to be followed in every data cleaning process. One of those steps is the removal of duplicate data. Regardless of textual or numeric data, removal of duplicate data is very important because if the dataset contains too many duplicates then the time to process that data also increases.
# 1. Removes duplicate and returns a copy of dataframe
df = df.drop_duplicates()
# 2. Removes duplicates in place
df = df.drop_duplicates(inplace=True)
# 3. Drops duplicates and keep first/last occurance
df = df.drop_duplicates(inplace=True, keep='last')
# 4. Consider only certain columns for identigying duplicates
df = df.drop_duplicates(subset=['Id', 'Price'], inplace=True, keep='last')
Remove emojis 😁 in Pandas Dataframe
There are many cases where we do not want emojis in our textual dataset. We can remove emojis by using a single line of code. The code snippet shown below will remove emojis from pandas dataframe column by column. The code snippet can be found on Stackoverflow.
df = df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
This code snippet encodes all the data into ASCII (American Standard Code for Information Interchange) values and ignores if the data can not be encoded. After encoding it tries to decode them all again because all the emojis were ignored in the encoding process. So now we have all the data without emojis.
Change data into lowercase in Pandas Dataframe
Possibilities of changing cases of data are very likely. Here, I have attached a code snippet to change data to lowercase. More examples can be found here.
df['Type'] = df['Type'].str.lower()
df['Name'] = df['Name'].str.lower()
Remove multiple white spaces, tabs and new-lines in Pandas Dataframe
Every dataset contains unnecessary whitespaces, tabs and newlines. Problem is that we can see tabs and newlines clearly but not whitespaces which in turn affects when we train our models.
df['Type'] = df['Type'].str.replace('\n', '')
df['Type'] = df['Type'].str.replace('\t', ' ')
df['Type'] = df['Type'].str.replace(' {2,}', ' ', regex=True)
df['Type'] = df['Type'].str.strip()
The first two lines of code will replace tabs and newlines with empty characters, respectively. The third line will find two or more spaces with the help of regular expression(regex), and it will replace it with a single space. Finally, the last line will strip (removes whitespaces) data from both sides.
Remove URLs (Uniform Resource Locators) in Pandas Dataframe
Many people use surveys to get data. People tend to fill in random details and sometimes this data has URLs in them. I have used the regex pattern shown in the code snippet to remove URLs, though one can use any regex pattern to match URLs. Here, I have replaced matching URL patterns with empty string characters.
df['Type'] = df['Type'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)
Drop rows with empty data in Pandas Dataframe
After all the above cleaning process it left some empty data in columns. We have to get rid of those empty rows otherwise it creates uncertainty in the trained model. To make sure we drop all rows with empty data, we use two methods which are as shown below.
df.dropna()
df['Type'].astype(bool)
df = df[df['Type'].astype(bool)]
First-line removes all rows which contain np.nan, pd.NaT and None whereas other lines remove rows which has empty string characters. The second method is fast but if the column has even whitespace it will not work. This is another reason why we strip our data earlier.
More data processing
Sometimes you need to drop some columns, create a new column from existing data or remove rows that do not contain specific data in them.
import numpy as np
df = df.drop(['Id', 'Name'], axis=1)
df = df[df['Type'].str.contains('frozen') | df['Type'].str.contains('green')]
def detect_price(row):
if row['Price'] > 15.50:
return 'High'
elif row['Price'] > 5.50 and row['Price'] <= 15.50:
return 'Medium'
elif row['Price'] > 0.0 and row['Price'] <= 5.50:
return 'Low'
else:
return np.NaN
df['Range'] = df.apply (lambda row: detect_price(row), axis=1)
Here, line number three drops two columns named Id and Name; and returns a copy of the new dataframe. Line number four checks if the ‘Type’ column contains string frozen or green then returns true and keeps that row. Line number 7 to 17 creates a new column named ‘Range’ based on the value of column ‘Price’. By using the lambda function, I passed each row to the detect_price function and return the value based on the price. This returned value is then assigned to a new column on the row passed to the function. Here, is a reason to use np.NaN is we can remove those rows afterwards using df.dropna().
Conclusion
The data cleaning process is one of many processes involved in data science. To have clean and accurate data is a blessing in disguise. We need to clean and process data differently in every project. I have mentioned some of the cleaning methods which are used frequently. You can create your own set of methods as well or use any one of the existing methods. I have attached the whole code here for reference.