Checklist for Performing Exploratory Data Analysis (EDA)
01/28/2022 2022-01-28 5:57Checklist for Performing Exploratory Data Analysis (EDA)
Checklist for Performing Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA), or data exploration for short, is an essential step in any data analysis or machine learning project. It helps you understand the data and gives you a clear intuition on the appropriate measures to take as you work with that data. The importance of this cannot be overemphasized. You may have heard the term, garbage in and garbage out. This is expressive when building machine learning models with bad data.
So in this article, you will learn how to perform EDA step by step. Many times, we know what to do but we do not know how to do it. This is what EDA is like for many people. By the end of this article, however, you should have a clearer picture.
The article is intended to be non-technical so as many people as possible can follow along. So you can use any tool you desire for these processes, whether Python, R, Excel, SPSS, Power BI, etc. That is totally up to you.
We will be focusing more on logic than the use of tools here. Before we jump into the crux of EDA, let’s understand the key components of exploratory data analysis.
Components of EDA
- An attempt to understand the variables involved.
- Understanding the relationship between the variables.
- Cleaning the data.
Any EDA step must do one of these components. So yea, let’s get into it.
What you must do in Exploratory Data Analysis.
- Understand the basic information about your data.
You cannot work with data you do not understand. So the first thing you must do is to understand the properties of the data. A good place to start is identifying the number of rows and the number of columns. You also want to check what each column means and the data type it contains. You also want to check the basic statistics of each column. As an example, you check the maximum value in a column, the minimum value as well as the average value. This will give you a good idea of how the values are distributed and if there would be possibilities for outliers.
- Understand each column and the data types.
Understanding the columns has already been talked about in the last point but we need to emphasize on understanding the data types. There are two major types of data:
- Discrete: This involves whole numbers. For instance, the number of wheels in a vehicle, or the population of students in a class. They are purely integers and cannot be a fraction. Discrete columns may not be only numbers. They could as well be in the form of strings (textual data), boolean (true or false), date, and so on.
- Continuous: These are features that can take any form. They could be fractions or even negative numbers. Examples of continuous data include the price of a car, the temperature of a room, the distance between two points, and so on.
But why is it important to differentiate between the data types? Some operations are not suitable depending on the data type. For instance, finding the mean of discrete data is useless.
- Check for outliers.
In simple terms, outliers are extremely low and high values in a distribution. It is important to check if they exist as they can bias your data and thwart further actions you’d be taking on the data. Imagine a data that contains the average income of Nigerians and out of 1000 sample population, Dangote the richest black man is one of them. This stupendously high value will shift the average income of Nigerians to the high side which would render the result incorrect. Bottomline is to deal with outliers, either by removing them or scaling them to a smaller value.
- Check for missing values and deal with them
Missing values can potentially mar the success of your work. It is important to identify and know the best possible way to deal with them. Many people advocate for removing rows with missing values but this is not a one-size-fits-all approach. When you have sparse missing values, yes, you could afford to take all of them out. But when 40% of rows contain one or more missing values, it is not advisable to remove them as you will lose a lot of information. In such cases, you can replace missing values with mean, median or mode. As a rule of thumb, it is advisable to use median and not mean if outliers exist in your data. There are other ways of dealing with missing values. Some even use machine learning models to predict the best possible results for the missing values.
- Ask questions and get answers from data visualization.
This is perhaps where it gets interesting. You need to ask insightful questions and try to get answers through visualizations. For instance, if you have data that shows the grade of students in a class. You could determine the percentage of students that had an A, B, C, or F by drawing a pie chart. If you want to know how the grade was distributed amongst the students you could draw an histogram. If you want to know how the age of a student affects their grade, you could draw a correlation plot.
The most important thing is to find out the best possible way to learn and master EDA. This can be achieved through self learning or bootcamp training. Data Techcon provides in depth visualization that is perfect to answer your questions.
EDA is not as difficult as people see it. However, it is an art that you get better as you do it. As you perform more EDA, you get more intuition on how best to understand the data and what to do given any situation. This article has however given you a rundown on the things you must do when performing EDA. Get to it already.