Exploratory Data Analysis

 It's been a long time since the last one...So we are back with another important topic EDA. In this blog I will be covering on : 

  • What is EDA and why is it important
  • Loading datasets for EDA
  • Tools and libraries used
  • Types of EDA
  • Insights and Hypothesis generation
  • Mistakes to avoid
let's begin...

As you know Data Science is making sense of data, so EDA has a good role in making that happen. It makes us find patterns, insights etc.. in the data. 

What is EDA?

  • EDA is an important step in DS and DA to visualize the data and understand its main features, find patterns and discover how different parts of the data are connected.
  • So we make visual models for the data to analyse the datasets.
"Okay we got the idea on what EDA's main goal is but what are we gonna do by finding patterns, draw a shape??" Well EDA is very important in Data Science, model building because  


Why is it important?

  • It shows how the data is distributed, what type of data each feature contains : errors,outliers, patterns which effect the model alot 
  • Hidden patterns finding, which helps us in model building,
  • It plays a vital role in DS pipeline as it facilitates hypothesis generation allowing analysts to formulate initial questions about data and observed patterns
  • Before building models it helps us in understanding and finding outliers which would make us create better models
Now we got some understanding on why EDA is important, so let's begin our EDA. For doing EDA we need something to work on (datasets), we should load our data sets.

Loading dataset for EDA

To begin this first import pandas into your code by typing 

import pandas as pd
  • pd.read_csv,pd.read_excel to import data into code
  •  some instructions on inspecting the data : 
  • df.shape() : returns rows and columns as tuple, 
  • df.head() :  returns the first 5 rows
  • df.head(value) : returns the rows upto the given value
  • df.tail() : returns the last 5 rows
  • df.columns() : returns all the column names in our dataframe.
  • df.info() : Shows a summary of your DataFrame including the no.of rows,column names, datatypes, Non-null counts , Memory usage.
  • df.describe() : Gives out statistical summary of our data like mean, standard deviation, min, count etc..
  • df['column name'].value_counts() : to peek at unique values
  • df.isnull().sum() : It returns the count of empty/null values in our dataset
  • small tip : For larger files, you can load subset first by putting n_rows=subset size.

For your understanding, I've put screenshots of how the outputs would be like...You can try them on your own, I have used "Titanic Dataset" from Kaggle, you can use any other one.




Tools and libraries used

Alright we have our data loaded and inspected let's explore and visualize it. To make our life easier there are many powerful tools and libraries:

  • Pandas – The backbone of data analysis in Python. Helps us load, inspect, filter, group, and aggregate data, and also handle missing values effortlessly.

  • NumPy – Essential for numerical computations and array operations. Many times, it’s used behind the scenes to prepare data before modeling.

  • Matplotlib – Our Ishaan Avasti ( I hope you remember my previous blog). Perfect for visualizing data through bar charts, line graphs, pies, and more.

  • Seaborn – Built on Matplotlib but designed for statistical visualizations. Great for correlation plots, categorical plots, and prettier graphs.

  • Power BI – A low-code tool for interactive visualizations. It is soo easy to import data, clean it, and quickly get insights. It even has built-in AI visuals that can suggest patterns in your data, I like this application. It is also used in industry level (I got to know this in a workshop for Power BI held in our college).

Types of EDA

 EDA is all about how we explore our data, and the way we explore our data depends on how many variables we look at. It is three types/levels:
1) Univariate Analysis
2) Bivariate Analysis
3) Multivariate Analysis

Univariate Analysis 

Analysis of single variable in your dataset to understand its distribution( whether our data is like a cluster or it is spread out) , central tendency( it's like a center value common measures are mean, median or mode), spread ( How much the values vary around the center measures are std, variance, range) , and outliers ( values that are far away from the most other values "introverts in a party ,Literally me")

Types of Univariate Analysis

  • For Numerical Variables (Continuous numbers):

    • Histogram: Shows the frequency distribution of values.

    • Boxplot: Highlights median, quartiles, and outliers.

    • Density/KDE Plot: Smooth estimate of the distribution.

    • Summary Statistics: Mean, median, mode, range, variance, standard deviation.

  • For Categorical Variables (Categories/Labels):

    • Bar Chart / Count Plot: Shows how many times each category appears.

    • Pie Chart: Visualizes proportion of each category.

    • Frequency Table: Counts unique values in the variable.

To explain in simple way, lets imagine you going to a supermarket and bought some fruits, now after finsihing shopping you want to check on what you have bought, so for this you might want to understand each fruit at a time lets say your first fruit you thought of is an orrange, so for understanding oranges bought you see how many of the bought are small/medium/large (dsitribution), What size is most common , do any giant or tiny ones exist, is there any vegetable in the basket(outlier) , so after we undertsand oranges we move on to the next fruit. this gives us complete understanding.

Bivariate Analysis 

In Bivariate we analyse the relation between two variables. If they are related we describe the nature, strength and direction of that relationship, It helps in knowing about correlations(how one variable changes when there is a change in another) and associations between different factors in DA
There are 3 types of Bivariate analysis : 

  • numerical vs numerical : Both dependent and independent variables are numerical
  • Categorical vs Categorical : Both are categorical i.e..,gender
  •  Numerical vs categorical : Usually dependent variable is dependent and Independent variable is categorical

Multivariate Analysis

As the name says, here we analyze more than two variables together. It helps us understand how multiple factors interact and influence each other at the same time. Instead of seeing things pair by pair (like in bivariate), multivariate gives us the bigger picture. It is very useful in realworld data where outcomes are usually influenced by many factors, not just one or two. Here are some types of Multivariate analysis :
  • Multivariate Regression: Predict one variable using several predictors.

  • Principal Component Analysis (PCA): Reduce dimensions while keeping important information.

  • Factor Analysis: Group related variables into underlying factors.

  • Cluster Analysis: Segment data points into similar groups.

  • Heatmaps & Pairplots: Visualize relationships across multiple variables at once.

For example, A student’s performance can depend on study hours, sleep, diet, and class attendance combined not just one factor alone. So in summary : 
Univariate Analysis : Looking at one variable at a time.

Bivariate Analysis : Looking at two variables together.

Multivariate Analysis : Looking at more than 2 variables together.

Like imagine you went to a supermarket and bought fruits, you want to check what have you bought
so using Univariate analysis you focus on one fruit, let's say Watermelon (I've chosen it because it's my favourite, you can choose whatever you like) so we analyse the characteristics of watermelon i.e..shape,size, color etc.. well in technical sense we look at it's distribution, central tendency etc...
Now you changed your mind and thought of doing bivariate analysis, you look at two fruits Watermelon and Mango to see if they are related for example, heavier watermelons might come with larger mangoes. You thought these two are slow now you switched to Multivariate analysis, in this one you look at all the fruits together,considering multiple characteristics like size, ripeness, color, and price, to spot complex patterns or interactions , maybe baskets with large watermelons, medium mangoes, and ripe bananas tend to be the most valuable. In short, univariate focuses on one item, bivariate on pairs, and multivariate on the whole picture.

Insights & Hypothesis Generation

After all of this, you can generate insights(Our main goal) that guide your next steps. EDA helps you formulate hypotheses about relationships between variables. Like, in a sales dataset,if you notice that online purchases are higher on weekends, your hypothesis could be: “Customers buy more online on weekends than weekdays.” These insights can help in feature engineering, model building, or testing assumptions. Feature engineering would be covered later in coming blogs. We test hypothesis by using methods like F Test, Chi square test, T test etc...

Mistakes to avoid in EDA

  • Misinterpreting correlation as causation : Just because two variables move together doesn’t mean one causes the other.
  • Ignoring outliers without context : Some extreme values may be errors, but there can be some which are insights
  • Overplotting : Too many points or variables in one plot can hide patterns instead of showing them.
  • Not checking data types or missing values : This can lead to incorrect analysis or plots.
  • Focusing only on one type of analysis :If we rely on a single analysis we will miss bigger patterns


That concludes my blog. EDA plays an important role in understanding our data well, like we go full Sherlock Holmes mode in this. My references for this blog are Wikipedia, GeeksForGeeks and YouTube. EDA reminds me of that one Michael Scott quote 

"Sometimes I just start exploring a dataset and I don't even know the insights, I just dig deeper and hope I find it along the way" - Michael Scott if he was a Data Scientist probably. I hope there are people who watch "The Office" who reads my blogs.

Until next blog, continue exploring data and gather insights

Comments

Popular Posts