Exploratory Data Analysis
It's been a long time since the last one...So we are back with another important topic EDA. In this blog I will be covering on :
- What is EDA and why is it important
- Loading datasets for EDA
- Tools and libraries used
- Types of EDA
- Insights and Hypothesis generation
- Mistakes to avoid
What is EDA?
- EDA is an important step in DS and DA to visualize the data and understand its main features, find patterns and discover how different parts of the data are connected.
- So we make visual models for the data to analyse the datasets.
Why is it important?
- It shows how the data is distributed, what type of data each feature contains : errors,outliers, patterns which effect the model alot
- Hidden patterns finding, which helps us in model building,
- It plays a vital role in DS pipeline as it facilitates hypothesis generation allowing analysts to formulate initial questions about data and observed patterns
- Before building models it helps us in understanding and finding outliers which would make us create better models
Loading dataset for EDA
- pd.read_csv,pd.read_excel to import data into code
- some instructions on inspecting the data :
- df.shape() : returns rows and columns as tuple,
- df.head() : returns the first 5 rows
- df.head(value) : returns the rows upto the given value
- df.tail() : returns the last 5 rows
- df.columns() : returns all the column names in our dataframe.
- df.info() : Shows a summary of your DataFrame including the no.of rows,column names, datatypes, Non-null counts , Memory usage.
- df.describe() : Gives out statistical summary of our data like mean, standard deviation, min, count etc..
- df['column name'].value_counts() : to peek at unique values
- df.isnull().sum() : It returns the count of empty/null values in our dataset
- small tip : For larger files, you can load subset first by putting n_rows=subset size.
Tools and libraries used
Alright we have our data loaded and inspected let's explore and visualize it. To make our life easier there are many powerful tools and libraries:
-
Pandas – The backbone of data analysis in Python. Helps us load, inspect, filter, group, and aggregate data, and also handle missing values effortlessly.
-
NumPy – Essential for numerical computations and array operations. Many times, it’s used behind the scenes to prepare data before modeling.
-
Matplotlib – Our Ishaan Avasti ( I hope you remember my previous blog). Perfect for visualizing data through bar charts, line graphs, pies, and more.
-
Seaborn – Built on Matplotlib but designed for statistical visualizations. Great for correlation plots, categorical plots, and prettier graphs.
-
Power BI – A low-code tool for interactive visualizations. It is soo easy to import data, clean it, and quickly get insights. It even has built-in AI visuals that can suggest patterns in your data, I like this application. It is also used in industry level (I got to know this in a workshop for Power BI held in our college).
Types of EDA
Univariate Analysis
Analysis of single variable in your dataset to understand its distribution( whether our data is like a cluster or it is spread out) , central tendency( it's like a center value common measures are mean, median or mode), spread ( How much the values vary around the center measures are std, variance, range) , and outliers ( values that are far away from the most other values "introverts in a party ,Literally me")
Types of Univariate Analysis
For Numerical Variables (Continuous numbers):
-
Histogram: Shows the frequency distribution of values.
-
Boxplot: Highlights median, quartiles, and outliers.
-
Density/KDE Plot: Smooth estimate of the distribution.
-
Summary Statistics: Mean, median, mode, range, variance, standard deviation.
-
-
For Categorical Variables (Categories/Labels):
-
Bar Chart / Count Plot: Shows how many times each category appears.
-
Pie Chart: Visualizes proportion of each category.
- Frequency Table: Counts unique values in the variable.
-
To explain in simple way, lets imagine you going to a supermarket and bought some fruits, now after finsihing shopping you want to check on what you have bought, so for this you might want to understand each fruit at a time lets say your first fruit you thought of is an orrange, so for understanding oranges bought you see how many of the bought are small/medium/large (dsitribution), What size is most common , do any giant or tiny ones exist, is there any vegetable in the basket(outlier) , so after we undertsand oranges we move on to the next fruit. this gives us complete understanding.
Bivariate Analysis
In Bivariate we analyse the relation between two variables. If they are related we describe the nature, strength and direction of that relationship, It helps in knowing about correlations(how one variable changes when there is a change in another) and associations between different factors in DA
There are 3 types of Bivariate analysis :
- numerical vs numerical : Both dependent and independent variables are numerical
- Categorical vs Categorical : Both are categorical i.e..,gender
- Numerical vs categorical : Usually dependent variable is dependent and Independent variable is categorical
Multivariate Analysis
Multivariate Regression: Predict one variable using several predictors.
-
Principal Component Analysis (PCA): Reduce dimensions while keeping important information.
-
Factor Analysis: Group related variables into underlying factors.
-
Cluster Analysis: Segment data points into similar groups.
-
Heatmaps & Pairplots: Visualize relationships across multiple variables at once.
Bivariate Analysis : Looking at two variables together.
Insights & Hypothesis Generation
After all of this, you can generate insights(Our main goal) that guide your next steps. EDA helps you formulate hypotheses about relationships between variables. Like, in a sales dataset,if you notice that online purchases are higher on weekends, your hypothesis could be: “Customers buy more online on weekends than weekdays.” These insights can help in feature engineering, model building, or testing assumptions. Feature engineering would be covered later in coming blogs. We test hypothesis by using methods like F Test, Chi square test, T test etc...
Mistakes to avoid in EDA
- Misinterpreting correlation as causation : Just because two variables move together doesn’t mean one causes the other.
- Ignoring outliers without context : Some extreme values may be errors, but there can be some which are insights
- Overplotting : Too many points or variables in one plot can hide patterns instead of showing them.
- Not checking data types or missing values : This can lead to incorrect analysis or plots.
- Focusing only on one type of analysis :If we rely on a single analysis we will miss bigger patterns
"Sometimes I just start exploring a dataset and I don't even know the insights, I just dig deeper and hope I find it along the way" - Michael Scott if he was a Data Scientist probably. I hope there are people who watch "The Office" who reads my blogs.
Comments
Post a Comment