Libraries used for Data Science

 Soo it's been quite a long time since I posted. In the last blog we introduced ourselves to the world of Data Science. I just gave a small intro on what happens and you might've seen that I already mentioned about the libraries used in the first post but I only talked about what I use now as a beginner in one line. I think it's better to let you know a bit more about them and also the ones that I use now, the new libraries. This post is going to be about that.

In this post we will see:

  • Why do we use libraries?
  • Different libraries and their functions
  • Core concepts 

Why do we even use libraries??

We all know Python is a language with vast amount of libraries, these libraries help data scientists with:

  • Cleaning and manipulating data sets
  • Vizualizing patterns
  • Building models
  • Processing images, text and Natural language
  • and more all with a few lines of code
They will reduce the time and effort and the length of code and make the code much more readable.

Different libraries and their functions

Numpy

Numpy stands for Numerical Python. It is used to work with arrays mainly. You might have had a doubt "Tharun, we have lists that can be used as arrays in python. Why would I use this library?" Well you are not wrong, list do serve purpose of arrays, but what makes us use numpy is the speed. Numpy arrays are much faster than lists(Upto 50 times). So as we work with large datasets we will have to provide huge amount of data in array form, it would make the lists work very slow. This is why we use Numpy rather than lists. Numpy is faster because the arrays are stored in a continuous place in memory unlike lists. So processes can access and manipulate them efficiently.
Numpy is written partially in Python but the most parts are written in C or C++. 

  • For installing Numpy in our python we use the command : "pip install numpy"
  • For importing it to our code we use "import numpy" 

What other things can Numpy do?

Besides working with arrays Numpy can also do the following:
  • Matrix operation
  • Linear algebra
  • Fourier transformations
  • Random number generation
  • Statistical and mathematical functions

Core concepts

  • ndarray : N-dimensional array that allows fast and efficient storage of data and computation
  • Broadcasting : Allows operations on arrays 
  • Vectorized operations : Perform element wise operations
  • Linear algebra : Supports matrix operations, eigen values etc..
  • Random module : Generate random numbers, useful for simulations and ML initializations
We use Numpy in data science because in data science we work on datasets in form of matrices and arrays. Numpy makes working with such data faster.

Pandas

We saw that Numpy is used for working on numerical arrays, and others. Pandas is used to work with structured data(like CSV files, Excel). You might think "Well I can use lists or dictionaries, why would I listen to you?" Well you can butttt we tend to use Pandas because it provides DataFrames(tables), tables are easier to manipulate, filter and group the data. 

  • For installing Pandas we use the command : "pip install pandas"
  • For importing in our code we use : "import pandas"

What other things it can do?

  • It can load data from CSVs, Excel, JSON, SQL and more.
  • Easier to sort, filter, group or merge data
  • Handles missing values (like a boss)
  • Easier to reshape data, do aggregate operations.
  • Works well with Numpy and Matplotlib (In data science these three are like "Nikhil, Pani and Venkat" from Snehitudu movie ).

Core Concepts:

  • Series : One dimensional labeled array(like column)
  • Data Frame : Two dimensional labeled data structure(Table)
  • Indexing and slicing : Select, filter or manipulate rows/columns using .loc[] or .iloc[] 
  • Missing data handling : Missing data can cause errors while computing so we either remove them or fill them with a value like mean,mode or median. It is handled by isnull(), fillna(), dropna()
  • Aggregating : For grouping and analyzing data we use GroupBy
So in summary we use Pandas to work with data in table form. As i told you before cleaning the Data takes 80% of data scientist's time, so Pandas play a key role in data science. When the Data set is too large for even pandas to handle we use Dask. Dask extends NumPy, Pandas, Scikit-learn to work on larger data sets. It uses lazy evaluation and parallel processing to handle billions of rows efficiently. It is used when data sets are larger than RAM.

Matplotlib

This one is the artist of the trio. Matplotlib helps us in vizualizing the data, it helps us in finding a meaning(patterns) in the data by showing how the data looks like in forms of Graphs, Charts, Plots etc..
This is helpful because we tend to understand pictures rather than raw numbers. Let's be honest if you show me a page filled with numbers and ask me to make it make sense I'd probably get a stroke by even seeing all those numbers.

  • For installing it we use : "pip install matplotlib"
  • For importing it we use : "import matplotlib.pyplot"

What other things can it do?

  • Line plots
  • Bar charts
  • Histograms
  • Scatter plots
  • Pie charts
  • Subplots (many plots in one figure)
  • We can even add customizations like colors, labels, legends etc.. 
There is another library too named Seaborn. Seaborn is the stylish child of Matplotlib. Seaborn is based on Matplotlib. It makes attractive and informative statistical graphics, Quickly plots correlation heatmaps, distributions etc..

Core concepts:

  • Figures and axes : A figure can contain multiple axes(plots)
  • Basic Plotting : plot(), scatter(), bar(), hist() etc....
  • Customization : Titles, labels, legends, grid etc..
  • Subplots : Create multiple plots in one figure
  • Saving figure : Export plots as png, pdf using savefig()

Scikit-learn

We talked about Nikhil, Pani and Venkat but we forgot about one person that is "Silencer". Scikit learn is like "Silencer (I have never heard his real name in movie)" . It just memorises all the algorithm and doesn't understand data like others. Scikit-learn(also written as sklearn) is our go to for machine learning. It is built on top of Numpy, SciPy and Matplotlib. It contains ready made models and data preprocessing techniques. Scikit-learn. We need machine learning models to predict the outcomes of data, this library has the classic ML algorithms. It is great for tabular data. 
  • For installing we use "pip install scikit-learn"
  • For importing we use "import sklearn"

What can it do?

  • Supervised learning 
  • Unsupervised learning
  • Model selection
  • Preprocessing
  • Ensemble methods
  • Feature selection
  • It even has built in data sets

Core concepts : 

  • Models : Prebuilt ML algos like linear regression, SVM, Decision Trees, Random Forests etc..
  • Model selection : It contains tools like GridSearchCV(used for hyperparameter tuning) and train_test_split (splitting our data into two sets). I will talk about these more when we introduce ML
  • Fit/Predict : We use .fit() to train the model and after that we use .predict() to make predictions based on what it was trained
  • Model Evaluation : We can see how well our model is doing by using accuracy_score, confusion_matrix, cross_val_score. Think of it as the marks which students get at exam , they show how well our model understood the data.
  • Feature Scaling : StandardScaler, MinMaxScaler are used to normalize the data. When we have huge numbers for a value it would take time for computing so we normalize the data so that models learn faster.

PyTorch

Scikit-learn is only limited for classic Machine learning algorithms and doesn't have Deep Learning algorithms. Deep learning algorithms are used when ML algos fail at complex tasks. PyTorch is a Deep Leaning library known for it's simplicity and flexibility. It is helpful for researchers and developers working on neural networks. PyTorch allows us to build and train neural networks. We can perform tensor computations(Like Numpy) with strong accelaration via GPU. Tensors are multi dimensional arrays.
  • For installing it we use : "pip install torch"
  • For importing we use : "import torch"

What can it do?

  • Computer vision
  • NLP (Natural Language Processing)
  • Speech recognition and Sound classification
  • Reinforcement learning
  • Custom Deep learning system (Building own LLMs like ChatGPT)
Internally pytorch records everything we do in a dynamic computation graph. 

Core concepts :

  • Tensors : These are like Numpy arrays but on GPU,
  • Autograd : PyTorch tracks operations and can automatically compute gradients using autograd 
  • Neural Networks : We use torch.nn to build models. I will explain about neural networks in future, it is one of the most fascinating concept I've learnt.
  • Loss functions : Loss shows how much our predictions are differing from the actual value
  •  Optimizers : They are used to update weights while forward and backward passes to improve learning of data.
PyTorch also consists of prebuilt models like alexnet, transformer, hubert etc...


NLTK 

Natural Language Tool Kit is used for classical NLP tasks. It is widely used in research, prototyping NLP pipelines. Like NLTK is the go to library to understand NLP basics like tokenization, stemming, classification etc.... It provides tools for text processing, data sets, resources to learn NLP. Basically we use it when want to give words or sentences as data to the machine.

  • For installing we use the command : "pip install nltk"
  • For importing in our code we use : "import nltk"

What can it do ?

  • Text preprocessing
  • Linguistic analysis
  • Text classification
  • Work with Corpora and Lexicons
  • Language modeling and generation

Core concepts:

  • Tokenization - Splitting sentences into parts
  • Stop words removal - Removing words which are common in general english like "is","the" etc..
  • Stemming and Lemmatization - Stemming converts words to their root form like "drinking" would be converted to "drink". Lemmatization could be thought of converting verb forms like "better" would be converted to "good"
  • POS tagging - Parts Of Speech tagging the name itself tells us what it does, it labels each word as noun, verb or adjective. 
  • Corpora & Lexicons - NLTK gives us a huge amount of data sets. Lexicon is like a dictionary of words and Corpora is a library of books.
  • N-Grams - It breaks the text into chunk of N words.
 We cannot just give words to machine it would be like a brainrot kid talking to an old man who only speaks in morse code, this is where we use NLP. NLP converts human language to numbers and understands it. Well it is not magic how words are turned to numbers and machine processes them, the processing could look like this :
  1. Input : Our text is given as input
  2. Text processing (NLTK used here) : In this step the text is , 
          tokenized -> turned to lowercase -> stop words removed->Stemming& Lemmatization 
  3. Convert words to numbers (Vectorization) : We convert words to numbers like Bag of Words, TF-IDF, word embeddings like BERT. Scikit-learn is used here.
  4. Feed the data to model : After vectorization we feed the data to machine learning or DL model and it processes it. 
After these steps the oldman(machine) understands what "ts pmo icl" means. 

Open CV

This deals with all the unstructured data, like images and videos. It is very useful in making models for object detection, face detection and more. It is written in C/C++ but it has python bindings. The snapchat filters which we spend hours playing with, uses Open CV like it detects your face and adds the filter. Open CV integrates easily with Numpy, PyTorch, Scikit-learn.

  • For installing we use : "pip install opencv-python"
  • For importing in code we use : "import cv2"

What can it do?

  • Image processing
  • Video processing
  • Computer vision (Face detection, Object detection, motion detection etc..)
  • AI/ML with CV 
  • Drawing and GUI

Core concepts 

  • Image representation - We represent images as Numpy arrays, we all know that image is a bunch of pixels. So an image is thought as a matrix of pixel values, we can manipulate image data like how we do with data in matrices.
  • Color spaces - We can convert between multiple color spaces we change colors based on different tasks, like BGR to Grayscale is done when we want to detect edges. 
  • Image operations - Manipulating the pixels of the images, the operations like resize, flip, rotate, crop etc.. can be done. It is used for noise reduction, fitting image sizes for neural networks.
  • Edge detection - Detect where sharp changes in color occur i.e.., boundaries of objects
  • Contours and shape detection - Contours are the outlines of objects, Open CV can find and draw them. Used for object tracking , face detection etc..
  • Video and webcam processing - When we want to work on real time input we use this. It is important to build real time apps like Surviellance, Gesture control in camera, Snap chat filters etc...


With this I'd like to wrap up the blog for today, I hope you learnt something. If you want to remember what each library does I have the nicest trick for you
  • Numpy :
          "Pani" - Just like how Pani simplifies complex concepts effortlessly, NumPy makes heavy computations on arrays and matrices simple and efficient.

  • Pandas :
           "Venkat" -  Just like how he organizes his thoughts and understands his passions , Pandas helps us to organize, analyze and understand data.

  • Matplotlib :
     
    "Nikhil /Ishaan Avasti(Taare Zameen Par)" - Works well with NumPy and Pandas and just like how Ishaan turns his thoughts to beautiful art Matplotlib turns numbers to visual representations

  • Scikit-learn : 
                 "Silencer" - He memorizes everything and Scikit learn has prebuilt models just like memorizing. Doesn't understand deeply but gives results i.e., Sklearn doesn't do deep learning

  • PyTorch : 
         "Pani... again" - Learns by understanding concepts in deep not memorization i.e., deep learning models are built step by step , ideal for learning intuitively.

  • NLTK :
             "Chitti from Robo" - Just like how Chitti understands human language, interprets meaning and responds to it, NLTK enables machines to process and understand human language i.e., turning raw text to something a computer can make sense of.

  • Open CV :
              
    "Ghajini" - Just like how Sanjay relies on visual data like images, tattoos, Polaroids and make sense of world and act accordingly, Open CV processes and understands visual input like images, video frame etc... to make intelligent decisions in real time

Comments

Popular Posts