Exploratory Data Analysis (EDA) is the process of visualizing and analyzing data to extract insights from it. Sometimes what we see with our naked eye cannot give us all truth. It needs time to understand, analyze and find out the real truth. In other words, EDA is the process of summarizing important characteristics of data in order to gain better understanding of the dataset.

The whole objective of EDA is to understand the data well and understanding the data can be more difficult once we start exploring the data. The EDA is performed to make sure that the data is…


I realize that you may have never heard of the Apache Parquet file format. Similar to a CSV file, Parquet is a type of file.

Parquet is a free and open-source file format that is available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row-based files like CSV or TSV files. It provides efficient data compression and schemes with enhanced performance to handle complex data in bulk. This approach is best especially for those queries that need to read certain columns from a…


For any model in machine learning it is considered as best practice if the model is tested with an independent data set. Usually, any model will work with an unknown data set which is also know as training set. In the real-life scenario, the model will be tested for the efficiency and the accuracy with a different and unique data set. In those circumstances we would want our model to be efficient enough or at least have the same efficiency as of the training data. This kind of testing is called cross-validation. …


Boosting is an ensemble meta-algorithm that is primarily used for reducing bias and variance in supervised learning. It is a process that uses Machine learning algorithm to combine weak learner to form a strong learner to increase the accuracy of the model. Boosting is a sequential process.

In this post I will be explaining you about What boosting is, Types of boosting and Boosting algorithm.

Why is boosting used

Let’s understand it with an example … Let’s say we are given a data set with cat and dog images and we are asked to build a machine learning model that can classify these images…


Linear regression is one of the well-known and well-understood algorithms in statistics and machine learning. It is one of the machine learning algorithms based on supervised learning. Linear regression is a statistical model that shows the relationship between two variables with the linear equation.

When Do You Need Regression?

You will need regression to answer whether and how some factors influence the other or how the variables are related. For example, you can use it to determine to what extent the experience or gender impact salaries.

Regression is also useful when you want to forecast the feature. For example, you could try to predict electricity…


For any model to perform well the error needs to be reduced. The correct balance of bias and variance is important for building any machine-learning algorithms and to create accurate results from their models. Bias and variance are used in supervised machine learning, in which an algorithm learns from training data or a sample data set of known quantities. Bias and variance are components of reducible error.

Bias

Bias is basically how far we have predicted a value than the actual value. We can say that the bias is too high if our prediction is far off from the actual prediction…


Photo by Nick Kwan from Pexels

Random Forest

Random Forest is a classifier that evolves from Decision trees. As the name suggests, this algorithm creates the forest with a number of trees. The random forest algorithm is a supervised classification algorithm which can be used for both classification and regression kind of problems.

To understand Random Forest better we must first know what is a Decision Tree and how does it work.

Decision Tree

I am sure all of us must have been using Decision tree technique on our day-to-day life knowingly or unknowingly. We just don’t give a fancy name to those decision-making process. …


Decision tree and Support vector machines are the popular tools used in Machine learning to make predictions. Both these algorithms can be used on classification and regression problems. Without further delay let’s have a short briefing on them…

Decision Tree Making

Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split.

Swetha Dhanasekar

Data Scientist & Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store