Personal tools

Data Preprocessing in Machine Learning

The Data Science Landscape_010522A
[The Data Science Landscape - Towards Data Science]

Finding Solutions from Data -

Finding Climate Change Solutions Through Data



- Data Preparation in Data Science

Data preparation is the process of cleaning and transforming raw data before processing and analyzing it. It's also known as pre-processing. 

Data preparation techniques are typically used at the earliest stages of the machine learning (ML) and AI development pipeline to ensure accurate results. 

Data preparation is a key step in data analytics projects. It can involve many tasks, such as:

  • Collecting data
  • Cleaning data
  • Labeling data
  • Reformatting data
  • Making corrections to data
  • Combining datasets
  • Standardizing data formats
  • Enrichment of source data
  • Elimination of outliers


Other components of data preparation include: Data preprocessing, Profiling, Cleansing, Validation, Visualization.

Data preparation can help business analysts and data scientists trust, understand, and ask better questions of their data. This can make their analyses and modeling more accurate and meaningful.


- Data Preparation Techniques

Data preparation is a crucial step in the machine learning (ML) pipeline. It involves collecting, cleaning, and organizing data before using it to train a model. The quality of the data used to train a model significantly impacts the accuracy of its predictions. 

Here are some data preparation techniques:  

  • Data cleansing: An essential process for preparing raw data for ML. Raw data may contain numerous errors, which can affect the accuracy of ML models.
  • Feature engineering: Involves selecting, extracting, transforming, and creating new features from the available data to improve the performance of ML algorithms.
  • Hyperparameter tuning: An essential part of the ML process that involves optimizing the model's performance by fine-tuning its hyperparameters.
  • Transform data files: Transform all the data files into a common format.
  • Explore the dataset: Use a data preparation tool like Tableau, Python Pandas, etc. to explore the dataset.
  • Pick feature variables: Use feature selection methods to pick feature variables from the dataset.


 - The Stages of Data Science Process

Companies can use data from nearly endless sources - internal information, customer service interactions, and the entire Internet - to help them make choices and improve their businesses. 

But you can't simply take raw data and run it through ML and analytics programs right away. You first need to preprocess the data so that a machine can successfully "read" or understand it.  

Data preprocessing is a step in the data mining and data analysis process that takes raw data and converts it into a format that computers and machine learning can understand and analyze. Raw real data in the form of text, images, videos, etc. is messy. Not only can it contain bugs and inconsistencies, but it is often incomplete and has no regular, uniform design.

Machines love to process nice and neat information - they read data as 1s and 0s. Therefore, calculating structured data such as integers and percentages is easy. However, unstructured data in the form of text and images must first be cleaned and formatted before analysis.

The simple linear form of data science process consists of following five distinct activities (stages) that depend on each other: 

  • Stage 1: Acquire - To Obtain Data 
  • Stage 2: Prepare - To Scrub Data
  • Stage 3: Analyze - To Explore Data
  • Stage 4: Report - To Model Data
  • Stage 5: Act - To Interpret Models and Data


- Data Preparation - Turns Insights into Action

Big data and data science are only useful if the insights can be turned into action, and if the actions are carefully defined and evaluated. Interpreting data refers to the presentation of your data to a non-technical layman.  

Data preparation is the process of preparing raw data to make it suitable for further processing and analysis. Key steps include collecting, cleaning, and labelling raw data into a form suitable for machine learning (ML) algorithms, and then exploring and visualizing the data. Data preparation can take up to 80% of the time spent on an ML project. Using specialized data preparation tools is important to optimize this process.

Data preparation is a very important part of the data science process. In fact, this is where you will spend most of your time on any data science effort. It can be a tedious process, but it is a crucial step. Always remember, garbage in, garbage out. If you don't spend the time and effort to create good data for the analysis, you will not get good results no matter how sophisticated the analysis technique you're using is


- The Main Goals in Data Preprocessing

The raw data that you get directly from your sources are never in the format that you need to perform analysis on. There are two main goals in the data pre-processing step. 

The first is to clean the data to address data quality issues, and the second is to transform the raw data to make it suitable for analysis. A very important part of data preparation is to address quality of issues in your data. Real-world data is messy. In order to address data quality issues effectively, knowledge about the application, such as how the data was collected, the user population, and the intended uses of the application is important. This domain knowledge is essential to making informed decisions on how to handle incomplete or incorrect data. 

The second part of preparing data is to manipulate the clean data into the format needed for analysis. The step is known by many names: data manipulation, data preprocessing, data wrangling, and even data munging. Some operations for this type of operation include scaling, transformation, feature selection, dimensionality reduction, and data manipulation. 


 - Data Preparation for ML

Data fuels machine learning. Leveraging this data to reshape your business, while challenging, is critical to staying relevant now and into the future. This is the survival of the most informed people, those who can use their data to make better, more informed decisions can react faster to unexpected events and uncover new opportunities. This important but tedious process is a prerequisite for building accurate ML models and analyses, and is the most time-consuming part of an ML project. To minimize time investment, data scientists have access to tools that help automate data preparation in various ways.


- Steps in Data Preprocessing in ML

Data preprocessing in ML is a critical step to help improve data quality to facilitate the extraction of meaningful insights from data. 

Data preprocessing in ML refers to the techniques of preparing (cleaning and organizing) raw data to make it suitable for building and training ML models. In short, data preprocessing in ML is a data mining technique that transforms raw data into an understandable and readable format.

When creating a ML model, data preprocessing is the first step that marks the start of the process. Often, real-world data is incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute values/trends. 

This is where data preprocessing comes into the picture - it helps to clean, format and organize raw data so that it is ready for machine learning models.

Here are the seven important steps of data preprocessing in machine learning:


- Step 1: Acquire the Dataset

Acquiring a dataset is the first step in data preprocessing in ML. To build and develop ML models, you must first acquire relevant datasets. 

This dataset will consist of data collected from a number of different sources, then combined in an appropriate format to form a dataset. 

Dataset formats vary by use case. For example, a commercial dataset will be completely different from a medical dataset. Business datasets will contain relevant industry and business data, while medical datasets will contain healthcare-related data.

- Step 2: Import All the Key Libraries

Since Python is the most widely used library and the favorite library of data scientists around the world. Predefined Python libraries can perform specific data preprocessing jobs. 

Importing all key libraries is an important step in data preprocessing for ML.

The three core Python libraries used for this data preprocessing in ML are:

  • NumPy - NumPy is the foundational package for scientific computing in Python. Hence, it is used to insert any kind of mathematical operation in the code. With NumPy, you can also add large multidimensional arrays and matrices to your code.
  • Pandas - Pandas is an excellent open source Python library for data manipulation and analysis. It is widely used to import and manage datasets. It includes high-performance, easy-to-use data structures and data analysis tools for Python.
  • Matplotlib - Matplotlib is a Python 2D plotting library for drawing any kind of charts in Python. It provides publication-quality diagrams across platforms (IPython shell, Jupyter notebooks, web application servers, etc.) in a variety of hardcopy formats and interactive environments.

- Step 3: Importing the Dataset

In this step, you need to import the dataset collected for the ML project at hand. Importing datasets is one of the important steps in data preprocessing in ML.


- Step 4: Identifying and Handling Missing Values

In data preprocessing, it is crucial to identify and properly handle missing values, otherwise, you may draw inaccurate and wrong conclusions and inferences from the data. Needless to say, this can get in the way of your ML projects. 

Basically, there are two ways of dealing with missing data:

  • Deleting Particular Rows – In this method, you delete specific rows or specific columns with null values where more than 75% of the values are missing. However, this method is not 100% effective. It is recommended that you only use it when the data set samples are sufficient. You have to make sure that after removing the data, the bias is not increased.
  • Calculating the Mean - This method is useful for features with numeric data such as age, salary, year, etc. Here you can calculate the mean, median or mode for a specific feature or column or row that contains missing values and replace the result with missing values. This approach increases the variance of the dataset and effectively counteracts any data loss. So it will yield better results than the first approach (omitting rows/columns). Another approximation is by the bias of neighboring values. However, this works best for linear data.

- Step 5: Encoding the Categorical Data

Categorical data refers to information in a data set that has a specific category. Machine learning models are primarily based on mathematical equations. 

So you can intuitively understand that keeping categorical data in the equation causes some problems because you only need the numbers in the equation.


- Step 6: Splitting the Dataset

Splitting datasets is the next step in data preprocessing in machine learning. Every dataset for a machine learning model must be split into two separate sets - a training set and a test set.

The training set represents the subset of the dataset used to train the machine learning model. Here, you already know the output. On the other hand, a test set is a subset of a dataset used to test a machine learning model. ML models use a test set to predict outcomes.

Typically, datasets are split in a 70:30 or 80:20 ratio. This means you can use 70% or 80% of the data to train the model and ignore the other 30% or 20%. The splitting process varies depending on the shape and size of the dataset in question.


- Step 7: Feature Scaling

Feature scaling marks the end of data preprocessing in machine learning. It is a method of standardizing the independent variables of a data set within a certain range. 

In other words, feature scaling limits the range of variables so that you can compare them on common ground.   


- Feature Engineering

Feature engineering is a machine learning (ML) technique that involves transforming numeric representations of raw data into formats for ML models. It can improve the performance of a ML model by: 

  • Simplifying and speeding up data transformations
  • Enhancing model accuracy
  • Making certain algorithms converge faster
  • Leading to better model performance
Feature engineering includes four main steps: Feature creation, Transformation, Feature extraction, Feature selection. 

Some examples of feature engineering techniques include: 

  • Feature creation: Generating new features based on domain knowledge or by observing patterns in the data.
  • Imputation: Managing missing values, which is one of the most common problems when it comes to preparing data for ML.
  • Normalization: Bringing all the values on to the same scale so that the performance of the model will be improved.
Some best practices for performing feature engineering include:
  • Handling missing data in your input features
  • Using one-hot encoding for categorical data
  • Considering feature scaling
  • Creating interaction features where relevant
  • Removing irrelevant features



[More to come ...]



Document Actions