Personal tools

Data Preprocessing

The Data Science Landscape_010522A
[The Data Science Landscape - Towards Data Science]

Finding Solutions from Data -

Finding Climate Change Solutions Through Data


 - The Stages of Data Science Process

Companies can use data from nearly endless sources—internal information, customer service interactions, and the entire Internet - to help them make choices and improve their businesses. But you can't simply take raw data and run it through machine learning and analytics programs right away. You first need to preprocess the data so that a machine can successfully "read" or understand it.  

Data preprocessing is a step in the data mining and data analysis process that takes raw data and converts it into a format that computers and machine learning can understand and analyze.Raw real data in the form of text, images, videos, etc. is messy. Not only can it contain bugs and inconsistencies, but it is often incomplete and has no regular, uniform design.

Machines love to process nice and neat information - they read data as 1s and 0s. Therefore, calculating structured data such as integers and percentages is easy. However, unstructured data in the form of text and images must first be cleaned and formatted before analysis.

The simple linear form of data science process consists of following five distinct activities (stages) that depend on each other: 

  • Stage 1: Acquire - To Obtain Data 
  • Stage 2: Prepare - To Scrub Data
  • Stage 3: Analyze - To Explore Data
  • Stage 4: Report - To Model Data
  • Stage 5: Act - To Interpret Models and Data


- Data Preparation - Turns Insights into Action

Big data and data science are only useful if the insights can be turned into action, and if the actions are carefully defined and evaluated. Interpreting data refers to the presentation of your data to a non-technical layman.  

Data preparation is the process of preparing raw data to make it suitable for further processing and analysis. Key steps include collecting, cleaning, and labelling raw data into a form suitable for machine learning (ML) algorithms, and then exploring and visualizing the data. Data preparation can take up to 80% of the time spent on an ML project. Using specialized data preparation tools is important to optimize this process.

Data preparation is a very important part of the data science process. In fact, this is where you will spend most of your time on any data science effort. It can be a tedious process, but it is a crucial step. Always remember, garbage in, garbage out. If you don't spend the time and effort to create good data for the analysis, you will not get good results no matter how sophisticated the analysis technique you're using is


- The Main Goals in Data Preprocessing

The raw data that you get directly from your sources are never in the format that you need to perform analysis on. There are two main goals in the data pre-processing step. 

The first is to clean the data to address data quality issues, and the second is to transform the raw data to make it suitable for analysis. A very important part of data preparation is to address quality of issues in your data. Real-world data is messy. In order to address data quality issues effectively, knowledge about the application, such as how the data was collected, the user population, and the intended uses of the application is important. This domain knowledge is essential to making informed decisions on how to handle incomplete or incorrect data. 

The second part of preparing data is to manipulate the clean data into the format needed for analysis. The step is known by many names: data manipulation, data preprocessing, data wrangling, and even data munging. Some operations for this type of operation include scaling, transformation, feature selection, dimensionality reduction, and data manipulation. 


 - Data Preparation for ML

Data fuels machine learning. Leveraging this data to reshape your business, while challenging, is critical to staying relevant now and into the future. This is the survival of the most informed people, those who can use their data to make better, more informed decisions can react faster to unexpected events and uncover new opportunities. This important but tedious process is a prerequisite for building accurate ML models and analyses, and is the most time-consuming part of an ML project. To minimize time investment, data scientists have access to tools that help automate data preparation in various ways.



[More to come ...]



Document Actions