Personal tools

Data Preprossing

The Data Science Landscape_010522A
[The Data Science Landscape - Towards Data Science]

Finding Solutions from Data -

Finding Climate Change Solutions Through Data


The simple linear form of data science process consists of following five distinct activities (stages) that depend on each other: Acquire, Prepare, Analyze, Report, and Act. 

Stage 1: Acquire - To Obtain Data

- Overview

The first stage in the data science process is to acquire the data. You need to obtain the source material before analyzing or acting on it. The Acquire activity includes anything that makes us retrieve data including finding, accessing, acquiring, and moving data.  

- To Determine What Data Is Available

The first stage in acquiring data is to determine what data is available. You want to identify suitable data related to your problem and make use of all data that is relevant to your problem for analysis. Leaving out even a small amount of important data can lead to incorrect conclusions. Data comes from many places, local and remote, in many varieties, structured and un-structured, and with different velocities. 

- Technologies To Access Data

There are many techniques and technologies to access these different types of data:

  • A lot of data exists in conventional relational databases, like structured big data from organizations. The tool of choice to access data from databases is structured query language or SQL. Additionally, most data base systems come with a graphical application environment that allows you to query and explore the data sets in the database.  
  • Data can also exist in files such as text files and Excel spreadsheets. Scripting languages are generally used to get data from files. Common scripting languages with support for processing files are Java Script, Python, PHP, Perl, R, MATLAB, and many others.  
  • An increasingly popular way to get data is from websites. Web pages are written using a set of standards approved by a World Wide Web consortium or shortly, W3C. This includes a variety of formats and services. One common format is the Extensible Markup Language, or XML, which uses markup symbols or tabs to describe the contents on a webpage.  
  • Many websites also host web services which produce program access to their data. There are several types of web services. The most popular is REST because it's so easy to use. REST stand for Representational State Transfer. And it is an approach to implement web services with performance, scalability and maintainability in mind.  
  • Web socket services are also becoming more popular since they allow real time modifications from web sites. NoSQL storage systems are increasingly used to manage a variety of data types in big data. These data stores are databases that do not represent data in a table format with columns and rows as with conventional relational databases. Examples of these data stores include Cassandra, MongoDB and HBASE. NoSQL data stores provide APIs to allow users to access data. These APIs can be used directly or in an application that needs to access the data. Additionally, most NoSQL systems provide data access via a web service interface, such a REST.  


As a summary, big data comes from many places. Finding and evaluating data useful to your big data analytics is important before you start acquiring data. Depending on the source and structure of data, there are alternative ways to access it. If you are looking to work on projects on a much bigger data sets, or big data, then you need to learn how to access using distributed storage like Apache Hadoop, Spark or Flink.

Stage 2: Prepare - To Scrub Data

- Overview

The 2nd stage is Prepare data. We divide the pre-data activity into two steps based on the nature of the activity. Namely, explore data and pre-process data

The first step in data preparation involves literally looking at the data to understand its nature, what it means, its quality and format. It often takes a preliminary analysis of data, or samples of data, to understand it. This step is called explore. Once we know more about the data through exploratory analysis, the next step is pre-processing of data for analysis. Pre-processing includes cleaning data, sub-setting or filtering data, creating data, which programs can read and understand, such as modeling raw data into a more defined data model, or packaging it using a specific data format. If there are multiple data sets involved, this step also includes integration of multiple data sources, or streams.  

- Step 2-A: Exploring Data 
The first step after getting your data is to explore it. Exploring data is a part of the two-step data preparation process. You want to do some preliminary investigation in order to gain a better understanding of the specific characteristics of your data. In this step, you'll be looking for things like correlations, general trends, and outliers. Without this step, you will not be able to use the data effectively. 

Correlation graphs can be used to explore the dependencies between different variables in the data. Graphing the general trends of variables will show you if there is a consistent direction in which the values of these variables are moving towards, like sales prices going up or down. In statistics, an outlier is a data point that's distant from other data points. Plotting outliers will help you double check for errors in the data due to measurements. In some cases, outliers that are not errors might make you find a rare event. Additionally, summary statistics provide numerical values to describe your data. Summary statistics are quantities that capture various characteristics of a set of values with a single number or a small set of numbers. Some basic summary statistics that you should compute for your data set are mean, median, range, and standard deviation. Mean and median are measures of the location of a set of values. Mode is the value that occurs most frequently in your data set. And range and standard deviation are measures of spread in your data. Looking at these measures will give you an idea of the nature of your data. They can tell you if there's something wrong with your data. 

In summary, what you get by exploring your data is a better understanding of the complexity of the data you have to work with. This, in turn, will guide the rest of your process.

- Step 2-B: Pre-processing Data 

  • Scaling involves changing the range of values to be between a specified range. Such as from zero to one. This is done to avoid having certain features that large values from dominating the results. For example, in analyzing data with height and weight. The magnitude of weight values is much greater than of the height values. So scaling all values to be between zero and one will equalize contributions from both height and weight features. 
  • Various transformations can be performed on the data to reduce noise and variability. One such transformation is aggregation. Aggregate data generally results in data with less variability, which may help with your analysis. For example, daily sales figures may have many serious changes. Aggregating values to weekly or monthly sales figures will result in similar data. Other filtering techniques can also be used to remove variability in the data. Of course, this comes at the cost of less detailed data. So these factors must be weighed for the specific application. 
  • Feature selection can involve removing redundant or irrelevant features, combining features, and creating new features. During the exploring data step, you might have discovered that two features are correlated. In that case one of these features can be removed without negatively affecting the analysis results. For example, the purchase price of a product and the amount of sales tax paid, are likely to be correlated. Eliminating the sales tax amount, then will be beneficial. Removing redundant or irrelevant features will make the subsequent analysis much simpler. In other cases, you may want to combine features or create new ones. For example, adding the applicant's education level as a feature to a loan approval application would make sense. There are also algorithms to automatically determine the most relevant features, based on various mathematical properties. 
  • Dimensionality reduction is useful when the data set has a large number of dimensions. It involves finding a smaller subset of dimensions that captures most of the variation in the data. This reduces the dimensions of the data while eliminating irrelevant features and makes analysis simpler. A technique commonly used for dimensional reduction is called Principle Component Analysis or PCA. 
  • Raw data often has to be manipulated to be in the correct format for analysis. For example, from samples recording daily changes in stock prices, we may want the capture price changes for a particular market segments like real estate or health care. This would require determining which stocks belong to which market segment. Grouping them together, and perhaps computing the mean, range, standard deviation for each group. 

Stage 3: Analyze - To Explore Data

- Overview

Once your data is ready to be used, and right before you jump into AI and Machine Learning, you will have to examine the data. Now that you have your data nicely prepared, the next stage is to analyze the data. The prepared data then would be passed onto the Analysis stage, which involves selection of analytical techniques to use, building a model of the data, and analyzing results. This stage can take a couple of iterations on its own or might require data scientists to go back to stages one and two to get more data or package data in a different way. 

- Data Analysis and Analysis Techniques

Data analysis involves building a model from your data, which is called input data. The input data is used by the analysis technique to build a model. What your model generates is the output data. There are different types of problems, and so there are different types of analysis techniques. The main categories of analysis techniques are classification, regression, clustering, association analysis, and graph analysis. 

  • Classification. In classification, the goal is to predict the category of the input data. An example of this is predicting the weather as being sunny, rainy, windy, or cloudy in this case. Another example is to classify a tumor as either benign or malignant. In this case, the classification is referred to as binary classification, since there are only two categories. But you can have many categories as well, as the weather prediction problem shown here having four categories. Another example is to identify handwritten digits as being in one of the ten categories from zero to nine.  
  • Regression. When your model has to predict a numeric value instead of a category, then the task becomes a regression problem. An example of regression is to predict the price of a stock. The stock price is a numeric value, not a category. So this is a regression task instead of a classification task. Other examples of regression are estimating the weekly sales of a new product and predicting the score on a test.  
  • Clustering. In clustering, the goal is to organize similar items into groups. An example is grouping a company's customer base into distinct segments for more effective targeted marketing like seniors, adults and teenagers. Another such example is identifying areas of similar topography, like mountains, deserts, plains for land use application. Yet another example is determining different groups of weather patterns, like rainy, cold or snowy.  
  • Association Analysis. The goal in association analysis is to come up with a set of rules to capture associations within items or events. The rules are used to determine when items or events occur together. A common application of association analysis is known as market basket analysis, which is used to understand customer purchasing behavior. For example, association analysis can reveal that banking customers who have certificate of deposit accounts (or CDs), also tend to be interested in other investment vehicles, such as money market accounts. This information can be used for cross-selling. If you advertise money market accounts to your customers with CDs, they're likely to open such an account.  
  • Graph Analysis. When your data can be transformed into a graph representation with nodes and links, then you want to use graph analytics to analyze your data. This kind of data comes about when you have a lot of entities and connections between those entities, like social networks. Some examples where graph analytics can be useful are exploring the spread of a disease or epidemic by analyzing hospitals' and doctors' records; identification of security threats by monitoring social media, email and text data; and optimization of mobile communications network traffic, and optimization of mobile telecommunications network traffic, to ensure call quality and reduce dropped calls. 

- Constructing the Model

Modeling starts with selecting, one of the techniques we listed as the appropriate analysis technique, depending on the type of problem you have. Then you construct the model using the data you've prepared. To validate the model, you apply it to new data samples. This is to evaluate how well the model does on data that was used to construct it. The common practice is to divide the prepared data into a set of data for constructing the model and reserving some of the data for evaluating the model after it has been constructed. You can also use new data prepared the same way as with the data that was used to construct model. 

- Evaluating the Model

Evaluating the model depends on the type of analysis techniques you used. Let's briefly look at how to evaluate each technique. For classification and regression, you will have the correct output for each sample in your input data. Comparing the correct output and the output predicted by the model, provides a way to evaluate the model. For clustering, the groups resulting from clustering should be examined to see if they make sense for your application. For example, do the customer segments reflect your customer base? Are they helpful for use in your targeted marketing campaigns? For association analysis and graph analysis, some investigation will be needed to see if the results are correct. For example, network traffic delays need to be investigated to see what your model predicts is actually happening. And whether the sources of the delays are where they are predicted to be in the real system. 

- Transforming Business Questions into Data Science Questions

After you have evaluated your model to get a sense of its performance on your data, you will be able to determine the next steps. Some questions to consider are, should the analysis be performed with more data in order to get a better model performance? Would using different data types help? For example, in your clustering results, is it difficult to distinguish customers from distinct regions? Would adding zip code to your input data help to generate finer grained customer segments? Do the analysis results suggest a more detailed look at some aspect of the problem? For example, predicting sunny weather gives very good results, but rainy weather predictions are just so-so. This means that you should take a closer look at your examples for rainy weather. Perhaps you just need more samples of rainy weather, or perhaps there are some anomalies in those samples. Or maybe there are some missing data that needs to be included in order to completely capture rainy weather. 

- Summary

The ideal situation would be that your model platforms very well with respect to the success criteria that were determined when you defined the problem at the beginning of the project. In that case, you're ready to move on to communicating and acting on the results that you obtained from your analysis. As a summary, data analysis involves selecting the appropriate technique for your problem, building the model, then evaluating the results. As there are different types of problems, there are also different types of analysis techniques.

Stage 4: Report - To Model Data

- Communicating Results

The stage four for communicating results includes evaluation of analytical results. Presenting them in a visual way, creating reports that include an assessment of results with respect to success criteria. Activities in this stage can often be referred to with terms like interpret, summarize, visualize, or post process.  

Reporting the insights gained from our analysis is a very important step to communicate your insights and make a case for what actions should follow. It can change shape based on your audience and should not be taken lightly. So how do you get started? The first thing to do is to look at your analysis results and decide what to present or report as the biggest value or biggest set of values. 

In deciding what to present you should ask yourself these questions. What are the main results? What added value do these results provide or how can the model add to the application? How do the results compare to the success criteria determined at the beginning of the project? Answers to these questions are the items you need to include in your report or presentation. So make them the main topics and gather facts to back them up. Keep in mind that not all of your results may be rosy. Your analysis may show results that are counter to what you were hoping to find, or results that are inconclusive or puzzling. You need to show these results as well. Domain experts may find some of these results to be puzzling, and inconclusive findings may lead to additional analysis. Remember the point of reporting your findings is to determine what the next step should be. All findings must be presented so that informed decisions can be made. 

- Visualization Tools To Present the Results

Visualization is an important tool in presenting your results. Scatter plots, line graphs, heat maps, and other types of graphs are effective ways to present your results visually. This time you're not plotting the input data, but you're plotting the output data with similar tools. You should also have tables with details from your analysis as backups, if someone wants to take a deeper dive into the results. 

There are many visualization tools that are available. Some of the most popular open source ones are listed here. 

  • R is a software package for general data analysis. It has powerful visualization capabilities as well. 
  • Python is a general purpose programming language that also has a number of packages to support data analysis and graphics. 
  • D3 is a JavaScript library for producing interactive web based visualizations and data driven documents. 
  • Leaflet is a lightweight mobile friendly JavaScript library to create interactive maps. 
  • Tableau Public allows you to create visualizations, in your public profile, and share them, or put them, on a site, or blog. 
  • Google Charts provides cross-browser compatibility, and closed platform portability to iPhones and Android. 
  • Timeline is a JavaScript library that allows you to create timelines. 

In summary, you want to report your findings by presenting your results and value add with graphs using visualization tools.

Stage 5: Act - To Interpret Models and Data

Act, turning insights into action. Now that you have evaluated the results from your analysis and generated reports on the potential value of the results, the next stage is to determine what action or actions should be taken, based on the insights gained. 

- To Find Actionable Insights

Remember why we started bringing together the data and analyzing it in the first place? To find actionable insights within all these data sets, to answer questions, or for improving business processes. For example, is there something in your process that should change to remove bottle necks? Is there data that should be added to your application to make it more accurate? Should you segment your population into more well defined groups for more effective targeted marketing? This is the first step in turning insights into action. 

- Figuring Out How To Implement The Action

Now that you've determined what action to take, the next step is figuring out how to implement the action. What is necessary to add this action into your process or application? How should it be automated? The stakeholders need to be identified and become involved in this change. Just as with any process improvement changes, we need to monitor and measure the impact of the action on the process or application. 


- Evaluating Results from The Implemented Action

Assessing the impact leads to an evaluation. Evaluating results from the implemented action will determine your next steps. Is there additional analysis that need to be performed in order to yield even better results? What data should be revisited? Are there additional opportunities that should be explored? For example, let's not forget what big data enables us to do. Real-time actions based on high velocity streaming information. We need to define what part of our business needs real-time action to be able to influence the operations or the interaction with the customer. Once we define these real time actions, we need to make sure that there are automated systems, or processes to perform such actions, and provide failure recovery in case of problems.  

- The Purpose

The last step brings us back to the very first reason we do data science, the purpose. Reporting insights from analysis and determining actions from insights based on the purpose you initially defined is what we refer to as the Act step. In this process, technical skills only are not sufficient. One essential skill you need is to be able to tell a clear and actionable story. If your presentation does not trigger actions in your audience, it means that your communication was not efficient. Remember that you will be presenting to an audience with no technical background, so the way you communicate the message is key.



As a conclusion, big data and data science are only useful if the insights can be turned into action, and if the actions are carefully defined and evaluated. Interpreting data refers to the presentation of your data to a non-technical layman.  In this process, technical skills only are not sufficient. One essential skill you need is to be able to tell a clear and actionable story.

The raw data that you get directly from your sources are never in the format that you need to perform analysis on. There are two main goals in the data pre-processing step. 

The first is to clean the data to address data quality issues, and the second is to transform the raw data to make it suitable for analysis. A very important part of data preparation is to address quality of issues in your data. Real-world data is messy. In order to address data quality issues effectively, knowledge about the application, such as how the data was collected, the user population, and the intended uses of the application is important. This domain knowledge is essential to making informed decisions on how to handle incomplete or incorrect data. 

The second part of preparing data is to manipulate the clean data into the format needed for analysis. The step is known by many names: data manipulation, data preprocessing, data wrangling, and even data munging. Some operations for this type of operation include scaling, transformation, feature selection, dimensionality reduction, and data manipulation. 

In summary, data preparation is a very important part of the data science process. In fact, this is where you will spend most of your time on any data science effort. It can be a tedious process, but it is a crucial step. Always remember, garbage in, garbage out. If you don't spend the time and effort to create good data for the analysis, you will not get good results no matter how sophisticated the analysis technique you're using is.



[More to come ...]



Document Actions