Personal tools

Data Science Life Cycle

Data Science Lifecycle_061621A
[Data Science Lifecycle - Microsoft]

 

New Data Economy: Turning Big Data into Smart Data

 

 

- A General Data Science Life Cycle

A data science life cycle is an iterative set of data science steps you take to deliver a project or analysis. Because every data science project and team are different, every specific data science life cycle is different. However, most data science projects tend to flow through the same general life cycle of data science steps. 

Some data science life cycles narrowly focus on just the data, modeling, and assessment steps. Others are more comprehensive and start with business understanding and end with deployment. This life cycle has five steps:

  • Problem Definition: The problem statement stage is the first and most important step of solving an analytics problem. It can make or break the entire project. When a business approaches a data scientist with a problem they want to solve, they will always define the problem in layman’s terms. This means the problem will not be clear enough, from an analytics point of view, to begin solving it right away. The problem needs to be well framed. As the data scientist, you need to think of the problem statement in mathematical terms. This is easier said than done, but not impossible. A good data science problem should be relevant, specific, and unambiguous. It should align with the business strategy.
  • Data Investigation and Cleaning: When using data, most people agree that your insights and analysis are only as good as the data you are using. Essentially, garbage data in is garbage analysis out. Data investigation and cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way every time.
  • Minimal Viable Model: A Minimum Viable Model (MVM), and the process that surrounds it, aims to maximize our early understanding of an ML/AI problem domain in a real-world context while minimizing investment of time and resources. The key points above are “understanding” and “real-world context.” When we begin a project, there’s a lot that we don’t know, so we make assumptions based on experience and intuition, but we can’t be certain if those assumptions are accurate until they’ve been tested with real-world data, under real-world conditions.
  • Deployment and Enhancements
  • Data Science Ops


These are not linear data science steps. You will start with step one and then proceed to step two. However, from there, you should naturally flow among the steps as necessary. Several small iterative steps are better than a few larger comprehensive phases.

 

- Big Data Life Cycle

Big data is an emerging term referring to the process of managing huge amount of data from different sources, such as, DBMS, log files, postings of social media, and sensor data. Big data (text, number, images... etc.) could be divided into different forms: structured, semi-structured, and unstructured. Big data could be further described by some attributes like velocity, volume, variety, value, and complexity. The emerging big data technologies also raise many security concerns and challenges.

Big data must pass through a series of steps before it generates value. Namely data access, storage, cleaning, and analysis. One approach to solve this problem is to run each stage as a different layer. And use tools available to fit the problem at hand, and scale analytical solutions to big data. 

The big data life cycle consists of four stages, namely: Data Acquisition, Data Awareness, Data Analytics and Data Governance.

 

Data Science Landscape_112021A
[Data Science Landscape]

- Data Acquisition

Data acquisition has been understood as the process of gathering, filtering, and cleaning data before the data is put in a data warehouse or any other storage solution. The acquisition of big data is most commonly governed by four of the Vs: volume, velocity, variety, and value. Most data acquisition scenarios assume high-volume, high-velocity, high-variety, but low-value data, making it important to have adaptable and time-efficient gathering, filtering, and cleaning algorithms that ensure that only the high-value fragments of the data are actually processed by the data-warehouse analysis. 

 

- Data Awareness

Data Awareness is the task of creating a scheme of relationships within a set of data, to allow different users of the data to determine a fluid yet valid context and utilise it for their desired tasks. It is a relatively new field, in which most of the work is currently being done on semantic structures to allow data to gain context in an interoperable format, in contrast to the current system where data is given context using unique, model specific constructs. (such as XML Schemes, etc.) 

Prior to the Big Data revolution, organizations were inward-looking in terms of data. During this time, data-centric environments like data warehouses dealt only with data created within the enterprise. But with the advent of data science and predictive analytics, many organizations have come to the realization that enterprise data must be fused with external data to enable and scale a digital business transformation. 

This means that processes for identifying, sourcing, understanding, assessing and ingesting such data must be developed.

 

- Data Processing and Analytics

Data Processing largely has three primary goals: a. determines if the data collected is internally consistent; b. make the data meaningful to other systems or users using either metaphors or analogy they can understand; and (what many consider most importantly) provide predictions about future events and behaviours based upon past data and trends. Being a very vast field with rapidly changing technologies governing its operation, this section will largely concentrate on the most commonly used technologies in data analytics. Data analytics requires four primary conditions to be met in order to carry out effective processing: fast, data loading, fast query processing, efficient utilisation of storage and adaptivity to dynamic workload patterns. The analytical model most commonly associated with meeting this criteria and with big data in general is MapReduce, detailed below. 

 

- Data Governance

Data governance is a requirement in today’s fast-moving and highly competitive enterprise environment. Now that organizations have the opportunity to capture massive amounts of diverse internal and external data, they need a discipline to maximize their value, manage risks, and reduce cost. 

Data governance is a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals. It establishes the processes and responsibilities that ensure the quality and security of the data used across a business or organization. Data governance defines who can take what action, upon what data, in what situations, using what methods. 

A well-crafted data governance strategy is fundamental for any organization that works with big data, and will explain how your business benefits from consistent, common processes and responsibilities. Business drivers highlight what data needs to be carefully controlled in your data governance strategy and the benefits expected from this effort. This strategy will be the basis of your data governance framework. 

Data Governance is the act of managing raw big data as well as the processed information that arises from big data in order to meet legal, regulatory and business imposed requirements. While there is no standardized format for data governance, there have been increasing call with various sectors (especially healthcare) to create such a format to ensure reliable, secure and consistent big data utilisation across the board. 

For example, if a business driver for your data governance strategy is to ensure the privacy of healthcare-related data, patient data will need to be managed securely as it flows through your business. Retention requirements (e.g. history of who changed what information and when) will be defined to ensure compliance with relevant government requirements, such as the GDP

 
 

[More to come ...]

 

 



 

Document Actions