Personal tools

Data Science Life Cycle

Data Science Lifecycle_061621A
[Data Science Lifecycle - Microsoft]


New Data Economy: Turning Big Data into Smart Data


- Overview

Big data is an emerging term referring to the process of managing huge amount of data from different sources, such as, DBMS, log files, postings of social media, and sensor data. Big data (text, number, images... etc.) could be divided into different forms: structured, semi-structured, and unstructured. Big data could be further described by some attributes like velocity, volume, variety, value, and complexity. The emerging big data technologies also raise many security concerns and challenges.

Big data must pass through a series of steps before it generates value. Namely data access, storage, cleaning, and analysis. One approach to solve this problem is to run each stage as a different layer. And use tools available to fit the problem at hand, and scale analytical solutions to big data. 

The big data life cycle consists of four stages, namely: Data Acquisition, Data Awareness, Data Analytics and Data Governance.


- Data Acquisition

Data acquisition has been understood as the process of gathering, filtering, and cleaning data before the data is put in a data warehouse or any other storage solution. The acquisition of big data is most commonly governed by four of the Vs: volume, velocity, variety, and value. Most data acquisition scenarios assume high-volume, high-velocity, high-variety, but low-value data, making it important to have adaptable and time-efficient gathering, filtering, and cleaning algorithms that ensure that only the high-value fragments of the data are actually processed by the data-warehouse analysis. 


- Data Awareness

Data Awareness is the task of creating a scheme of relationships within a set of data, to allow different users of the data to determine a fluid yet valid context and utilise it for their desired tasks. It is a relatively new field, in which most of the work is currently being done on semantic structures to allow data to gain context in an interoperable format, in contrast to the current system where data is given context using unique, model specific constructs. (such as XML Schemes, etc.) 

Prior to the Big Data revolution, organizations were inward-looking in terms of data. During this time, data-centric environments like data warehouses dealt only with data created within the enterprise. But with the advent of data science and predictive analytics, many organizations have come to the realization that enterprise data must be fused with external data to enable and scale a digital business transformation. 

This means that processes for identifying, sourcing, understanding, assessing and ingesting such data must be developed.


- Data Processing and Analytics

Data Processing largely has three primary goals: a. determines if the data collected is internally consistent; b. make the data meaningful to other systems or users using either metaphors or analogy they can understand; and (what many consider most importantly) provide predictions about future events and behaviours based upon past data and trends. Being a very vast field with rapidly changing technologies governing its operation, this section will largely concentrate on the most commonly used technologies in data analytics. Data analytics requires four primary conditions to be met in order to carry out effective processing: fast, data loading, fast query processing, efficient utilisation of storage and adaptivity to dynamic workload patterns. The analytical model most commonly associated with meeting this criteria and with big data in general is MapReduce, detailed below. 


- Data Governance

Data governance is a requirement in today’s fast-moving and highly competitive enterprise environment. Now that organizations have the opportunity to capture massive amounts of diverse internal and external data, they need a discipline to maximize their value, manage risks, and reduce cost. 

Data governance is a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals. It establishes the processes and responsibilities that ensure the quality and security of the data used across a business or organization. Data governance defines who can take what action, upon what data, in what situations, using what methods. 

A well-crafted data governance strategy is fundamental for any organization that works with big data, and will explain how your business benefits from consistent, common processes and responsibilities. Business drivers highlight what data needs to be carefully controlled in your data governance strategy and the benefits expected from this effort. This strategy will be the basis of your data governance framework. 

Data Governance is the act of managing raw big data as well as the processed information that arises from big data in order to meet legal, regulatory and business imposed requirements. While there is no standardized format for data governance, there have been increasing call with various sectors (especially healthcare) to create such a format to ensure reliable, secure and consistent big data utilisation across the board. 

For example, if a business driver for your data governance strategy is to ensure the privacy of healthcare-related data, patient data will need to be managed securely as it flows through your business. Retention requirements (e.g. history of who changed what information and when) will be defined to ensure compliance with relevant government requirements, such as the GDP


[More to come ...]




Document Actions