Personal tools
You are here: Home Research Trends & Opportunities New Media and New Digital Economy Data Science and Analytics Big Data Platforms and Ecosystems Big Data Integration, Data Lakes, Data Warehouses and Mining

Big Data Integration, Data Lakes, Data Warehouses and Mining

Pluto_053022A
[Pluto - NASA]

 

- Big Data vs Data Warehouse vs Data Mining

Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage and process the data with low latency. Big data has one or more of the following characteristics: high volume, high velocity or high variety. Artificial intelligence (AI), mobile, social and the Internet of Things (IoT) are driving data complexity through new forms and sources of data. For example, big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media — much of it generated in real time and at a very large scale.

A data warehouse is a large multi-faceted repository for data of all types, and is a critical element in any Big Data strategy. Just as a warehouse is a large building for the storage of goods, a data warehouses is a repository where large amounts of data can be collected - it's an important tool for Big Data. Data mining is considered as a process of extracting data from large data sets, whereas a Data warehouse is the process of pooling all the relevant data together. Data mining is the process of analyzing unknown patterns of data, whereas a Data warehouse is a technique for collecting and managing data.

Data Warehousing is one of the common words for last 10-20 years, whereas Big Data is a hot trend for last 5-10 years. Both of them hold a lot of data, used for reporting, managed by an electronic storage device. So one common thought of maximum people that recent big data will replace old data warehousing very soon. But still, big data and data warehousing is not interchangeable as they used totally for a different purpose. 

  

- Data Warehouses

A data warehouse (DW), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. The system’s logical design facilitates the integration of data sources and allows the generation of new, additional valuable data sources without significant structural adjustment.  

Each organization has distinct operation practices and business models, which result in a variety of data generation platforms. Ultimately, a data warehouse should be larger than the sum of its data, and serve as an ongoing intelligent resource for use by multiple members of an organization, large or small. For that to happen, data warehouse technologies require data virtualization, processing, and transformation methods. 

The are several delivery models, including physical appliances, such as dedicated traditional storage subsystems built to support analytics and business performance (BI) (BI is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance). With the addition and ongoing evolution of the cloud, cloud-based solutions, seen as agile and low capital intensive solutions, aim to simplify both the hosting of and analysis of data in an increasingly complicated environment. 

In addition to the explosive growth in the amount of data and data sources we’ve seen in recent years, another motivation for creating even more sophisticated data warehousing systems is the ever-increasing need for customizable business intelligence and analytics. 

 

- Data Lakes

A data lake is a centralized repository that stores, processes, and secures large amounts of data. It can store data in its native format and process any variety of it, ignoring size limits. 

Data lakes are used for: Analytics applications, Big data analytics, Machine learning, Reporting, Visualization, Advanced analytics. 

Data lakes are different from traditional data warehouses, which store data in hierarchical dimensions and tables. Data lakes use a flat architecture to store data, primarily in files or object storage. 

Data lakes are used for: Analytics applications, Big data analytics, Machine learning, Reporting, Visualization, Advanced analytics. 

Data lakes are different from traditional data warehouses, which store data in hierarchical dimensions and tables. Data lakes use a flat architecture to store data, primarily in files or object storage. 

Data lakes can include raw copies of source system data, sensor data, social data, transformed data.

A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). Data lakes can be used to explore and analyze petabytes of data. One petabyte of data is equivalent to 1 million gigabytes. 

 

[More to come ...]



   

 
Document Actions