Big Data Analytics Technologies and Tools
- [Zaanse Schans, Zaanstad, Netherlands - Michal Soukup]
Big Data Science - The Future of Analytics
- Big Data Analytics
Data has become a primary resource for value generation. Today, big data falls under three categories of data sets - structured, unstructured and semi-structured. Big Data Analytics software is widely used in providing meaningful analysis of a large set of data. This software helps in finding current market trends, customer preferences, and other information.
Big data analytics is the often complex process of examining large and varied data sets, or big data, to uncover information - such as hidden patterns, unknown correlations, market trends and customer preferences - that can help organizations make informed business decisions. Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers.
Driven by specialized analytics systems and software, as well as high-powered computing systems, big data analytics offers various business benefits, including: cost reduction, faster and better decision making, new revenue opportunities, more effective marketing, better customer service, new products and services, competitive advantages over rivals, etc.
Big data analytics applications enable big data analysts, data scientists, predictive modelers, statisticians and other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional BI (Business Intelligence) and analytics programs. This encompasses a mix of semi-structured and unstructured data - for example, Internet clickstream data, web server logs, social media content, text from customer emails and survey responses, mobile phone records, and machine data captured by sensors connected to the Internet of things (IoT).
- Big Data Analytics - Technologies and Tools
- (Feature Model of Big Data Systems - ScienceDirect)
Big data analytics is the process of extracting useful information by analysing different types of big data sets. Big data analytics is used to discover hidden patterns, market trends and consumer preferences, for the benefit of organizational decision making. There are several steps and technologies involved in big data analytics.
There are several technologies and tools involved in big data analytics.
1. Data Acquisition
Data acquistion has two components: identification and collection of big data. Identification of big data is done by analyzing the two natural formats of data - born digital and born analogue.
- Born Digital Data
Information that is “born digital,” is created, by a user or by a digital system, specifically for use by a computer or data‐processing system. This is a vast range of information and newer fields are being added to this category on a daily basis. It is the information which has been captured through a digital medium, e.g. a computer or smartphone app, etc. This type of data has an ever expanding range since systems keep on collecting different kinds of information from users. Born digital data is traceable and can provide both personal and demographic business insights. Examples include digital photographs, harvested web content, digital manuscripts, electronic records, static data sets, dynamic data, digital art, digital media publications, cookies, web Analytics and GPS tracking. All of this data can be tracked and tagged to users as well as be aggregated to form a larger picture, massively increasing the scope of what may constitute the ‘data’ in big data.
- Born Analogue Data
Information is said to be “analogue” when it contains characteristics of the physical world, such as images, video, heartbeats, etc. When information is in the form of pictures, videos and other such formats which relate to physical elements of our world, it is termed as analogue data. This data requires conversion into digital format by using sensors, such as cameras, voice recording, digital assistants, etc. The increasing reach of technology has also raised the rate at which traditionally analogue data is being converted or captured through digital mediums. Some examples to better illustrate information that is born analogue but collected via digital means are: Voice and/or video content on devices; Personal health data such as heartbeats, blood pressure, respiration, velocity, etc.; Camera on Home Appliances.
While not as vast a category as born digital data, the increasingly lower costs of technology and ubiquitous usage of digital, networked devices is leading to information that was traditionally analogue in nature to be captured for use at a rapidly increasing rate.
2. In-memory Data Fabric
This technology helps in distribution of large quantities of data across system resources such as Dynamic RAM, Flash Storage or Solid State Storage Drives. Which in turn enables low latency access and processing of big data on the connected nodes.
Apache Ignite is an in-memory computing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale. Apache Ignite is an open-source distributed database (without rolling upgrade) , caching and processing platform designed to store and compute on large volumes of data across a cluster of nodes.
- Distributed Storage
A way to counter independent node failures and loss or corruption of big data sources, distributed file stores contain replicated data. Sometimes the data is also replicated for low latency quick access on large computer networks. These are generally non-relational databases.
- The Apache Hadoop software library
The Apache Hadoop Software Library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
3. Data Virtualization
Enterprise data comes in many forms and is stored in many locations. There is both structured and unstructured data, including rows and columns of data in a traditional database, and data in formats like logs, email, and social media content. Big Data in its many forms is stored in databases, log files, CRM, SaaS, and other apps. Data virtualization integrates data from disparate sources without copying or moving the data, thus giving users a single virtual layer that spans multiple applications, formats, and physical locations. This means faster, easier access to data.
Data virtualization is a logical data layer that integrates all enterprise data siloed across the disparate systems, manages the unified data for centralized security and governance, and delivers it to business users in real time.
4. Data Integration
Data integration is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process, and includes steps such as cleansing, ETL (Extract, Transform, Load) mapping, and transformation. Data integration ultimately enables analytics tools to produce effective, actionable business intelligence.
Big data integration takes traditional data, machine-generated data, social media, web data, and data from the Internet of Things (IoT), and combines it into a single framework to provide the most complete and up-to-date view of your business. It drives insights you need, pulled from potentially vast numbers of disparate data sources in order to boost performance, allowing data analysis that wouldn't otherwise be possible.
5. Data Preprocessing
Data preprocessing is a fundamental stage to prepare the data in order to get more out of it. Data preprocessing is a data mining technique to turn the raw data gathered from diverse sources into cleaner information that’s more suitable for work. In other words, it’s a preliminary step that takes all of the available information to organize it, sort it, and merge it.
Raw data can have missing or inconsistent values as well as present a lot of redundant information. The most common problems you can find with raw data can be divided into 3 groups: a) Missing data (or inaccurate data) - it often appears when there is a problem in the collection phase; b) Noisy data (or erroneous data) - Noise data is made of human mistakes, rare exceptions, mislabels, and other issues during data gathering; c) Inconsistent data - Inconsistencies happen when you keep files with similar data in different formats and files. Duplicates in different formats, mistakes in codes of names, or the absence of data constraints often lead to inconsistent data. If you didn’t take care of those issues, the final output would be plagued with faulty insights.
6. Data Quality
Data quality might be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless. The challenge with data quality is that there are no clear and simple formulas for determining if data is correct. Data that is correct today, might not be correct in a month. Fortunately, there are some fundamental techniques and approaches that can be broadly applied when validating data quality. The general theme of data quality is around finding outliers that do not meet specific requirements and record sets that violate business assumptions.
When working with moving data, data can be thought about in three separate layers: the ETL (Extract, Transform, Load) layer, the business layer, and the reporting layer. The ETL layer contains the code for data ingestion and data movement between a source system and a target system (for example from the application database to the data warehouse). The business layer sits between your raw ingested data and your final data models. Finally, the reporting layer contains the dashboards from which business users can view and interact with.
Fundamentally, data quality validations should be automated as much as possible. Validations should be embedded into the data pipeline code but in a manner that allows it to be effortlessly changed. When designing a data pipelines, data quality should be a driving factor that heavily influences the development effort. Existing ETL tools, like Informatica, may have data quality checks features already built in. However, it’s still important to understand how to implement data quality from scratch to ensure that the data quality checks you implement make sense.
7. Predictive Analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modelling, and machine learning, that analyze current and historical facts to make predictions about future or otherwise unknown events. The science of predictive analytics can generate future insights with a significant degree of precision. With the help of sophisticated predictive analytics tools and models, any organization can now use past and current data to reliably forecast trends and behaviors milliseconds, days, or years into the future.
There are several types of predictive analytics methods available. For example, data mining involves the analysis of large tranches of data to detect patterns from it. Text analysis does the same, except for large blocks of text.
8. NoSQL Databases
NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to scale. NoSQL database is used for distributed data stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For example, companies like Twitter, Facebook, Google that collect terabytes of user data every single day.
NoSQL database stands for "Not Only SQL" or "Not SQL." Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL database system encompasses a wide range of database technologies that can store structured, semi-structured, unstructured and polymorphic data.
9. Knowledge Discovery and Data Mining (KDD)
Knowledge discovery is the process of analyzing data for the purpose of understanding performance, reporting, predicting, and/or harvesting new knowledge.
Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon methodologies for extracting useful knowledge from data. The ongoing rapid growth of online data due to the Internet and the widespread use of databases have created an immense need for KDD methodologies. The challenge of extracting knowledge from data draws upon research in statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing, to deliver advanced business intelligence and web discovery solutions.
10. Stream Analytics
Streaming analytics is the ability to continually calculate statistical analytics while working within the ongoing stream of data that is coming in. It also allows people to not only manage and
monitor but also analyze the new data in real time. Because streaming analytics happens immediately, companies must act on the new data quickly, before it loses its power.
[More to come ...]