Personal tools

Big Data Platforms and Tools

(University of Michigan at Ann Arbor)

Data Big Platforms & Analytics: 

Leveraging Big Data, Analytics and Machine Learning



- Big Data Platforms

Apache Hadoop is one of the most widely used big data platforms. It's an open-source software platform that stores and processes big data in a distributed computing environment across hardware clusters. This distribution allows for faster data processing.

Apache Spark is a unified analytics engine for batch processing, streaming data, machine learning, and graph processing. It is one of the most popular big data platforms used by companies in 2023. One of the key benefits that Apache Spark offers is speed


- Big Data Analytics Platforms

Because the persistent gush of data from numerous sources is only growing more intense, lots of sophisticated and highly scalable big data analytics platforms — many of which are cloud-based - have popped up to parse the ever expanding mass of information.

Following are some big data analytics platforms to know:

  • Microsoft Azure 
  • Cloudera 
  • Sisense
  • Collibra
  • Tableau
  • MapR
  • Qualtrics
  • Oracle
  • MongoDB
  • Datameer 
  • Etc.


- Apache Hadoop Ecosystem

Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, at the same time, at massive scale on industry-standard hardware.

Apache Hadoop is an open source framework intended to make interaction with big data easier, Hadoop has made its place in the industries and companies that need to work on large data sets which are sensitive and needs efficient handling. Hadoop is a framework that enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop is made up of several modules that are supported by a large ecosystem of technologies.

Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem: 

  • HDFS: Hadoop Distributed File System 
  • YARN: Yet Another Resource Negotiator 
  • MapReduce: Programming based Data Processing 
  • Spark: In-Memory data processing 
  • PIG, HIVE: Query based processing of data services 
  • HBase: NoSQL Database 
  • Mahout, Spark MLLib: Machine Learning algorithm libraries 
  • Solar, Lucene: Searching and Indexing 
  • Zookeeper: Managing cluster 
  • Oozie: Job Scheduling
  • Etc.


Note: Apart from the above-mentioned components, there are many other components tools that are part of the Hadoop ecosystem. 


- Open Source Big Data Tools

Apache Hadoop is designed to support the processing of large data sets in a distributed computing environment. Hadoop can handle big batches of distributed information but most often there's a need for a real time processing of people generated data like Twitter or Facebook updates. Financial compliance monitoring is another area of our central time processing is needed, in particular to reduce market data. Social media and market data are two types of what we call high velocity data. 

Apache Storm and Spark are two other open source frameworks that handle such real time data generated at a fast rate. Both Storm and Spark can integrate data with any database or data storage technology. 


  • [Apache Hadoop]: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. 
  • [Apache Storm]: Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.
  • [Apache Spark]: Apache Spark is a fast and general engine for large-scale data processing. Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.


- Apache Spark

Apache Spark is a fast and general processing engine compatible with Hadoop data. Apache Spark is an open source, big data processing, cluster-computing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.

Spark can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. 

As of today, thousands of organizations are using Spark in production. Many organizations run Spark on clusters of thousands of nodes. The largest cluster we know has 8000 of them. In terms of data size, Spark has been shown to work well up to petabytes.


[More to come ...]


Document Actions