Personal tools
You are here: Home Research Trends & Opportunities New Media and New Digital Economy Data Science and Analytics Programming Models for Big Data

Programming Models for Big Data

(The University of Chicago, Alvin Wei-Cheng Wong)


- Overview

A programming model is an abstraction or existing machinery or infrastructure. It is a set of abstract runtime libraries and programming languages that form a model of computation. This abstraction level can be low-level as in machine language in computers. Or very high as in high-level programming languages, for example, Java. So we can say, if the enabling infrastructure for big data analysis is distributed file systems as we mentioned, then the programming model for big data should enable the programmability of the operations within distributed file systems. What we mean by this being able to write computer programs that work efficiently on top of distributed file systems using big data and making it easy to cope with all the potential issues. 

The Big Data programming model represents a programming style that provides an interface paradigm for developers to write big data applications and programs. Programming models are often core features of big data frameworks, as they implicitly influence the execution model of big data processing engines and also drive the way users express and build big data applications and programs.


- Programming Languages for Big Data

Programming languages, just like spoken languages, have their own unique structures, formats, and flows. While spoken languages are typically determined by geography, the use of programming languages is determined more by the coder’s preference, IT culture, and business objectives.

There are many, many programming languages today used for a variety of purposes, but the four most prominent you’ll see when it comes to big data are: Python, R, Java, and Scala.

Some of these languages are better for large-scale analytical tasks while others excel at operationalizing big data and the internet of things.  

  • Computer programming language R: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
  • Computer programming language Scala: Scala combines object-oriented and functional programming in one concise, high-level language. Scala's static types help avoid bugs in complex applications, and its JVM and JavaScript runtimes let you build high-performance systems with easy access to huge ecosystems of libraries.

- Big Data Programming Models

Today, data flows are abundant thanks to emerging technologies such as cloud computing, edge computing, and the Internet of Things (IoT). Many industries, including healthcare, government, media and entertainment, manufacturing, and the Internet of Things, generate vast amounts of data every day. 

These industries need analytical models to improve their operational efficiency. The data generated by these industries is called big data because it is not only large, but also fast and in various formats. Organizations such as McDonald's, Amazon, and Walmart are investing in big data applications to examine large data sets to reveal hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. 

In big data programming, data-driven parallel programs are written by users to execute in large-scale and distributed environments. Many programming models are available for big data with different focuses and advantages. Software developers use programming models to build applications. 

Regardless of the programming language and supported application programming interfaces (APIs), the programming model connects the underlying hardware architecture with the software. IoT applications, such as smart homes, wearables, and smart cities, generate massive amounts of data for processing. Analyzing such a huge amount of data is a significant challenge.


- The Requirements for Programming Models

Disruptive technologies such as cloud computing, blockchain, distributed machine learning, artificial intelligence, and deep learning are used in almost every application in today's computing systems. This greatly generates a large amount of data with different computational requirements.

What are the requirements for the big data programming model?

  • First, such a big data programming model should support common big data operations, such as splitting large amounts of data. This means partitioning and placing data in and out of computer memory, and synchronizing the model of the dataset later.
  • Access to data should be implemented in a fast manner. It should allow fast distribution to nodes within a rack, which might be data nodes to which we move computation. This means scheduling many parallel tasks at once.
  • It should also enable the reliability and fault tolerance of computing. This means it should enable programmable replication and file recovery when needed.
  • It should easily be extended to produce distributed notes of data. It should also be able to add new resources to take advantage of distributed computers and scale to more or faster data without losing performance. If needed, this is called scaling out.
  • Since there are many different types of data, such as documents, graphs, tables, key-values, etc. The programming model should be able to operate on specific collections of these types. Not every type of data may be backed by a particular model, but the model should be optimized for at least one type.


- Programming Models and Systems for Big Data Analysis

Big Data analysis refers to advanced and efficient data mining and machine learning techniques applied to large amount of data. Research work and results in the area of Big Data analysis are continuously rising, and more and more new and efficient architectures, programming models, systems, and data mining algorithms are proposed. 

The Big Data programming model represents a programming style that provides an interface paradigm for developers to write big data applications and programs. Programming models are often core features of big data frameworks, as they implicitly influence the execution model of big data processing engines and also drive the way users express and build big data applications and programs.

The current most popular programming models for Big Data analysis are: MapReduce, Directed Acyclic Graph, Message Passing, Bulk Synchronous Parallel, Workflow and SQL-like. Such systems are compared using four classification criteria (i.e. level of abstraction, type of parallelism, infrastructure scale and classes of applications) for helping developers and users to identify and select the best solution according to their skills, hardware availability, productivity and application needs.


- Programming Models and Algorithms for Big Data

Efficient big data management is the grand vision of modern computing as it empowers millions of smart, connected devices that can communicate with each other and gradually control our world.

Big data analytics is not a single computing paradigm; rather, it serves as an enabling computing technology for various industries such as smart cities, transportation, intelligent systems, energy management systems, healthcare applications, and more. Technically, electronic health records (EHR) and electronic health applications are considered as one of the potential examples of big data applications, which generate huge amounts of data every second that, when processed efficiently, can control the entire functionality of electronic devices. health services. 

Therefore, in response to the growing demand for big data innovation, data scientists around the world should start focusing on advanced big data programming models and algorithms that can rapidly learn and automate big data analysis processes in real-time data-intensive application.

In addition, it should also facilitate effective communication, forecasting and decision-making processes. Although existing big data analysis methods can perform operations reasonably, the increased work efficiency of current technology applications has greatly reduced the workability of traditional big data programming models, requiring improved security and assurance functions. 

Exploring high-level programming models for big data is the only way to properly handle these massive amounts of data. Here are some themes of innovative programming models and algorithms for big data: 

  • Energy-efficient programming and computing models for IoT-related big data applications
  • Programming Model and Algorithm Progress of Big Data Open Platform
  • Big data innovative programming models and algorithms beyond Hadoop/Map Reduce
  • Efficient Big Data Programming Models and Algorithms for Big Data Search
  • Programming models and algorithms for big data visualization analysis and applications
  • Programming Model for Big Data Assisted Linking and Graph Mining
  • Semantic-based Big Data Mining Programming Model and Algorithm
  • Secure big data analytics and algorithmic models for privacy-preserving data-intensive applications
  • Algorithms and efficient programming models for multimedia big data analysis and management processes
  • New and innovative big data computing models
  • High-performance/parallel computing-assisted programming model for big data


- MapReduce

MapReduce refers to a programming model that is suitable for processing big amounts of data. It is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. For example, Hadoop is capable of executing a MapReduce program written in several programming languages, including Java, C++, Python, Ruby, and others. 

MapReduce is designed to efficiently process a huge volume of data, by connecting many commodity computers together, to work in parallel. In addition to this, MapReduce ties smaller and more reasonably priced machines.



[More to come ...]

Document Actions