Personal tools
You are here: Home Research Trends & Opportunities New Media and New Digital Economy Data Science and Analytics Programming Models for Big Data

Programming Models for Big Data

The_UChicago_DSC_0252
(The University of Chicago, Alvin Wei-Cheng Wong)

 

- Programming Languages for Big Data

Programming languages, just like spoken languages, have their own unique structures, formats, and flows. While spoken languages are typically determined by geography, the use of programming languages is determined more by the coder’s preference, IT culture, and business objectives.

There are many, many programming languages today used for a variety of purposes, but the four most prominent you’ll see when it comes to big data are: Python, R, Java, and Scala.

Some of these languages are better for large-scale analytical tasks while others excel at operationalizing big data and the internet of things.

 

  • Computer Programming Language R: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
  • Computer Programming Language Scala: Scala combines object-oriented and functional programming in one concise, high-level language. Scala's static types help avoid bugs in complex applications, and its JVM and JavaScript runtimes let you build high-performance systems with easy access to huge ecosystems of libraries.
 

- Big Data Programming Models

Big Data programming models represent the style of programming and present the interfaces paradigm for developers to write big data applications and programs. Programming models normally the core feature of big data frameworks as they implicitly affects the execution model of big data processing engines and also drives the way for users to express and construct the big data applications and programs.

A programming model is an abstraction or existing machinery or infrastructure. It is a set of abstract runtime libraries and programming languages that form a model of computation. This abstraction level can be low-level as in machine language in computers. Or very high as in high-level programming languages, for example, Java. So we can say, if the enabling infrastructure for big data analysis is distributed file systems as we mentioned, then the programming model for big data should enable the programmability of the operations within distributed file systems. What we mean by this being able to write computer programs that work efficiently on top of distributed file systems using big data and making it easy to cope with all the potential issues. 

Big data programming models portray the style of and show the interface paradigm for developers to write big data applications and programs. Programming models normally the core feature of big data frameworks as they implicitly affects the execution model of big data processing engines and also drives the way for users to express and construct the big data applications and programs. 

MapReduce refers to a programming model that is suitable for processing big amounts of data. It is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. For example, Hadoop is capable of executing a MapReduce program written in several programming languages, including Java, C++, Python, Ruby, and others. MapReduce is designed to efficiently process a huge volume of data, by connecting many commodity computers together, to work in parallel. In addition to this, MapReduce ties smaller and more reasonably priced machines.

 

- The Requirements for Programming Models

Based on everything we discussed so far, let's describe the requirements for big data programming models. First of all, such a programming model for big data should support common big data operations like splitting large volumes of data. This means for partitioning and placement of data in and out of computer memory along with a model to synchronize the datasets later on. The access to data should be achieved in a fast way. It should allow fast distribution to nodes within a rack and these are potentially, the data nodes we moved the computation to. This means scheduling of many parallel tasks at once. It should also enable reliability of the computing and fault tolerance from failures. This means it should enable programmable replications and recovery of files when needed. It should be easily scalable to the distributed notes where the data gets produced. It should also enable adding new resources to take advantage of distributed computers and scale to more or faster data without losing performance. This is called scaling out if needed. Since there are a variety of different types of data, such as documents, graphs, tables, key values, etc. A programming model should enable operations over a particular set of these types. Not every type of data may be supported by a particular model, but the models should be optimized for at least one type. 

 

- Programming Models and Systems for Big Data Analysis

Big Data analysis refers to advanced and efficient data mining and machine learning techniques applied to large amount of data. Research work and results in the area of Big Data analysis are continuously rising, and more and more new and efficient architectures, programming models, systems, and data mining algorithms are proposed. The current most popular programming models for Big Data analysis are: MapReduce, Directed Acyclic Graph, Message Passing, Bulk Synchronous Parallel, Workflow and SQL-like. Such systems are compared using four classification criteria (i.e. level of abstraction, type of parallelism, infrastructure scale and classes of applications) for helping developers and users to identify and select the best solution according to their skills, hardware availability, productivity and application needs.

 

 

[More to come ...]



Document Actions