Personal tools

Big Data Tools and Techniques

[Lungern, Switzerland - Civil Engineering Discoveries]


Big Data Collection Methods


- The General Steps To Collect Big Data

Today, many companies collect big data to analyze and interpret daily transactions and traffic data, aiming to keep track of the operations, forecast needs or implement new programs. But how to collect big data directly. There may be a lot of data collection methods and you may feel quite confused. Following are the general steps to collect big data:

  • Step 1: Gather data according to different purposes. 
  • Step 2: Store data putting the data into databases or storage services for further processing. 
  • Step 3: Clean up data to sort the data, including cleaning up, concatenating and merging the data.
  • Step 4: Reorganize data to turn the unstructured or semi-unstructured formats into structured formats like Hadoop and HDFS. 
  • Step 5: Verify data to make sure the data you get is right and makes sense. 


These are the general steps to collect big data. However, to collect the data, analyze it and glean insights into markets is not as easy as it seems. 


- Big Data Collection Tools

Through the great advancements of technology and Internet of Things (IoT), it is now easier than ever to collect, process and analyse the data. Big data collection tools such as transactional data, analytics, social media, maps and loyalty cards are all ways in which data can be collected. It’s all about personalisation - businesses must be able to analyse the data collected and then use it to customize their marketing efforts to target specific customers and in turn, have highly effective campaigns.

Data collection tools like Octoparse help make this process so much easier. They allow users to gather clean and structured data automatically so there is no need to clean it up or reorganize it. Octoparse is the ultimate tool for data extraction (web crawling, data crawling and data scraping), which lets you turn the whole Internet into a structured format. After the data is collected, it can be stored in cloud databases, which can be accessed anytime from anywhere.

Here are the various techniques and methods to help businesses collect data about their customers:

  • Transactional Data - Transactional data includes multiple variables, such as what, how much, how and when customers purchased as well as what promotions or coupons they used.
  • Online Marketing Analytics - Every time a user browses a website, information is collected. For example, Google Analytics has the ability to provide a lot of demographic insight on each visitor. This information is useful is building marketing campaigns, as well as website performance analysis.
  • Social Media - In today’s day and age, most of humanity are using social media in one form or another. Nearly every aspect of our lives is affected. Social media is used in many ways on a frequent basis: networking, procrastinating, gossiping, sharing, educating, games etc.


Big Data Platforms and Tools


Big Data tools bring cost efficiency, better time management into the data analytical tasks. 


- Open Source Big Data Tool

Apache Hadoop is designed to support the processing of large data sets in a distributed computing environment. Hadoop can handle big batches of distributed information but most often there's a need for a real time processing of people generated data like Twitter or Facebook updates. Financial compliance monitoring is another area of our central time processing is needed, in particular to reduce market data. Social media and market data are two types of what we call high velocity data. 

Apache Storm and Spark are two other open source frameworks that handle such real time data generated at a fast rate. Both Storm and Spark can integrate data with any database or data storage technology. 


  • [Apache Hadoop]: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. 
  • [Apache Storm]: Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.
  • [Apache Spark]: Apache Spark is a fast and general engine for large-scale data processing. Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.


Big Data Processing Methodology

- Overview

Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.  


- The Traditional ETL Methodology

Traditional data was normally processed using the Extract, Transform, Load (ETL) methodology, which was used to collect the data from outside sources, modify the data to fit needs, and then upload the data into the data storage system for future use. Technology such as spreadsheets, RDBMS databases, Structured Query Languages (SQL), etc. were all initially used to carry out these tasks.


- The MAD Process

However, for big data, the methodology traditionally followed is both inefficient and insufficient to meet the demands of modern use. Therefore, the Magnetic, Agile, Deep (MAD) process is used to collect and store data. The needs and benefits of such a system are: attracting all the data sources regardless of their quality (magnetic), logical and physical contents of storage systems adapting to the rapid data evolution in big data (agile) and complex algorithmic statistical analysis required of big data on a very short notice. 


- The Computing Power To Support the MAD

The technology used to perform data storage using the MAD process requires vast amount of processing power, which is very difficult to create in a single, physical space/unit for nonstate or research entities, who cannot afford supercomputers. Therefore, most solutions used in big data rely on two major components to store data: distributed systems and Massive Parallel Processing (MPP) that run on non-relational (in-memory) database systems. 


Big Data Storage Architecture


Big data storage is a storage infrastructure that is designed specifically to store, manage and retrieve massive amounts of data, or big data. Big data storage enables the storage and sorting of big data in such a way that it can easily be accessed, used and processed by applications and services working on big data. Big data storage is also able to flexibly scale as required. 

Big data storage primarily supports storage and input/output operations on storage with a very large number of data files and objects. A typical big data storage architecture is made up of a redundant and scalable supply of direct attached storage (DAS) pools, scale-out or clustered network attached storage (NAS) or an infrastructure based on object storage format. The storage infrastructure is connected to computing server nodes that enable quick processing and retrieval of big quantities of data. Moreover, most big data storage architectures/infrastructures have native support for big data analytics solutions such as Hadoop, Cassandra and NoSQL.


- Direct Attached Storage (DAS) Pools

Direct-attached storage (DAS) is a type of storage that is attached directly to a computer without going through a network. The storage might be connected internally or externally. Only the host computer can access the data directly. Other devices must go through the host computer to work with the data. 

Most servers, desktops and laptops contain an internal hard disk drive (HDD) or solid-state drive (SSD). Each of these devices is a form of direct-attached storage. Some computers also use external DAS devices. In some cases, an enterprise server might connect directly to drives that are shared by other servers. 

A direct-attached storage device is not networked. There are no connections through Ethernet or Fibre Channel (FC) switches, as is the case for network-attached storage (NAS) or a storage area network (SAN). 

An external DAS device connects directly to a computer through an interface such as Small Computer System Interface (SCSI), Serial Advanced Technology Attachment (SATA), Serial-Attached SCSI (SAS), FC or Internet SCSI (iSCSI). The device attaches to a card plugged into an internal bus on the computer. 

Other types of storage, such as optical devices and tape, are technically DAS as they are directly attached to a system, either internally or externally. However, references to DAS are usually related to storage devices such as HDDs or SSDs.

- Network Attached Storage (NAS)

NAS systems are rapidly becoming popular with enterprise and small businesses in many industries as an effective, scalable, low-cost storage solution. 

A NAS system is a storage device connected to a network that allows storage and retrieval of data from a centralized location for authorized network users and heterogeneous clients. NAS systems are flexible and scale-out, meaning that as you need additional storage, you can add on to what you have. NAS is like having a private cloud in the office. It’s faster, less expensive and provides all the benefits of a public cloud on site, giving you complete control.



[More to come ...]


Document Actions