Personal tools
You are here: Home Research Trends & Opportunities New Media and New Digital Economy Data Science and Analytics AI, Data and Information Fusion Structured, Semi-Structured, and Unstructured Data

Structured, Semi-Structured, and Unstructured Data

Tsinghua University_071123E
[Tsinghua University, China]
 

 

- Overview

Data is a set of facts such as descriptions, observations, and numbers used in decision-making. We can classify data as structured, unstructured, or semi-structured data.
 
  • Structured dataData that is stored in a predefined format and is easy to organize. Structured data is often quantitative or numerical, and can be understood by both humans and machines.
  • Semi-structured dataData that combines features of both structured and unstructured data. Semi-structured data is more flexible than structured data, but less flexible than unstructured data.
  • Unstructured dataData that is stored in its native format and has no predefined organizational form or specific format. Unstructured data is often complex and qualitative, and can't be organized in a relational database.

 

- Unstructured Data 

Unstructured data can be thought of as data that is not actively managed in a transactional system; for example, data that does not exist in a relational database management system (RDBMS). 

Unstructured data is information that either is not organized in a pre-defined manner or does not have a pre-defined data model. Unstructured information is a set of text-heavy but may contain data such as numbers, dates, and facts as well. Videos, audio, and binary data files might not have a specific structure. They’re assigned to unstructured data.

Some common forms of unstructured data include: text files, word documents, PDF files, audio/video transcripts, powerpoint presentations, slideshares, Audio files.


- Structured Data

Structured data is generally tabular data that is represented by columns and rows in a database. Databases that hold tables in this form are called relational databases. 

Structured data can be thought of as records (or transactions) in a database environment; for example, rows in a SQL database table. 

 

- Semi-structured Data

Semi-structured data is information that doesn’t consist of structured data (relational database) but still has some structure to it. Semi-structured data consists of documents held in JavaScript Object Notation (JSON) format. It also includes key-value stores and graph databases.

 

- Structured Data Vs. Unstructured Data

We store structured data in a predefined format (such as a table) because it is highly specific and easy to organize and analyze. 

Unstructured data is a compilation of many types of data, such as text, audio, and video, stored in a raw format, making it more difficult to organize and analyze. 

NoSQL databases are highly scalable and flexible database management systems that can store and process unstructured and semi-structured data.

There is no preference whether the data is structured or unstructured. Both have tools that allow users to access information. Unstructured data happens to be richer than structured data.

Examples of unstructured data are:

  • Rich Media: media and entertainment data, surveillance data, geospatial data, audio, weather data
  • Document Collections: invoices, records, emails, productivity apps
  • Internet of Things (IoT): sensor data, market data
  • Analytics: Machine Learning, Artificial Intelligence (AI)

Before the advent of object-based storage, most, if not all, unstructured data was stored in file-based systems.

 

- Database Schema and Database Schema Design

Database schema refers to the logical and visual configuration of the entire relational database. Database objects are usually grouped and displayed as tables, functions, and relationships. The schema describes the organization and storage of data in the database and defines the relationships between tables. 

A database schema represents the structure or the organization of data in a database management system. A database schema includes descriptive details of the database that can be described through a schema diagram.

Database schema design provides a blueprint for developing a database architecture so that large amounts of information can be stored systematically. It also refers to the strategies and best practices involved in constructing a database. Database schema design makes data easier to use, interpret, and retrieve by organizing data into separate entities and identifying relationships between organized entities.

 

- Data Modeling

Database designers create database schemas to help programmers interact effectively with the database. The process of creating a database is called data modeling. To design a database schema, you collect information and arrange it into tables, rows, and columns. You need to organize information to make it easier to understand, relate, and use.

 

- Methods of Storing and Managing Data

"Schema-on-write" and "schema-on-read" are two different ways of storing and managing data. Schema-on-write means that the schema or structure of the data is defined when the data is written to the database. Schema-on-read means defining the schema when reading data from the database.

Structured data is usually stored using schema-on-write mode. This is because the schema is known in advance and can be used to optimize data storage and performance.

Unstructured data is typically stored using schema-on-read. This is because the schema is not known in advance and may need to change frequently. 

Which method is best for a particular application depends on the application's specific needs.

 

[More to come ...]


 

Document Actions