Personal tools

Training Data, Labeled Data, Unlabeled Data

[Rice University]


- Overview

In machine learning, training data is the data that you use to train a machine learning algorithm or model. Training data requires some human involvement to analyze or process the data for use in machine learning. How people get involved depends on the type of machine learning algorithms you use and the types of problems they are intended to solve. 

  • With supervised learning, humans participate in selecting the data features to be used in the model. The training data must be labeled—that is, enriched or annotated—to teach the machine how to recognize the outcomes your model is designed to detect.
  • Unsupervised learning uses unlabeled data to find patterns in the data, such as inference or clustering of data points. There are hybrid machine learning models that allow you to use a combination of supervised and unsupervised learning.
  • Semi-supervised learning is a hybrid of supervised and unsupervised learning. The model has a relatively small dataset with available labels and a larger dataset with unlabeled data. The goal is to learn relationships from a small amount of labeled information and test these relationships in an unlabeled dataset to learn from.
  • Reinforcement learning differs from previous approaches in that it does not require training data, but simply works and learns through the described reward system.


- Data Labeling and Training Data

Training data is labeled data that is used to teach an AI model or ML algorithm to make correct decisions.

Data labeling in ML is annotating unlabeled data (such as photos, text files, videos, etc.) and adding one or more insightful labels to provide context to the data so that machine learning models can learn from it. The label might say, for example, if the photo shows a bird or a car, which words were said in the recording, or whether the tumor is visible on an X-ray. Data labeling is required for many use cases, such as computer vision, natural language processing, and speech recognition.

Data labeling supports various machine learning and deep learning use cases, such as computer vision and natural language processing. For example, if you are trying to build a model for a self-driving car, the training data will include images and videos labeled to recognize cars, street signs, and people. If you're creating a customer service chatbot, data could be all the different ways to ask "what's my account balance?" Text and audio are then translated into different languages.

Training data is critical to the success of any AI model or project. Think of it as garbage in, garbage out. If you train a model on poor quality data, how can you expect it to perform? You can't and won't.

You may have the best fit algorithm, but if you train your machine on bad data, it will learn the wrong lessons, fail to live up to expectations, and not work as you (or your clients) intended. Your success depends almost entirely on your data.


- AI Training Data - Part of a Continuous Flywheel

The development process of AI is like a continuous flywheel, and data is the link that makes the flywheel turn. Since it all starts with AI training data, it has to be top-notch to confidently proceed with AI-based methods. 

Whether you're looking at what's right, what's wrong, or an explanation of what happened to your model, a plethora of issues will eventually be identified as having to do with the quality, quantity, and completeness of your AI training data. 

Taking self-driving cars as an example, how can a model learn correctly if it doesn't know the difference between a car and a street sign? The answer is that it cannot reasonably be assigned this expectation. 

So how does it affect other parts of the AI development flywheel? When you start training your model, you'll want to verify that it was trained correctly. You will need test data to see how it works, and then you may need more training data to further tune the model for areas where it did not or could not make accurate predictions. 

Once your model is behaving the way you want it to, it becomes critical to regularly update your model to ensure your model evolves with human behavior.


- Comparing Labeled and Unlabeled Data

Labeled data is a set of samples labeled with one or more labels. Labeling typically takes an unlabeled set of data and augments each part of it with informative labels. For example, data labels might indicate whether a photo contains a horse or a cow, what words were said in a recording, the type of actions performed in a video, what the topic of a news article was, what the overall mood of a tweet was, or an X-ray Whether a spot on the slice is a tumor.

Labels can be obtained by asking humans to make judgments given unlabeled data. Labeled data is much more expensive to acquire than raw unlabeled data.

Unlabeled data are pieces of data that have not been labeled with labels that identify features, attributes, or categories. Unlabeled data is often used in various forms of machine learning.

  • Unsupervised learning uses unlabeled data while supervised learning uses labeled data.
  • Unlabeled data is easier to obtain and store than labeled data, and therefore cheaper and more convenient.
  • Compared to labeled data, unlabeled data has a more limited range of applications in providing actionable insights (e.g., predicting activity). Unsupervised learning techniques can help discover new data clusters and enable new labels.
  • To eliminate the need for manually labeled data, while still providing large annotated datasets, computers can also use combined data for semi-supervised learning.

An important step in creating high-performance ML models is data labeling. Although tabs look simple, it's not always that simple to use. Therefore, enterprises must weigh various aspects and strategies to choose the most suitable method for data labeling

Effective labeling strategy. A thorough assessment of the complexity of the task and the size, scope, and duration of the project is recommended, as each data labeling approach has advantages and disadvantages.



[More to come ...]




Document Actions