Personal tools

Feature Extraction Techniques

Tsinghua University_071123E
[Tsinghua University, China]

- Overview

Feature extraction is a machine learning (ML) and data analysis process that involves extracting relevant features from raw data. It's also known as feature engineering or feature discovery. 

Feature extraction transforms raw data into numerical features that can be processed while preserving the original data set's information. This can lead to better results than applying ML directly to the raw data. 

Features extracted from raw data can be used to create a more informative dataset. This dataset can then be used for various tasks such as: Classification, Prediction, Clustering

Feature extraction can be accomplished manually or automatically. It can lead to: 

  • A boost in training speed
  • An improvement in model accuracy
  • A reduction in risk of overfitting
  • A rise in model explainability
  • Better data visualization

A typical use case for feature extraction is image files. For example, data scientists can create new features suitable for ML applications by extracting the shape of an object or the redness value in images.

 

- Features

In machine learning (ML), variables extracted from raw data are called features. Features are essential to ML because they are the input to learning algorithms and are used to predict an outcome or target variable. 

Features can come in many forms, including:

  • Continuous: Non-finite, metric values
  • Discrete: Numeric but countable values
  • Categorical: Categories that don't need to be ranked
  • Binary: Another type of feature

Raw data is unprocessed, original data that is collected by computers. It is often complex and large in volume, and is usually collected from sensors or large groups of people. Raw data can be useful, but it often isn't in a format that machine learning algorithms can use. 

Data preparation is the process of cleaning and organizing data so that it can be used by ML algorithms. 

Data preprocessing is the process of transforming raw data into a format that is more meaningful and suitable for analysis and model training. Data preprocessing can help improve the quality and efficiency of ML models by addressing issues like inconsistencies, missing values, noise, and outliers. 

Here are some steps to construct a dataset for ML: 

  • Collect the raw data
  • Identify feature and label sources
  • Select a sampling strategy
  • Split the data
 

- Feature Extraction Techniques

Feature extraction techniques in ML select, combine, and transform raw data to create relevant inputs for ML algorithms. These techniques can improve the performance, efficiency, and interpretability of AI systems. 

Feature extraction transforms raw data, with image files being a typical use case, into numerical features that are compatible with ML algorithms. Data scientists can create new features suitable for ML applications by extracting the shape of an object or the redness value in images.

Here are some common feature extraction techniques:
  • Feature selection: Removes irrelevant, redundant, or noisy features from the original set to select the most relevant subset.
  • Principal component analysis (PCA): Finds important variables from a large set of variables in a dataset, especially useful for 3 or higher-dimensional data. PCA emphasizes variation and captures relationships and patterns between variables.
  • Bag of words (BoW): Extracts and classifies words in a text by their usage frequency, and represents each document as a vector of word counts. This technique is effective in natural language processing (NLP).
  • Wavelet transform: Decomposes data into different frequency components to capture local and global patterns. It's often used in image analysis and signal processing.
  • Autoencoders: A type of unsupervised learning that reduces data noise by compressing, encoding, and then reconstructing input data as output. This process reduces data dimensionality, allowing the user to focus on the most important parts of the input.
  • Automated feature extraction: Uses deep networks or specialized algorithms to extract features from images or signals without human intervention. Wavelet scattering is an example of automated feature extraction.

Other feature extraction techniques include: Forward selection, Backward elimination, Select K best, Missing value ratio, and t-SNE. 

The best algorithm to use depends on the task.

 
 
 

[More to come ...]

 

 

 
Document Actions