Personal tools

Variables in Machine Learning

Harvard_001
(Harvard University - Joyce Yang)


- Overview

Variables are characteristics that can be measured and can take on different values. They can be found in a research question or hypothesis. 

In mathematics, a variable (from Latin variabilis, "changeable") is a symbol that represents a mathematical object. A variable may represent a number, a vector, a matrix, a function, the argument of a function, a set, or an element of a set. 

Please refer to the following for more information:

 

- Types of Variables

Variables are categorized in a variety of ways, including: 

  • Independent variables: A variable that stands alone and is not changed by other variables. For example, a person's age.
  • Dependent variables: A variable that changes as a result of the independent variable. Also called response variables. For example, how much a dog eats.
  • Continuous variables: A variable that can take any value between two numbers. For example, the height of a group of basketball players.
  • Discrete variables: A variable that takes on distinct, countable values.
  • Confounding variables: A factor other than the one being studied that is associated with both the dependent and independent variables. A confounding variable may distort or mask the effects of another variable.

Other types of variables include: quantitative variables, qualitative variables, intervening variables, moderating variables, extraneous variables.

 

- Identifying Variables

Identifying variables before conducting an experiment is important for a few reasons:

  • Define and measure factors: Identifying variables helps researchers clearly define and measure the factors being studied. This improves the reliability and validity of the research findings.
  • Select appropriate methods: Understanding variables helps researchers select appropriate research methods and statistical analyses.
  • Know what to experiment on: Identifying variables helps researchers know which items to experiment on and which to measure and get results from.
  • Identify confounding variables: Identifying confounding variables helps ensure that the relationship being observed between independent and dependent variables is real, and that the results of a study are valid.
  • Control variables: Control variables help ensure that the experiment results are fair, unskewed, and not caused by your experimental manipulation. For example, having the same glassware for all experiments is a controlled variable.
  • Take variables into account: When scientists are aware of all variables, they can take them into account as they try to make sense of their results.  
 

- Types of Variables in Data Analysis

Variables are categorized in a variety of ways for data analysis in research and statistics:
  • Categorical: Represent names, qualities, and other labels that divide data into groups or classes. Categorical variables can be further classified as nominal or ordinal.
  • Numeric: Represent countable or measurable quantities. Numeric variables can be further classified as discrete or continuous.
  • Qualitative: Refer to groupings.
  • Quantitative: Indicate amounts.

Here are some examples of categorical and numeric variables:
  • Categorical: Binary variables have only two categories, such as male or female, red or blue. Nominal variables can be organized in more than two categories that do not follow a particular order. Ordinal variables have three categories that can be ranked, but not placed a value on.
  • Numeric: Discrete variables are numerical variables that can be counted, such as the number of patients in a group. Continuous variables are numerical variables that cannot be finished counting, such as time or the weight of patients.
 

- Feature Variables in Machine Learning

In machine learning (ML), data variables, also known as feature variables, attributes, or predictors, are measurable pieces of data that are the basic building blocks of datasets. They are used as input for training and making predictions, and are the columns of the data matrix. 

Feature variables are independent variables that serve as inputs to machine learning (ML) algorithms. A ML model then maps these data inputs, or features, to a target or predictor variable. The target variable is the variable that is being predicted or understood. The quality of the features in a dataset has a significant impact on the quality of the insights gained from ML.

There are different types of feature variables, each with a specific purpose:

  • Numerical variables: Provide information about the scale or magnitude of an attribute
  • Categorical variables: Classify data into distinct categories
  • Binary variables: Indicate the presence or absence of a particular attribute


For example, in a dataset about Titanic passengers, features might include age, name, sex, and fare. In house pricing data, features might include the number of bedrooms and the house size in square feet. In speech recognition, features might include the length of sounds, noise ratios, filter matches, and relative power.

 

- Why are Feature Variables Important?

Feature variables, also known as features, columns, or attributes, are the basic building blocks of datasets in machine learning and pattern recognition. They are measurable properties or characteristics of a phenomenon, and are usually numeric. The quality of the features in a dataset can significantly impact the quality of the insights gained from machine learning.

Features are the basic building blocks of datasets. The quality of the features in your dataset has a major impact on the quality of the insights you will gain when you use that dataset for machine learning (ML). Additionally, different business problems within the same industry do not necessarily require the same features, which is why it is important to have a strong understanding of the business goals of your data science project. 

You can improve the quality of your dataset’s features with processes like feature selection and feature engineering, which are notoriously difficult and tedious. If these techniques are done well, the resulting optimal dataset will contain all of the essential features that might have bearing on your specific business problem, leading to the best possible model outcomes and the most beneficial insights.

 

 - Feature Importance

Feature importance is a step in building a machine learning model that calculates a score for each input feature, with higher scores indicating greater importance. This helps practitioners understand which features are most important and which are less important in contributing to the final prediction. 

Feature importance can be used for:

  • Data comprehension: Understanding the relationship between features and the target variable
  • Model improvement: Reducing the dimensionality of the model
  • Feature selection: Selecting a subset of relevant features to use in building a model
  • Model interpretability: Improving model interpretability


One way to test the importance of a feature is to remove it from the model and see how much the predictive accuracy suffers. This can be done by randomly permuting the values for that feature and then refitting the model. However, this can be computationally expensive because it requires refitting the model many times.

 
 
 

[More to come ...]

 

Document Actions