Personal tools

Data Collection Layer

Data Collection for ML_013022A
[Data Collection for ML - Yuji Roh]


 - Overview

As new technologies unfold in an era of exciting innovations, collecting data is undoubtedly important for any organization. Data fuels analytical insights and artificial intelligence that are difficult to achieve otherwise.

As a society, we are generating data at an unprecedented rate. This data can be numeric (temperature, loan amount, customer retention rate), categorical (gender, skin color, highest degree earned), or even free text (think doctor's notes or opinion surveys). Data collection is the process of collecting and measuring information from countless different sources. In order to use the data we collect to develop practical artificial intelligence (AI) and machine learning solutions, it must be collected and stored in a way that makes sense for the business problem at hand. 

Collecting data allows you to capture records of past events so that we can use data analytics to find recurring patterns. Based on these patterns, you can use machine learning algorithms to build predictive models to look for trends and predict future changes.   


- Data Collection for AI/ML/DL

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. 

Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.


- Data Preparation for Automated Machine Learning

Predictive models are as good as the data that builds them, so good data collection practices are critical to developing high-performance models. The data needs to be free of errors (garbage in, garbage out) and contain relevant information for the task at hand. For example, a loan default model would not benefit from tiger population size, but would benefit from natural gas prices over time.

The quality of predictive output relies on the quality of input -- if you put good in, you’ll get good out. That’s why proper data preparation is such a critical success factor for achieving optimal machine learning results. The iterative process of preparing data for automated machine learning is both an art and a science.



[More to come ...]

Document Actions