Acquire
- Overview
Acquiring a dataset is the first step in data preprocessing in machine learning. To build and develop machine learning models, you must first acquire relevant datasets. This dataset will consist of data collected from a number of different sources and then combined in an appropriate format to form a dataset.
Dataset formats vary by use case. For example, a commercial dataset will be completely different from a medical dataset. Business datasets will contain relevant industry and business data, while medical datasets will contain healthcare-related data.
There are several online sources from where you can download datasets like https://archive.ics.uci.edu/ml/index.php. You can also create a dataset by collecting data via different Python APIs. Once the dataset is ready, you must put it in CSV, or HTML, or XLSX file formats.
- Acquire - To Obtain Data
The first stage in the data science process is to acquire the data. You need to obtain the source material before analyzing or acting on it. The Acquire activity includes anything that makes us retrieve data including finding, accessing, acquiring, and moving data.
There are four ways to acquire data: collect new data; convert/transform legacy data; share/exchange data; and procure data. This includes automated collection (eg, sensor-derived data), manual recording of empirical observations, and acquisition of existing data from other sources.
There are many considerations when acquiring data. After data is collected or received, it must be reviewed to ensure that it meets standards and can be demonstrated to be acceptable to the organization for its intended use.
- To Determine What Data Is Available
The first stage in acquiring data is to determine what data is available. You want to identify suitable data related to your problem and make use of all data that is relevant to your problem for analysis.
Leaving out even a small amount of important data can lead to incorrect conclusions. Data comes from many places, local and remote, in many varieties, structured and un-structured, and with different velocities.
Following are common data collection considerations:
- Business requirements: The first thing to consider is the business requirements - why do you need this data? What will they do?
- Business rules: Business rules determine the constraints of business operation. For example, where applicable, all geospatial data must have Federal Geographic Data Council (FGDC) compliant metadata. These rules will influence your data collection decisions.
- Data Standards: Any applicable government, USGS, or industry standards need to be considered.
- Accuracy requirements: The most familiar accuracy requirement is the positional accuracy of the spatial data; but you may also need to consider other accuracy requirements.
- Cost: Cost is always a consideration. Sometimes buying is cheaper than collecting.
- Currency of data: For many types of jobs, data needs to be fairly up-to-date. For others, the data may need to cover a specified time period. For others, the data needs to be in a specific season. For example, if you're trying to determine vegetation cover, you might want photos in summer, when vegetation is at its highest level. If you are looking for land form, you may need winter photos.
- Time limit: You should determine how often you need the data.
- Format: Do you need spatial data, photos, flat files, Excel files, XML files, etc.? This may not apply, but you need to determine this for each project.
- Technologies To Access Data
Big data comes from many places. Finding and evaluating data useful to your big data analytics is important before you start acquiring data. Depending on the source and structure of data, there are alternative ways to access it. If you are looking to work on projects on a much bigger data sets, or big data, then you need to learn how to access using distributed storage like Apache Hadoop, Spark or Flink.
There are many techniques and technologies to access these different types of data:
- A lot of data exists in conventional relational databases, like structured big data from organizations. The tool of choice to access data from databases is structured query language or SQL. Additionally, most data base systems come with a graphical application environment that allows you to query and explore the data sets in the database.
- Data can also exist in files such as text files and Excel spreadsheets. Scripting languages are generally used to get data from files. Common scripting languages with support for processing files are Java Script, Python, PHP, Perl, R, MATLAB, and many others.
- An increasingly popular way to get data is from websites. Web pages are written using a set of standards approved by a World Wide Web consortium or shortly, W3C. This includes a variety of formats and services. One common format is the Extensible Markup Language, or XML, which uses markup symbols or tabs to describe the contents on a webpage.
- Many websites also host web services which produce program access to their data. There are several types of web services. The most popular is REST because it's so easy to use. REST stand for Representational State Transfer. And it is an approach to implement web services with performance, scalability and maintainability in mind.
- Web socket services are also becoming more popular since they allow real time modifications from web sites. NoSQL storage systems are increasingly used to manage a variety of data types in big data. These data stores are databases that do not represent data in a table format with columns and rows as with conventional relational databases. Examples of these data stores include Cassandra, MongoDB and HBASE. NoSQL data stores provide APIs to allow users to access data. These APIs can be used directly or in an application that needs to access the data. Additionally, most NoSQL systems provide data access via a web service interface, such a REST.
[More to come ...]

