Ebook

Chapter 2

Collecting Data


Training a network with deep learning requires a lot of quality labeled data. This chapter looks at different ways to access and collect this data.

Why Is So Much Training Data Necessary?

No matter what method you chose to design your classification algorithm, you need data. Even if you are building a rule-based algorithm, you have to understand your system and the inputs that it will see in order to be able to write those rules.

The difference between that set of data and what a deep learning system needs for training is mostly a question of quantity.

When you are designing a rule-based algorithm, you are bringing along years of experience and knowledge about the problem, which helps you to quickly dismiss certain approaches or ideas that are obviously not the solution. However, unless you are starting with a partially trained model, the deep neural network you are training has no experience or existing knowledge to draw from. It doesn’t know what is obvious.

Therefore, it takes many more examples of labeled data for the network to understand even basic concepts like rising edges in a signal, let alone combining those edges into the more abstract patterns that you are trying to classify.

So, in this way, deep learning uses more data to offset the experience and knowledge that humans would normally bring to the problem.

Acquiring Labeled Data

Your network will only be as good as the labeled training data that you provide, so it is important that you have access to data that covers the entire solution space. There are several methods for acquiring labeled data. You may choose one or a combination of the methods depending on the type of problem you’re solving.

Collect Your Own Data

You can build a database from scratch by collecting your own data from sensors. In some cases, like with autonomous vehicles, this is a good way to collect data because there are billions of vehicles on the road in every conceivable environment and driving condition and you can simply record their sensor data. Then, over time and millions of driven miles and countless hours of labeling, a database is built up.

At first, collecting your own data seems like a straightforward and obvious approach to building up a data set; however, there are things you need to consider. 

For instance, you need to make sure that you collect data across the entire solution space. For example, if your network is supposed to identify words in human speech, then you need training data that covers not only every spoken word, but also the different ways people say the same word. If you only train on a subset of accents, you will fit your model to those accents, and it won’t learn the entire scope of the problem.

Once you’ve collected all of this data, you need to label it … which is no small task! 

Existing Databases

If you’re lucky enough, you might find all of your labeled data in an existing database. For example, if you’re designing a network that can recognize common objects in images, you might be able to use ImageNet, which has over 14 million labeled images. 

There are also audio data sets that contain labeled examples of songs, speech, and other sounds. Plus, there are other data sets with different types of signals that are being created all the time.

Augment Existing Data

If an existing database doesn’t contain all of the training data that you need, you can augment that data set by adding your own labeled data to fill in any missing gaps and modify the existing data to cover a larger solution space.

Two examples of modifying existing data are adjusting the pitch of human speech and rotating and scaling handwritten characters.

Speech recognition: The audio database may have a set of words spoken by a single speaker; by duplicating the data set and adjusting the frequency, you can train your network to identify speech that occurs at a different pitch than the original database contained.

Character recognition: An existing database of handwritten characters may only contain images that have been scaled and rotated so that each character is the same size and orientation. If you want your network to be able to recognize handwriting at different scales or written at a slant, you can augment the original database by duplicating the data set and adjusting the scale and orientation.

However, if your engineering problems are unique enough, it might be the case that augmenting or extending an existing database is just about as large of a problem as creating your own database from scratch.

Synthesize Data

If you understand the physics of your problem well enough to build a simulation, you can use it to synthesize your training data. A benefit of synthesized data is that the label practically comes for free since you need the label to generate the data in the first place. 

Synthesized data could also be used in situations where it is too expensive or dangerous to collect real data. For example, it may be cheaper to simulate a robot in many different situations where failure might result in damage to the robot hardware rather than try to set up a physical scenario to collect the data.

On the other hand, if you want to build a network that could classify words in audio signals, it might not make sense to simulate people saying words because that’s much harder than just collecting a lot of real audio. So, you have to decide if synthesized data, real data, or a combination of both is right for your problem.

Example: Synthesizing Waveform Data

For an example of data synthesis, check out the MATLAB example Radar and Communications Waveform Classification Using Deep Learning. In this example, deep learning is used to train a CNN to recognize RF waveform modulation types.

One modulation type the network should classify is linear frequency modulation (LFM), which is shown in the figure below for a single carrier frequency, sweep bandwidth, pulse width, and sweep direction.

This is the ideal pattern, but there are many things that can impair this signal: weather, hardware distortions from the radio electronics, and reflections off obstacles near the antenna, as well as many other sources of noise and errors.

Each of the signals below are different noisy LFM waveforms.

The waveform classifier needs to be able to understand the unique features of these signals that make them linear frequency modulations. Therefore, the solution space (the entire set of conditions and scenarios under which the classification algorithm needs to work) is massive, and your training data set needs to cover all of that. This is where deep learning and synthesized data are beneficial.

Since RF modulation schemes and the impairments that produce noise on them are so well known, they are a perfect candidate for synthesized training data. In this example, 10,000 frames are generated for each modulation type; the following plot shows an example frame for a few of the waveforms.

This synthesized data is then used to train the network. The real test is how well this network can label real RF data. Like with any model, you would want to validate and test this network on hardware using realistic scenarios.

Learn More About Using Synthesized Data