Behavioral Cloning for Autonomous Vehicles

robotcar.jpg

1 Introduction

An important milestone in Udacity's Self-Driving Car Engineer nanodegree program is to teach a toy model of a car the basics about how to drive. We can accomplish this using a technique called Behavioral Cloning. The Encyclopedia of Machine Learning says,

Behavioral cloning is a method by which human subcognitive skills can be captured and reproduced in a computer program. As the human subject performs the skill, his or her actions are recorded along with the situation that gave rise to the action. A log of these records is used as input to a learning program. The learning program outputs a set of rules that reproduce the skilled behavior. This method can be used to construct automatic control systems for complex tasks for which classical control theory is inadequate. It can also be used for training.

In this exercise, I outline the method I took to apply behavioral cloning, discuss some of the tools I used and developed, and reflect on some of the lessons I learned along the way. This is not a step-by-step "How-To" script, and so for that reason much of the Python code below is redacted, though in some cases the results are shown. Nevertheless, perhaps this article can help students who stumble into this fascinating topic in the future. So, let's get started!

2 Method

The method is quite simple. We somehow acquire training data—perhaps by recording it ourselves—in a computerized driving simulator, which Udacity provides. Images from that simulator are shown below. The training data comprise images of the road as seen from cameras mounted on the simulated car, along with corresponding control inputs (in this case, just the steering angle). The training data are used to train a Deep Learning neural network model so that it recognizes road/car configurations and generates the appropriate steering angle. The model is then used to generate inputs (steering angles) in real-time for the simulation, unpiloted by a human driver. Here are some of the tools we use along the way.

  • Keras - Deep Learning toolkit for Python
  • TensorFlow - High-Performance numerical computation library for Python and backend for Keras.
  • Unix command-line tools - handy for data pre-processing
  • Emacs - indispensable coding and writing environment
  • Org mode - indispensable writing and publishing environment for Emacs

Keras and TensorFlow (really, TensorFlow) are tailor-made for modern high-performance parallel numerical computation using GPUs, environments that are easily-obtained with cloud-computing environments like Amazon AWS. Yet, everything in this project was conducted just on the ancient laptop shown in Figure 2. While better hardware would be almost certainly be essential for real Deep Learning applications and autonomous vehicles, for toy problems like this one it may not be necessary.

rig.jpg

Figure 2: Project Lab

3 Data Collection and Preparation

Behavioral cloning relies on training neural networks with data exhibiting the very behavior you wish to clone. We use the driving simulator provided by Udacity, which in its "training mode" can emit a stream of data samples as the user operates the car. Each sample consists of a triplet of images and a single floating point number in the interval [-1, 1], recording the view and the steering angle for the simulation and car at regular intervals. The three images are meant to be from three "cameras" mounted on the simulated car's left, center, and right, giving three different aspects of the scene and in principle providing stereoscopic depth information.

The driving simulator also has an "autonomous mode" in which the car interacts with a network server to exchange telemetry that guides the car. The simulator sends the network server camera images and the network server is expected to reply with steering angles. So, not only is the driving simulator critical for understanding the problem and helpful for obtaining training data, it is absolutely essential for evaluating the solution.

The data are recorded into a CSV "index file" and corresponding image files. Each line in the index file correlates images with the steering angle, throttle, brake, and speed of the car. The images are identified by filenames referring to three camera images files.

head driving_log.csv
wc -l driving_log.csv
center,left,right,steering,throttle,brake,speed
IMG/center_2016_12_01_13_30_48_287.jpg, IMG/left_2016_12_01_13_30_48_287.jpg, IMG/right_2016_12_01_13_30_48_287.jpg, 0, 0, 0, 22.14829
IMG/center_2016_12_01_13_30_48_404.jpg, IMG/left_2016_12_01_13_30_48_404.jpg, IMG/right_2016_12_01_13_30_48_404.jpg, 0, 0, 0, 21.87963
IMG/center_2016_12_01_13_31_12_937.jpg, IMG/left_2016_12_01_13_31_12_937.jpg, IMG/right_2016_12_01_13_31_12_937.jpg, 0, 0, 0, 1.453011
IMG/center_2016_12_01_13_31_13_037.jpg, IMG/left_2016_12_01_13_31_13_037.jpg, IMG/right_2016_12_01_13_31_13_037.jpg, 0, 0, 0, 1.438419
IMG/center_2016_12_01_13_31_13_177.jpg, IMG/left_2016_12_01_13_31_13_177.jpg, IMG/right_2016_12_01_13_31_13_177.jpg, 0, 0, 0, 1.418236
IMG/center_2016_12_01_13_31_13_279.jpg, IMG/left_2016_12_01_13_31_13_279.jpg, IMG/right_2016_12_01_13_31_13_279.jpg, 0, 0, 0, 1.403993
IMG/center_2016_12_01_13_31_13_381.jpg, IMG/left_2016_12_01_13_31_13_381.jpg, IMG/right_2016_12_01_13_31_13_381.jpg, 0, 0, 0, 1.389892
IMG/center_2016_12_01_13_31_13_482.jpg, IMG/left_2016_12_01_13_31_13_482.jpg, IMG/right_2016_12_01_13_31_13_482.jpg, 0, 0, 0, 1.375934
IMG/center_2016_12_01_13_31_13_584.jpg, IMG/left_2016_12_01_13_31_13_584.jpg, IMG/right_2016_12_01_13_31_13_584.jpg, 0, 0, 0, 1.362115
8037 driving_log.csv

Deep Learning lore says that it is often prudent to randomize the data when possible and always prudent to split the data into training and validation sets, which we can do in just a few lines of shell code.

cat driving_log.csv | tail -n+2 | shuf > driving_log_all.csv
cat driving_log_all.csv | head -n7000 > driving_log_train.csv
cat driving_log_all.csv | tail -n+7001 > driving_log_validation.csv
wc -l driving_log_all.csv
wc -l driving_log_train.csv
wc -l driving_log_validation.csv
: 8036 driving_log_all.csv
: 7000 driving_log_train.csv
: 1036 driving_log_validation.csv

Now, Paul Heraty argues that it can be useful in the early stages of developing a solution to "overtrain" it on a small sample comprising disparate canonical examples. I can confirm that this was extremely good advice. One of the chief difficulties I encountered as a newcomer to Deep Learning and its community of tools was simply "getting it to work in the first place," independent of whether the model actually was very good. One of the chief strategies for overcoming this difficulty I found was to "try to get a pulse:" develop the basic machinery of the model and solution first, with little or no regard for its fidelity. Working through the inevitable blizzard of error messages one first encounters is no small task. Once it is cleared and the practitioner has confidence his tools are working well, then it becomes possible to rapidly iterate and converge to a good solution. Creating an "overtraining sample" is worthwhile because overtraining is a vivid expectation that can quickly be realized (especially with only 3 samples), and if overtraining does not occur you know you have deeper problems. With a little magic from Bash, Awk, etc., we can select three disparate samples, with neutral steering, extreme left steering, and extreme right steering.

cat <(cat driving_log_all.csv | sort -k4 -n -t, | head -n1) <(cat driving_log_all.csv | sort -k4 -nr -t, | head -n1) <(cat driving_log_all.csv | awk -F, -vOFS=, '{print $1, $2, $3, sqrt($4*$4), $5, $6, $7}' | sort -k4 -n -t, | head -n1) > driving_log_overtrain.csv
cat driving_log_overtrain.csv
: IMG/center_2016_12_01_13_39_28_024.jpg, IMG/left_2016_12_01_13_39_28_024.jpg, IMG/right_2016_12_01_13_39_28_024.jpg, -0.9426954, 0, 0, 28.11522
: IMG/center_2016_12_01_13_38_46_752.jpg, IMG/left_2016_12_01_13_38_46_752.jpg, IMG/right_2016_12_01_13_38_46_752.jpg, 1, 0, 0, 13.2427
: IMG/center_2016_12_01_13_30_48_287.jpg, IMG/left_2016_12_01_13_30_48_287.jpg, IMG/right_2016_12_01_13_30_48_287.jpg,0, 0, 0, 22.14829

4 Exploratory Analysis

It often pays to explore your data with relatively few constraints before diving in to build and train the actual model. One may gain insights that help guide you to better models and strategies, and avoid pitfalls and dead-ends.

To that end, first we just want to see what kind of input data we are dealing with. We know that they are RGB images, so we load a few of them for display. Here, we show the three frames taken from the driving_log_overtrain.csv file described above—center camera only—labeled by their corresponding steering angles. As you can see, the image with a large negative angle seems to have the car on the extreme right edge of the road. Perhaps the driver in this situation was executing a "recovery" maneuver, turning sharply to the left to veer away from the road's right edge and back to the centerline. Likewise, with the next figure that has a large positive angle, we see that the car appears to be on the extreme left edge of the road. Perhaps the opposite recovery maneuver was in play. Finally, in the third and last image that has a neutral steering angle (0.0), the car appears to be sailing right down the middle of the road, a circumstance that absent extraneous circumstances (other cars, people, rodents) should not require corrective steering.

road1.png

Figure 3: Large Negative Steering Angle

road2.png

Figure 4: Large Positive Steering Angle

road3.png

Figure 5: Neutral Steering Angle

We can see that the images naturally divide roughly into "road" below the horizon and "sky" above the horizon, with background scenery (trees, mountains, etc.) superimposed onto the sky. While the sky (really, the scenery) might contain useful navigational information, it is plausible that it contains little or no useful information for the simpler task of maintaining an autonomous vehicle near the centerline of a track, a subject we shall return to later. Likewise, it is almost certain that the small amount of car "hood" superimposed onto the bottom of the images contains no useful information. Therefore, let us see what the images would look like with the hood cropped out on the bottom by 20 pixels, and the sky cropped out on the top by 60 pixels, 80 pixels, and 100 pixels.

road4.png

Figure 6: Hood Crop: 20, Sky Crop: 60

road5.png

Figure 7: Hood Crop: 20, Sky Crop: 80

road6.png

Figure 8: Hood Crop: 20, Sky Crop: 100

Next we perform a very simple analysis of the target labels, which again are steering angles in the interval [-1, 1]. In fact, as real-valued outputs it may be a stretch to call them "labels" and this is not really a classification problem. Nevertheless in the interest of time we will adopt the term.

# REDACTED
: 
: >>>
: OrderedDict([('nobs', 8036),
:		 ('minmax', (-0.94269539999999996, 1.0)),
:		 ('mean', 0.0040696440648332506),
:		 ('variance', 0.016599764281272529),
:		 ('skewness', -0.13028924577521908),
:		 ('kurtosis', 6.311554102057668)])

hist1.png

Figure 9: All Samples - No Reflection

The data have non-zero mean and skewness, perhaps arising from a bias toward left-hand turns when driving on a closed track.

The data are dominated by small steering angles because the car spends most of its time on the track in straightaways. The asymmetry in the data is more apparent if I mask out small angles and repeat the analysis. Steering angles occupy the interval [-1, 1], but the "straight" samples appear to be within the neighborhood [-0.01, 0.01].

We might consider masking out small angled samples from the actual training data as well, a subject we shall return to in the summary.

# REDACTED
: 
: >>> >>> >>>
: OrderedDict([('nobs', 3584),
:		 ('minmax', (-0.94269539999999996, 1.0)),
:		 ('mean', 0.0091718659514508933),
:		 ('variance', 0.037178302717086109),
:		 ('skewness', -0.166578259690152),
:		 ('kurtosis', 1.1768785967587396)])

hist2.png

Figure 10: abs(angle)>0.01 - No Reflection

A simple trick we can play to remove this asymmetry—if we wish—is to join the data with its reflection, effectively doubling our sample size in the process. For illustration purposes only, we shall again mask out small angle samples.

# REDACTED
: 
: >>> >>>
: OrderedDict([('nobs', 7168),
:		 ('minmax', (-1.0, 1.0)),
:		 ('mean', 0.0),
:		 ('variance', 0.03725725015081123),
:		 ('skewness', 0.0),
:		 ('kurtosis', 1.1400026599654964)])

hist3.png

Figure 11: abs(angle)>0.01 - Full Reflection

In one of the least-surprising outcomes of the year, after performing the reflection and joining operations, the data now are symmetrical, with mean and skewness identically 0.

Of course, in this analysis I have only reflected the target labels. If I apply this strategy to the training data, naturally I need to reflect along their horizontal axes the corresponding input images as well. In fact, that is the purpose of the Xflip, yflip, rmap, rflip, and sflip utility functions described elsewhere.

It turns out there is another approach to dealing with the bias and asymmetry in the training data. In lieu of reflecting the data, which by definition imposes a 0 mean and 0 skewness, we can instead just randomly flip samples 50% of the time. While that will not yield a perfectly balanced and symmetric data distribution, given enough samples it should give us a crude approximation. Moreover, it saves us from having to store more images in memory, at the cost of some extra computation. Essentially, we are making the classic space-time trade-off between memory consumption and CPU usage.

# REDACTED
>>>
OrderedDict([('nobs', 16072),
	 ('minmax', (-1.0, 1.0)),
	 ('mean', 0.00043679202588352409),
	 ('variance', 0.037466760137091985),
	 ('skewness', 0.04798402427255847),
	 ('kurtosis', 1.1908627204420252)])

OrderedDict([('nobs', 32144),
	 ('minmax', (-1.0, 1.0)),
	 ('mean', -8.8215565268790185e-05),
	 ('variance', 0.03728948130040155),
	 ('skewness', -0.026881590212767648),
	 ('kurtosis', 1.141079050851383)])

OrderedDict([('nobs', 64288),
	 ('minmax', (-1.0, 1.0)),
	 ('mean', -9.0587991538078374e-05),
	 ('variance', 0.037257618046768089),
	 ('skewness', -0.017803991805898242),
	 ('kurtosis', 1.1379394743349964)])

Here, we see that as we increase the number of samples we draw from the underlying data set, while randomly flipping them, the mean tends to diminish. The skewness does not behave quite so well, though a coarser smoothing kernel (larger bin sizes for the histograms) may help. In any case, the following figure does suggest that randomly flipping the data with large enough sample sizes does help balance out the data.

hist7.png

Figure 12: abs(angle)>0.01 - Random Flipping

The sflip utility function flips not only the target labels—the steering angles—but also the images (as it must). We check that by again displaying the 3 samples from driving_log_overtrain.csv as above, but this time with each of them flipped.

road7.png

Figure 13: Large Negative Steering Angle Flipped

road8.png

Figure 14: Large Positive Steering Angle Flipped

road9.png

Figure 15: Neutral Steering Angle Flipped

If we compare these 3 figures depicting the flipped samples from the driving_log_overtrain.csv set, with the original unflipped samples in the figures above we can confirm the expected results. The images are indeed horizontally-reflected mirror images, and the corresponding steering angles indeed have their signs flipped (though, trivially for the neutral-steering case). Armed with some intuition about the data we can now turn to developing and training the model.

5 Modeling

There are many approaches to selecting or developing a model. The one I took was to assume that there is wisdom and experience already embedded in the Deep Learning models that already have been developed in the autonomous-vehicle community. I decided to choose one of those models as a starting point, and then build off of it or adapt it as needed. There are many to choose from, but two well-known and often-used ones are a model from comma.ai and a model from NVIDIA. Comparing these models we see some general shared characteristics.

  • stacks of scaling, convolutional, flattening, fully-connected, and readout layers
  • start with a non-trainable normalization layer that scales the input so that each pixel color channel is in [-1.0, 1.0].
  • no sigmoid, softmax, or one-hot encoding
  • consider adding pooling and dropout layers
  • optimize mse using the ADAM optimizer so that at least I do not have to worry about learning rate parameters

An example model is laid out in Keras code below. It's not redacted because while this is a real Keras model, it is not the one that I used and almost certainly is far too simple for the task at hand. It's provided for illustration purposes only.

model = Sequential()
model.add(Convolution2D(24, 3, 3, subsample=(2,2), name="Convolution2D1", activation="relu", input_shape=(160,320,3)))
model.add(Flatten(name="Flatten"))
model.add(Dense(1, activation="relu", name="Readout"))
model.compile(loss="mse", optimizer="adam")
model.summary()
plot(model, to_file="model.png", show_shapes=True)
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
Convolution2D1 (Convolution2D)   (None, 79, 159, 24)   672         convolution2d_input_1[0][0]      
____________________________________________________________________________________________________
Flatten (Flatten)                (None, 301464)        0           Convolution2D1[0][0]             
____________________________________________________________________________________________________
Readout (Dense)                  (None, 1)             301465      Flatten[0][0]                    
====================================================================================================
Total params: 302,137
Trainable params: 302,137
Non-trainable params: 0
____________________________________________________________________________________________________

model.png

Figure 16: Neural-Net Architecture

6 Training

The data-processing pipeline leans heavily on the composability of Python generators. They are used in a Python function to assemble pipelines, both for training and for validation.

def pipeline(filename):
    # randomly cycle through cached, loaded samples (images + angles)
    samples = select(rcycle(fetch(select(split(feed(filename)), [0,3]))), [0,1])
    # group the samples
    groups = group(samples, theta.batch_size)
    # turn the groups into batches (NumPy arrays)
    batches = batch(transpose(groups))
    # return the batch generator
    return batches

As another sanity check, we conduct a small training (3 epochs, 30 samples per epoch, batch size 10) of the data in driving_log_overtrain.csv. This is just to "get our feet wet" and quickly to verify that the code written above even works. Note that we use the same file for the validation set. This is just a test, so it does not really matter what we use for the validation set.

INPUT_SHAPE = [160, 320, 3]
SAMPLES_PER_EPOCH = 30
VALID_SAMPLES_PER_EPOCH = 30
EPOCHS = 3
BATCH_size = 10

traingen = pipeline("driving_log_overtrain.csv")
validgen = pipeline("driving_log_overtrain.csv")

history = model.fit_generator(
    traingen,
    SAMPLES_PER_EPOCH,
    EPOCHS,
    validation_data=validgen,
    verbose=2,
    nb_val_samples=VALID_SAMPLES_PER_EPOCH)

Next, we perform the actual training on the driving_log_train.csv file, validating against the driving_log_validation.csv file. After this training we actually save the model to model.json and the model weights to model.h5, files suitable for input into the network service in drive.py.

SAMPLES_PER_EPOCH = 7000
VALID_SAMPLES_PER_EPOCH = 1036
EPOCHS = 3
BATCH_SIZE = 100

traingen = pipeline("driving_log_train.csv")
validgen = pipeline("driving_log_validation.csv")

history = model.fit_generator(
    traingen,
    SAMPLES_PER_EPOCH,
    EPOCHS,
    validation_data=validgen,
    verbose=2,
    nb_val_samples=VALID_SAMPLES_PER_EPOCH)

Ta-da! We now have a trained model whose architecture and weights we can save in order to load into the network service that operates the simulator.

model.save_weights("model.h5")
open("model.json", "w").write(model.to_json())

7 Lessons

This was a whirlwind tour of using applying behavioral cloning to a toy example for self-driving cars, and to protect the innocent a lot of the technical details had to be withheld. Still, we can summarize some key lessons learned in the exercise.

  • Think outside the box for tooling. Not everything need be done in just one language, such as Python.
  • Consider generating a tiny overtraining sample for testing your pipeline and model against an expected "non-result."
  • As in any scientific inquiry, think carefully and pose a question that can reasonably be answered with the resources you have.
  • If presented with a question—or a homework assignment—read the problem statement and its objectives carefully, so it can guide your investigation.
  • Embrace the notion of Getting a Pulse, using test data like your overtrain sample to help work through the mechanics of your data pipeline, your model, and your training code, irrespective of the accuracy of the predictions it generates.
  • Spend a considerable amount of time performing exploratory data analysis to gain intuition about the problem space.
  • Think carefully about what you already believe about the problem space and try to use it to your advantage. At the risk of some generality, simple operations like resizing and cropping images may save your model from having to learn things you already know.
  • Read the documentation! You may find some delightfully useful tools buried in your toolboxes.
  • When using tools like Keras and TensorFlow, which introduce an alternative programming model, like with using other languages altogether, consider the trade-offs among them.
  • For instance, you might consider performing pre-processing steps like normalization, resizing, and cropping right in the model. It simplifies your code and you get those operations for free wherever the model is used, at the possible expense of more numerical processing.
  • Likewise, try to keep in mind sound software engineering principles, like the classic space-time trade-off. For instance, you may be able to fit all your data in memory after all, and if you can, it may speed up training considerably. This will not always be the case, but pay attention to when it is.
  • Think carefully about your model and what it is—and isn't—capable of. For instance, if you have a neural network with no memory or anticipatory functions, you might downplay the importance of features within your data that contain information about the future as opposed to features that contain information about the present.
  • Consider starting simple and adding complexity only as needed, rather than the other way around. For instance, start with a small model and set aside pre-processing steps like augmentation, until and unless they become necessary. Smaller models and less pre-processing translates to faster training and more rapid iteration.
  • Likewise, consider stripping away complexity that adds friction to the iteration cycle. If you have a small model, a small training set, and you find cloud-based GPUs unwieldy, run some initial experiments to see if you can iterate right on your local environment. You might not be able to, but if you can it is often worth it.
  • Finally, learn Python generators!