Self-driving car based on deep learning

Generalization: automated driving on a yet unknown complex track (compared to training tracks).
Note: “jumpy” steering reflects toy RC car limitations: it turns 45° to the left/right or drives straight ahead.
(Music: GoNotGently)

After struggling to make a neural net that would predict steering commands reliably for an autonomous toy RC car, only based on the current camera view (no history), I approached the problem systematically in a robot simulator, which allowed for faster experimentation, finally leading to success.

Training examples: manual driving with arrow keys to create a perfect left/right turn.
The purple “Trail” shows the driven path (geometrically clean after several tries).

With only two simple training tracks, one with a 90° left curve and the other one with a 90° right curve, I was able to teach reliable driving behavior. The neural net generalizes better than expected, such that the self-driving car stays on the “road”, even for tracks differing significantly from the training data.

Given more varied examples of successful steering, the driving behavior could become a lot smoother than the video shows. But interestingly, the convolutional neural network (CNN) seems to interpolate nicely between the provided training examples, and is able to handle unknown degrees of road bends.

It even manages to drive through road crossings (see after the break), if a little awkwardly, since crossings “look confusing” and were never trained. When positioned outside of the track facing it at a slight angle, the car also manages to steer in the “hinted” direction and aligns properly with the track!

Neural network architecture

After experimenting with various architectures, I settled for a transfer learning approach, to reduce training time and the need for training examples. As the predictions need to be done with low latency, such that the car can react in real time, the model should be relatively small.

MATLAB’s documentation has a comparison of the performance characteristics of pretrained models, including a nice plot comparing prediction accuracy vs. time (notice time there is relative to the fastest model, and not in seconds). While I use PyTorch with the fastai wrapper library, the documentation of pretrained models focuses mostly on accuracy.

I ended up picking resnet18, which is a convolutional neural network (CNN) with a depth of 18 layers, and a native image input size of 224×224 pixels. It provides a good balance between accuracy and speed in my experiments (and with my hardware setup).

Collecting training data for the CNN

Training data is obtained by manually driving the car in a robot simulator (V-REP) with the keyboard (arrow keys). A Python program I wrote, controls V-REP and reacts to key strokes, but also stores each vision sensor frame it receives from the simulator, and labels it with the direction the wheels are pointing at during that frame.

The examples above show what typical frames look like for the three possible categories: driving forward, driving left (and forward), driving right (and forward). There is no way to identify whether the car is driving backwards or is stopped from a single frame. Automated driving can be engaged and stopped manually by a key stroke.

Also note that the real toy RC car can only steer the wheels to one of three discrete angles. In accordance, I designed the simulated car so it can only steer the wheels to +45° (left turn), -45° (right turn), or 0° (forward direction).

Each frame is an RGB image of 32×32 pixels (small size to reduce training time as well as prediction duration). The uniform colors (white for the sky, yellow for the road, brown/red for the surroundings) are the result of applying filtering to the original frames obtained from the vision sensor, which is mounted on the front of the simulated car. As can be seen in the introductory video, the unprocessed and higher resolution view is also available (see car_VisionSensorRealView), for comparison.

The mentioned color preprocessing reduces noise in the training data, and ensures we can observe what kind of training input works in principle, without being led astray by irrelevant features. Originally, I had problems with training data obtained from the real toy RC car, but also in the simulation when using non square-image sizes (unknown effects of upscaling), a too small FOV for the vision sensor, “confusing” driving paths, and other (simulated) hardware issues related to timing and positioning. (More on this below.) So simplifying the input as much as possible, while retaining enough information, was a key goal.

More efficient training

Even with all those measures training was tediously slow and didn’t allow for fast enough experimentation. This is why I ended up using Google Colab, which offers free computational resources for training neural networks, in the form of Jupyter notebooks. Since the fastai library is now available in Colab by default, and training data can be imported from Google Drive, the process is reasonably efficient.

Faster training and experimentation cycles would still be desirable, as deep learning still lacks the comfort common in traditional programming: fast develop and debug iterations with quick inspection of (intermediate) results. Maybe this limit can be mitigated, with smarter choice of training examples and more problem analysis, such that the right direction can be identified faster with simple models, and accuracy can be improved later with more data (and more complex models). There are some machine learning tools to identify problems, such as heatmaps, but understandable AI is still an active research topic.

Evaluation

The final training accuracy was 0.985337 (98.5%) with a training loss of 0.045329 and validation loss of 0.028133. The exact values vary a bit each time training is redone, but remain close in value.

As mentioned before, the learned automated driving generalized better than expected, even managing road crossings, that were not in the training data set (see video below). While the car “hesitates” and wants to take a right turn, it does a left turn again when it comes close to the end of the crossing and drives through successfully.

The bending/curvature of the curves also varies constantly differing clearly from the 90° bend of the training tracks. Yet the car manages to approximate the curves by using the most appropriate steering command available, most of the time.

Generalizing successfully: handling bends of varying curvature and road crossings not found in the training sets.
(Music: CoolRide)

Overall, I am rather pleased with the results, even if the driving could be smoother, it nicely reflects the categories it learned, and applies them in unknown situations. To improve the driving behavior, it would be relatively easy to add more training data for bends of varying degrees of curvature. However, this may result in an overfitting and less accurate model, due to contradicting steering examples (see below).

Without adjustments external to the model, the prediction lag was still too high, such that the car sometimes missed to identify street bends in time. But adding synchronization points to the simulation solved this issue, effectively slowing the simulation down enough, such that PyTorch had time to finish predicting each frame.

Using TensorFlow Lite or TensorFlow Lite Micro would likely solve this issue and might be an upcoming project. Especially, for the real toy RC car it might be necessary, since no time bending mechanism is known yet 😉 Embedded accelerators, like the Jetson Nano, could also help to improve the prediction latency.

Finally, to make videos of the autonomously driving car, it was important to enable real-time mode in V-REP to reduce jitter (car gets “stuck” briefly). Together with speeding up the video five times, the result is a reasonably fast and smooth recording.

Towards an accurate model

Before I finally made a reliable classifier, as demonstrated above, I struggled a long time to make reliable steering predictions for the real world toy RC car. It showed overfitting or bad accuracy in actual driving, such as being very position sensitive (slight variations of the car’s position on the track caused erratic steering), or stubbornly preferring one direction.

Similar issues with unreliable autonomous steering, due to neural net training issues, don’t seem to be that seldom (1, 2, 3, 4, 5). So I was curious to approach the problem in a systematic way and understand it more thoroughly.

The trained models were clearly not focusing on the right features, but it was not obvious if it was a neural network architecture problem, a problem in the training data, the limited field of view of the camera, the lag incurred by video and command streaming, the lag from the NN classifier, varying lightening and sharpness, or noise in the camera frames.

Since the real world track/road had too many features that could be erroneously correlated with the steering commands (i.e., would not generalize), it was easy to overfit to these examples. Generating sufficiently many tracks with enough variation in the available room was not practical. Especially due to the shallow steering angle the toy car had and the limited size of the room, allowing just for tracks made of a curve bending in one direction only.

Simulation allowed for testing larger and more varied tracks, since there were no constraints due to room size, and fewer limits for testing various hardware setups. Being able to avoid buying and evaluating various chassis (and their steering properties), cameras, power supplies, and not having to consider weight constraints, motor properties (fast acceleration), camera mounting options, etc., allowed for quicker experimentation.

Freed from these constraints, I thought about the problem in a fundamental way to know what information could and could not be obtained from frames, in principle. But also what properties a model would need to have to predict reliably. Further more, I switched to simulation to allow for having more control over timing (such as delays and synchronization).

Fundamental model limits

Thinking systematically about the problem, instead of the common mantra to just throw more data at it, I identified some key elements that were necessary for any kind of model to be able to predict steering commands from camera / visions sensor frames.

We want to make a model that predicts the steering only based on the current camera frame, i.e., predictions are based on information from the current point in time only. That means we need to have the freshest information possible to predict the most accurate steering command, since we cannot plan ahead and compute a longer driving path based on past data, or data that arrives with a delay.

If we wanted to do that, we would need to model time somehow, and use the information about previous states and possible future ones.

Using a recurrent neural network (RNN) or long-short term memory (LSTM) model would allow for that. It could be beneficial to obtain smoother steering paths over time, and would also reduce the need to react in real time. This may be an option for future work, but is not my goal now.

The three critical factors I identified for successful autonomous driving are:

unambiguous training examples
- different steering only for clearly different camera frames
low-latency data collection and steering prediction
- delays between a camera frame and the corresponding steering command must be minimal
- the in-sequence delay between pairs of (camera frame, steering command) must be minimal, as well
a horizontally wide-angle view of the street ahead
- ideally, starting closely after the front wheels, and showing both horizontal limits of the street

In the following, I’ll elaborate on each of these points.

Unambiguous training examples

The simulated car can only turn left (at a fixed angle, which I set to 45°), or turn right (-45°) or drive straight ahead — this mimics the RC car behavior. These three categories should have completely independent training examples. More specifically, since our model does not consider timing information, frames that are close in time (and are therefore visually similar) should not have different (i.e., contradicting) steering information. Instead, each frame should stand on its own and be sufficient to predict the right steering command.

For example, assume a car is driving along a left curve on a street. If steering is done by approximating the left curve by quickly alternating between turning left (45°) and driving straight ahead (0°), this will create confusing information: we will have fundamentally differing steering commands (once wheels steered at a 45° angle and once at a 0° angle) for pairs of frames that only differ slightly visually. Both frames will show the street ahead from almost the same angle/perspective, due to the rapidly alternating steering, that gives the car barely any time to travel much.

Therefore, for an example driving that approximates a track by progressive steering adjustments, we cannot only record frame/steering-angle-pairs, but also needs to record time information of some sort.

Uncorrelated noise/features (imagine varying vegetation along the road) can be (ab)used to distinguish each frame and thereby deduce a kind of temporal order of the frames. However those features will not generalize since they are irrelevant “artifacts” of the environment, e.g., it should not matter for steering if there is a tree with or without flowers next to the road. In the simulated environment the artifacts will be much more subtle, such as interpolation errors of the 3D scene, or random pattern details of the floor tiles.

It makes sense that it will be hard(er) to fit such contradictory training examples without overfitting (i.e., fitting to distinctive features or noise that are not related to steering in general).

Without time information it cannot be decided if steering commands are part of small continuous corrections following a longer term trajectory, or if they are just random noise or otherwise faulty steering commands contradicting previous steering.

A model considering time, maybe like a PID controller, that has integrative and derivative components, could smooth out such frequent steering changes to estimate the general direction. A timeless model however cannot abstract from training data that essentially encodes information over time. Like a movie some temporal information can be gained from single frames (e.g., detecting motion blur), but some correlations will remain only visible when considering several frames (e.g., the motion path). Attempting to extrapolate from lack of or too subtle information will result in overfitting.

Since we want to use a CNN to classify single images, such that at each point in time the current vision sensor frame alone is enough to decide the correct steering angle — effectively a timeless model –, we have to make sure training data is timeless as well, i.e., previous frames or time information is not needed as context when classifying the current frame.

Therefore a left curve should be driven with a steady left turn, steering left (45°) during the entire length of the curve, and without any corrections in between. The 45° steering capability limits the possible curvatures of left curves we can train on. Analogous constraints apply to training right curves.

Finally, we should also collect driving data for continuously driving along straight road segments, only.

Compare the geometrically perfect driving trails above to the trails generated by approximating a path with frequent steering changes, as shown below.

Driving trail (purple) showing frequently changing steering to approximate a path’s curvature.

The only variations will be regarding the car’s starting position. But the car should always be parallel to the road, such that no steering corrections are necessary, and only one continuously repeating steering command will allow to successfully drive along the chosen road segment.

Timing is critical

Given that, besides having the current camera frame, nothing else is available to us, and that we have no way to consider previous steering actions or any other state information, we need to get the steering as good as possible for each point in time. Later correction may not be possible anymore, since we may be in a position where the camera does not provide the necessary clues for correct steering.

For example, the wheels might still be on the street when inside of a street turn, but the front camera only shows the “nature” next to the street, and not the boundaries of the street. Therefore steering in time, before critical information is out of sight, is essential.

For driving the correct path, the time steps dt between steering commands need to have the same (or close to the same) duration. dt must remain steady for the recorded steering examples, but also when the car drives autonomously, as in both cases, the distance driven during each step will depend on the duration of the time step.

There is no way to verify the actual time step duration from camera frames, and any deviation would result in steering commands that do not do what the predictor expected. If the predictor takes too long for its computations, the car may also have driven out of a street curve already, and will lack any information to correct its course once the street is not visible anymore.

As a corollary, the car needs to drive at a constant speed, since there is no way to measure its current speed from a single camera frame. We only know the frame rate, but it is completely independent of the car’s speed.

At most we could rely on the degree of motion blur, which however also requires a reasonably rich background / scenery, and close by objects (far objects do not move much). The speed of change of the environment from one frame to another cannot be used, since the model will only consider one frame at a time.

However, for a reasonable estimate that merely uses in-frame information, we would need to provide a lot of training data, that highlights speed changes. The few accelerations and deceleration are unlikely to be descriptive enough as training data for estimating speed.

A major issue with RC toy car is that it takes quite a while to accelerate to its top speed, which is quite fast and makes it hard to drive in average-size rooms. Also its steering angle is pretty shallow, which limits the complexity of tracks that can be made in a normal-size room.

With a low valued resistor, around 5 Ohm, in series with the car’s motor, I tried to slow it down. But it was still too fast, and choosing a higher one would make the car’s torque too weak to start. Maybe there is a sweet spot I missed, but I didn’t have any potentiometer or resistors with a high enough “resolution”.

Due to the mass of the car carrying the battery and the Raspberry Pi, it is unlikely I can reduce the long acceleration time. I would need a motor with a higher torque and slower speed. Another gear box might help, but doing such mechanical changes would be more complex than getting a different car. Another option might be to use a different motor driver to add in PWM control.

Adjusting such properties in the simulator is a lot easier. I just made sure that the car in V-REP has a brief acceleration and then drives at constant speed (by definition/design), only with a few settings and with some code in the Python car controller.

By adding synchronization points in the simulation, there is no or only minimal delay between the camera view and the recorded steering commands, during training data collection. During autonomous driving, the simulator pauses each time the steering predictor is working, ensuring that the effective delay between the arrival of the camera frame and the predicted steering command is minimal, as well.

Complete view of street directly ahead

Finally, the car needs to be able to see obvious features of the street that allow for a clear prediction of a steering command.

The camera should capture an as complete view of the street directly ahead of it as possible. It should not point ahead too far, since the predictions are always about the very next time step. The less information the camera frame contains (about street shape, curvature, and free space left and right of it), the more likely it is that irrelevant features or random contextual artifacts are correlated with the given training steering commands.

Theses constraints can be achieved by picking a camera with a large field of view (to see a large horizontal section of the street), and positioning the camera at the very front of the car, pointing slightly down (to see the street part immediately ahead).

Summary

Training examples should be completely unambiguous
- During training, differing steering commands (a command can be one of the three: forward, forward-left, forward-right) should not be given for visually similar camera frames.
  - To that end, driving examples that generate training data should follow these rules:
    - strictly turn left in left curves (no corrections/approximations)
    - strictly turn right in right curves
    - strictly drive forward in straight street segments
  - Approximating curves with a quick succession of forward and forward-left or forward and forward-right commands is detrimental.
Timing is critical
- Only information from current point in time can be used (= current camera frame)
- Delays or jitter of time steps dt (time delta between subsequent pairs of (camera frame, steering command)) will either distort the training data or the path the car drives along in autonomous mode.
  - Long prediction delays may cause the car to miss curves, or steer too late.
  - Car needs to drive at constant speed to ensure a constant dt. It is essential, since we do not measure speed, inferring it from one camera frame alone is not robust, and we have no way to detect a varying dt nor to account for it.
- Similarly, delays between camera frame and steering command should be minimal (when recording training data, and when predicting commands), such that the right correlations are captured and predicted.
Camera should provide complete view of the street immediately ahead
- The view directly in front of the car is necessary to compute the steering command for the next time step, not something in further distance that might lack the relevant information (and force to correlate with random features / artifacts).
- The more information about the street’s shape, curvature, and free space to its left and right is visible, the more likely it is that a model will pick up relevant features instead of focusing on less robust and environment specific ones.
  - To that end, the camera should capture as much of the horizontal section of the street immediately ahead.

Training data and prediction accuracy

During experimentation with the car simulation I made in V-REP, it indeed turned out that perfect driving examples result in significantly lower training and validation loss, as theorized above.

The purple path named “Trail” shows the perfect left turn the car took: starting with a straight line segment, followed by a perfect left arc, and ending in a straight line segment.
A much more unsteady left turn: a mix of short left arcs and straight forward driving segments.

The perfect left turn examples result in a training accuracy of 97% to 98% with a validation and training loss of 0.055434 and 0.087586, respectively. On the other hand, training with the unsteady left turn examples struggled to reach 91% accuracy, having a significantly higher training and validation loss of 0.271898 and 0.156824, respectively.

Other factors improved the self-driving behavior (for both training sets above) of the car: switching to a square resolution of 32×32 pixels for the vision sensor and increasing the vision sensor’s field of view / angle of view to 120°. The first change results in a more natural scaling to the expected minimal resolution of resnet18, which is 224×224. The second one increases the amount of the street that can be viewed at once, which ensures critical features of the street, such as its curvature and overall shape, remain visible even when far inside a street curve.

Furthermore, adjusting the vision sensor’s height ensured a good view of the street immediately in front of the car, allowing for a more obvious prediction of the necessary corrective steering command. When viewing the street further ahead, the relationship is less straight forward, or even not possible to derive, because the street is not visible anymore (e.g., when in a street curve).

The neural net is then forced to correlate the recorded steering commands with random artifacts in the corresponding vision sensor frames, as no distinctive (and relevant) information can be “seen”. This makes the prediction dependent on the street’s environment or random features / variations, therefore reducing the ability to generalize and “encouraging” unreliable predictions.

Future work

Besides the obvious next step of transferring this successful approach to a physical RC car, with the given hardware requirements established above, there are further improvements possible.

Automatically generate perfect steering examples

Given known track paths a mapping to the necessary steering commands could be computed automatically, using classical geometrical algorithms. Letting the car drive along those trajectories would provide us with the training data necessary to correlate camera images and steering commands using deep learning.

Additionally, putting the car in environments that vary visually, but keep the track geometry the same, would allow to emphasize what features are relevant for steering and which are not, improving overall model quality.

Automation would greatly reduce the laborious collection of training data and simplify training/testing-cycles. It can be quite difficult to manually drive a car perfectly along a given path, especially with the limited steering granularity (-45°/0°/+45°) our car provides coupled with the need of precise timing. Yet ideal examples of driving along perfect left/right turns is needed to obtain an accurate model, as established higher up in this post.

It would also allow to introduce some slight variations / random noise in the tracks’ path to further improve model robustness.

Non-visual training

Experimentation can be quite slow due to the need of manual data collection and training time. A non-visual approach, where plain geometrical paths (not perspective 3D rendering of them and their environment) are correlated to steering commands would allow for fast iterations.

The learned mapping could be easily inspected and identified for errors (compare car position/trajectory and track path), maybe even automatically using geometrical algorithms. Also, training and data collection would be much faster, due to the significant reduction of the required bandwidth.

This enables testing out various model parameters and learning models, picking the most promising one that can model the required mapping well, and later using camera images as training inputs, instead of path segments.

Timeless vs. time-based models

Sped-up training seems interesting for testing Q-learning, RNNs, LSTMs, or PID controllers and comparing their performance and capability to generalize. Especially, in comparison with timeless models, such as used in this post.

Time-based models would allow to compensate for previous steering inaccuracies or enable to gradually approach a target. Usually human drivers plan ahead, and do not just react to current stimuli (= current camera frame), as in the implementation shown in the videos.

Even PID controllers, which are common and relatively basic, will gather information over time and smooth out control signals using derivative and integral terms.

Finally, time-based models are necessary to detect situations where the car made a steering mistake and went off track. The currently available information will not be enough to detect this (unless other sensors that provide global positioning are available).

Tuning in to growing words