**Vision for cone detection**

**Tyler Folsom November 2, 2012**

The Arduino processor is not really suitable for vision, but it might be possible. Imaging could be based on a web cam streaming video over USB to an Arduino Mega ADK. A 320×240 RGB image at 25 frames per second is a about 5.8 million bytes per second, and the Arduino clock speed is only 16 MHz. Arduino has 248 KB of available memory. If only a 128×128 monocular region of interest (ROI) is stored, storage required per image is 16.4 KB. Every second or third frame might be acquired, with acquisition alternating with processing.

TI Tiva, Digilent chipkit Max 32, Raspberry Pi or a smart phone may be a more sensible vision platform. The discussion below is about optimizing the vision system to run on a slow processor with limited memory. Arduino has no operating system, so does not support OpenCV. Development can be done on a desktop, then the source code can be adapted to run on the microcontroller.

**RGB to mono image transformation**

Transform from a color image to a black and white image that is optimized to reveal cones. One way to do this is to transform from RGB to HSV (Hue, Saturation, Value) aka HSI (Hue, Saturation, Intensity), and discard the saturation and value. A more sophisticated way is to analyze images of cones and find their color range.

Method: For a collection of images, manually find the trapezoid defining the cone in each. Calculate center, size, and orientation of cone.

Transform images to HSI. Write a program that reads an image and the cone trapezoid. The program should calculate the mean pixel value within the cone, the mean pixel value for the background, and the standard deviations. These should be computed for each of the three H, S and I planes.

For each of H, S and I, compute the pixel range in which Cone is more likely than background. Compute P(p | C) and P(p |B). This is the probability that a pixel is within the determined range, given that the pixel is in the cone region, and the probability that a pixel is within the determined range, given that the pixel is in the background region.

Bayes formula says P(C | p) = P(p | C) P(p) / P(C)

P(p) = total number of pixels in the determined range / total number of pixels

P(C) = area of Cone / area of ROI

Thus within each of H, S, and I we assess the ability of a value range to predict a cone. For example, we might have P(p_{H}) = P(10 < hue < 26) = 0.09

Chance of cone = P(C) = 0.12

P(10 < hue < 26 | C) = 0.78

P(C | 10 < hue < 26) = 0.78 * 0.09 / 0.12 = 0.585

Similarly, we might find

P(C | 58 < intensity < 183) = 0.36

P(C | 97 < saturation < 205) = 0.15

Since P(C | p_{S}) is only slightly bigger than chance, saturation provides almost no information, and can be discarded. Hue has about twice the predictive power of intensity. Thus we might create a ConeTone monocular image from (2*hue + intensity) / 3.

We would then go through the same exercise and determine if there is a range of ConeTone that produces a better prediction than P(C | p_{H}). If not, stick with hue, or come up with a nonlinear way ofcombining hue and intensity. You might want to transform the images so that both p_{H} and p_{I} have the same range and mean.

**Cone position prediction**

The position of the robot is known, and the position of the next cone is known. There is some inaccuracy in both estimated positions; the job of vision is to refine the estimate of cone location. We can arbitrarily set the altitude of the robot to zero. If we have map information that shows the expected difference in altitude at the cone location, we can set the Z cone location; otherwise set cone Z to 0 or its last estimated value.

The robot location and attitude may not match those of the camera. For simplicity, assume that the camera is located at the center of the robot and pointed the same direction. In the world frame of reference, let the camera be at (X1w, Y1w, Z1w) and the cone at (X2w, Y2w, Z2w) where distances are in meters, X is east, Y is north, and Z is up. Note that this differs from much of the computer graphics literature, which often assigns Z to depth.

Distance to the cone = d = Sqrt((X1w-X2w)^{2} + (Y1w-Y2w)^{2} + (Z1w-Z2w)^{2})

In the camera frame of reference, the center of the lens is located at (0,0,0) and the axes are aligned with the camera, with the vehicle pointing in the X direction. The camera axis is (x, 0, 0) and the image is projected to the yz plane with the y axis pointing left and z axis up. Let the rotation of the robot/camera relative to the world be (a, b, c) = (roll, pitch, yaw). When the world is rotated to align with the camera, a point on the camera axis goes to world points (X3w, Y3w, Z3w) by

|X3w| |1 0 0 | |cos(b) 0 sin(b)| |cos(c) -sin(c) 0| |x|

|Y3w| = |0 cos(a) -sin(a)| | 0 1 0 | |sin(c) cos(c) 0| |0|

|Z3w| |0 sin(a) cos(a)| |-sin(b) 0 cos(b)| | 0 0 1| |0|

Thus the center of the image is the ray

(y, z) = (x[cos(a)sin(c) + sin(a)sin(b)cos(c)], x[sin(a)sin(c) – cos(a)sin(b)cos(c)])

If there is no roll or pitch, (y, z) = (x sin(c), 0).

The new distance to the object may decrease: X3w = x cos(b) cos(c)

Now consider the camera focal plane, which is distance f behind the lens (given in mm, but use m to match other units). Let the pixel size be wP x hP meters and the image size be WIDTH x HEIGHT.

Any point on the camera axis will go to pixel (WIDTH/2, HEIGHT/2). If the cone is known to be L meters high and at a distance d, the image on the camera focal plane will be L * f / d meters high, or (L * f) / (d * hP) pixels high.

Now let’s rotate the camera by (-a, -b, -c) to match the world. We reverse the matrix equation above, being careful about the order of the three rotations.

|X2c| | cos(c) sin(c) 0| |cos(b) 0 -sin(b)| |1 0 0 | |X2w-X1w|

|Y2c| = |-sin(c) cos(c) 0| | 0 1 0 | |0 cos(a) sin(a)| |Y2w-Y1w|

|Z2c| | 0 0 1| |sin(b) 0 cos(b)| |0 -sin(a) cos(a)| |Z2w-Z1w|

The cone is at location (Y2c, Z2c) in the camera coordinate system. It will be imaged at pixel

(column, row) = (WIDTH/2 + Y2c * f/(d * wP), HEIGHT/2 – Z2c * f/(d * hP))

**Cone**** ****detection**** ****algorithm**

Initialize default parameters: e.g. Cone location is (320, 240) out of 640 by 480. Initialize cone height to 64 pixels. Cone size is described by the height. At this scale, the cone is expected to be 6 pixels wide at the top and 25 at the base. Initialize probability of cone to 0. Construct an image of an ideal cone, centered on its center of gravity. Distort this image into a log-polar version and take its Fourier Transform. Call the magnitude of the transform TP.

Loop:

Read expected cone position and size as analog inputs.

If (Expected cone height < 16 pixels or (Probability of cone > 50% and expected cone height <= 32 pixels)) /* Use 64 by 64 region of interest centered on expected cone position. */

skip = 0;

Else if (Probability of cone > 50% and expected cone height > 32 and expected cone height <= 96) /* Reduce a 128 by 128 ROI to 64 by 64. */

skip = 1;

Else if (Probability of cone > 50% and expected cone height <= 196)

/* Reduce a 256 by 256 ROI to 64 by 64. */

skip = 3;

Else /* Reduce 512 by 512 image centered on expected cone position to 64 by 64. */

skip = 8;

Read USB stream of RGB pixels from camera. Skip rows and columns as indicated. Produce a 64 by 64 image centered on expected position. Skip anything not falling into this ROI. For each accepted byte triplet, convert to HSI and save the hue; discard the intensity and saturation.

H = Theta if B<= G; else 360 – Theta for B > G.

Theta = arccos ( 0.5* [(R-G) + (R-B)] / sqrt((R-G)^{2} + (R-B)*(G-B))

[Note: in some cases the algorithm will work better on I than H]. Construct an orangeness image (64 by 64 called S) whose value is the difference of the hue from orange; e.g. perfect orange cone color is 180 and blue is 0.

Form a 64 by 64 image of a perfect cone (C) of the proper size. Take the Fourier transform of the cone image. Keep the magnitude of the transformed cone (call it TC) and discard the phase.

Take Fourier Transform of S. Keep the magnitude of the transformed scene (call it TS) and discard the phase.

Multiple TC by TS, pixel by pixel. Call the resulting 64 by 64 image the Convolution. The code is in Numerical Recipes in C, 2^{nd} ed., p. 543 convlv(), or you can use OpenCV. Sum the columns of the convolution. Sum the rows of the convolution. Find the peak in the column sums and row sums. The (x, y) coordinates of the peaks give the position of the cone. Translate this position back to the 640 by 480 image.

Distort S into a log-polar image, SL, centered at the cone position, with blank areas filled with 0. Take the Fourier transform of SL. Keep the magnitude (call it TL) and discard the phase. Multiple TL by TP, pixel by pixel to obtain the convolution, using convlv(). Sum the columns of the convolution. Sum the rows of the convolution. Find the peak in the row sums. This gives the scale of the cone. Exponentiate the scale to get the height of the cone in pixels.

In the original image, find a rectangle of the proper size around the cone. Construct a trapezoid of the expected cone boundaries. Find the average value of all pixels inside the cone boundaries; find the average value of all pixels in the rectangle, but outside the cone boundaries. Their difference is a measure of whether there is any cone present.

Convert the four numbers: (vertical position)/640, (horizontal position)/480, cone height in pixels, probability of cone to four signals in 0 to 255 range and send out on analog lines.

**Update of cone position**

If probability of cone detection is high, update cone estimated position, using Kalman filter.

d = L * f / (hP * cone_pixel_height)

Y2c = (column – WIDTH/2) * d * wP / f

Z2c = -(row – HEIGHT/2) * d * hP / f

X2c = sqrt(d^{2} – Y2c^{2} – Z2c^{2})

Apply the rotation transform to (X2c, Y2c, Z2c) to find the updated world cone coordinates.