Camera Model

Pinhole Camera Model

Fig: Pinhole Camera Model Intuition – Source: Wikipedia_Pinhole_Model

Fig: Pinhole Camera Model Intuition – Source: Nvidia_Docs

Coordinate System

X_w, Y_w, Z_w : World Coordinate Frame (Reference Frame)
X_c, Y_c, Z_c : Camera Coordinate Frame
u, v : Pixel Coordinate Frame

World Frame (Convention)

Generally, it would be a robot’s base_link or a map frame

X_w: Front
Y_w: Left
Z_w: Up

Camera Frame (Convention)

If camera lens is facing opposite to me, then:

Centre of lens: Origin
X_c: Left -> Right
Y_c: Top -> Bottom
Z_c: Into the Plane

Pixel Frame (Convention)

If camera lens is facing opposite to me, then:

Top-left of the frame: Origin
u: Left -> Right
v: Top -> Bottom

Forward Projection

To get the pixel coordinates from the world coordinates

\[\mathbf{s} = \mathbf{K} \cdot [\mathbf{R} \mid \mathbf{t}] \cdot \mathbf{X}\]

where:

\(\mathbf{s} = \lambda \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}\) (Homogeneous image coordinates)
\(\mathbf{K} = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}\) (Intrinsic matrix: Camera -> Pixel Coordinates)
\([\mathbf{R} \mid \mathbf{t}] = \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{bmatrix}\) (Extrinsic matrix: World -> Camera Coordinates)
\(\mathbf{X} = \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}\) (Homogeneous world coordinates)

Role of each factor

Extrinsic Matrix – \([\mathbf{R}_{3x3} \mid \mathbf{t}_{3x1}]\)

The three columns of rotation matrix \(\mathbf{R}\)_3x3 represent the three basis vectors of camera frame with respect to world frame. The translation vector \(\mathbf{t}\)_3x1 represents the translation of camera frame with respect to world frame

Intrinsic Matrix – \(\mathbf{K}_{3x3}\)

f_x and f_y are in diagonal position and convert units from meters to pixels. During manufacturing, the sensor may not be square, so there’s need of different values of f_x and f_y.

\[\mathbf{f_x} = \frac{\text{focal length in mm}}{\text{pixel size in mm per pixel (x axis)}}\]

c_x and c_y are the offsets in X & Y axis of the optical centre, where optical axis cuts the focal plane, with respect to pixel coordinate frame. Ideally, the optical centre should be at the geometrical centre of the focal plane, but due to errors in manufacturing, the axis may not pass exactly through centre in most cases but is quite near to the centre.

Backward Projection

To get world coordinates from pixel coordinates. In most real world scenarios, this is what we want to do instead of forward projection. During image formation, we project 3D scene to a 2D plane thus losing information of a dimension. So for backward projection , we need to have information about the lost dimension i.e. depth value of each image coordinate to get a unique solution (single world coordinate) using pixel coordinates.

For such use case, we use depth cameras that provides Z_c i.e. z-coordinate w.r.t. camera frame.

Distortion

Due to spherical structure of lens, the image clicked from camera suffers from unwanted distortions such that we donot obtain perfectly rectangular image as seen by our eyes. Such distortion can be omitted by calibrating the camera.

Fig: Common types of distortions in image – Source: GFG_Distortion_Examples