Camera Model
Pinhole Camera Model
Fig: Pinhole Camera Model Intuition – Source: Wikipedia_Pinhole_Model
Fig: Pinhole Camera Model Intuition – Source: Nvidia_Docs
Coordinate System
Xw, Yw, Zw : World Coordinate Frame (Reference Frame)
Xc, Yc, Zc : Camera Coordinate Frame
u, v : Pixel Coordinate Frame
World Frame (Convention)
Generally, it would be a robot’s base_link or a map frame
Xw: Front
Yw: Left
Zw: Up
Camera Frame (Convention)
If camera lens is facing opposite to me, then:
Centre of lens: Origin
Xc: Left -> Right
Yc: Top -> Bottom
Zc: Into the Plane
Pixel Frame (Convention)
If camera lens is facing opposite to me, then:
Top-left of the frame: Origin
u: Left -> Right
v: Top -> Bottom
Forward Projection
To get the pixel coordinates from the world coordinates
where:
\(\mathbf{s} = \lambda \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}\) (Homogeneous image coordinates)
\(\mathbf{K} = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}\) (Intrinsic matrix: Camera -> Pixel Coordinates)
\([\mathbf{R} \mid \mathbf{t}] = \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{bmatrix}\) (Extrinsic matrix: World -> Camera Coordinates)
\(\mathbf{X} = \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}\) (Homogeneous world coordinates)
Role of each factor
Extrinsic Matrix – \([\mathbf{R}_{3x3} \mid \mathbf{t}_{3x1}]\)
The three columns of rotation matrix \(\mathbf{R}\)3x3 represent the three basis vectors of camera frame with respect to world frame. The translation vector \(\mathbf{t}\)3x1 represents the translation of camera frame with respect to world frame
Intrinsic Matrix – \(\mathbf{K}_{3x3}\)
fx and fy are in diagonal position and convert units from meters to pixels. During manufacturing, the sensor may not be square, so there’s need of different values of fx and fy.
cx and cy are the offsets in X & Y axis of the optical centre, where optical axis cuts the focal plane, with respect to pixel coordinate frame. Ideally, the optical centre should be at the geometrical centre of the focal plane, but due to errors in manufacturing, the axis may not pass exactly through centre in most cases but is quite near to the centre.
Backward Projection
To get world coordinates from pixel coordinates. In most real world scenarios, this is what we want to do instead of forward projection. During image formation, we project 3D scene to a 2D plane thus losing information of a dimension. So for backward projection , we need to have information about the lost dimension i.e. depth value of each image coordinate to get a unique solution (single world coordinate) using pixel coordinates.
For such use case, we use depth cameras that provides Zc i.e. z-coordinate w.r.t. camera frame.
Distortion
Due to spherical structure of lens, the image clicked from camera suffers from unwanted distortions such that we donot obtain perfectly rectangular image as seen by our eyes. Such distortion can be omitted by calibrating the camera.
Fig: Common types of distortions in image – Source: GFG_Distortion_Examples