Understanding CNN

업데이트: November 30, 2025

Motivation: A Quick Review of MLPs

Before diving into Convolutional Neural Networks (CNNs), let’s briefly review what we’ve learned about Perceptrons and Multi-Layer Perceptrons (MLPs).

MLPs are versatile tools used for various tasks, from simple Boolean functions (NOT, AND, OR) to complex classifiers and regression models. While a single Perceptron can theoretically achieve Universal Approximation (UAT), stacking layers to increase depth is computationally more efficient. This led to the rise of MLPs.

Regardless of the task, the ultimate goal of an MLP is to find the optimal parameters (weights and biases) through Backpropagation. However, standard MLPs have a specific constraint: they require fixed-size vector inputs.

The Limitations of MLP in Vision

While MLPs are powerful, they struggle significantly when applied to image processing tasks. Here is why:

1. The Parameter Explosion (Size Issue)

Since MLPs require vector inputs, a 2D image must be flattened into a 1D vector. Consider a standard input image of size $224 \times 224$ pixels. Flattening this results in a vector with approximately 150,000 dimensions.

If we connect this input to a single hidden layer with another 150,000 nodes, the number of weights becomes:

\[ 150,000 \times 150,000 \approx 22 \text{ Billion Parameters} \]

This computational cost is unmanageable.

2. Loss of Spatial Structure (Inductive Bias)

In images, nearby pixels are statistically correlated (Local structure matters). For example, in a photo of a bird, pixels surrounding the bird are likely to be the sky, and pixels near a lake are likely to be trees. However, Fully Connected (FC) layers treat every pixel as independent, discarding this crucial spatial information.

Furthermore, MLPs lack invariance. If a bird in a picture moves slightly to the right, it is still the same bird. However, an FC layer treats this shifted input as a completely new pattern because the weights are tied to specific locations. To recognize the bird in a different position, the network has to “re-learn” the object from scratch.

We describe this issue as the lack of Inductive Bias for vision tasks.

Image Properties and CNNs

To solve the problems above, we need a model that respects the fundamental properties of images.

Property 1: Translation Invariance

A classifier should recognize an object regardless of its position in the image. Whether the bird is on the left or the right, the output should remain the “bird”. Mathematically, this is expressed as:

\[ f(t(x)) = f(x) \]

The function $f$ (classifier) is invariant to the transformation $t$ (shift/translation).

Property 2: Translation Equivariance

If the input shifts, the feature map (output) should shift in the same way.

\[ f(t(x)) = t(f(x)) \]

The function output transforms in the same way as the input.

The Solution: Convolutional Neural Networks (CNN)

Introduced by Yann LeCun et al. (See Paper), CNNs are designed to satisfy the properties mentioned above while solving the parameter explosion problem.

The core idea is Scanning.

Instead of processing the entire image at once with a giant MLP, we create a small, local MLP called a Window, Kernel, or Filter. We then slide (scan) this window across the image.

Key Concept: Weight Sharing Crucially, as we slide the window, we use the same weights for every position. This means all sub-networks share parameters, drastically reducing the total number of weights compared to FC layers.

The Arithmetic of Convolution

Let’s break down how the calculation actually works with a simple 1D example.

Input: $[1, 3, 2, 3, 0, \dots]$ (A long vector)
Kernel: $[1, 3, 0, -1]$ (A fixed 4-size filter)

Step 1: Take the first 4 elements of the input $[1, 3, 2, 3]$ and perform a dot product with the kernel.

\[ (1 \times 1) + (3 \times 3) + (2 \times 0) + (3 \times -1) = 1 + 9 + 0 - 3 = \mathbf{7} \]

Step 2 (Slide): Shift one step to the right. Take input $[3, 2, 3, 0]$ and multiply with the same kernel.

\[ (3 \times 1) + (2 \times 3) + (3 \times 0) + (0 \times -1) = 3 + 6 + 0 + 0 = \mathbf{9} \]

This operation is simply a sliding dot product.

Note on Equivariance: This calculation demonstrates equivariance. If the input pattern [1, 3, 2, 3] shifts to the right, the output value 7 will also appear shifted to the right in the output vector, because we are using the same shared weights everywhere.

Key Hyperparameters

When designing a CNN, there are four major concepts to understand:

Zero-padding: If we compute convolution without padding, the output dimension shrinks because the kernel cannot be centered on the edge pixels. Zero-padding adds zeros around the border of the input to preserve the spatial dimensions.
Stride: How many pixels do we shift the window at each step? (e.g., Stride 1 vs. Stride 2).
Kernel Size: The size of the window (e.g., $3 \times 3$, $5 \times 5$).
Dilation: How wide should the kernel spread? Dilation introduces gaps between kernel elements, increasing the receptive field without adding more parameters.

Parameter Efficiency: FC vs. CNN

Let’s compare the complexity of the two architectures.

Fully Connected Layer:

\[ h_i = \sigma(\sum_{j} w_{ij} x_j + b_i) \]

Dense connectivity: Every input is connected to every output.
Parameters required: $\approx D^2$ (Quadratic complexity).

Convolutional Layer:

\[ h_i = \sigma(\sum_{j} w_{j} x_{i+j} + b) \]

Sparse connectivity (Locality): Only connects to local neighbors.
Parameters required: Proportional to the Kernel Size.

Parameter Counting Formula (1D CNN)

To calculate the total number of learnable parameters in a standard CNN layer:

\[ \text{Total Params} = (C_{in} \times C_{out} \times K) + C_{out} \]

$C_{in}$: Number of Input Channels
$C_{out}$: Number of Output Channels (Number of filters)
$K$: Kernel Size
$+ C_{out}$: The bias term (one bias per output channel/filter).

Advanced Concepts

Channels

A single filter can only extract a single type of feature (e.g., a vertical edge). If we only use one filter, we miss out on horizontal edges, diagonals, or color textures.

Solution: Use multiple filters in parallel. By applying $N$ different filters, we generate $N$ different Feature Maps. These stacked maps form the Channels of the next layer.

Receptive Field

Why is depth important in CNNs? It relates to the Receptive Field.

The receptive field is the region of the input space that a particular feature looks at (i.e., affects the calculation of that unit).

FC Layer: The receptive field is the entire input image (Global).
CNN Layer:
- Lower Layers: Small receptive field. Detects simple, local features like lines and textures.
- Higher Layers: As we go deeper (stacking layers), the receptive field expands. The network begins to “see” larger regions, recognizing high-level concepts like objects or faces.

Finding the Optimal Kernel

In traditional computer vision, humans manually designed kernels (e.g., Sobel filters for edge detection). In Deep Learning, we do not design kernels; we learn them.

Just as MLPs use backpropagation to find optimal weights, CNNs find the optimal Kernel values for the task. Because CNNs use significantly fewer parameters due to weight sharing and locality, they suffer less from overfitting and offer far better generalization performance for visual tasks compared to FC networks.

Twitter Facebook LinkedIn

Seungwoo Lim