In the realm of neural networks, one parameter stands tall among the rest, significantly influencing the training process - the batch size.
Although it might seem like a minor detail, the batch size can drastically affect the training curves of your neural networks.
Want to learn more how the batch size influences the training process?
Let's dive into the fascinating world of batch sizes and their impact on gradient descent algorithms.
The Role of Batch Size in Neural Network Training
Imagine training two identical neural networks using gradient descent.
They share the same architecture, loss, optimizer, learning rate, momentum, epochs, and training data. However, one crucial difference sets them apart - the batch size.
This single difference can lead to vastly different training curves, making the choice of batch size a critical decision in the optimization process.
Let's delve into the ramifications of this pivotal decision, exploring the distinct paths carved by different batch sizes.
Stochastic Gradient Descent (batch_size=1)
Using randomly a single sample of data on every iteration is known as Stochastic Gradient Descent (SGD).
SGD has an unstable loss and requires a long time to train.
Moreover, running SGD multiple times will yield completely different results due to its stochastic nature. It tends to jump around, never settling on a good solution.
Advantages:
Simple to understand. Offers a straightforward approach to gradient descent.
Avoids getting stuck in local minima
Provides immediate feedback. Allows for swift adjustments.
Disadvantages:
Computationally intensive
May not settle in the global minimum. With no guarantee of convergence.
Noisy/unstable performance
Batch Gradient Descent (batch_size=all_training_data)
Using all the training data at once is called Batch Gradient Descent (BGD).
The algorithm takes the entire dataset and computes the updates.
BGD offers a very stable loss and fast training, but it comes at the cost of requiring significant memory to fit the entire dataset on every iteration.
It can also get stuck in local minima from time to time.
Advantages:
Computationally efficient
Stable performance (less noise)
Disadvantages:
Requires a lot of memory
May get stuck in local minima
Mini-Batch Gradient Descent (batch_size=X)
Using a subset of data (more than one sample but fewer than the entire dataset) is called Mini-Batch Gradient Descent.
The algorithm works like Batch Gradient Descent, but with fewer samples.
It's fast and doesn't require much memory.
It might have some noise but it's relatively stable performance. As well, there's a better chance of not getting stuck in local minima.
Advantages:
Avoids getting stuck in local minima
More computationally efficient than SGD
Requires less memory than BGD
Disadvantages:
- Introduces a new hyper-parameter to optimize
Mini-Batch Gradient Descent in Keras
# Train the model using Mini-Batch Gradient Descent
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=12,
verbose=1,
validation_data=(x_test, y_test))
Mini-Batch Gradient Descent in Pytorch
# Set the batch size
batch_size = 32
# Create DataLoaders for the datasets
train_loader = torch.utils.data.DataLoader(
dataset=train_dataset,
batch_size=batch_size,
shuffle=True)
# Train the model using Mini-Batch Gradient Descent
num_epochs = 12
for epoch in range(num_epochs):
for x_batch, y_batch in train_loader:
outputs = model(x_batch)
loss = criterion(outputs, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Optimal Batch Size: Finding the Sweet Spot
In practice, Batch Gradient Descent is rarely used, especially with large datasets. SGD isn't very popular either.
Mini-Batch Gradient Descent is the most commonly used method due to its balance between computational efficiency and stability.
There's extensive research on finding the optimal batch size.
Every problem is different, but empirical evidence suggests that smaller batches tend to perform better.
However, blindly increasing the batch size is not a solution, as it can lead to overfitting.
Striking the perfect balance is crucial. This paper recommends 32 as a good default value.
Read more at Practical recommendations for gradient-based training of deep architectures
Conclusion
When training a neural network with a batch size of 1, the loss becomes noisy, and the training process takes a long time. With every run, you'll get different results, as the model never settles on a good solution.
On the other hand, using all the training data as the batch size results in a stable loss and fast training. However, it demands a lot of memory and can get stuck in local minima.
The sweet spot lies in using a mini-batch size, such as 32. This approach offers a balance between speed, memory requirements, and stability, while also reducing the risk of getting stuck in local minima.
In the end, the choice of batch size significantly impacts the training process, and understanding its nuances can help you make informed decisions when optimizing neural networks.
Hope this article helps you get better training results.
If you like this article, share it with others ♻️
Would help a lot ❤️
And feel free to follow me for articles more like this.