Image Augmentation in Julia and Questioning my Life Choices

Augmentation is an important part of deep learning, especially when used for computer vision tasks. I recently investigated the state of image augmentation in Julia and DataAugmentation.jl stood out as a nicely designed package that is very easy to use. However, I wanted to test its speed, especially when compared to augmentation packages in Python. In this post, I’ll run a basic augmentation pipeline with DataAugmentation.jl and compare it to Albumentations, a popular augmentation package for Python.

To start, let’s grab 2000 images from the ImageNet-1K Validation dataset dataset and load them into memory:

using FileIO, Images

files = readdir("./imagenet/"; join=true)
images = [RGB.(load(f)) for f in files]
println("Loaded $(length(images)) images")

Loaded 2000 images

using MosaicViews

mosaicview(images[1:6], npad=5, nrow=2)

Figure 1: Sample of loaded ImageNet-1k images.

For this comparison, we’ll refer to Albumentation’s PyTorch example as a typical augmentation pipeline. This pipeline scales, rotates, crops, and applies random contrast and brightness. Finally, it normalizes the image and converts it to a tensor. Let’s setup this pipeline with DataAugmentations.jl and see how fast it can augment images:

using BenchmarkTools, DataAugmentation

images = [Image(image) for image in images]
augmentation = ScaleKeepAspect((256, 256)) |>
               Maybe(Zoom((0.95, 1.05)) |> Rotate(15), 0.5) |>
               RandomCrop((224, 224)) |>
               Maybe(AdjustContrast(0.2) |> AdjustBrightness(0.2), 0.5) |>
               PinOrigin() |>
               ImageToTensor() |>
               Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))

time = @benchmark for image in images
    apply(augmentation, image)
end

BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  3.173 s …    3.601 s  ┊ GC (min … max): 0.21% … 2.38%
 Time  (median):     3.334 s               ┊ GC (median):    1.04%
 Time  (mean ± σ):   3.335 s ± 106.714 ms  ┊ GC (mean ± σ):  1.01% ± 0.66%

      ▆▆█▁      ▃   ▆   ▃▃▁    ▁ ▁  ▁   █ ▁                   
  ▄▇▇▇████▇▇▇▇▄▄█▁▁▁█▄▇▇███▄▇▄▇█▄█▄▁█▁▄▄█▄█▇▁▇▁▇▄▁▁▄▁▁▁▁▁▁▁▄ ▄
  3.17 s         Histogram: frequency by time          3.6 s <

 Memory estimate: 2.80 GiB, allocs estimate: 328618.

Not bad, DataAugmentation.jl took an average of 3.3 seconds to augment all 2000 images. We can now do the same in Python with Albumentations. First we load the images, create the same augmentation pipeline, and time the image augmentation:

import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
import pathlib
import timeit

files = list(pathlib.Path("./imagenet/").iterdir())
images = [
    cv2.cvtColor(cv2.imread(str(file)), cv2.COLOR_BGR2RGB)
    for file in files
]

augmentation = A.Compose(
    [
        A.SmallestMaxSize(max_size=256),
        A.ShiftScaleRotate(shift_limit=0.0, scale_limit=0.05, rotate_limit=15, p=0.5),
        A.RandomCrop(height=224, width=224),
        A.RandomBrightnessContrast(p=0.5),
        A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
        ToTensorV2(),
    ]
)
time = timeit.timeit(
    "for image in images: augmentation(image=image)",
    globals=globals(),
    number=10,
)
print(f"Python mean: {time/10:.4}s")

Python mean: 1.624s

This is an impressive result. On average, Albumentations tore through 2000 images in 1.6 seconds, less than 1ms per image, and twice as fast as DataAugmentaion.jl. This is just a testament to all the work and attention to performance the Albumentations team has put into their package. Well done.

But wait, up to this point we have only used one thread, and I have 6 cores and 12 hyperthreads sitting here (AMD Ryzen 5 PRO 4650G). I am sure many of you have more, especially since we often run our deep learning jobs on monster machines. So how do things look in these more realistic circumstances? Well, in Julia we can easily parallelize our augmentation:

time = @benchmark Threads.@threads for image in images
    apply(augmentation, image)
end

BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  441.473 ms … 629.252 ms  ┊ GC (min … max): 0.00% … 15.97%
 Time  (median):     477.320 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   498.405 ms ±  51.302 ms  ┊ GC (mean ± σ):  3.42% ±  5.75%

  ▆▆▂█▆   █                ▄▂                                    
  ██████▄▄█▆▆█▆▄▁▆▁▁▁▁▁█▁█▆██▆▆▄▁▄▁▄▆▁▁▁▁▄▄▄▄▄▁▄▁▆▁▄▄▄▁▆▆▁▄▁█▁▄ ▄
  441 ms           Histogram: frequency by time          609 ms <

 Memory estimate: 2.80 GiB, allocs estimate: 325847.

Wow, Julia’s multi-threaded execution is 7x faster than single-threaded on a 6 core, 12 thread CPU. Now, we’ll try multi-threading in Python. Python programmers know where this is going, don’t you? You almost don’t want to see what happens in Python. But let’s get it over with and try multi-threading our augmentation in Python:

from concurrent.futures import ThreadPoolExecutor, wait

with ThreadPoolExecutor() as executor:
    time = timeit.timeit(
        "wait([executor.submit(augmentation, image=image) for image in images])",
        globals=globals(),
        number=10,
    )
print(f"Python mean: {time/10:.4}s")

Python mean: 1.808s

And there it is, multi-threading does not help at all, and it’s the GIL. Python just doesn’t let you easily multi-thread because of the GIL. In this example, multi-threading actually slowed down the augmentation, and is 4x slower than the Julia version. As a Python developer for almost two decades, I am beginning to wonder why I put up with this when multi-threading is well supported in other languages. Years ago, I worked with C++ and happily developed multi-threaded algorithms. Seeing the simplicity of multi-threading in Julia, I am beginning to wonder why I have been living with the GIL for so long.

Ok, ok. We all know that the GIL is there and multi-threading is a pain in Python, but what about multi-process, and what about our use case in deep learning where multi-process data loaders are available and should alleviate this suffering. So let’s setup a data loader using Julia’s deep learning framework, Flux. For comparison, we will setup a similar data loader in Python with PyTorch. To keep things fair between Flux’s multi-threaded implementation and PyTorch’s multi-process implementation, we will load images from disk in the data loader instead of preloading into memory.

First, the Julia version with Flux:

using Flux

struct Dataset
    files::Vector{String}
    augmentation::DataAugmentation.Transform
end

Base.length(d::Dataset) = length(d.files)

function Base.getindex(dataset::Dataset, index::Int)
    open(dataset.files[index]) do file
        image = RGB.(load(file))
        augmented = apply(dataset.augmentation, Image(image)) |> itemdata
        fakelabel = index % 2
        return (augmented, fakelabel)
    end
end

dataset = Dataset(files, augmentation)
loader = Flux.DataLoader(dataset, batchsize=64, parallel=true, collate=true)

time = @benchmark for d in loader
end

BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  1.097 s …   1.369 s  ┊ GC (min … max): 0.00% … 1.01%
 Time  (median):     1.171 s              ┊ GC (median):    1.25%
 Time  (mean ± σ):   1.181 s ± 50.532 ms  ┊ GC (mean ± σ):  1.36% ± 1.12%

             ▃▁▃▁▁█ ▃▃ ▆                                     
  ▄▁▄▄▁▇▇▄▇▆▆██████▇██▆█▆▄▇▄▄▇▆▁▄▇▁▇▆▁▄▁▄▁▁▁▁▄▁▁▁▁▄▁▄▁▄▁▁▁▆ ▄
  1.1 s          Histogram: frequency by time        1.33 s <

 Memory estimate: 7.98 GiB, allocs estimate: 426715.

and the Python version with PyTorch:

from torch.utils.data import DataLoader, Dataset

class TorchDataset(Dataset):
    def __init__(self, files, augmentation):
        self.files = files
        self.augmentation = augmentation

    def __len__(self):
        return len(self.files)

    def __getitem__(self, index):
        image = cv2.imread(str(self.files[index]))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image = self.augmentation(image=image)["image"]
        fakelabel = index % 2
        return image, fakelabel

dataset = TorchDataset(files, augmentation)
loader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=12)

time = timeit.timeit("for _ in loader: pass", globals=globals(), number=10) / 10
print(f"Python mean: {time:.4}s")

Python mean: 1.621s

DataAugmentation.jl with Flux is 30% faster than Albumentations with PyTorch. This is despite Albumentations being 2x faster in single threaded execution. Once real world usage is considered (multi-CPU compute), Python’s advantage just falls apart. This is even more impressive when you consider the lopsided amount of developer effort that has gone into PyTorch and Albumentations compared to Flux and DataAugmentation.jl. If we look at GitHub, Albumentations has 150 contributors and 14 thousand stars. DataAugmentations.jl has 15 contributors and 42 stars. Granted, Albumentations has a much larger library of augmentations available, so these aren’t completely comparable packages. But still, with much fewer resources, DataAugmentation.jl has shown to match and possibly exceed performance of Albumentations in real world use largely due to being written in Julia as opposed to Python. This is just one small data point, but it does reinforce my sense that I should move on from Python’s limitations to tools better suited to my needs.