Author: Quan Hua
Last Updated: Fri, Sep 16, 2022WebDataset is a PyTorch Dataset implementation to work with large-scale datasets efficiently. WebDataset provides sequential/streaming access directly to datasets stored in tar archives in a local disk or a cloud storage object for training without unpacking and can stream data with no local storage. With WebDataset, you can scale up the same code from running local experiments to using hundreds of GPUs. This article explains how to use WebDataset and PyTorch with tar archives stored on Vultr Object Storage.
At the end of this article, you know:
Install webdataset
and s3cmd
using Python package manager.
$ pip install webdataset s3cmd
Run the configuration and enter the information of your Vultr Object Storage
$ s3cmd --configure
Here is an example setup
Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.
Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.
Access Key: WSZ3GHRPA189CVSGRKU6
Secret Key: uKCwHbdmUEqTbjVno3k373rfORzSam0lWAXbaAm6
Default Region [US]:
Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
S3 Endpoint [s3.amazonaws.com]: sgp1.vultrobjects.com
Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used
if the target S3 system supports dns based buckets.
DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: sgp1.vultrobjects.com
Encryption password is used to protect your files from reading
by unauthorized persons while in transfer to S3
Encryption password:
Path to GPG program [/usr/bin/gpg]:
When using secure HTTPS protocol all communication with Amazon S3
servers is protected from 3rd party eavesdropping. This method is
slower than plain HTTP, and can only be proxied with Python 2.7 or newer
Use HTTPS protocol [Yes]:
On some networks all internet access must go through a HTTP proxy.
Try setting it here if you can't connect to S3 directly
HTTP Proxy server name:
New settings:
Access Key: WSZ3GHRPA189CVSGRKU6
Secret Key: uKCwHbdmUEqTbjVno3k373rfORzSam0lWAXbaAm6
Default Region: US
S3 Endpoint: sgp1.vultrobjects.com
DNS-style bucket+hostname:port template for accessing a bucket: sgp1.vultrobjects.com
Encryption password:
Path to GPG program: /usr/bin/gpg
Use HTTPS protocol: True
HTTP Proxy server name:
HTTP Proxy server port: 0
Test access with supplied credentials? [Y/n]
Please wait, attempting to list all buckets...
Success. Your access key and secret key worked fine :-)
Now verifying that encryption works...
Not configured. Never mind.
Save settings? [y/N] y
Configuration saved to '/home/ubuntu/.s3cfg'
Make a bucket to store the dataset. Replace demo-bucket
with your bucket name.
$ s3cmd mb s3://demo-bucket
This section shows how to create a tar archive for the CIFAR10 dataset, upload it to a Vultr Object Storage, and load the data for training.
This section uses torchvision
library to simplify the explanation of processing the dataset. However, you can use any method to load each sample into separate variables.
Create a file named create_cifar10.py
as follows:
import torchvision
import webdataset as wds
import sys
dataset = torchvision.datasets.CIFAR10(root="./temp", download=True)
for index, (input, output) in enumerate(dataset):
if index < 3:
print("input", type(input), input)
print("output", type(output), output)
print("")
else:
break
Run the script create_cifar10.py
$ python create_cifar10.py
Here is an example result. Each sample is a tuple of a Python Imaging Library (PIL) Image and an integer number as the category label.
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./temp/cifar-10-python.tar.gz
100.0%
Extracting ./temp/cifar-10-python.tar.gz to ./temp
input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA6B0>
output <class 'int'> 6
input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA680>
output <class 'int'> 9
input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7F373B6CA6E0>
output <class 'int'> 9
Change the file create_cifar10.py
as follows. For each sample, an instance of wds.TarWriter
saves a dictionary of input
with key ppm
and output
with key cls
. The key ppm
makes the writer save the image with an image encoder. The key cls
makes the writer save the label as an integer.
import torchvision
import webdataset as wds
import sys
dataset = torchvision.datasets.CIFAR10(root="./temp", download=True)
filename = "cifar10.tar"
sink = wds.TarWriter(filename)
for index, (input, output) in enumerate(dataset):
if index == 0:
print("input", type(input), input)
print("output", type(output), output)
sink.write({
"__key__": "sample%06d" % index,
"ppm": input,
"cls": output,
})
sink.close()
Run the script create_cifar10.py
to create a tar archive named cifar10.tar
.
$ python create_cifar10.py
Run the following command to upload the cifar10.tar
to the Vultr Object Storage
$ s3cmd put cifar10.tar s3://demo-bucket
Create a file named load_cifar10.py
as follows. The code decode("pil")
makes the WebDataset decode the input image into a PIL image instead of raw bytes. The code to_tuple("ppm", "cls")
makes the dataset return a tuple from key "ppm"
and "cls"
.
import torch
import webdataset as wds
from itertools import islice
url = "pipe:s3cmd -q get s3://demo-bucket/cifar10.tar -"
dataset = wds.WebDataset(url).decode("pil").to_tuple("ppm", "cls")
for sample in islice(dataset, 0, 3):
input, output = sample
print("input", type(input), input)
print("output", type(output), output)
print()
Run the script load_cifar10.py
$ python load_cifar10.py
Here is an example result. The input and output match the previous step.
input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50AE18E20>
output <class 'int'> 6
input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50ACDE230>
output <class 'int'> 9
input <class 'PIL.Image.Image'> <PIL.Image.Image image mode=RGB size=32x32 at 0x7FB50AE19AB0>
output <class 'int'> 9
(Optional) Change the load_cifar10.py
to perform data augmentation and normalization.
import torch
import webdataset as wds
from itertools import islice
from torchvision import transforms
def identity(x):
return x
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
preprocess = transforms.Compose([
transforms.RandomResizedCrop(32),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
])
url = "pipe:s3cmd -q get s3://demo-bucket/cifar10.tar -"
dataset = (
wds.WebDataset(url)
.shuffle(64)
.decode("pil")
.to_tuple("ppm", "cls")
.map_tuple(preprocess, identity)
)
for sample in islice(dataset, 0, 3):
input, output = sample
print("input", type(input), input)
print("output", type(output), output)
print()
The dataset from WebDataset is a standard PyTorch IterableDataset
instance. WebDataset is fully compatible with the standard PyTorch DataLoader, replicating the dataset instance across multiple threads and performing parallel data loading and preprocessing.
Here is an example of using the standard PyTorch DataLoader
loader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=8)
batch = next(iter(loader))
batch[0].shape, batch[1].shape
The authors of WebDataset recommend explicitly batching in the dataset instance as follows:
batch_size = 20
dataloader = torch.utils.data.DataLoader(dataset.batched(batch_size), num_workers=4, batch_size=None)
images, targets = next(iter(dataloader))
images.shape
If you want to change the batch size dynamically, WebDataset provides a wrapper that adds a fluid interface to the standard PyTorch DataLoader. Here is an example of using the WebLoader from WebDataset
dataset = dataset.batched(16)
loader = wds.WebLoader(dataset, num_workers=4, batch_size=None)
loader = loader.unbatched().shuffle(1000).batched(12)
batch = next(iter(loader))
batch[0].shape, batch[1].shape
This section shows how to create a tar archive for LibriSpeech dataset. LibriSpeech is a corpus of about 1000 hours of 16kHz read English speech. Each sample contains audio in the Free Lossless Audio Codec (FLAC) format, an English text, and some integer numbers.
This section uses torchaudio
library to simplify the explanation of processing the dataset. However, you can use any method to load each sample into separate variables.
Create a file named create_librispeech.py
as follows:
import torchaudio
import webdataset as wds
import sys
import os
dataset = torchaudio.datasets.LIBRISPEECH(
root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
download=True)
for index, sample in enumerate(dataset):
waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
if index < 3:
for i, item in enumerate(sample):
print(type(item), item)
print()
else:
break
Run the script create_librispeech.py
$ python create_librispeech.py
Here is an example result. Each sample is a tuple of a PyTorch Tensor, a string for English text and some integer numbers.
<class 'torch.Tensor'> tensor([[-0.0065, -0.0055, -0.0062, ..., 0.0033, 0.0005, -0.0095]])
<class 'int'> 16000
<class 'str'> CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK
<class 'int'> 103
<class 'int'> 1240
<class 'int'> 0
<class 'torch.Tensor'> tensor([[-0.0059, -0.0045, -0.0067, ..., 0.0007, 0.0034, 0.0047]])
<class 'int'> 16000
<class 'str'> THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM
<class 'int'> 103
<class 'int'> 1240
<class 'int'> 1
<class 'torch.Tensor'> tensor([[ 0.0052, 0.0074, 0.0113, ..., -0.0007, -0.0039, -0.0058]])
<class 'int'> 16000
<class 'str'> FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP
<class 'int'> 103
<class 'int'> 1240
<class 'int'> 2
Change the file create_librispeech.py
as follows. For each sample, the waveform.pth
makes the writer save the audio wave form in the PyTorch Tensor format. The transcript.text
makes the writer save the English transcript as a text. Other keys end with .id
to save as integer number.
import torchaudio
import webdataset as wds
import sys
import os
dataset = torchaudio.datasets.LIBRISPEECH(
root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
download=True)
filename = "LibriSpeech.tar"
sink = wds.TarWriter(filename)
for index, sample in enumerate(dataset):
waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
if index < 3:
for i, item in enumerate(sample):
print(type(item), item)
print()
sink.write({
"__key__": "sample%06d" % index,
"waveform.pth": waveform,
"sample_rate.id": sample_rate,
"transcript.text": transcript,
"speaker.id": speaker_id,
"chapter.id": chapter_id,
"utterance.id": utterance_id
})
sink.close()
(Optional) Run the script create_librispeech.py
to create a tar archive named LibriSpeech.tar
. The result is a large tar archive with a size of 22GB. This approach creates a bigger size than the original dataset (6GB) as the result of decoding all the audio files into a Tensor and save into the tar archive file.
$ python create_librispeech.py
Change the create_librispeech.py
as follows to save the audio in FLAC format. This script reads the raw bytes from the .flac
files and saves with waveform.flac
import torchaudio
import webdataset as wds
import sys
import os
dataset = torchaudio.datasets.LIBRISPEECH(
root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
download=True)
filename = "LibriSpeech.tar"
root_dir = os.path.join("./temp", "LibriSpeech", "train-clean-100")
sink = wds.TarWriter(filename)
for index, sample in enumerate(dataset):
waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
if index < 3:
for i, item in enumerate(sample):
print(type(item), item)
print()
# Load audio
fname = f"{speaker_id}-{chapter_id}-{utterance_id:04d}.flac"
fpath = os.path.join(root_dir, str(speaker_id), str(chapter_id), fname)
with open(fpath, "rb") as file:
sink.write({
"__key__": "sample%06d" % index,
"waveform.flac": file.read(),
"sample_rate.id": sample_rate,
"transcript.text": transcript,
"speaker.id": speaker_id,
"chapter.id": chapter_id,
"utterance.id": utterance_id
})
sink.close()
Run the script create_librispeech.py
to create a tar archive named LibriSpeech.tar
. The result is a tar archive with a size of 6GB.
$ python create_librispeech.py
Run the following command to upload the LibriSpeech.tar
to the Vultr Object Storage
$ s3cmd put LibriSpeech.tar s3://demo-bucket
Create a file named load_librispeech.py
as follows. The code decode(wds.torch_audio)
makes the WebDataset use the torchaudio
to decode the audio.
import torch
from torch.utils.data import IterableDataset
from torchvision import transforms
import webdataset as wds
from itertools import islice
url = "pipe:s3cmd -q get s3://demo-bucket/LibriSpeech.tar -"
dataset = wds.WebDataset(url).decode(wds.torch_audio).to_tuple("waveform.flac",
"sample_rate.id",
"transcript.text",
"speaker.id",
"chapter.id",
"utterance.id")
for sample in islice(dataset, 0, 3):
waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
for i, item in enumerate(sample):
print(type(item), item, )
print()
Run the script load_librispeech.py
$ python load_librispeech.py
Here is an example result. The input and output match the previous step.
<class 'tuple'> (tensor([[-0.0065, -0.0055, -0.0062, ..., 0.0033, 0.0005, -0.0095]]), 16000)
<class 'int'> 16000
<class 'str'> CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK
<class 'int'> 103
<class 'int'> 1240
<class 'int'> 0
<class 'tuple'> (tensor([[-0.0059, -0.0045, -0.0067, ..., 0.0007, 0.0034, 0.0047]]), 16000)
<class 'int'> 16000
<class 'str'> THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT PLACE IT WAS REPUTED TO BE AN INTRICATE HEADLONG BROOK IN ITS EARLIER COURSE THROUGH THOSE WOODS WITH DARK SECRETS OF POOL AND CASCADE BUT BY THE TIME IT REACHED LYNDE'S HOLLOW IT WAS A QUIET WELL CONDUCTED LITTLE STREAM
<class 'int'> 103
<class 'int'> 1240
<class 'int'> 1
<class 'tuple'> (tensor([[ 0.0052, 0.0074, 0.0113, ..., -0.0007, -0.0039, -0.0058]]), 16000)
<class 'int'> 16000
<class 'str'> FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR WITHOUT DUE REGARD FOR DECENCY AND DECORUM IT PROBABLY WAS CONSCIOUS THAT MISSUS RACHEL WAS SITTING AT HER WINDOW KEEPING A SHARP EYE ON EVERYTHING THAT PASSED FROM BROOKS AND CHILDREN UP
<class 'int'> 103
<class 'int'> 1240
<class 'int'> 2
Run the script load_cifar10.py
$ python load_cifar10.py
WebDataset supports sharding to split the dataset into many shards to achieve parallel I/O and shuffle data by shards.
Here is an example of creating and loading the LibriSpeech dataset with Sharding.
Create a file named create_librispeech_sharding.py
as follows:
import torchaudio
import webdataset as wds
import sys
import os
dataset = torchaudio.datasets.LIBRISPEECH(
root="./temp", folder_in_archive="LibriSpeech", url="train-clean-100",
download=True)
root_dir = os.path.join("./temp", "LibriSpeech", "train-clean-100")
filename = "LibriSpeech-%04d.tar"
sink = wds.ShardWriter(filename, maxsize=1e9)
for index, sample in enumerate(dataset):
waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
if index < 3:
for i, item in enumerate(sample):
print(type(item), item)
print()
# Load audio
fname = f"{speaker_id}-{chapter_id}-{utterance_id:04d}.flac"
fpath = os.path.join(root_dir, str(speaker_id), str(chapter_id), fname)
with open(fpath, "rb") as file:
sink.write({
"__key__": "sample%06d" % index,
"waveform.flac": file.read(),
"sample_rate.id": sample_rate,
"transcript.text": transcript,
"speaker.id": speaker_id,
"chapter.id": chapter_id,
"utterance.id": utterance_id
})
sink.close()
Run the script create_librispeech_sharding.py
to create multiple shards with 1GB for each shard.
$ python create_librispeech_sharding.py
Upload all the shards into the Vultr Object Storage
$ s3cmd put LibriSpeech-* s3://demo-bucket
Create a file named load_librispeech_sharding.py
as follows. The code shardshuffle=True
makes the WebDataset shuffle the dataset based on the shards and shuffle the samples inline with the shuffle
method.
import torch
from torch.utils.data import IterableDataset
from torchvision import transforms
import webdataset as wds
from itertools import islice
url = "pipe:s3cmd -q get s3://demo-bucket/LibriSpeech-{0000..0006}.tar -"
dataset = wds.WebDataset(url, shardshuffle=True).shuffle(100).decode(wds.torch_audio).to_tuple("waveform.flac",
"sample_rate.id",
"transcript.text",
"speaker.id",
"chapter.id",
"utterance.id")
for sample in islice(dataset, 0, 3):
waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id = sample
for i, item in enumerate(sample):
print(type(item), item, )
print()
Run the script load_librispeech_sharding.py
$ python load_librispeech_sharding.py