Datasets#

Overview#

When training a model, one must first collect a dataset and preprocess it so that text samples have certain structure / format. In this project we have collect some datasets and provide utilities so that one can train language models on them.

See also

lmp.script.sample_dset

Dataset sampling script.

Import dataset module#

All dataset classes are collectively gathered under the module lmp.dset. One can import dataset module as usual Python module:

import lmp.dset

Create dataset instances#

After importing lmp.dset, one can create dataset instance through the class attributes of lmp.dset. For example, one can create demo dataset lmp.dset.DemoDset and wiki-text-2 dataset lmp.dset.WikiText2Dset as follow:

import lmp.dset

# Create demo dataset instance.
demo_dataset = lmp.dset.DemoDset()

# Create wiki-text-2 dataset instance.
wiki_dataset = lmp.dset.WikiText2Dset()

A dataset can have many versions. One can access the class attribute vers of a dataset class to get all supported versions. For example, all supported versions of lmp.dset.DemoDset are test, train and valid:

import lmp.dset

# Supported versions of `lmp.dset.DemoDset`.
assert 'test' in lmp.dset.DemoDset.vers
assert 'train' in lmp.dset.DemoDset.vers
assert 'valid' in lmp.dset.DemoDset.vers

# Construct different versions.
test_dataset = lmp.dset.DemoDset(ver='test')
train_dataset = lmp.dset.DemoDset(ver='train')
valid_dataset = lmp.dset.DemoDset(ver='valid')

If parameter ver is not passed to dataset class’s constructor, the default version of a dataset class is used. The default version of a dataset class is defined as the class attribute df_ver. For example, the default version of DemoDset is train:

import lmp.dset

# Get default version.
assert 'train' == lmp.dset.DemoDset.df_ver

# All following constructions are the same.
train_dataset = lmp.dset.DemoDset()
train_dataset = lmp.dset.DemoDset(ver=None)
train_dataset = lmp.dset.DemoDset(ver='train')
train_dataset = lmp.dset.DemoDset(ver=lmp.dset.DemoDset.df_ver)

Sample from dataset#

One can access dataset samples through dataset instances. The only way to access specific sample is using indices. For example, we can access the 0th and the 1st samples in the training set of lmp.dset.DemoDset as follow:

import lmp.dset

# Create dataset instance.
dataset = lmp.dset.DemoDset(ver='train')

# Access samples by indices.
sample_0 = dataset[0]
sample_1 = dataset[1]

One can use len to get the total number of samples in a dataset. For example, we can enumerate each sample in lmp.dset.DemoDset as follow:

import lmp.dset

# Use ``len`` to get dataset size.
dataset = lmp.dset.DemoDset(ver='train')
dataset_size = len(dataset)

# Access each sample in the dataset.
for index in range(dataset_size):
  print(dataset[index])

One can enumerate samples by treating a dataset instance as an iterator. For example, we can iterate through each sample in lmp.dset.DemoDset as follow:

import lmp.dset

# Use dataset as iterator.
for sample in lmp.dset.DemoDset(ver='train'):
  print(sample)

See also

lmp.script.sample_dset

Dataset sampling script.

Download dataset#

We provide downloading utilities so that datasets are downloaded automatically if they are not on your local machine. All downloaded files will be put under project_root/data directory. For example, to download the training set of lmp.dset.WikiText2Dset, all you need to do is as follow:

import lmp.dset

# Automatically download dataset if dataset is not on local machine.
dataset = lmp.dset.WikiTextDset(ver='train')

All available datasets#