Quick Start#

We provide installation instructions only for Ubuntu 20.04+.

Environment Prerequests#

  1. We use Python with version 3.8+. You can install Python with

    apt install python3.8 python3.8-dev
    

    Note

    Currently (2022) the latest version of Python supported by PyTorch is 3.8. That’s why we install python3.8 instead of python3.10. You might need to use sudo to perform installation.

  2. We use PyTorch with version 1.10+ and CUDA with version 11.2+. This only work if you have Nvidia GPUs. You can install CUDA library with

    apt install nvidia-driver-470
    

    Note

    You might need to use sudo to perform installation.

  3. We use pipenv to install Python dependencies. You can install pipenv with

    pip install pipenv
    

    Warning

    Do not use apt to intall pipenv.

    Note

    You might want to set environment variable PIPENV_VENV_IN_PROJECT=1 to make virtual environment folders always located in your Python projects. See pipenv document for details.

Installation#

  1. Clone the project from GitHub.

    git clone https://github.com/ProFatXuanAll/language-model-playground.git
    
  2. Change current directory to language-model-playground.

    cd language-model-playground
    
  3. Use pipenv to create Python virtual environment and install dependencies in Python virtual environment.

    pipenv install
    
  4. Launch Python virtual environment created by pipenv.

    pipenv shell
    
  5. Now you can run any scripts provided by this project! For example, you can take a look at chinese poem dataset by running lmp.script.sample_dset :

    python -m lmp.script.sample_dset chinese-poem
    

Training Language Model Pipline#

We now demonstrate a typical language model training pipline.

Note

Throughout this tutorial you might see the symbol \ several times. \ are used to format our CLI codes to avoid lenthy lines. All CLI codes can fit into one line.

1. Choose a Dataset#

One have to choose a dataset to train a tokenizer and a language model. In this example we use wiki-text-2 dataset WikiText2Dset as demonstration.

See also

lmp.dset

All available datasets.

lmp.script.sample_dset

Dataset sampling script.

2. Train a Tokenizer#

The following example use whitespace tokenizer WsTknzr to train on WikiText2Dset dataset since samples in WikiText2Dset are English and words are separated by whitespace.

python -m lmp.script.train_tknzr whitespace \
  --dset_name wiki-text-2 \
  --exp_name my_tknzr_exp \
  --is_uncased \
  --max_vocab -1 \
  --min_count 0 \
  --ver train

See also

lmp.tknzr

All available tokenizers.

lmp.script.train_tknzr

Tokenizer training script.

3. Evaluate Tokenizer#

We can use pre-trained tokenizer to tokenize arbitrary text. In the following example we tokenize the sentence hello world into string list ['hello', 'world']:

python -m lmp.script.tknz_txt \
  --exp_name my_tknzr_exp \
  --txt "hello world"

See also

lmp.script.tknz_txt

Text tokenization script.

4. Train a Language Model#

Now we train our language model with the help of pre-trained tokenizer. The following example train a LSTM (2000 version) based language model LSTM2000:

python -m lmp.script.train_model LSTM-2000 \
  --batch_size 64 \
  --beta1 0.9 \
  --beta2 0.999 \
  --ckpt_step 1000 \
  --d_blk 1 \
  --d_emb 1000 \
  --dset_name wiki-text-2 \
  --eps 1e-8 \
  --exp_name my_model_exp \
  --init_fb 1.0 \
  --init_ib -1.0 \
  --init_lower -0.1 \
  --init_ob -1.0 \
  --init_upper 0.1 \
  --label_smoothing 0.0 \
  --log_step 200 \
  --lr 1e-4 \
  --max_norm 1 \
  --max_seq_len 32 \
  --n_blk 1000 \
  --n_lyr 1 \
  --p_emb 0.1 \
  --p_hid 0.1 \
  --stride 32 \
  --tknzr_exp_name my_tknzr_exp \
  --total_step 10000 \
  --ver train \
  --warmup_step 5000 \
  --weight_decay 1e-2

We log the training process with tensorboard. You can launch tensorboard and open browser with URL http://localhost:6006 to see the performance logs. Use the following script to launch tensorboard:

pipenv run tensorboard

See also

lmp.model

All available language models.

lmp.script.train_model

Language model training script.

5. Evaluate Language Model#

The following example use the validation set of wiki-text-2 dataset to evaluate language model.

python -m lmp.script.eval_dset_ppl wiki-text-2 \
  --batch_size 32 \
  --first_ckpt 0 \
  --exp_name my_model_exp \
  --ver valid

We log the evaluation process with tensorboard. You can launch tensorboard and open browser with URL http://localhost:6006 to see the performance logs. Use the following script to launch tensorboard:

pipenv run tensorboard

See also

lmp.script.eval_dset_ppl

Dataset perplexity evaluation script.

lmp.script.eval_txt_ppl

Text perplexity evaluation script.

6. Generate Continual Text#

We use pre-trained language model to generate continual text conditioned on given text segment A very good.

python -m lmp.script.gen_txt top-1 \
  --ckpt -1 \
  --exp_name my_model_exp \
  --max_seq_len 32 \
  --txt "A very good"

See also

lmp.infer

All available inference methods.

lmp.script.gen_txt

Continual text generation script.

7. Record Experiment Results#

Now you have finished an experiment. You can record your results and compare results done by others. See Experiment Results for others’ experiments and record yours!

Documents#

You can read documents on this website or use the following steps to build documents locally. We use Sphinx to build our documents.

  1. Install documentation dependencies.

    pipenv install --dev
    
  2. Build documents.

    pipenv run doc
    
  3. Open the root document in your browser.

    xdg-open doc/build/index.html
    

Testing#

This is for developer only.

  1. Install testing dependencies.

    pipenv install --dev
    
  2. Run test.

    pipenv run test
    
  3. Get test coverage report.

    pipenv run test-coverage