Quick Start#
We provide installation instructions only for Ubuntu 20.04+
.
Environment Prerequests#
We use Python with version
3.8+
. You can install Python withapt install python3.8 python3.8-dev
We use PyTorch with version
1.10+
and CUDA with version11.2+
. This only work if you have Nvidia GPUs. You can install CUDA library withapt install nvidia-driver-470
Note
You might need to use
sudo
to perform installation.We use pipenv to install Python dependencies. You can install
pipenv
withpip install pipenv
Warning
Do not use
apt
to intall pipenv.
Installation#
Clone the project from GitHub.
git clone https://github.com/ProFatXuanAll/language-model-playground.git
Change current directory to
language-model-playground
.cd language-model-playground
Use pipenv to create Python virtual environment and install dependencies in Python virtual environment.
pipenv install
Launch Python virtual environment created by pipenv.
pipenv shell
Now you can run any scripts provided by this project! For example, you can take a look at chinese poem dataset by running lmp.script.sample_dset :
python -m lmp.script.sample_dset chinese-poem
Training Language Model Pipline#
We now demonstrate a typical language model training pipline.
Note
Throughout this tutorial you might see the symbol \
several times.
\
are used to format our CLI codes to avoid lenthy lines.
All CLI codes can fit into one line.
1. Choose a Dataset#
One have to choose a dataset to train a tokenizer and a language model.
In this example we use wiki-text-2 dataset WikiText2Dset
as demonstration.
See also
- lmp.dset
All available datasets.
- lmp.script.sample_dset
Dataset sampling script.
2. Train a Tokenizer#
The following example use whitespace tokenizer WsTknzr
to train on WikiText2Dset
dataset since samples in WikiText2Dset
are English and words are separated by whitespace.
python -m lmp.script.train_tknzr whitespace \
--dset_name wiki-text-2 \
--exp_name my_tknzr_exp \
--is_uncased \
--max_vocab -1 \
--min_count 0 \
--ver train
See also
- lmp.tknzr
All available tokenizers.
- lmp.script.train_tknzr
Tokenizer training script.
3. Evaluate Tokenizer#
We can use pre-trained tokenizer to tokenize arbitrary text.
In the following example we tokenize the sentence hello world
into string list ['hello', 'world']
:
python -m lmp.script.tknz_txt \
--exp_name my_tknzr_exp \
--txt "hello world"
See also
- lmp.script.tknz_txt
Text tokenization script.
4. Train a Language Model#
Now we train our language model with the help of pre-trained tokenizer.
The following example train a LSTM (2000 version) based language model LSTM2000
:
python -m lmp.script.train_model LSTM-2000 \
--batch_size 64 \
--beta1 0.9 \
--beta2 0.999 \
--ckpt_step 1000 \
--d_blk 1 \
--d_emb 1000 \
--dset_name wiki-text-2 \
--eps 1e-8 \
--exp_name my_model_exp \
--init_fb 1.0 \
--init_ib -1.0 \
--init_lower -0.1 \
--init_ob -1.0 \
--init_upper 0.1 \
--label_smoothing 0.0 \
--log_step 200 \
--lr 1e-4 \
--max_norm 1 \
--max_seq_len 32 \
--n_blk 1000 \
--n_lyr 1 \
--p_emb 0.1 \
--p_hid 0.1 \
--stride 32 \
--tknzr_exp_name my_tknzr_exp \
--total_step 10000 \
--ver train \
--warmup_step 5000 \
--weight_decay 1e-2
We log the training process with tensorboard. You can launch tensorboard and open browser with URL http://localhost:6006 to see the performance logs. Use the following script to launch tensorboard:
pipenv run tensorboard
See also
- lmp.model
All available language models.
- lmp.script.train_model
Language model training script.
5. Evaluate Language Model#
The following example use the validation set of wiki-text-2 dataset to evaluate language model.
python -m lmp.script.eval_dset_ppl wiki-text-2 \
--batch_size 32 \
--first_ckpt 0 \
--exp_name my_model_exp \
--ver valid
We log the evaluation process with tensorboard. You can launch tensorboard and open browser with URL http://localhost:6006 to see the performance logs. Use the following script to launch tensorboard:
pipenv run tensorboard
See also
- lmp.script.eval_dset_ppl
Dataset perplexity evaluation script.
- lmp.script.eval_txt_ppl
Text perplexity evaluation script.
6. Generate Continual Text#
We use pre-trained language model to generate continual text conditioned on given text segment A very good
.
python -m lmp.script.gen_txt top-1 \
--ckpt -1 \
--exp_name my_model_exp \
--max_seq_len 32 \
--txt "A very good"
See also
- lmp.infer
All available inference methods.
- lmp.script.gen_txt
Continual text generation script.
7. Record Experiment Results#
Now you have finished an experiment. You can record your results and compare results done by others. See Experiment Results for others’ experiments and record yours!
Documents#
You can read documents on this website or use the following steps to build documents locally. We use Sphinx to build our documents.
Install documentation dependencies.
pipenv install --dev
Build documents.
pipenv run doc
Open the root document in your browser.
xdg-open doc/build/index.html
Testing#
This is for developer only.
Install testing dependencies.
pipenv install --dev
Run test.
pipenv run test
Get test coverage report.
pipenv run test-coverage