Language Models#

Overview#

In this project, a language model is a deep learning model which can predict next possible token conditioned on given tokens. Each language model has a loss function which can be optimized by the language model training script lmp.script.train_model. The optimization goal of a language model is to have low perplexity, which serve as an indication of performing well on next token prediction. A language model is paired with one and only one tokenizer. A language model always predict tokens contained in the paired tokenizer’s vocabulary. When construct a language model, one must first construct a tokenizer and then pass that tokenizer to model constructor.

Import language model module#

All language model classes are collectively gathered under the module lmp.model. One can import language model module as usual Python module:

import lmp.model

Create language model instances#

After importing lmp.model, one can create language model instance through the class attributes of lmp.model. For example, one can create Elman-Net language model ElmanNet as follow:

import lmp.model
import lmp.tknzr

# Create tokenizer instance.
tokenizer = lmp.tknzr.CharTknzr()

# Create language model instance.
model = lmp.model.ElmanNet(tknzr=tokenizer)

Each language model is an instance of torch.nn.Module. Each language model is paired with one and only one tokenizer. In the example above we see that an Elman-Net language model can be paired with a character tokenizer.

Initialize language model parameters#

Pytorch provides built-in utilities to initialize model parameters. All initialization utilities are collectively gathered under the module torch.nn.init.

import lmp.model
import lmp.tknzr
import torch

# Create language model instance.
model = lmp.model.ElmanNet(tknzr=lmp.tknzr.CharTknzr())

# Initialize model parameters.
torch.nn.init.zeros_(model.fc_e2h.bias)

If you cannot decide how to initialize a language model, we have provided an utility params_init for each language model to help you initialize model parameters.

import lmp.model
import lmp.tknzr

# Create language model instance.
model = lmp.model.ElmanNet(tknzr=lmp.tknzr.CharTknzr())

# Initialize model parameters.
model.params_init()

Calculate prediction loss#

One can calculate mini-batch loss of a language model using cal_loss function. For example,

import lmp.model
import lmp.tknzr
import torch

# Create tokenizer instance.
tokenizer = lmp.tknzr.CharTknzr()

# Build tokenizer vocabulary.
batch_txt = ['hello world', 'how are you']
tokenizer.build_vocab(batch_txt=batch_txt)

# Encode mini-batch.
batch_tkids = []
for txt in batch_txt:
  tkids = tokenizer.enc(txt=txt)
  tkids = tokenizer.pad_to_max(max_seq_len=20, tkids=tkids)
  batch_tkids.append(tkids)

# Convert mini-batch to tensor.
batch_tkids = torch.LongTensor(batch_tkids)

# Create language model instance.
model = lmp.model.ElmanNet(tknzr=tokenizer)

# Calculate mini-batch loss.
loss, batch_cur_states = model.cal_loss(
  batch_cur_tkids=batch_tkids[:, :-1],
  batch_next_tkids=batch_tkids[:, 1:],
  batch_prev_states=None,
)

The method cal_loss takes three input and returns a tuple. The batch_cur_tkids is the input token id list and the batch_next_tkids is the prediction target. Both batch_cur_tkids and batch_next_tkids are long tensor and have the same shape \((B, S)\) where \(B\) is the batch size and \(S\) is input sequence length. We set batch_prev_states=None to use initial hidden states. The first item in the returned tuple is a torch.Tensor which represents the mini-batch next token prediction loss. One can call the PyTorch built-in torch.Tensor.backward method to perform back-propagation. The second item in the returned tuple represents the current hidden states of a language model. The exact structure of current hidden states depends on which language model is used. The current hidden states can be used as the initial hidden states of next input. This is done by pass current hidden states as batch_prev_states. This is needed since one can only process certain sequence length at a time.

Predict next token#

Next token prediction can be done by the pred method. The input of pred is almost the same as cal_loss, except that we do not input the prediction target. This is because when performing evaluation one do not and cannot know the prediction target. One set batch_prev_states=None to use initial hidden states just as in cal_loss. The returned tuple have two items. The first item in the returned tuple is a torch.Tensor which represent the next token id probability distribution. The probability distribution tensor has shape \((B, S, V)\) where \(B\) is batch size, \(S\) is input sequence length and \(V\) is the vocabulary size of the language model pairing tokenizer. The second item in the returned tuple represents the current hidden states of a language model. One should compare this with cal_loss.

import lmp.model
import lmp.tknzr
import torch

# Create tokenizer instance.
tokenizer = lmp.tknzr.CharTknzr()

# Build tokenizer vocabulary.
batch_txt = ['hello world', 'how are you']
tokenizer.build_vocab(batch_txt=batch_txt)

# Encode mini-batch.
batch_tkids = []
for txt in batch_txt:
  tkids = tokenizer.enc(txt=txt)
  tkids = tokenizer.pad_to_max(max_seq_len=20, tkids=tkids)
  batch_tkids.append(tkids)

# Convert mini-batch to tensor.
batch_tkids = torch.LongTensor(batch_tkids)

# Create language model instance.
model = lmp.model.ElmanNet(tknzr=tokenizer)

# Calculate next token prediction.
pred, batch_cur_states = model.pred(
  batch_cur_tkids=batch_tkids,
  batch_prev_states=None,
)