Contents Menu Expand Light mode Dark mode Auto light/dark mode
Language Model Playground 1.0.0 documentation
Language Model Playground 1.0.0 documentation

Table of Contents:

  • Quick Start
  • Scripts
    • Sample dataset
    • Tokenizer training script
    • Tokenize given text
    • Language model training script
    • Evaluate language model perplexity on a dataset
    • Evaluate language model perplexity on given text
    • Generate continual text with language model conditioned on given text
  • Datasets
    • Dataset base class
    • Chinese Poem Dataset
    • Demo Dataset
    • Winograd NLI dataset
    • Wiki-Text-2 Dataset
  • Tokenizers
    • Byte-pair encoding Tokenizer
    • Tokenizer base class
    • Character Tokenizer
    • Whitespace Tokenizer
  • Language Models
    • Language model base class
    • Elman-Net
    • LSTM (1997 version)
    • LSTM (2000 version)
    • LSTM (2002 version)
    • Transformer encoder
  • Inference methods
    • Inference method base class
    • Top-1 inference method
    • Top-K inference method
    • Top-P inference method
  • Experiment Results
    • Demo Dataset
      • Elman Net: structure-related hyperparameters baseline
      • Elman Net: structure-related hyperparameters best possible settings
    • WikiText-2 Dataset
    • WNLI Dataset
      • Experiment 1: Models Performance Baseline
  • Developer Guilds
    • Contributing to Language Model Playground
    • How To Document Guide
    • How To Test Guide
    • lmp
      • lmp.dset
        • lmp.dset._base
        • lmp.dset._ch_poem
        • lmp.dset._demo
        • lmp.dset._wiki_text_2
        • lmp.dset._wnli
      • lmp.infer
        • lmp.infer._base
        • lmp.infer._top_1
        • lmp.infer._top_k
        • lmp.infer._top_p
      • lmp.model
        • lmp.model._base
        • lmp.model._elman_net
        • lmp.model._lstm_1997
        • lmp.model._lstm_2000
        • lmp.model._lstm_2002
        • lmp.model._trans_enc
      • lmp.script
        • lmp.script.eval_dset_ppl
        • lmp.script.eval_txt_ppl
        • lmp.script.gen_txt
        • lmp.script.sample_dset
        • lmp.script.tknz_txt
        • lmp.script.train_model
        • lmp.script.train_tknzr
      • lmp.tknzr
        • lmp.tknzr._base
        • lmp.tknzr._bpe
        • lmp.tknzr._char
        • lmp.tknzr._ws
      • lmp.util
        • lmp.util.cfg
        • lmp.util.dset
        • lmp.util.infer
        • lmp.util.log
        • lmp.util.metric
        • lmp.util.model
        • lmp.util.optim
        • lmp.util.rand
        • lmp.util.tknzr
        • lmp.util.validate
      • lmp.vars
  • Glossary
Back to top
Edit this page

Elman Net: structure-related hyperparameters best possible settings#

Abstract#

The goal of this experiment is to improve Elman Net performance based on the baseline experiment. Basically, we increased d_emb, d_hid and n_lyr and recorded what happened. We found that

  • Increasing d_emb from 100 to 200 makes training loss and perplexity lower.

  • When d_emb = 100 and d_emb = 200, increasing n_lyr from 2 to 3 (or 4) makes training loss and perplexity lower.

  • Overfitting was observed.

  • \(100\%\) accuracy on training set is possible.

  • Performance are really bad for validation sets. This might be the limit of Elman Net.

Environment setup#

We ran experiments on Nvidia RTX 2070S. CUDA version is 11.4 and CUDA driver version is 470.129.06.

Experiment setup#

We changed the values of d_emb, d_hid and n_lyr and recorded training loss and perplexity. Hyperparameters and their values are listed at the table below. One should compare the value ranges with baseline experiment.

Name

Values

d_emb

\(\set{100, 150, 200}\)

d_hid

\(\set{100, 150, 200}\)

n_lyr

\(\set{2, 3, 4}\)

Tokenizer settings#

We used lmp.script.train_tknzr to train a whitespace tokenizer WsTknzr. Compare to baseline settings, using whitespace tokenizer makes vocabulary size larger. Script was executed as below:

python -m lmp.script.train_tknzr whitespace \
  --dset_name demo \
  --exp_name demo_tknzr \
  --is_uncased \
  --max_vocab -1 \
  --min_count 0 \
  --ver train

Model training settings#

We trained Elman Net language model ElmanNet with different model structure hyperparameters. We used lmp.script.train_model to train language models. Script was executed as below:

python -m lmp.script.train_model Elman-Net \
  --dset_name demo \
  --batch_size 32 \
  --beta1 0.9 \
  --beta1 0.999 \
  --ckpt_step 500 \
  --d_emb D_EMB \
  --d_hid D_HID \
  --dset_name demo \
  --eps 1e-8 \
  --exp_name EXP_NAME \
  --init_lower -0.1 \
  --init_upper 0.1 \
  --label_smoothing 0.0 \
  --log_step 100 \
  --lr 1e-3 \
  --max_norm 1 \
  --max_seq_len 35 \
  --n_lyr N_LYR \
  --p_emb 0.0 \
  --p_hid 0.0 \
  --seed 42 \
  --stride 35 \
  --tknzr_exp_name demo_tknzr \
  --total_step 40000 \
  --ver train \
  --warmup_step 10000 \
  --weight_decay 0.0

Model evaluation settings#

We evaluated language models using lmp.script.eval_dset_ppl. Script was executed as below:

python -m lmp.script.eval_dset_ppl demo \
  --batch_size 512 \
  --exp_name EXP_NAME \
  --first_ckpt 0 \
  --last_ckpt -1 \
  --seed 42 \
  --ver VER

Experiment results#

All results were logged on tensorboard. You can launch tensorboard with the script

pipenv run tensorboard

Training loss#

d_emb

d_hid

n_lyr

5k steps

10k steps

15k steps

20k steps

25k steps

30k steps

35k steps

40k steps

100

100

2

1.043

0.9594

0.9187

0.8927

0.8647

0.8515

0.8371

0.8321

100

100

3

1.027

0.9519

0.9051

0.8775

0.855

0.8369

0.8175

0.8122

100

100

4

1.04

0.9851

0.9294

0.8947

0.8628

0.8543

0.8294

0.8223

100

150

2

1.036

0.96

0.9166

0.8774

0.8613

0.8378

0.8246

0.8189

100

150

3

1.017

0.9633

0.9202

0.9002

0.8678

0.8449

0.8257

0.8192

100

150

4

1.009

0.9833

0.9239

0.9004

0.8686

0.8287

0.816

0.81

100

200

2

1.026

0.9754

0.9341

0.8995

0.8743

0.8446

0.8331

0.8258

100

200

3

1.013

0.9676

0.9332

0.8963

0.8673

0.8452

0.8219

0.8163

100

200

4

1.019

0.9735

0.9311

0.8999

0.8698

0.843

0.8156

0.8088

150

100

2

1.032

0.947

0.9044

0.8719

0.8492

0.8284

0.8197

0.8127

150

100

3

1.027

0.9455

0.9033

0.876

0.8455

0.8224

0.815

0.8076

150

100

4

1.024

0.9553

0.9059

0.8767

0.8479

0.8153

0.8065

0.8009

150

150

2

1.008

0.9533

0.9095

0.8718

0.8398

0.8122

0.8026

0.797

150

150

3

1.006

0.9699

0.9125

0.8878

0.8527

0.82

0.8107

0.8046

150

150

4

1.01

0.9586

0.9154

0.8907

0.8576

0.8227

0.8057

0.7997

150

200

2

1.007

0.9572

0.9104

0.8758

0.8471

0.8183

0.8059

0.7998

150

200

3

1.012

0.965

0.9186

0.8866

0.8576

0.8296

0.8089

0.8023

150

200

4

1.01

0.975

0.9313

0.8979

0.8621

0.8305

0.808

0.801

200

100

2

1.014

0.9473

0.9065

0.8677

0.8453

0.8197

0.8095

0.8027

200

100

3

1.008

0.9393

0.8942

0.8656

0.8279

0.806

0.797

0.791

200

100

4

1.016

0.9672

0.9139

0.8786

0.85

0.8422

0.8063

0.7986

200

150

2

1.004

0.9612

0.9108

0.8885

0.844

0.8245

0.8047

0.799

200

150

3

0.9939

0.9445

0.8991

0.8701

0.8436

0.833

0.7979

0.7921

200

150

4

0.9971

0.9465

0.9113

0.88

0.8414

0.8129

0.7983

0.7892

200

200

2

0.9984

0.9661

0.9085

0.878

0.851

0.814

0.8032

0.7958

200

200

3

1.003

0.9727

0.9111

0.8805

0.8546

0.8162

0.8022

0.7956

200

200

4

0.9909

0.9617

0.9188

0.8797

0.8519

0.818

0.7969

0.7904

Observation 1: Increasing d_emb from 100 to 150 in general makes training loss smaller.#

By fixing d_hid and n_lyr, we can compare training loss for d_emb = 100 and d_emb = 150. Most comparisons (\(\dfrac{67}{72}\)) show that training loss is smaller when increasing d_emb from 100 to 150.

Observation 2: Increasing d_emb from 150 to 200 in general makes training loss smaller.#

By fixing d_hid and n_lyr, we can compare training loss for d_emb = 150 and d_emb = 200. Most comparisons (\(\dfrac{52}{72}\)) show that training loss is smaller when increasing d_emb from 150 to 200.

Observation 3: Increasing d_hid from 100 to 150 in general makes training loss smaller.#

By fixing d_emb and n_lyr, we can compare training loss for d_hid = 100 and d_hid = 150. Little more than half comparisons (\(\dfrac{39}{72})\) show that training loss is smaller when increasing d_hid from 100 to 150.

Observation 4: Increasing d_hid from 150 to 200 in general makes training loss larger.#

By fixing d_emb and n_lyr, we can compare training loss for d_hid = 150 and d_hid = 200. Most comparisons (\(\dfrac{43}{72})\) show that training loss is larger when increasing d_hid from 150 to 200.

Observation 5: When d_emb = 100, increasing n_lyr from 2 to 3 in general makes training loss smaller.#

By fixing d_emb = 100 and d_hid, we can compare training loss for n_lyr = 2 and n_lyr = 3. Most comparisons (\(\dfrac{17}{24})\) show that training loss is smaller when increasing n_lyr from 2 to 3.

Observation 6: When d_emb = 100, increasing n_lyr from 2 to 4 in general makes training loss smaller.#

By fixing d_emb = 100 and d_hid, we can compare training loss for n_lyr = 2 and n_lyr = 4. Little more than half comparisons (\(\dfrac{15}{24})\) show that training loss is smaller when increasing n_lyr from 2 to 4.

Observation 7: When d_emb = 150, increasing n_lyr from 2 to 3 in general makes training loss larger.#

By fixing d_emb = 150 and d_hid, we can compare training loss for n_lyr = 2 and n_lyr = 3. Little more than half comparisons (\(\dfrac{16}{24})\) show that training loss is larger when increasing n_lyr from 2 to 3.

Observation 8: When d_emb = 150, increasing n_lyr from 2 to 4 in general makes training loss larger.#

By fixing d_emb = 150 and d_hid, we can compare training loss for n_lyr = 2 and n_lyr = 4. Most comparisons (\(\dfrac{19}{24})\) show that training loss is larger when increasing n_lyr from 2 to 4

Observation 9: When d_emb = 200, increasing n_lyr from 2 to 3 in general makes training loss smaller.#

By fixing d_emb = 200 and d_hid, we can compare training loss for n_lyr = 2 and n_lyr = 3. Most comparisons (\(\dfrac{17}{24})\) show that training loss is smaller when increasing n_lyr from 2 to 3.

Observation 10: When d_emb = 200, increasing n_lyr from 2 to 4 in general makes training loss smaller.#

By fixing d_emb = 200 and d_hid, we can compare training loss for n_lyr = 2 and n_lyr = 4. Little more than half comparisons (\(\dfrac{14}{24})\) show that training loss is smaller when increasing n_lyr from 2 to 4.

Observation 11: Minimum loss is achieved when d_emb = 200, d_hid = 150 and n_lyr = 4.#

Observation 12: Training loss is still decreasing in all configuration.#

All comparisons (\(\dfrac{189}{189}\)) show that training loss is still decreasing no matter which configuration is used. This suggest that further training may be required.

Perplexity#

d_emb

d_hid

n_lyr

5k steps

10k steps

15k steps

20k steps

25k steps

30k steps

35k steps

40k steps

train

valid

test

train

valid

test

train

valid

test

train

valid

test

train

valid

test

train

valid

test

train

valid

test

train

valid

test

100

100

2

2.588

4.489

2.986

2.396

6.753

2.755

2.315

12.3

2.673

2.27

21.63

2.652

2.203

26.53

2.573

2.178

29.93

2.547

2.149

30.92

2.509

2.142

30.5

2.499

100

100

3

2.57

6.25

2.909

2.362

17.83

2.792

2.3

27.96

2.689

2.224

40.18

2.626

2.191

44.71

2.528

2.131

56.2

2.586

2.114

58.28

2.556

2.106

59.4

2.545

100

100

4

2.579

4.701

2.925

2.421

23.84

2.847

2.32

68.85

2.609

2.278

119.4

2.615

2.247

154.6

2.63

2.17

156.5

2.494

2.137

168.6

2.438

2.127

175.2

2.453

100

150

2

2.588

4.999

2.974

2.403

11.97

2.715

2.328

19.11

2.729

2.244

24.6

2.615

2.184

29.94

2.552

2.164

33.04

2.562

2.126

34.04

2.52

2.118

34.64

2.523

100

150

3

2.538

4.23

2.878

2.438

11.23

2.808

2.309

19.04

2.625

2.26

26.82

2.583

2.201

32.99

2.579

2.166

38.65

2.55

2.127

39.76

2.483

2.119

40.07

2.469

100

150

4

2.518

4.412

2.838

2.436

13.16

2.817

2.328

30.12

2.736

2.29

46.5

2.611

2.205

48.3

2.548

2.129

52.22

2.429

2.109

59.41

2.409

2.101

59.05

2.413

100

200

2

2.545

4.805

2.873

2.464

15.89

2.841

2.342

30.28

2.726

2.277

39.29

2.681

2.227

46.19

2.616

2.162

48.54

2.569

2.141

48.05

2.51

2.133

49.23

2.504

100

200

3

2.512

5.707

2.881

2.405

20.45

2.761

2.331

40.46

2.695

2.271

55.97

2.656

2.221

58.88

2.547

2.167

68.22

2.519

2.12

68.44

2.458

2.111

68.52

2.455

100

200

4

2.555

6.489

3.034

2.402

27.98

2.809

2.319

35.38

2.663

2.262

43.32

2.601

2.207

51.82

2.581

2.157

56.78

2.516

2.108

61.49

2.479

2.099

62.23

2.462

150

100

2

2.558

5.168

2.926

2.354

14.35

2.727

2.287

23.78

2.659

2.215

31.73

2.629

2.176

33.97

2.574

2.132

36.96

2.495

2.115

40.21

2.504

2.108

40.35

2.482

150

100

3

2.542

6.571

2.919

2.354

15.73

2.702

2.274

22.72

2.559

2.222

28.45

2.586

2.17

35.1

2.484

2.122

40.48

2.48

2.106

44.3

2.485

2.098

45.63

2.467

150

100

4

2.547

10.76

3.055

2.365

15.5

2.741

2.266

35.47

2.647

2.216

56.28

2.539

2.176

71.85

2.51

2.109

79.58

2.44

2.091

88.16

2.438

2.084

90.33

2.422

150

150

2

2.514

7.944

2.923

2.361

23.62

2.732

2.272

39.04

2.676

2.21

50.69

2.561

2.151

60.86

2.52

2.1

71.3

2.481

2.083

72.28

2.455

2.077

73.39

2.452

150

150

3

2.494

8.508

2.865

2.43

38.41

2.779

2.297

61.11

2.605

2.257

90.4

2.625

2.173

115.7

2.51

2.114

135.6

2.462

2.097

148.8

2.452

2.09

147.4

2.438

150

150

4

2.504

7.715

2.829

2.382

33.2

2.814

2.327

56.41

2.693

2.245

74.8

2.602

2.19

88.55

2.555

2.122

98.17

2.474

2.089

108.8

2.448

2.081

109.2

2.433

150

200

2

2.505

5.688

2.822

2.405

39.71

2.796

2.27

71.41

2.618

2.221

80.56

2.576

2.166

99.65

2.561

2.113

109.2

2.482

2.088

114.6

2.453

2.081

114

2.446

150

200

3

2.535

6.452

2.912

2.446

63.95

2.809

2.307

163.4

2.657

2.244

220.2

2.579

2.18

230.6

2.539

2.128

279

2.501

2.094

291.9

2.454

2.086

301

2.445

150

200

4

2.477

7.073

2.822

2.445

30.17

2.816

2.32

43.03

2.732

2.278

53.86

2.608

2.208

67.19

2.546

2.132

76.35

2.501

2.092

78.57

2.455

2.084

80.15

2.444

200

100

2

2.518

6.878

2.853

2.368

41.77

2.817

2.266

124.7

2.659

2.2

233.4

2.602

2.153

331.7

2.537

2.112

450.7

2.478

2.095

544

2.516

2.089

558.5

2.497

200

100

3

2.507

9.783

2.864

2.344

24.58

2.717

2.266

38.58

2.698

2.193

44.55

2.582

2.13

55.65

2.542

2.088

59.09

2.472

2.07

61.16

2.459

2.064

62.02

2.467

200

100

4

2.516

8.239

2.857

2.405

20.88

2.77

2.299

29.06

2.668

2.234

41.72

2.574

2.197

51.4

2.562

2.175

59.59

2.575

2.088

64.57

2.455

2.08

67.06

2.444

200

150

2

2.52

5.719

2.851

2.402

24.45

2.805

2.28

50.64

2.638

2.241

84.59

2.645

2.164

107.5

2.571

2.122

116.8

2.517

2.087

122

2.461

2.08

126.3

2.46

200

150

3

2.468

7.356

2.898

2.393

18.42

2.763

2.28

27.93

2.663

2.218

37.08

2.565

2.147

46.77

2.546

2.122

49.58

2.495

2.073

52.52

2.45

2.067

52.9

2.443

200

150

4

2.48

7.631

2.849

2.374

21.66

2.639

2.273

45.17

2.623

2.214

58.63

2.587

2.136

68.66

2.501

2.129

87.26

2.519

2.069

89.91

2.436

2.062

89.39

2.429

200

200

2

2.485

6.539

2.872

2.379

35.74

2.747

2.281

61.56

2.705

2.231

73.16

2.565

2.169

81.68

2.572

2.102

89.24

2.49

2.083

92.18

2.481

2.075

92.33

2.47

200

200

3

2.487

8.765

2.862

2.379

26.74

2.678

2.287

48.8

2.638

2.227

57.39

2.613

2.19

71.3

2.561

2.112

82.03

2.535

2.08

85.65

2.458

2.073

87.17

2.459

200

200

4

2.452

7.022

2.802

2.379

42.21

2.695

2.324

75.96

2.685

2.223

85.98

2.566

2.176

98.35

2.563

2.111

110.2

2.526

2.07

116.7

2.466

2.063

120.3

2.465

Observation 1: Increasing d_emb from 100 to 150 in general makes perplexity smaller.#

By fixing d_hid and n_lyr, we can compare perplexity for d_emb = 100 and d_emb = 150. Most comparisons (\(\dfrac{138}{216}\)) show that perplexity is smaller when increasing d_emb from 100 to 150.

Observation 2: Increasing d_emb from 150 to 200 in general makes perplexity smaller.#

By fixing d_hid and n_lyr, we can compare perplexity for d_emb = 150 and d_emb = 200. Most comparisons (\(\dfrac{125}{216}\)) show that perplexity is smaller when increasing d_emb from 150 to 200.

Observation 3: Increasing d_hid from 100 to 150 in general makes perplexity smaller.#

By fixing d_emb and n_lyr, we can compare perplexity for d_hid = 100 and d_hid = 150. Little more than half comparisons (\(\dfrac{114}{216}\)) show that perplexity is smaller when increasing d_hid from 100 to 150.

Observation 4: Increasing d_hid from 150 to 200 in general makes perplexity larger.#

By fixing d_emb and n_lyr, we can compare perplexity for d_hid = 150 and d_hid = 200. Most comparisons (\(\dfrac{144}{216}\)) show that perplexity is larger when increasing d_hid from 150 to 200.

Observation 5: When d_emb = 100 and d_hid = 100, increasing n_lyr from 2 to 3 in general makes perplexity larger.#

By fixing d_emb = 100 and d_hid = 100, we can compare perplexity for n_lyr = 2 and n_lyr = 3. Little more than half comparisons (\(\dfrac{13}{24}\)) show that perplexity is larger when increasing n_lyr from 2 to 3.

Observation 6: When d_emb = 100 and d_hid = 150, increasing n_lyr from 2 to 3 in general makes perplexity larger.#

By fixing d_emb = 100 and d_hid = 100, we can compare perplexity for n_lyr = 2 and n_lyr = 3. Little more than half comparisons (\(\dfrac{13}{24}\)) show that perplexity is larger when increasing n_lyr from 2 to 3.

Observation 7: When d_emb = 100 and d_hid = 200, increasing n_lyr from 2 to 3 in general makes perplexity smaller.#

By fixing d_emb = 100 and d_hid = 200, we can compare perplexity for n_lyr = 2 and n_lyr = 3. About but less than comparisons (\(\dfrac{10}{24}\)) show that perplexity is smaller when increasing n_lyr from 2 to 3.

Observation 8: When d_emb = 100 and d_hid = 100, increasing n_lyr from 2 to 4 in general makes perplexity larger.#

By fixing d_emb = 100 and d_hid = 100, we can compare perplexity for n_lyr = 2 and n_lyr = 4. Little more than half comparisons (\(\dfrac{14}{24}\)) show that perplexity is larger when increasing n_lyr from 2 to 4.

Observation 9: When d_emb = 100 and d_hid = 150, increasing n_lyr from 2 to 4 doesn’t show the trend of perplexity.#

By fixing d_emb = 100 and d_hid = 150, we can compare perplexity for n_lyr = 2 and n_lyr = 4. Half comparisons (\(\dfrac{12}{24}\)) show that perplexity is larger when increasing n_lyr from 2 to 4.

Observation 10: When d_emb = 100 and d_hid = 200, increasing n_lyr from 2 to 4 in general makes perplexity smaller.#

By fixing d_emb = 100 and d_hid = 200, we can compare perplexity for n_lyr = 2 and n_lyr = 4. Little less than half comparisons (\(\dfrac{10}{24}\)) show that perplexity is smaller when increasing n_lyr from 2 to 4.

Observation 11: When d_emb = 150, increasing n_lyr from 2 to 4 in general makes perplexity larger.#

By fixing d_emb = 150 and d_hid, we can compare perplexity for n_lyr = 2 and n_lyr = 4. Most comparisons (\(\dfrac{43}{72}\)) show that perplexity is larger when increasing n_lyr from 2 to 4.

Observation 12: When d_emb = 200, increasing n_lyr from 2 to 3 in general makes perplexity smaller.#

By fixing d_emb = 200 and d_hid, we can compare perplexity for n_lyr = 2 and n_lyr = 3. Most comparisons (\(\dfrac{58}{72}\)) show that perplexity is smaller when increasing n_lyr from 2 to 3.

Observation 13: When d_emb = 200, increasing n_lyr from 2 to 4 in general makes perplexity smaller.#

By fixing d_emb = 200 and d_hid, we can compare perplexity for n_lyr = 2 and n_lyr = 4. Most comparisons (\(\dfrac{46}{72}\)) show that perplexity is smaller when increasing n_lyr from 2 to 4.

Observation 14: Overfitting seems to happen.#

On test set, most comparisons (\(\dfrac{170}{189}\)) show that perplexity is still decreasing. However, on validation set, most comparisons (\(\dfrac{183}{189}\)) show that perplexity is increasing. Most of the perplexity increasing on validation set occur at 10k or 15k step.

Observation 15: Minimum perplexity on training set is achieved at 40k step when d_emb = 200, d_hid = 150 and n_lyr = 4.#

  • On training set, minimum perplexity \(2.062\) is achieved at 40k step when d_emb = 200, d_hid = 150 and n_lyr = 4.

  • On validation set, minimum perplexity \(4.23\) is achieved at 5k step when d_emb = 100, d_hid = 150 and n_lyr = 3.

  • On testing set, minimum perplexity \(2.413\) is achieved at 40k step when d_emb = 100, d_hid = 150 and n_lyr = 4.

Observation 16: Only when setting d_emb = 200 and d_hid = 150 perplexity is lower than \(2.1\).#

Later in the accuracy experiments we see that only when perplexity is lower than \(2.1\), accuracy can be \(100\%\).

Accuracy#

We use the following script to calculate accuracy on demo dataset:

import re

import torch

import lmp.dset
import lmp.infer
import lmp.model
import lmp.script
import lmp.tknzr
import lmp.util.model
import lmp.util.tknzr

device = torch.device('cuda')
tknzr = lmp.util.tknzr.load(exp_name='demo_tknzr')
for d_emb in [100, 150, 200]:
  for d_hid in [100, 150, 200]:
    for n_lyr in [2, 3, 4]:
      for ckpt in [5000, 10000, 15000, 20000, 25000, 30000, 35000, 40000]:
        for ver in lmp.dset.DemoDset.vers:
          dset = lmp.dset.DemoDset(ver=ver)
          exp_name = f'demo-d_emb-{d_emb}-d_hid-{d_hid}-n_lyr-{n_lyr}'
          model = lmp.util.model.load(exp_name=exp_name, ckpt=ckpt).to(device)
          infer = lmp.infer.Top1Infer(max_seq_len=35)

          correct = 0
          for spl in dset:
            match = re.match(r'If you add (\d+) to (\d+) you get (\d+) .', spl)
            input = f'If you add {match.group(1)} to {match.group(2)} you get '

            output = infer.gen(model=model, tknzr=tknzr, txt=input)

            if input + output == spl:
              correct += 1

          print(f'{exp_name}, ckpt: {ckpt}, ver: {ver}, acc: {correct / len(dset) * 100 :.2f}%')

d_emb

d_hid

n_lyr

5k steps

10k steps

15k steps

20k steps

25k steps

30k steps

35k steps

40k steps

train

valid

test

train

valid

test

train

valid

test

train

valid

test

train

valid

test

train

valid

test

train

valid

test

train

valid

test

100

100

2

23.45

8.1

18

31.39

9.07

22

45.19

7.21

24

54.08

5.58

41

81.23

7.31

56

85.6

6.3

65

98.22

7.56

84

98.46

8.1

88

100

100

3

11.43

3.92

8

35.15

5.82

17

39.45

6.32

33

70.16

7.64

53

79.66

8.02

78

98.83

7.45

83

99.7

8.48

92

99.6

8.4

92

100

100

4

20.44

8.44

17

23.9

3.21

10

40.1

5.8

40

46.75

4

31

55.11

4.75

54

89.17

5.74

72

98.06

6.24

85

99.81

6.91

94

100

150

2

13.35

8.38

8

31.86

6.3

24

35.41

4.89

22

64.14

6.71

51

88.30

6.51

66

88.38

5.33

61

99.35

6.22

88

99.6

6.26

88

100

150

3

17.47

11.54

15

21.88

4.91

20

47.07

4.14

26

56.14

2.85

29

76.53

3.92

54

88.34

3.41

64

99.07

3.84

87

99.58

4.04

88

100

150

4

19.62

8.59

13

18.81

2.28

7

34.53

2.89

18

44.65

3.8

38

69.98

3.52

49

99.13

4.06

82

99.9

4.34

92

99.92

4.42

95

100

200

2

26.38

10.16

12

20.42

4.1

13

38.59

3.07

26

54.28

3.72

29

67.47

2.95

52

93.89

3.39

71

96.16

3.43

80

97.82

3.62

85

100

200

3

26.71

7.05

17

27.03

3.78

22

38.14

3.68

30

49.29

2.79

28

68.4

2.69

54

85.6

2.63

56

99.78

3.21

86

99.78

2.91

86

100

200

4

12.59

3.49

3

28.65

2.59

14

43.94

2.69

27

57.37

1.52

37

73.15

2.22

51

90.38

2.32

67

99.88

2.4

77

99.9

2.38

79

150

100

2

23.01

7.25

16

40.83

4.99

27

48.55

4.36

34

71.92

4.97

48

85.23

5.43

51

96.3

7.47

80

98.87

6.51

81

99.29

6.99

86

150

100

3

23.8

5.03

14

36.12

6.2

21

51.52

6.89

31

60.91

5.94

55

82.65

5.94

62

98.87

6.81

85

99.54

6.85

90

99.6

7.07

89

150

100

4

22.65

3.52

14

34.57

5.25

27

54.52

4.16

36

64.22

4.24

46

73.21

4.89

58

99.39

5.58

90

99.74

5.27

88

99.8

5.47

93

150

150

2

20.95

5.35

13

33.9

5.52

33

46.65

4.65

34

67.92

3.9

42

86.22

3.13

65

99.6

3.03

87

99.8

2.93

89

99.8

3.05

89

150

150

3

22.79

6.93

18

23.07

2.79

20

45.31

4.24

34

51.37

3.66

33

80.06

4.08

68

99.25

4.12

84

99.92

4.26

92

99.92

4.46

94

150

150

4

20.4

7.17

9

27.52

2.34

13

36.87

2.1

28

53.62

1.8

27

68.89

2.48

51

95.9

2.4

74

99.76

2.57

88

99.84

2.73

91

150

200

2

20.08

7.88

12

27.58

3.11

18

52.02

2.67

33

63.86

2.81

44

84.28

2.02

56

96.91

2.69

72

99.86

2.63

87

99.88

2.63

90

150

200

3

14.57

8.59

16

20.67

3.29

15

41.39

2.4

34

55.39

1.92

44

79.19

2.67

60

95.88

2.2

71

99.94

2.44

87

99.96

2.46

86

150

200

4

24.85

6.2

14

19.6

2.57

13

38.53

2.59

32

46.06

1.86

39

67.37

2.44

48

93.33

2.2

73

99.94

2

80

99.96

2.04

82

200

100

2

22.32

7.74

19

31.09

3.94

25

51.68

4.57

35

73.17

4.51

38

86.22

6.46

78

98.77

6.32

84

98.95

6.42

85

99.27

6.61

90

200

100

3

22.57

2.79

16

38.44

4.06

20

46.83

2.69

24

70.57

4.3

53

90.38

4.44

68

99.05

4.65

81

99.84

4.65

88

99.88

5.09

89

200

100

4

19.72

4.24

17

25.11

8.44

23

41.8

6.12

44

55.47

4.97

44

69.39

5.01

51

77.11

5.35

57

9.47

6.67

89

99.66

6.81

90

200

150

2

16.1

5.82

13

23.23

4.53

19

45.58

3.64

32

52.97

2.57

39

79.35

2.61

56

94.75

2.73

71

99.62

3.66

88

99.8

3.74

89

200

150

3

25.43

4.85

12

25.37

2.81

18

45.62

2.97

27

59.9

2.38

39

85.76

3.05

58

89.72

3.41

72

99.94

3.8

89

99.96

3.8

92

200

150

4

21.94

6.28

14

28.69

3.52

27

50.32

2.63

33

61.86

2.61

44

89.31

2.38

63

85.09

2.32

70

100

2.67

87

100

2.95

92

200

200

2

23.64

7.07

10

28.63

4.2

22

47.43

1.78

30

58.77

2.69

50

78.48

2.93

59

97.58

2.57

72

9.21

2.48

83

99.54

2.53

85

200

200

3

20.75

4.69

13

27.88

3.15

24

46.28

2.61

32

59.17

1.66

41

72.51

2.04

56

95.52

2.1

67

99.98

2.12

79

99.98

2.04

82

200

200

4

27.17

7.88

22

29.72

2.4

26

37.29

1.68

24

60.87

1.64

47

75.03

1.33

49

94.12

1.52

72

99.96

1.7

76

99.96

1.64

83

Observation 1: \(100\%\) training accuracy is achieved.#

\(100\%\) accuracy is achieved using d_emb = 200, d_hid = 150 and n_lyr = 4 on step 35k and 40k.

Observation 2: Models are not generalized.#

Validation set do not have accuracy higher than \(12\%\). This might be the problem of dataset design.

Future work#

Validation set performance does not increase when Elman Net become bigger and deeper. Since we can achieve \(100 \%\) accuracy on training set, optimization process seems to be okay. Thus we conclude that Elman Net itself might be the cause of bad generalization phenomenon. We should consider changing models.

Next
WikiText-2 Dataset
Previous
Elman Net: structure-related hyperparameters baseline
Copyright © 2022, ProFatXuanAll
Made with Sphinx and @pradyunsg's Furo
On this page
  • Elman Net: structure-related hyperparameters best possible settings
    • Abstract
    • Environment setup
    • Experiment setup
      • Tokenizer settings
      • Model training settings
      • Model evaluation settings
    • Experiment results
      • Training loss
        • Observation 1: Increasing d_emb from 100 to 150 in general makes training loss smaller.
        • Observation 2: Increasing d_emb from 150 to 200 in general makes training loss smaller.
        • Observation 3: Increasing d_hid from 100 to 150 in general makes training loss smaller.
        • Observation 4: Increasing d_hid from 150 to 200 in general makes training loss larger.
        • Observation 5: When d_emb = 100, increasing n_lyr from 2 to 3 in general makes training loss smaller.
        • Observation 6: When d_emb = 100, increasing n_lyr from 2 to 4 in general makes training loss smaller.
        • Observation 7: When d_emb = 150, increasing n_lyr from 2 to 3 in general makes training loss larger.
        • Observation 8: When d_emb = 150, increasing n_lyr from 2 to 4 in general makes training loss larger.
        • Observation 9: When d_emb = 200, increasing n_lyr from 2 to 3 in general makes training loss smaller.
        • Observation 10: When d_emb = 200, increasing n_lyr from 2 to 4 in general makes training loss smaller.
        • Observation 11: Minimum loss is achieved when d_emb = 200, d_hid = 150 and n_lyr = 4.
        • Observation 12: Training loss is still decreasing in all configuration.
      • Perplexity
        • Observation 1: Increasing d_emb from 100 to 150 in general makes perplexity smaller.
        • Observation 2: Increasing d_emb from 150 to 200 in general makes perplexity smaller.
        • Observation 3: Increasing d_hid from 100 to 150 in general makes perplexity smaller.
        • Observation 4: Increasing d_hid from 150 to 200 in general makes perplexity larger.
        • Observation 5: When d_emb = 100 and d_hid = 100, increasing n_lyr from 2 to 3 in general makes perplexity larger.
        • Observation 6: When d_emb = 100 and d_hid = 150, increasing n_lyr from 2 to 3 in general makes perplexity larger.
        • Observation 7: When d_emb = 100 and d_hid = 200, increasing n_lyr from 2 to 3 in general makes perplexity smaller.
        • Observation 8: When d_emb = 100 and d_hid = 100, increasing n_lyr from 2 to 4 in general makes perplexity larger.
        • Observation 9: When d_emb = 100 and d_hid = 150, increasing n_lyr from 2 to 4 doesn’t show the trend of perplexity.
        • Observation 10: When d_emb = 100 and d_hid = 200, increasing n_lyr from 2 to 4 in general makes perplexity smaller.
        • Observation 11: When d_emb = 150, increasing n_lyr from 2 to 4 in general makes perplexity larger.
        • Observation 12: When d_emb = 200, increasing n_lyr from 2 to 3 in general makes perplexity smaller.
        • Observation 13: When d_emb = 200, increasing n_lyr from 2 to 4 in general makes perplexity smaller.
        • Observation 14: Overfitting seems to happen.
        • Observation 15: Minimum perplexity on training set is achieved at 40k step when d_emb = 200, d_hid = 150 and n_lyr = 4.
        • Observation 16: Only when setting d_emb = 200 and d_hid = 150 perplexity is lower than \(2.1\).
    • Accuracy
    • Observation 1: \(100\%\) training accuracy is achieved.
    • Observation 2: Models are not generalized.
    • Future work