AI/ML text generation – training a (biased) model

In order to train the GPT-2 model we need some data, and to investigate how bias can affect the output of text generation we will experiment with political manifestos to see how text generation can be skewed to the political left and right.

Getting the raw data

The manifestos for the UK Conservative and Labour parties are in the public domain, so I have downloaded them into individual text files.

Building the data

To make the training process simpler, we will combine all the data for each of the political parties into two separate files, delimited by

<|endoftext|>

And the train the model using one of the data sets.

The easiest way to do combine the data is with a script:

builddata.sh:

#!/usr/bin/bash

if [ $# -eq 0 ];
then
  echo "usage:"
  echo " $0 <directory-of-text-files>"
  exit 1;
fi

for f in "$1"/* ; do cat "$f" >> "$1".txt; echo "<|endoftext|>" >> "$1".txt; done

To use the script, specify a directory of files and it will concatenate them with delimiters into a similarly named flat text file

Train the model

On consumer grade GPU’s, use the smaller 124M model – download with:

python3 download_model.py 124M

Then, set an environment variable for the script to run:

export PYTHONPATH=src

Then, encode each of the data sets e.g.:

python encode.py ../labour.txt ../labour.npz

And then train the model on one of the data sets:

python train.py --dataset ../labour.npz

The output of the train shows progress similar to:

[60 | 19.80] loss=3.11 avg=3.12

The first two numbers are fairly self explanatory:
Steps
Elapsed time (seconds)

The last two numbers refer to Cross entropy and are a measure of the performance of the model – as the training steps increase, loss should decrease (which indicates the model is learning)

Using the new model

After about 1 hour of runtime, the average had dropped to 0.13, so I stopped the training with CTRL-C

To use the model, create a new sub-directory in the models directory and copy over some config:

mkdir models/labour
cp models/124M/encoder.json models/labour/
cp models/124M/hparams.json models/labour/
cp models/124M/vocab.bpe models/labour/

Then copy (rather than move as you may wish to continue training at a later date) the new model data:

cp checkpoint/run1/* models/labour/

You can then generate a text sample with the new model using:

python3 src/generate_unconditional_samples.py --model_name labour

To generate a new model using different data, first move the checkpoint directory:

mv checkpoint/run1 checkpoint/labour

Then start a new training run with the other data set:

python train.py --dataset ../conservative.npz

Once the average has dropped to a similar value, create a new model directory with:

mkdir models/conservative
cp models/124M/encoder.json models/conservative/
cp models/124M/hparams.json models/conservative/
cp models/124M/vocab.bpe models/conservative/
cp checkpoint/run1/* models/conservative/

You can then create output using the other model with:

python3 src/generate_unconditional_samples.py --model_name conservative

Bias

You can then compare the output of both and see some fairly obvious political bias.

If a small amount of training data (around 1Mb) with 1 hour of training can result in a 2.4Gb model with clear political leanings, it is clear that large quantities of extreme content could radically change the behaviour of a text based model.

Training the model with public domain literature

On I lighter note, I then downloaded 35 interesting books out of the top 100 from Project Gutenberg (around 29Mb) and then trained a new model for 34 hours.

Once complete, the new model generates fairly readable paragraphs such as:

Very thoughtful and very disagreeable indeed it seems to be, lying in
the sombre and probably no other door to the empty house he had so
often occupied, sitting quietly

And,

In this world of solitude, solitude, and hardship, when solitude is embraced
of its own accord, there is always room for nobody

Conclusions

The output of these newly trained models shows how a small set of data can influence the tone and content of text, and while using large amounts of published literature can produce paragraphs that ‘almost’ make sense, It’s clear that there are a number of challenges to overcome in both development and testing to build and verify that AI systems can produce unbiased output that is correct.

The main test implications of AI systems appear to be in the quality of data used in training the model and the non-deterministic nature of the output – given the same input, an AI model may produce different output on successive runs, but that differing output may be correct.