Use
make help
to show the available make targets.
make help
make docker-build
make init-data
# The docker images built above don't support CUDA.
# This is unfortunate, but necessary, as the GPT-2 model doesn't fit on my GTX 1080.
# Training on my CPU takes 4.5 hours.
make CONFIG=data/models/gpt2/gpt2.jsonc train
make CONFIG=data/models/gpt2/gpt2.jsonc generate
The
make docker-build
target build
two
images:
notgnoshi/research
and
notgnoshi/api
. The research image is where the data preprocessing, model training, and exploratory analysis are
performed. It's about 3.2 GB. The API image is a bit smaller, at 2 GB, and adds the necessary
dependencies to
tiangolo/uvicorn-gunicorn-fastapi
to run the REST API for generation.
Unfortunately, I did not keep track of which haiku came from where, and for the sources that provided it, who the author was. If I were to start over, I would be sure to do so, despite the additional complexity of scraping the author from the haiku websites. I believe that having the author and source in the dataset could enable deeper analysis, and would enable others to use the same dataset for their own purposes.
I collected haiku from the following sources:
There were other sources, particularly PDF haiku anthologies, but I do not recall which specifically. Collecting the haiku corpus was particularly difficult, because there often was not a consistent markup used for the haiku. I spent many hours manually cleaning the scraped "haiku", and the corpus was therefore collected in fragments as I found new sources.
I had to clean:
&
and
I was able to scrape together 55367 haiku, available here . The linked CSV file also contains the number of lines in each haiku, the number of syllables per line, the total number of syllables, and any colors referenced in the haiku.
The preprocessing step used to create
haiku.csv
from
haiku.txt
does the following:
/
Unfortunately, there's a wide range in variability in the number of lines and the number of syllables in the haiku. This is a consequence of using haiku enthusiast websites, many of which include other poetic forms, and don't always strictly adhere to the 5-7-5 form traditionally associated with haiku. See Syllables for a discussion of the wide variability.
I developed three models for training. Use one of
make CONFIG=data/models/gpt2/gpt2.jsonc train
make CONFIG=data/models/markov-word/markov-word.jsonc train
make CONFIG=data/models/markov-character/markov-character train
The two markov models were expected to perform poorly, and were used to flesh out some of the supporting
glue code before I turned my attention to GPT-2. The makefile will default to using GPT-2 if the
CONFIG
argument is not given.
There are
many
options in
gpt2.jsonc
that you can tweak.
{
"name": "gpt2-default",
"type": "transformer",
// If type is "transformer", specify which model to use. Options: "gpt", "gpt2"
"model_type": "gpt2",
// A uint32_t random seed.
"seed": null,
// Batch size per GPU/CPU.
"batch_size": 4,
// Number of update steps to accumulate before performings a backward/update pass.
"gradient_accumulation_steps": 1,
"learning_rate": 5e-5,
"weight_decay": 0.0,
"adam_epsilon": 1e-8,
"max_gradient_norm": 1.0,
"num_train_epochs": 1,
// Maximum number of training steps to perform, if set. Overrides "num_train_epochs"
"max_steps": null,
// Use a linear warmpup over n steps.
"warmup_steps": 0,
// Proportion of the training set to use as evaluation data.
"evaluation_proportion": 0.1,
// Run model evaluation at each logging step.
"evaluate_during_training": true,
// Log every n steps.
"logging_steps": 1000,
// Save a restorable checkpoint ever n steps.
"checkpoint_steps": 1000,
// Restore and resume training from the given checkpoint, if given.
// Path to the checkpoint directory is relative to this file.
"resume_training_from": null,
// Number of checkpoints to save. Oldest deleted first.
"max_checkpoints": null,
// Path to save any generated haiku relative to this file.
"generated_path": "../../generated.csv",
// Maximum number of tokens to generate.
"max_tokens": 20,
// The prompt to use to generate haiku. String, if not given, random prompt will be chosen.
"prompt": null,
// The number of haiku to generate with the above prompt.
"number": 10,
// The temperature to sample the next token probability distribution. Must be positive.
"temperature": 1.0,
// Repetition penalty parameter. Between 1.0 (no penalty) and infinity.
"repetition_penalty": 1.0,
// The number of highest probability vocabulary tokens to keep for top-k filtering. Between 1 and infinity.
"k": 0,
// The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Between 0 and 1.
"p": 0.9
}
These options are for both training and generation. Should I do this again, I think I would pick another configuration option, and this is fairly verbose and clunky. I wanted a single interface to multiple language models with the same script, so that precludes commandline options without a lot of additional complexity, but I think that would be the way to procede should I do something similar in the future.
Tune the generation parameters in
gpt2.jsonc
to your preferences, and run
make generate
. This will print the haiku to the screen, and append them to
data/generated.csv
.
Run either
make api-dev
or
make api
to run the FastAPI endpoint. It was pleasantly straightforward to integrate this FastAPI container with
an Nginx proxy so that I could host it on this website with minimal effort.
make api-dev
is almost the same as
make api
, except that it monitors
app/main.py
for modifications, and restarts the API service if changes are detected.
See https://agill.xyz/docs for documentation describing the API endpoint, and visit https://agill.xyz/generate to generate your own haiku.