2. Methodology

2.a. Repository Setup

Install Docker
Perform the Linux post-install steps for running Docker as a non-root user, and starting Docker on boot. This will involve a reboot.
Clone https://github.com/Notgnoshi/research
Use the provided makefile to build the docker images, perform data preprocessing, and run a Jupyter server, the FastAPI generation API, train a language model, or use a trained language model to generate haiku.

Use make help to show the available make targets.

make help
make docker-build
make init-data
# The docker images built above don't support CUDA.
# This is unfortunate, but necessary, as the GPT-2 model doesn't fit on my GTX 1080.
# Training on my CPU takes 4.5 hours.
make CONFIG=data/models/gpt2/gpt2.jsonc train
make CONFIG=data/models/gpt2/gpt2.jsonc generate

The make docker-build target build two images: notgnoshi/research and notgnoshi/api . The research image is where the data preprocessing, model training, and exploratory analysis are performed. It's about 3.2 GB. The API image is a bit smaller, at 2 GB, and adds the necessary dependencies to tiangolo/uvicorn-gunicorn-fastapi to run the REST API for generation.

2.b. Data Collection and Preprocessing

Unfortunately, I did not keep track of which haiku came from where, and for the sources that provided it, who the author was. If I were to start over, I would be sure to do so, despite the additional complexity of scraping the author from the haiku websites. I believe that having the author and source in the dataset could enable deeper analysis, and would enable others to use the same dataset for their own purposes.

I collected haiku from the following sources:

There were other sources, particularly PDF haiku anthologies, but I do not recall which specifically. Collecting the haiku corpus was particularly difficult, because there often was not a consistent markup used for the haiku. I spent many hours manually cleaning the scraped "haiku", and the corpus was therefore collected in fragments as I found new sources.

I had to clean:

Non-unicode characters
Miscellaneous fragments of HTML
HTML character codes like & and  
Non-English haiku (there's still some more French, Spanish, and Japanese remaining)
Duplicate haiku
Fragments of page navbars and table of contents. (Many websites used the same markup for haiku as other portions of the pages)
Author names and dates (I shouldn't have cleaned this)

I was able to scrape together 55367 haiku, available here . The linked CSV file also contains the number of lines in each haiku, the number of syllables per line, the total number of syllables, and any colors referenced in the haiku.

The preprocessing step used to create haiku.csv from haiku.txt does the following:

Convert everything to lowercase ASCII-alphabetic, leaving apostrophes and numbers intact
Remove newline and separate lines with a /
Remove unicode, non-unicode, and punctuation
Count the number of lines in the haiku
Count the number of syllables per line
Count the total number of syllables
Find occurrences of color in each haiku

Unfortunately, there's a wide range in variability in the number of lines and the number of syllables in the haiku. This is a consequence of using haiku enthusiast websites, many of which include other poetic forms, and don't always strictly adhere to the 5-7-5 form traditionally associated with haiku. See Syllables for a discussion of the wide variability.

2.c. Model Training

I developed three models for training. Use one of

make CONFIG=data/models/gpt2/gpt2.jsonc train
make CONFIG=data/models/markov-word/markov-word.jsonc train
make CONFIG=data/models/markov-character/markov-character train

The two markov models were expected to perform poorly, and were used to flesh out some of the supporting glue code before I turned my attention to GPT-2. The makefile will default to using GPT-2 if the CONFIG argument is not given.

There are many options in gpt2.jsonc that you can tweak.

{
    "name": "gpt2-default",
    "type": "transformer",
    // If type is "transformer", specify which model to use. Options: "gpt", "gpt2"
    "model_type": "gpt2",
    // A uint32_t random seed.
    "seed": null,
    // Batch size per GPU/CPU.
    "batch_size": 4,
    // Number of update steps to accumulate before performings a backward/update pass.
    "gradient_accumulation_steps": 1,
    "learning_rate": 5e-5,
    "weight_decay": 0.0,
    "adam_epsilon": 1e-8,
    "max_gradient_norm": 1.0,
    "num_train_epochs": 1,
    // Maximum number of training steps to perform, if set. Overrides "num_train_epochs"
    "max_steps": null,
    // Use a linear warmpup over n steps.
    "warmup_steps": 0,
    // Proportion of the training set to use as evaluation data.
    "evaluation_proportion": 0.1,
    // Run model evaluation at each logging step.
    "evaluate_during_training": true,
    // Log every n steps.
    "logging_steps": 1000,
    // Save a restorable checkpoint ever n steps.
    "checkpoint_steps": 1000,
    // Restore and resume training from the given checkpoint, if given.
    // Path to the checkpoint directory is relative to this file.
    "resume_training_from": null,
    // Number of checkpoints to save. Oldest deleted first.
    "max_checkpoints": null,
    // Path to save any generated haiku relative to this file.
    "generated_path": "../../generated.csv",
    // Maximum number of tokens to generate.
    "max_tokens": 20,
    // The prompt to use to generate haiku. String, if not given, random prompt will be chosen.
    "prompt": null,
    // The number of haiku to generate with the above prompt.
    "number": 10,
    // The temperature to sample the next token probability distribution. Must be positive.
    "temperature": 1.0,
    // Repetition penalty parameter. Between 1.0 (no penalty) and infinity.
    "repetition_penalty": 1.0,
    // The number of highest probability vocabulary tokens to keep for top-k filtering. Between 1 and infinity.
    "k": 0,
    // The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Between 0 and 1.
    "p": 0.9
}

These options are for both training and generation. Should I do this again, I think I would pick another configuration option, and this is fairly verbose and clunky. I wanted a single interface to multiple language models with the same script, so that precludes commandline options without a lot of additional complexity, but I think that would be the way to procede should I do something similar in the future.

2.d. Commandline Haiku Generation

Tune the generation parameters in gpt2.jsonc to your preferences, and run make generate . This will print the haiku to the screen, and append them to data/generated.csv .

2.e. RESTful API Haiku Generation

Run either make api-dev or make api to run the FastAPI endpoint. It was pleasantly straightforward to integrate this FastAPI container with an Nginx proxy so that I could host it on this website with minimal effort. make api-dev is almost the same as make api , except that it monitors app/main.py for modifications, and restarts the API service if changes are detected.

See https://agill.xyz/docs for documentation describing the API endpoint, and visit https://agill.xyz/generate to generate your own haiku.

You can use a random seed with both generation methods (commandline and REST API) to ensure reproducibility. However, the same seed will produce different results with each method. This is unfortunate, but unavoidable as the commandline generation step has to deserialize the model from disk, and is therefore in a fresh state on every generation. The API however, keeps a single model in memory, and thus has slightly different initialization semantics. I think this could probably be fixed, but I don't really care to.