3.c. Syllables

As discussed in The Haiku Form, there is more to a haiku than its syllables. There's also substantial thematic content. However, since it's a question I keep getting, I wanted to do some analysis on the syllabic structure of the haiku in my dataset.

import collections
import operator

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns

from haikulib import data

sns.set(style="whitegrid")

First, we restrict our corpus to three-line haiku to make our analysis easier. The vast majority of the haiku are composed of three lines, with a few outliers on either side.

df = data.get_df()
# Consider only those haiku that consist of three lines.
df = df[df["lines"] == 3]
# Reindex, so that adding a syllable count column isn't borked.
df.reset_index(inplace=True, drop=True)

As expected, the distribution of the total number of syllables is roughly normal.

sns.distplot(
    df["total_syllables"], bins=np.arange(5, 25), kde_kws={"bw": 0.4}, hist_kws={"align": "left"}
)
plt.title("Haiku total syllable count")
plt.xlabel("syllables")
plt.ylabel("density")
plt.show()
The distribution of the total number of syllables.
The total syllable distribution is roughly normal.

However, what isn't expected is the distribution center. I expected it to be centered on 17 syllables, as that's the traditional structure discussed in The Haiku Form. The actual center is thirteen syllables.

df["total_syllables"].describe()
count    52371.000000
mean        13.316072
std          2.908676
min          3.000000
25%         11.000000
50%         13.000000
75%         15.000000
max         27.000000
Name: total_syllables, dtype: float64
df[df["total_yllables"] <= 4]
haiku colors lines syllables total_syllables
7284 rain / silent / snow [snow] 3 (1, 2, 1) 4
10565 the / smell / of snow [snow] 3 (1, 1, 2) 4
11868 star / crushed / sky [sky] 3 (1, 2, 1) 4
13992 heat / lightning / bugs [] 3 (1, 2, 1) 4
23200 frog / pond / splash [] 3 (1, 1, 1) 3
28351 frost / her / scowl [] 3 (1, 1, 1) 3
28479 lies / dirty / snow [snow] 3 (1, 2, 1) 4
32128 dusk / words / fail me [dusk] 3 (1, 1, 2) 4
46650 mime / lifting / fog [] 3 (1, 2, 1) 4
df[df["total_syllables"] >= 26]
# haiku colors lines syllables total_syllables
11370 some idiot with a campfire up in the trees / but the smoke smells good / like a hundred childhood mornings [smoke] 3 (12, 6, 8) 26
42005 can i reinvent you and me / to love until i become still / to worship until you become stone [stone] 3 (7, 9, 11) 27

Many of the outliers on either side seem subjectively reasonable, if strict adherence to the traditional seventeen-syllable structure is abandoned.

This outlier analysis revealed the presence of the following amusing haiku.

♡ ♡ ♡ ♡ ♡

♡ ♡ ♡ ♡ ♡ ♡ ♡

♡ ♡ ♡ ♡ ♡

This was treated as zero syllables, because the dataset preprocessing step converts all haiku to ASCII-encoded alphabetic characters, along with apostrophes and / line separators. This "haiku" was removed from the dataset.

So then we look at the syllable count for each line in the corpus (restricted to three-line haiku).

one, two, three = zip(*df["syllables"])

bins = np.arange(1, 10)
# Using the bandwidth kde kwarg to produce a smooth estimated kernel
# that doesn't spike with every bin.
kde_kws = {"bw": 0.4}
hist_kws = {"align": "left"}

sns.distplot(one, label="first", bins=bins, kde_kws=kde_kws, hist_kws=hist_kws)
sns.distplot(two, label="second", bins=bins, kde_kws=kde_kws, hist_kws=hist_kws)
sns.distplot(three, label="third", bins=bins, kde_kws=kde_kws, hist_kws=hist_kws)

plt.title("Haiku syllables per line")
plt.legend()
plt.xlabel("syllables")
plt.ylabel("density")
plt.show()

We wee that there is a clear distinction between the distributions of the middle and surrounding lines. This agrees with my expectations, but it's surprising to find that the middle distribution is centered on five, not seven syllables. It's also interesting to note that the distributions of the first and last lines are similar, but with the distribution of the third line's syllable count skewed slightly higher.

Syllables per line
There is a clear distinction between the middle and surrounding lines

Again restricted to three-line haiku, we can look at the most common syllabic structures occurring in the corpus.

counts = collections.Counter(df["syllables"])
total = sum(counts.values())

rows = {
    "syllables": list(counts.keys()),
    "count": list(counts.values()),
    "proportion": [v / total for v in counts.values()],
}

syllables = pd.DataFrame(rows)
syllables.sort_values(by="count", inplace=True, ascending=False)
syllables.reset_index(inplace=True, drop=True)
syllables.head(10)

We see that the 5-7-5 structure is the most common, but that it accounts for only 2.8% of the corpus. This is surprising. I had expected the traditional form to be dominant over the others, with only a few outliers.

5-7-5 is the most common structure, but only accounts for 3% of the corpus.
# syllables count proportion
0 (5, 7, 5) 1513 0.028890
1 (3, 5, 4) 1089 0.020794
2 (3, 4, 4) 946 0.018063
3 (4, 5, 4) 925 0.017662
4 (3, 5, 3) 917 0.017510
5 (3, 4, 3) 835 0.015944
6 (4, 6, 4) 799 0.015257
7 (4, 4, 4) 779 0.014875
8 (4, 5, 3) 776 0.014817
9 (3, 6, 4) 771 0.014722
plt.plot(np.log(syllables["count"]))
plt.title("Distribution of syllabic structures in haiku")
plt.ylabel("$\log(freq)$")
plt.xlabel("$rank$")
plt.show()
The syllabic structure distribution
The distribution of syllabic structures is roughly exponential with respect to rank.

With the exception of the most common structures, the distribution of syllabic structures in the corpus is exponential with respect to rank. Note that the stair-stepping at the bottom end is due to the discrete nature of the frequencies. There are a number of haiku with unique syllable structure, and there are a number of pairs of haiku with the same structure, and so on.

In conclusion, then, the syllabic structure of haiku (in my corpus) is empirically more varied than expected. Previous analysis on the vocabulary and thematic content of haiku met my expectations. However, it would appear that the structure in my corpus is not so well-behaved.