3.c. Syllables

As discussed in The Haiku Form, there is more to a haiku than its syllables. There's also substantial thematic content. However, since it's a question I keep getting, I wanted to do some analysis on the syllabic structure of the haiku in my dataset.

import collections
import operator

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns

from haikulib import data

sns.set(style="whitegrid")

First, we restrict our corpus to three-line haiku to make our analysis easier. The vast majority of the haiku are composed of three lines, with a few outliers on either side.

df = data.get_df()
# Consider only those haiku that consist of three lines.
df = df[df["lines"] == 3]
# Reindex, so that adding a syllable count column isn't borked.
df.reset_index(inplace=True, drop=True)

As expected, the distribution of the total number of syllables is roughly normal.

sns.distplot(
    df["total_syllables"], bins=np.arange(5, 25), kde_kws={"bw": 0.4}, hist_kws={"align": "left"}
)
plt.title("Haiku total syllable count")
plt.xlabel("syllables")
plt.ylabel("density")
plt.show()

However, what isn't expected is the distribution center. I expected it to be centered on 17 syllables, as that's the traditional structure discussed in The Haiku Form. The actual center is thirteen syllables.

df["total_syllables"].describe()

count    52371.000000
mean        13.316072
std          2.908676
min          3.000000
25%         11.000000
50%         13.000000
75%         15.000000
max         27.000000
Name: total_syllables, dtype: float64

df[df["total_yllables"] <= 4]

	haiku	colors	lines	syllables	total_syllables
7284	rain / silent / snow	[snow]	3	(1, 2, 1)	4
10565	the / smell / of snow	[snow]	3	(1, 1, 2)	4
11868	star / crushed / sky	[sky]	3	(1, 2, 1)	4
13992	heat / lightning / bugs	[]	3	(1, 2, 1)	4
23200	frog / pond / splash	[]	3	(1, 1, 1)	3
28351	frost / her / scowl	[]	3	(1, 1, 1)	3
28479	lies / dirty / snow	[snow]	3	(1, 2, 1)	4
32128	dusk / words / fail me	[dusk]	3	(1, 1, 2)	4
46650	mime / lifting / fog	[]	3	(1, 2, 1)	4

df[df["total_syllables"] >= 26]

#	haiku	colors	lines	syllables	total_syllables
11370	some idiot with a campfire up in the trees / but the smoke smells good / like a hundred childhood mornings	[smoke]	3	(12, 6, 8)	26
42005	can i reinvent you and me / to love until i become still / to worship until you become stone	[stone]	3	(7, 9, 11)	27

Many of the outliers on either side seem subjectively reasonable, if strict adherence to the traditional seventeen-syllable structure is abandoned.

This outlier analysis revealed the presence of the following amusing haiku.

♡ ♡ ♡ ♡ ♡

♡ ♡ ♡ ♡ ♡ ♡ ♡

♡ ♡ ♡ ♡ ♡

This was treated as zero syllables, because the dataset preprocessing step converts all haiku to ASCII-encoded alphabetic characters, along with apostrophes and / line separators. This "haiku" was removed from the dataset.

So then we look at the syllable count for each line in the corpus (restricted to three-line haiku).

one, two, three = zip(*df["syllables"])

bins = np.arange(1, 10)
# Using the bandwidth kde kwarg to produce a smooth estimated kernel
# that doesn't spike with every bin.
kde_kws = {"bw": 0.4}
hist_kws = {"align": "left"}

sns.distplot(one, label="first", bins=bins, kde_kws=kde_kws, hist_kws=hist_kws)
sns.distplot(two, label="second", bins=bins, kde_kws=kde_kws, hist_kws=hist_kws)
sns.distplot(three, label="third", bins=bins, kde_kws=kde_kws, hist_kws=hist_kws)

plt.title("Haiku syllables per line")
plt.legend()
plt.xlabel("syllables")
plt.ylabel("density")
plt.show()

We wee that there is a clear distinction between the distributions of the middle and surrounding lines. This agrees with my expectations, but it's surprising to find that the middle distribution is centered on five, not seven syllables. It's also interesting to note that the distributions of the first and last lines are similar, but with the distribution of the third line's syllable count skewed slightly higher.

Syllables per line — There is a clear distinction between the middle and surrounding lines

Again restricted to three-line haiku, we can look at the most common syllabic structures occurring in the corpus.

counts = collections.Counter(df["syllables"])
total = sum(counts.values())

rows = {
    "syllables": list(counts.keys()),
    "count": list(counts.values()),
    "proportion": [v / total for v in counts.values()],
}

syllables = pd.DataFrame(rows)
syllables.sort_values(by="count", inplace=True, ascending=False)
syllables.reset_index(inplace=True, drop=True)
syllables.head(10)

We see that the 5-7-5 structure is the most common, but that it accounts for only 2.8% of the corpus. This is surprising. I had expected the traditional form to be dominant over the others, with only a few outliers.

5-7-5 is the most common structure, but only accounts for 3% of the corpus.
#	syllables	count	proportion
0	(5, 7, 5)	1513	0.028890
1	(3, 5, 4)	1089	0.020794
2	(3, 4, 4)	946	0.018063
3	(4, 5, 4)	925	0.017662
4	(3, 5, 3)	917	0.017510
5	(3, 4, 3)	835	0.015944
6	(4, 6, 4)	799	0.015257
7	(4, 4, 4)	779	0.014875
8	(4, 5, 3)	776	0.014817
9	(3, 6, 4)	771	0.014722

plt.plot(np.log(syllables["count"]))
plt.title("Distribution of syllabic structures in haiku")
plt.ylabel("$\log(freq)$")
plt.xlabel("$rank$")
plt.show()

The syllabic structure distribution — The distribution of syllabic structures is roughly exponential with respect to rank.

With the exception of the most common structures, the distribution of syllabic structures in the corpus is exponential with respect to rank. Note that the stair-stepping at the bottom end is due to the discrete nature of the frequencies. There are a number of haiku with unique syllable structure, and there are a number of pairs of haiku with the same structure, and so on.

In conclusion, then, the syllabic structure of haiku (in my corpus) is empirically more varied than expected. Previous analysis on the vocabulary and thematic content of haiku met my expectations. However, it would appear that the structure in my corpus is not so well-behaved.