When it comes to a certain class of data problem, my mind reaches for SQL… which is a problem when the data’s in Python. Obviously I could create an in-memory sqlite database just for the purpose of storing the data and then retrieving it with SQL. But that would be mild overkill. One such example is grouping data by, say, the first letter. I’ve known about the itertools.groupby function for a while but for some reason whenever I came to look at it, it never quite seemed to find my brain. Having now made the breakthrough I’m reminding myself here for future purposes:

SELECT
  LEFT (words.word, 1),
  COUNT (*)
FROM
(
  SELECT
    word = LOWER (w.word)
  FROM
    words AS w
) AS words
WHERE
  LEN (words.word) >= 2
GROUP BY
  LEFT (words.word, 1)

translates to

import os, sys
import itertools
import operator
import re

first_letter = operator.itemgetter (0)

text = open (os.path.join (sys.prefix, "LICENSE.txt")).read ()
words = set (w.lower () for w in re.findall (r"\w{2,}", text))
groups = itertools.groupby (sorted (words), first_letter)

for k, v in groups:
  print k, "=>", len (list (v))