When it comes to a certain class of data problem, my mind reaches for SQL… which is a problem when the data’s in Python. Obviously I could create an in-memory sqlite database just for the purpose of storing the data and then retrieving it with SQL. But that would be mild overkill. One such example is grouping data by, say, the first letter. I’ve known about the itertools.groupby function for a while but for some reason whenever I came to look at it, it never quite seemed to find my brain. Having now made the breakthrough I’m reminding myself here for future purposes:
SELECT LEFT (words.word, 1), COUNT (*) FROM ( SELECT word = LOWER (w.word) FROM words AS w ) AS words WHERE LEN (words.word) >= 2 GROUP BY LEFT (words.word, 1)
translates to
import os, sys import itertools import operator import re first_letter = operator.itemgetter (0) text = open (os.path.join (sys.prefix, "LICENSE.txt")).read () words = set (w.lower () for w in re.findall (r"\w{2,}", text)) groups = itertools.groupby (sorted (words), first_letter) for k, v in groups: print k, "=>", len (list (v))