# Python Generators

Python generators are a powerful language feature. Like several other
functional programming aspects Python has, generators enrich programs
with composability, expressiveness, and concision. They also allow
for a form of lazy loading, which can be useful in Machine Learning
settings where large data sets that do not fit into main memory are
the norm. For instance, Keras makes use of that property with its
`fit_generator`

function, which expects input presented as generators
that infinitely recycle over the underlying data.

First a and a bit of advice is in order. Anyone who is working with
Python generators is urged to become acquainted with `itertools`

, a
standard Python library of reusable, composable generators. For me
the `cycle`

generator was a key find. As mentioned above,
`fit_generator`

needs infinitely-recycling generators, which is
exactly what `itertools.cycle`

provides. One wrinkle is that `cycle`

accomplishes this with an internal data cache, so if your data do
*not* fit in memory you may have to seek an alternative. However, if
your data *do* fit in memory this confers a very nice property, for
free: after cycling through the data the first time all subsequent
retrievals are from memory, so that performance improves dramatically
after the first cycle.

from itertools import groupby, islice, zip_longest, cycle, filterfalse

One problem that may crop up when using `itertools.cycle`

in Deep
Learning is that, according to Deep Learning lore, it is prudent to
randomize the data on every training epoch. To do that, we need to
rewrite `itertools.cycle`

so that it shuffles the data upon every
recycle. That is easily done, as shown below. Note that the elements
of the iterable are essentially returned in batches, and that the
first batch is not shuffled. If you want only to return random
elements then you must know the batch size, which will be the number
of elements in the underlying finite iterable, and you must discard
the first batch. The itertools.islice function can be helpful here.
In practice, this is often not a problem.

import random def rcycle(iterable): saved = [] # In-memory cache for element in iterable: yield element saved.append(element) while saved: random.shuffle(saved) # Shuffle every batch for element in saved: yield element

If we invoke `rcycle`

on a sequence drawn from the interval [0,5),
taken in 3 batches for a total of 15 values we can see this behavior.
The first 5 values are drawn in order, but then the next 10 are drawn
in two batches, each batch shuffled independently.

[x for x in islice(rcycle(range(5)), 15)]

[0, 1, 2, 3, 4, 1, 0, 4, 3, 2, 2, 1, 3, 4, 0]

The remaining utility functions that I wrote are quite straightforward and for brevity are written as "one-liners."

- feed
- generator that feeds lines from the file named by 'filename'
- split
- generator that splits lines into tuples based on a delimiter
- select
- generator that selects out elements from tuples
- load
- non-generator that reads an image file into a NumPy array
- Xflip
- non-generator that flips an input image horizontally
- yflip
- non-generator that flips a target label to its negative
- rmap
- generator that randomly applies or does not apply a function with equal probability
- rflip
- generator that flips samples 50% of the time
- fetch
- generator that loads index file entries into samples
- group
- generator that groups input elements into lists
- transpose
- generator that takes a generator of lists into a list of generators
- batch
- generator that takes a list of generators into a list of NumPy array "batches"

The actual code for these utility functions is presented below.

from PIL import Image import numpy as np feed = lambda filename: (l for l in open(filename)) split = lambda lines, delimiter=",": (line.split(delimiter) for line in lines) select = lambda fields, indices: ([r[i] for i in indices] for r in fields) load = lambda f: np.asarray(Image.open(f)) Xflip = lambda x: x[:,::-1,:] yflip = lambda x: -x sflip = lambda s: (Xflip(s[0]), yflip(s[1])) rmap = lambda f,g: (x if random.choice([True, False]) else f(x) for x in g) rflip = lambda s: rmap(sflip, s) fetch = lambda records, base: ([load(base+f.strip()) for f in record[:1]]+[float(v) for v in record[1:]] for record in records) group = lambda items, n, fillvalue=None: zip_longest(*([iter(items)]*n), fillvalue=fillvalue) transpose = lambda tuples: (list(map(list, zip(*g))) for g in tuples) batch = lambda groups, indices=[0, 1]: ([np.asarray(t[i]) for i in indices] for t in groups)

You may find these functions useful as written, but then again,
you may not. If you don't, by all means rewrite them! Or write
new ones. While I would be pleased if *someone* else found the
above code useful, I would be thrilled if anyone who reads this
took to heart the larger lesson. That is, that Python
generators let you quickly develop tiny, composable tools that
are tailored just for *your* needs.