# Python Generators

Python generators are a powerful language feature. Like several other functional programming aspects Python has, generators enrich programs with composability, expressiveness, and concision. They also allow for a form of lazy loading, which can be useful in Machine Learning settings where large data sets that do not fit into main memory are the norm. For instance, Keras makes use of that property with its fit_generator function, which expects input presented as generators that infinitely recycle over the underlying data.

First a and a bit of advice is in order. Anyone who is working with Python generators is urged to become acquainted with itertools, a standard Python library of reusable, composable generators. For me the cycle generator was a key find. As mentioned above, fit_generator needs infinitely-recycling generators, which is exactly what itertools.cycle provides. One wrinkle is that cycle accomplishes this with an internal data cache, so if your data do not fit in memory you may have to seek an alternative. However, if your data do fit in memory this confers a very nice property, for free: after cycling through the data the first time all subsequent retrievals are from memory, so that performance improves dramatically after the first cycle.

from itertools import groupby, islice, zip_longest, cycle, filterfalse


One problem that may crop up when using itertools.cycle in Deep Learning is that, according to Deep Learning lore, it is prudent to randomize the data on every training epoch. To do that, we need to rewrite itertools.cycle so that it shuffles the data upon every recycle. That is easily done, as shown below. Note that the elements of the iterable are essentially returned in batches, and that the first batch is not shuffled. If you want only to return random elements then you must know the batch size, which will be the number of elements in the underlying finite iterable, and you must discard the first batch. The itertools.islice function can be helpful here. In practice, this is often not a problem.

import random

def rcycle(iterable):
saved = []                 # In-memory cache
for element in iterable:
yield element
saved.append(element)
while saved:
random.shuffle(saved)  # Shuffle every batch
for element in saved:
yield element


If we invoke rcycle on a sequence drawn from the interval [0,5), taken in 3 batches for a total of 15 values we can see this behavior. The first 5 values are drawn in order, but then the next 10 are drawn in two batches, each batch shuffled independently.

[x for x in islice(rcycle(range(5)), 15)]

[0, 1, 2, 3, 4, 1, 0, 4, 3, 2, 2, 1, 3, 4, 0]


The remaining utility functions that I wrote are quite straightforward and for brevity are written as "one-liners."

feed
generator that feeds lines from the file named by 'filename'
split
generator that splits lines into tuples based on a delimiter
select
generator that selects out elements from tuples
non-generator that reads an image file into a NumPy array
Xflip
non-generator that flips an input image horizontally
yflip
non-generator that flips a target label to its negative
rmap
generator that randomly applies or does not apply a function with equal probability
rflip
generator that flips samples 50% of the time
fetch
generator that loads index file entries into samples
group
generator that groups input elements into lists
transpose
generator that takes a generator of lists into a list of generators
batch
generator that takes a list of generators into a list of NumPy array "batches"

The actual code for these utility functions is presented below.

from PIL import Image
import numpy as np

feed = lambda filename: (l for l in open(filename))
split = lambda lines, delimiter=",": (line.split(delimiter) for line in lines)
select = lambda fields, indices: ([r[i] for i in indices] for r in fields)