import csv
def read_csv(filepath):
"""Read a CSV file and return a list of dictionaries."""
with open(filepath, 'r') as f:
return list(csv.DictReader(f))
players = read_csv('../../data/players.csv')
courses = read_csv('../../data/courses.csv')
holes = read_csv('../../data/holes.csv')
rounds = read_csv('../../data/rounds.csv')
shots = read_csv('../../data/shots.csv')
print(f'Players: {len(players)}')
print(f'Courses: {len(courses)}')
print(f'Holes: {len(holes)}')
print(f'Rounds: {len(rounds)}')
print(f'Shots: {len(shots)}')Comprehensions and Generators
What You’ll Learn
- How comprehensions let you build lists, dicts, and sets in a single expression
- The connection between comprehensions and mathematical set-builder notation
- When comprehensions improve readability and when they hurt it
- How generator expressions and generator functions handle large datasets without loading everything into memory
- How to build data pipelines with
yield
Concept
Declarative vs Imperative
So far, you have built lists by writing for loops that append items one at a time. That is the imperative style: you spell out how to construct the result step by step.
# Imperative: step-by-step instructions
sub_80 = []
for r in rounds:
score = int(r['total_score'])
if score < 80:
sub_80.append(score)Python also supports a declarative style where you describe what you want, and the language figures out how to build it:
# Declarative: describe the result
sub_80 = [int(r['total_score']) for r in rounds if int(r['total_score']) < 80]Both produce the same list. The declarative version is called a comprehension.
The Set-Builder Notation Parallel
If you have taken a math course, comprehensions will look familiar. In set-builder notation you write:
\[S = \{x^2 \mid x \in \mathbb{N},\; x < 5\}\]
which reads: “the set of \(x^2\) for every natural number \(x\) less than 5.” The result is \(\{0, 1, 4, 9, 16\}\).
Python’s comprehension syntax mirrors this directly:
s = {x**2 for x in range(5)}The structure is always:
[expression for variable in iterable if condition]
^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^
what you where the data optional
want comes from filter
Why Comprehensions Matter
- Concise. One line instead of four. Less code means fewer places for bugs to hide.
- Readable. Once you learn the pattern, comprehensions telegraph intent: “I am building a collection.”
- Pythonic. The Python community treats comprehensions as idiomatic style. You will encounter them constantly in libraries, documentation, and other people’s code.
- Slightly faster. Python can optimize comprehensions internally because the intent is unambiguous. The difference is small, but it exists.
When NOT to Use Comprehensions
Comprehensions can become unreadable when they are too long or too deeply nested. A good rule of thumb:
- One
forclause with an optionalif? Almost always fine as a comprehension. - Two
forclauses? Fine if the logic is straightforward, but consider whether a regular loop is clearer. - Three or more
forclauses, or complex conditional logic? Write a regular loop. You can always refactor later.
If you cannot read the comprehension aloud and understand it in one pass, it is too complex. Readability always wins.
Code
Setup: Loading the Golf Data
We will reuse the pattern from Topic 03: read each CSV into a list of dictionaries with csv.DictReader.
1. List Comprehensions
A list comprehension builds a new list by applying an expression to each item in an iterable, with an optional filter. Let’s start with the simplest case.
Basic transformation: scores to relative-to-par
Every course has a par (typically 72). Let’s convert raw scores into relative-to-par values so we can see how each round compared to par. We will use a par of 72 as a baseline.
First, the imperative way with a for loop:
# Imperative approach: for loop
par = 72
relative_scores = []
for r in rounds:
score = int(r['total_score'])
relative_scores.append(score - par)
print(relative_scores)# Declarative approach: list comprehension
par = 72
relative_scores = [int(r['total_score']) - par for r in rounds]
print(relative_scores)Both produce the same list. The comprehension says what we want (each score minus par) without spelling out the empty-list-and-append mechanics.
Adding a filter: sub-80 rounds
The if clause in a comprehension keeps only items that pass a condition.
# Imperative: filter with a for loop
sub_80 = []
for r in rounds:
score = int(r['total_score'])
if score < 80:
sub_80.append(r)
for r in sub_80:
print(f"Player {r['player_id']}: {r['total_score']} on {r['date']}")# Declarative: list comprehension with filter
sub_80 = [r for r in rounds if int(r['total_score']) < 80]
for r in sub_80:
print(f"Player {r['player_id']}: {r['total_score']} on {r['date']}")Filtering shots: birdies only
A birdie happens when a player finishes a hole in one stroke fewer than par. We need to combine shot data with hole pars to find birdies. Let’s first build a par lookup, then count strokes per hole per round.
# Build a par lookup: (course_id, hole_number) -> par
par_lookup = {}
for h in holes:
par_lookup[(h['course_id'], h['hole_number'])] = int(h['par'])
# Build a round-to-course lookup
round_course = {}
for r in rounds:
round_course[r['round_id']] = r['course_id']
# Count strokes per (round_id, hole)
hole_scores = {}
for s in shots:
key = (s['round_id'], s['hole'])
hole_scores[key] = hole_scores.get(key, 0) + 1
# Now find birdies using a list comprehension
birdies = [
(round_id, hole, strokes, par_lookup[(round_course[round_id], hole)])
for (round_id, hole), strokes in hole_scores.items()
if strokes == par_lookup[(round_course[round_id], hole)] - 1
]
print(f'Total birdies across all rounds: {len(birdies)}')
print()
for round_id, hole, strokes, par in birdies[:10]:
print(f' Round {round_id}, Hole {hole}: {strokes} strokes (par {par})')Conditional expressions: scoring labels
A conditional expression (also called a ternary) lets you choose between two values inside the comprehension. This is different from the if filter – it does not exclude items, it picks which value to output.
Syntax: value_if_true if condition else value_if_false
# Label each round's score relative to par
par = 72
labeled_scores = [
(int(r['total_score']),
'under par' if int(r['total_score']) < par
else 'even par' if int(r['total_score']) == par
else 'over par')
for r in rounds
]
for score, label in labeled_scores:
print(f' {score:>3d} {label}')We can apply the same idea to individual holes. In golf, the number of strokes relative to par has a specific name:
def scoring_name(strokes, par):
"""Return the golf term for a hole score relative to par."""
diff = strokes - par
names = {-3: 'albatross', -2: 'eagle', -1: 'birdie',
0: 'par', 1: 'bogey', 2: 'double bogey', 3: 'triple bogey'}
return names.get(diff, f'+{diff}' if diff > 0 else f'{diff}')
# Build scoring labels for every hole played
hole_labels = [
scoring_name(strokes, par_lookup[(round_course[round_id], hole)])
for (round_id, hole), strokes in hole_scores.items()
]
# Count how often each label appears
label_counts = {}
for label in hole_labels:
label_counts[label] = label_counts.get(label, 0) + 1
print('Score distribution across all rounds:')
for label, count in sorted(label_counts.items(), key=lambda x: x[1], reverse=True):
print(f' {label:>15s}: {count}')2. Dict Comprehensions
A dict comprehension builds a dictionary in one expression. The syntax is the same as a list comprehension but uses curly braces and a key: value pair:
{key_expr: value_expr for variable in iterable if condition}Player lookup by ID
In Topic 03, we built lookup dictionaries with a for loop. Dict comprehensions make this a one-liner.
# Imperative: for loop
player_lookup = {}
for p in players:
player_lookup[p['player_id']] = p['name']
print(player_lookup)# Declarative: dict comprehension
player_lookup = {p['player_id']: p['name'] for p in players}
print(player_lookup)Scoring stats per player
Let’s build a dictionary that maps each player’s name to their average score. This shows how a dict comprehension can include more complex expressions.
# Group scores by player_id first (using a regular loop)
scores_by_player = {}
for r in rounds:
pid = r['player_id']
if pid not in scores_by_player:
scores_by_player[pid] = []
scores_by_player[pid].append(int(r['total_score']))
# Now use a dict comprehension to calculate averages
avg_scores = {
player_lookup[pid]: round(sum(scores) / len(scores), 1)
for pid, scores in scores_by_player.items()
}
print('Average score per player:')
for name, avg in sorted(avg_scores.items(), key=lambda x: x[1]):
print(f' {name:20s} {avg:.1f}')Club distance averages from shots.csv
For each club, what is the average distance of a shot? We calculate distance as start_distance_to_pin - end_distance_to_pin.
# Group distances by club
club_distances = {}
for s in shots:
club = s['club']
distance = float(s['start_distance_to_pin']) - float(s['end_distance_to_pin'])
if distance > 0: # exclude putts where ball may go past the hole
if club not in club_distances:
club_distances[club] = []
club_distances[club].append(distance)
# Dict comprehension: club -> average distance
avg_club_distance = {
club: round(sum(dists) / len(dists), 1)
for club, dists in club_distances.items()
}
print('Average distance by club:')
for club, avg_dist in sorted(avg_club_distance.items(), key=lambda x: x[1], reverse=True):
print(f' {club:>10s}: {avg_dist:>6.1f} yards')3. Set Comprehensions
A set comprehension builds a set – an unordered collection of unique values. The syntax uses curly braces like a dict comprehension, but without the key: value pair:
{expression for variable in iterable if condition}Unique clubs used
# All unique clubs in the dataset
all_clubs = {s['club'] for s in shots}
print(f'Unique clubs ({len(all_clubs)}):')
print(sorted(all_clubs))Unique courses each player has played
# Build a course name lookup
course_lookup = {c['course_id']: c['name'] for c in courses}
# For each player, find unique courses played
for p in players:
courses_played = {course_lookup[r['course_id']] for r in rounds if r['player_id'] == p['player_id']}
print(f"{p['name']:20s} played at: {', '.join(sorted(courses_played))}")Set operations: clubs one player uses that another does not
Sets support mathematical operations: union (|), intersection (&), and difference (-). These are powerful for comparing groups.
# Get rounds for Bear Woods (player_id=1) and Bobby Bogey (player_id=4)
bear_rounds = {r['round_id'] for r in rounds if r['player_id'] == '1'}
bobby_rounds = {r['round_id'] for r in rounds if r['player_id'] == '4'}
# Clubs used by each player
bear_clubs = {s['club'] for s in shots if s['round_id'] in bear_rounds}
bobby_clubs = {s['club'] for s in shots if s['round_id'] in bobby_rounds}
print(f"Bear Woods uses: {sorted(bear_clubs)}")
print(f"Bobby Bogey uses: {sorted(bobby_clubs)}")
print()
print(f"Clubs Bear uses that Bobby does not: {sorted(bear_clubs - bobby_clubs)}")
print(f"Clubs Bobby uses that Bear does not: {sorted(bobby_clubs - bear_clubs)}")
print(f"Clubs both use: {sorted(bear_clubs & bobby_clubs)}")4. Nested Comprehensions
You can nest for clauses inside a comprehension. The clauses read left to right, just like nested for loops read top to bottom.
Flattening data
Suppose we want a flat list of every (player, course) combination that was played. With nested loops:
# Imperative: nested for loops
player_course_pairs = []
for p in players:
for c in courses:
# Check if this player has a round at this course
played = any(
r['player_id'] == p['player_id'] and r['course_id'] == c['course_id']
for r in rounds
)
if played:
player_course_pairs.append((p['name'], c['name']))
for name, course in player_course_pairs:
print(f' {name:20s} -> {course}')# Declarative: nested comprehension
player_course_pairs = [
(p['name'], c['name'])
for p in players
for c in courses
if any(r['player_id'] == p['player_id'] and r['course_id'] == c['course_id']
for r in rounds)
]
for name, course in player_course_pairs:
print(f' {name:20s} -> {course}')Here is a simpler example of flattening. Suppose we have hole pars grouped by course and we want a single flat list:
# Group pars by course
pars_by_course = {}
for h in holes:
cid = h['course_id']
if cid not in pars_by_course:
pars_by_course[cid] = []
pars_by_course[cid].append(int(h['par']))
# Flatten with a nested comprehension
all_pars = [par for course_pars in pars_by_course.values() for par in course_pars]
print(f'Total hole records: {len(all_pars)}')
print(f'All pars: {all_pars}')Readability warning. Two nested for clauses are the practical limit for comprehensions. Anything deeper should be a regular loop or broken into helper functions. Compare:
# This is hard to read -- do NOT write comprehensions like this
result = [
f(x, y, z)
for x in xs
for y in ys
if g(x, y)
for z in zs
if h(y, z)
]When you reach this level of complexity, nested loops with clear variable names will be far more maintainable.
5. Generator Expressions
A generator expression looks exactly like a list comprehension, except it uses parentheses instead of brackets:
# List comprehension -- builds the entire list in memory
scores = [int(r['total_score']) for r in rounds]
# Generator expression -- produces values one at a time
scores = (int(r['total_score']) for r in rounds)The generator does not build a list in memory. Instead, it produces values lazily – one at a time, on demand. This matters when working with large datasets.
# List comprehension returns a list
scores_list = [int(r['total_score']) for r in rounds]
print(type(scores_list))
print(scores_list)
print()
# Generator expression returns a generator object
scores_gen = (int(r['total_score']) for r in rounds)
print(type(scores_gen))
print(scores_gen) # no values shown -- they haven't been produced yetMemory efficiency
With our small golf dataset, the difference is negligible. But let’s demonstrate the concept with sys.getsizeof.
import sys
# List comprehension: all values stored in memory
all_distances_list = [float(s['start_distance_to_pin']) for s in shots]
print(f'List: {sys.getsizeof(all_distances_list):>8,} bytes for {len(all_distances_list)} items')
# Generator expression: values produced on demand
all_distances_gen = (float(s['start_distance_to_pin']) for s in shots)
print(f'Generator: {sys.getsizeof(all_distances_gen):>5,} bytes (fixed overhead, regardless of data size)')Using generators with built-in functions
Generator expressions pair naturally with sum(), max(), min(), any(), and all(). These functions consume the generator one item at a time, so the full list is never in memory.
When you pass a generator to a function that takes a single argument, you can drop the extra parentheses:
sum(int(r['total_score']) for r in rounds) # no double parentheses needed# sum() with a generator
total_strokes = sum(int(r['total_score']) for r in rounds)
avg_score = total_strokes / len(rounds)
print(f'Total strokes across all rounds: {total_strokes}')
print(f'Average score: {avg_score:.1f}')# max() and min() with generators
best_score = min(int(r['total_score']) for r in rounds)
worst_score = max(int(r['total_score']) for r in rounds)
print(f'Best score: {best_score}')
print(f'Worst score: {worst_score}')# any() -- did any round score under par (72)?
has_under_par = any(int(r['total_score']) < 72 for r in rounds)
print(f'Any round under par? {has_under_par}')
# all() -- did every round score under 110?
all_under_110 = all(int(r['total_score']) < 110 for r in rounds)
print(f'All rounds under 110? {all_under_110}')
# any() short-circuits: it stops as soon as it finds a True value
# all() short-circuits: it stops as soon as it finds a False value
# This means they are efficient even on huge datasets6. Generator Functions with yield
A generator expression works for simple transformations. For more complex logic, you can write a generator function using the yield keyword.
A generator function looks like a regular function, but instead of returning a single result, it yields values one at a time. Each time the caller asks for the next value, the function resumes exactly where it left off.
Streaming a large file line by line
Our read_csv() function loads the entire file into memory. For a 2,100-row shots file that is fine, but imagine a file with millions of rows. A generator function can stream it row by row.
def stream_csv(filepath):
"""Yield rows from a CSV file one at a time, without loading the full file."""
with open(filepath, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
yield row
# Use it -- only one row is in memory at a time
row_count = 0
for row in stream_csv('../../data/shots.csv'):
row_count += 1
print(f'Streamed {row_count} rows from shots.csv')The key insight: the with open() block stays open across yield calls. The file is read incrementally and closed automatically when the generator is exhausted (or garbage collected). This means you can process a file larger than your available memory.
Building a data pipeline
Generator functions compose naturally into pipelines where each stage transforms or filters data before passing it to the next. Think of it as an assembly line: each station does one thing.
Let’s build a pipeline: read shots.csv, filter to a specific player, then calculate strokes gained stats.
# Stage 1: Stream raw shot data
def stream_shots(filepath):
"""Yield shot dictionaries from the CSV file."""
with open(filepath, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
yield row
# Stage 2: Filter to a specific player's rounds
def filter_by_player(shot_stream, player_id):
"""Yield only shots from rounds belonging to the given player."""
player_rounds = {r['round_id'] for r in rounds if r['player_id'] == player_id}
for shot in shot_stream:
if shot['round_id'] in player_rounds:
yield shot
# Stage 3: Calculate strokes gained per club
def strokes_gained_by_club(shot_stream):
"""Consume a shot stream and return strokes gained totals per club."""
sg_totals = {}
sg_counts = {}
for shot in shot_stream:
club = shot['club']
sg = float(shot['strokes_gained'])
sg_totals[club] = sg_totals.get(club, 0.0) + sg
sg_counts[club] = sg_counts.get(club, 0) + 1
return {club: (sg_totals[club], sg_counts[club]) for club in sg_totals}
# Connect the pipeline for Bear Woods (player_id='1')
raw_shots = stream_shots('../../data/shots.csv')
bear_shots = filter_by_player(raw_shots, '1')
bear_sg = strokes_gained_by_club(bear_shots)
print('Bear Woods -- Strokes Gained by Club:')
print(f'{"Club":>10s} {"Total SG":>10s} {"Shots":>6s} {"SG/Shot":>8s}')
print('-' * 40)
for club, (total, count) in sorted(bear_sg.items(), key=lambda x: x[1][0], reverse=True):
print(f'{club:>10s} {total:>10.2f} {count:>6d} {total/count:>8.3f}')Notice how each pipeline stage is independent and testable. You can swap stages, add new filters, or reuse them in different combinations. And because each stage yields one row at a time, the pipeline never loads the full dataset into memory.
Let’s reuse the same pipeline for Bobby Bogey to compare:
# Same pipeline, different player
raw_shots = stream_shots('../../data/shots.csv')
bobby_shots = filter_by_player(raw_shots, '4')
bobby_sg = strokes_gained_by_club(bobby_shots)
print('Bobby Bogey -- Strokes Gained by Club:')
print(f'{"Club":>10s} {"Total SG":>10s} {"Shots":>6s} {"SG/Shot":>8s}')
print('-' * 40)
for club, (total, count) in sorted(bobby_sg.items(), key=lambda x: x[1][0], reverse=True):
print(f'{club:>10s} {total:>10.2f} {count:>6d} {total/count:>8.3f}')Brief mention: infinite sequences
Because generators produce values lazily, they can represent infinite sequences. You would never build an infinite list, but an infinite generator is perfectly fine as long as the consumer stops at some point.
Here is a quick example outside our golf data: generating round IDs for a hypothetical unlimited tournament.
def round_ids(start=1):
"""Generate round IDs forever."""
n = start
while True:
yield f'R{n:04d}'
n += 1
# Take only the first 10 IDs
id_gen = round_ids()
first_ten = [next(id_gen) for _ in range(10)]
print(first_ten)The generator never terminates on its own, but that is fine because the consumer (next() called 10 times) controls how many values to pull. This pattern is useful for assigning unique IDs, generating test data, or simulating ongoing processes.
AI
Exercise 1: Refactor a For Loop into a Comprehension
Give an AI assistant the following code and ask it to refactor the loop into a comprehension:
import csv
with open('../../data/rounds.csv', 'r') as f:
reader = csv.DictReader(f)
rounds_data = list(reader)
windy_scores = []
for r in rounds_data:
if r['weather'] == 'windy':
score = int(r['total_score'])
windy_scores.append(score)
avg_windy = sum(windy_scores) / len(windy_scores)
print(f'Average score in windy conditions: {avg_windy:.1f}')Prompt to use: > Refactor the windy_scores for loop into a list comprehension. Keep everything else the same.
Evaluate the AI’s response: - Does the comprehension produce the same result as the original loop? - Is the comprehension more readable than the loop, or roughly equivalent? - Did the AI change anything it should not have (e.g., the file reading logic or the final calculation)?
# Paste the AI-generated code here and run it.
# Verify the output matches: the average score for windy rounds.Exercise 2: Explain a Complex Nested Comprehension
Give an AI assistant the following comprehension and ask it to explain what it does:
sg_by_player_club = {
player_lookup[pid]: {
club: round(sum(sg) / len(sg), 3)
for club, sg in club_sg.items()
}
for pid, club_sg in (
(r['player_id'], {})
for r in rounds
)
}Prompt to use: > Explain step by step what this nested comprehension does. What would the output look like? Are there any bugs?
Evaluate the AI’s response: - Does the AI correctly identify that this code has a bug? (The inner dict club_sg is always empty, so the inner comprehension produces an empty dict for every player.) - Does the AI explain the intended purpose versus what the code actually does? - Does the AI suggest a corrected version?
This exercise tests whether you can use AI to debug code, not just generate it. A good AI response should catch the logical error.
# Paste the AI's explanation and corrected code here.
# Try running the corrected version to verify it works.Exercise 3: When to Use a Generator vs a List Comprehension
Prompt to use: > When should I use a generator expression instead of a list comprehension in Python? Give concrete examples of when each is the better choice.
Evaluate the AI’s response. A good answer should mention:
- Memory efficiency – generators do not store all values in memory, making them better for large datasets.
- Lazy evaluation – generators produce values on demand, which means they can short-circuit (e.g., with
any()orall()). - Large or streaming data – when reading files line by line, processing API responses, or working with datasets that do not fit in memory.
- When you need to iterate only once – generators can only be consumed once. If you need to loop over the data multiple times, a list is required.
- When you need indexing or
len()– lists supportdata[3]andlen(data), generators do not.
If the AI misses any of these points, that is worth noting. Does it give practical examples, or only abstract explanations?
# Paste the AI's response here as a comment or markdown.
# Note which of the 5 evaluation points the AI covered.Summary
| Concept | Syntax | When to Use |
|---|---|---|
| List comprehension | [expr for x in iterable if cond] |
Building a new list by transforming and/or filtering an iterable |
| Dict comprehension | {k: v for x in iterable if cond} |
Building a lookup dictionary or aggregating data by key |
| Set comprehension | {expr for x in iterable if cond} |
Collecting unique values; then use set operations (-, &, \|) to compare groups |
| Nested comprehension | Multiple for clauses in one expression |
Flattening nested data; limit to two for clauses for readability |
| Conditional expression | a if cond else b inside a comprehension |
Choosing between two output values (not filtering) |
| Generator expression | (expr for x in iterable if cond) |
Feeding data into sum(), max(), min(), any(), all() without building an intermediate list |
| Generator function | def f(): ... yield val |
Streaming large files, building multi-stage data pipelines, producing infinite sequences |
Next up: Topic 05 – Classes and Data Modeling.