Boss 5, top-K
Welcome back. You've been using hash maps to answer yes-or-no questions and to group things. Today's problem is subtly different. You want the most frequent few items. Not all of them. Not a ranking of everything. Just the top K.
You'll learn a really fun trick for this: bucket sort. It's one of the few places where you can beat O(n log n) sorting and pay O(n) instead.
The story
Friday morning. The owner of the diner pulls out a printed log from the jukebox. Thousands of lines, each one the name of a song that got played this week.
... "Sweet Caroline", "Don't Stop Believin'", "Sweet Caroline", ...
She says: "I want to print a poster with the top 3 most played songs of the week. Bring me the answer by lunch."
You already know step one. Count the plays with a hash map.
from collections import Counter
counts = Counter(songs)That's the easy part. Now you need the top 3. How?
The naive way (which is also fine)
Sort the counts dictionary by value, descending, take the first 3.
def top_k_sort(songs, k):
counts = Counter(songs)
return [song for song, _ in counts.most_common(k)]most_common is Python being helpful. Under the hood it sorts the pairs by count. Time: O(n + m log m) where n is the number of plays and m is the number of unique songs. For most inputs, this is totally fine. Don't let anyone tell you otherwise.
But the problem setter wants you to do better than a full sort. So let's.
The weapon: bucket sort
Here's the mental shift. You don't need a sorted list of songs by count. You need the songs that appear the most, in any order. Fewer promises means more options.
The range of possible play counts is bounded. A song can be played zero times, or once, or twice, or up to N times (where N is the total number of plays). So you can use the play count itself as an index into an array.
Build an array of buckets, size N+1. Bucket i is a list of every song that was played exactly i times.
buckets[0] = []
buckets[1] = ["Hotel California"]
buckets[2] = []
buckets[3] = ["Thunderstruck", "Bohemian Rhapsody"]
...
buckets[47] = ["Sweet Caroline"]
Now walk the buckets from the back (highest count) forward. Collect songs until you have K of them. Done.
No sorting anywhere. Counting is O(n). Building the bucket array is O(m) where m is the number of unique songs. Walking the buckets backward is O(n) in the worst case because there are at most N buckets to scan. Total: O(n). Beats a sort.
The code
from collections import Counter
def top_k_frequent(songs: list[str], k: int) -> list[str]:
counts = Counter(songs)
buckets = [[] for _ in range(len(songs) + 1)]
for song, freq in counts.items():
buckets[freq].append(song)
result = []
for freq in range(len(buckets) - 1, -1, -1):
for song in buckets[freq]:
result.append(song)
if len(result) == k:
return result
return resultRead it top to bottom:
- Count plays with
Counter. - Create an empty bucket for each possible frequency (0 through N).
- Drop each song into the bucket for its frequency.
- Walk the buckets from highest frequency down, collecting songs into
result. - Stop as soon as
resulthas K items.
That's it. The "walk backward" part is the only slightly unusual bit, and Python's range(..., -1, -1) handles it cleanly.
Why is this O(n)?
This is the part worth understanding. You might look at the nested loop at the bottom and think "that's O(n × m), it can't be linear!". But count the total work done inside the loop body, not the structure.
- The outer loop visits each bucket once.
- The inner loop visits each song exactly once in its entire lifetime.
- Each song appears in exactly one bucket.
- Together, the inner loop runs a total of m times across all iterations, where m is the number of unique songs.
So the whole nested-loop thing sums to O(n + m), which is O(n). The nested appearance is a red herring.
This is the same reason hash map lookups are O(1) even though the hash function does work. You count amortized cost, not loop depth.
Bucket sort needs the value range to be bounded and small-ish. If you're sorting arbitrary floats or huge numbers, you can't just index an array by them. Bucket sort shines when the possible values are integers in a known, tight range. For top-K frequency problems, the range is "0 to n", which is always tight enough.
The comparison
most_common | Heap of size K | Bucket sort | |
|---|---|---|---|
| Time | O(n + m log m) | O(n + m log k) | O(n) |
| Space | O(m) | O(m + k) | O(n) |
| Code complexity | One line | Medium | Medium |
| Best when... | Easy, m small | K small, m huge | Tight integer range |
Three different tools for the same problem. Each has a use. Interviewers love bucket sort because it shows you see through the "obvious" sort-then-pick approach. In real life, the one-liner is usually enough and more maintainable.
Gotchas
1. Sizing the bucket array.
You need len(songs) + 1 buckets, not len(songs). If you have 10 plays and one song got all of them, its frequency is 10, which is index 10, which needs an array of length 11. Off-by-one errors are classic here.
2. Forgetting that ties exist.
If K=3 and you have songs with counts [5, 5, 5, 5, 2], there's more than one valid top-3. The problem usually says "any valid answer works". Don't stress over tie-breaking unless asked.
3. Walking the buckets wrong direction. High count first. If you go from zero upward, you'll collect the least frequent songs, which is the opposite of what the owner wants.
4. Forgetting that result is already full.
Once result has K items, return immediately. Without the if len(result) == k: return check you'll keep going and return all unique songs sorted by frequency. Still correct if you slice, but slower and wastes work.
What you really learned
Bucket sort is one of those patterns that feels like cheating. It dodges the sorting barrier by exploiting a property of the data. The property here was "frequencies are integers bounded by n". Whenever you see a problem where the values have a known bounded range, ask yourself if you can use them as indices directly. Sometimes you can. When you can, the speedup is dramatic.
Other places this trick shows up:
- Sorting an array of integers in a small range (counting sort)
- Finding the K most/least something
- Histogram problems
- "Given ages from 0 to 120, do something" kinds of problems
Boss down
Poster's printed. "Sweet Caroline", "Don't Stop Believin'", "Thunderstruck". Jukebox royalty.
Next up: Boss 6 — The Factory Line. A factory assembly line multiplies products as they move down a conveyor belt. The manager wants to know, for each station, what the final product would have been if that station had been skipped. Sounds impossible in one pass. You'll learn the prefix and suffix trick, which is one of the most clever array techniques out there.
Hard hats on. See you at the factory.