The ML System Design Interview · ML Engineer

5The ML System Design Interview

What is this interview actually testing?

The prompt sounds simple: "Design a fraud detection system." Forty-five minutes, a whiteboard, and an interviewer watching you think.

Here's the secret most candidates miss. The interviewer is not testing whether you know models. They assume you do. They're testing whether you can turn a vague business problem into a working ML system without someone holding your hand.

That means they're grading things that never appear in a course syllabus. Did you ask what "fraud" means before designing? Did you define a metric before picking a model? Did you mention what happens after deployment, or did your design end at "then we train the model"?

Backend engineers actually have an edge here. You already think in systems: APIs, latency budgets, failure modes. The ML system design interview is your system design interview with a model in the middle. You just need a repeatable way to structure the conversation.

So let me give you one.

The framework

Every ML design question, whether it's fraud, credit scoring, recommendations, or forecasting, can be walked through the same six stages.

Notice the order. The model shows up at stage four, more than halfway through. That's deliberate. In real ML work, and in the interview, most of the value and most of the failure lives in stages one, two, and six.

Announce the framework out loud at the start: "I'll clarify the problem, define the label and metric, then cover data, model, serving, and monitoring." The interviewer now knows you have a map. Half the battle is won in that first sentence.

Now let's run a full question through it.

Stage 1: Clarify before you design anything

"Design a fraud detection system for a payments platform."

Don't touch the whiteboard yet. Ask questions.

What kind of fraud, stolen cards or account takeover? What's the transaction volume, a thousand per second or ten? Do we block transactions in real time or flag them for review? What hurts more, missing fraud or blocking a good customer?

Say you learn: card-payment fraud, real-time decisions, about 2,000 transactions per second, and blocking a legitimate customer costs trust while missing fraud costs actual money. Now you're designing a specific system instead of reciting a generic one.

One minute of questions beats ten minutes of backtracking

Interviewers deliberately leave the prompt vague. Candidates who start drawing architecture immediately are answering a question nobody asked. Two or three sharp clarifying questions is the strongest possible opening.

Stage 2: What exactly are we predicting, and how do we score it?

This is where strong candidates separate from the pack. Define the label: for each transaction, fraud or not fraud, where ground truth comes from chargebacks and confirmed reports. Then immediately flag the catch: chargebacks arrive weeks late, so labels lag reality. Your freshest training data is always a little stale.

Now the metric. Accuracy is a trap and you should say so out loud. If 0.2 percent of transactions are fraud, a model that approves everything is 99.8 percent accurate and completely useless.

Talk precision and recall instead. Precision: of the transactions we blocked, how many were actually fraud? Recall: of all the fraud, how much did we catch? They fight each other. Block aggressively and precision drops, good customers suffer. Block cautiously and recall drops, fraud slips through.

Then tie it to money, because this is a business: "I'd work with the fraud team to price both errors. A missed fraud costs the transaction amount plus fees. A false block risks losing the customer. That ratio decides where we set the threshold."

Stage 3: Data and features

Where does the signal come from? Transaction data: amount, merchant, time, currency. Account history: age of account, usual spending pattern, devices seen before. Behavioral aggregates: transactions in the last hour, distance from the previous transaction's location.

Two things to say here that earn real points.

First, class imbalance. Fraud is a fraction of a percent of the data. Mention how you'd handle it: weighting the rare class more heavily in training, or resampling, and evaluating with precision-recall rather than accuracy either way.

Second, leakage. Any feature that wouldn't exist at decision time is poison. "Was this transaction later disputed" is obviously leaked, but subtler leaks hide in aggregates computed over the full dataset. State the rule: every feature must be computable at the moment the transaction happens.

Stage 4: Model choice, and the baseline that comes first

Here's where nervous candidates blurt "deep learning" and sink themselves.

Start with a baseline: simple rules, or logistic regression on a handful of features. It's fast, explainable, and gives you a number to beat. If your fancy model can't beat rules like "block transactions over 5k from brand-new accounts," the fancy model doesn't deserve to exist.

Then step up to gradient boosted trees, which dominate fraud and tabular problems generally. Strong with mixed feature types, robust with imbalanced data, fast enough for real time, and interpretable enough that fraud analysts can see why a transaction scored high.

Then say something like: "Deep learning earns its place if we later add sequence modeling over transaction histories, but I wouldn't start there. It's more infrastructure, slower iteration, and harder to explain, for uncertain gain on tabular data."

That sentence, tradeoffs stated out loud, is worth more than any architecture diagram.

Stage 5: Serving architecture and latency

Now the part backend engineers can genuinely enjoy. The decision must happen inside the payment flow, so you have a budget of maybe 100 to 200 milliseconds end to end, and the model gets a slice of that.

The key insight to voice: heavy features are precomputed. You can't scan 90 days of history during a live request, so aggregates like "spend in the last 24 hours" are maintained by a streaming job and served from a low-latency store. The online path just does lookups plus one model call.

Also mention the fallback. If the model service times out, what happens? Probably fail open with rules-based checks, because failing closed means blocking all payments, and that's a worse incident than any fraud.

Stage 6: Monitoring, retraining, and drift

Most candidates end at deployment. Don't. Fraud is adversarial: the moment your model works, fraudsters change tactics, and your model quietly rots. This is drift, and planning for it is what makes your design production-grade.

Monitor two layers. System health: latency, error rates, feature store freshness. Model health: score distributions shifting, precision and recall on freshly labeled data, alert volume changing week over week.

Then close the loop: retrain on a schedule using fresh chargeback labels, and retrain early if drift alarms fire.

But how do you know a new model is safe to ship? This is where offline versus online metrics comes in. Offline, the new model beats the old one on historical data. That's necessary, not sufficient, because history can't tell you how the model behaves against live traffic and live adversaries.

So you roll out in stages. First shadow deployment: the new model scores every live transaction but its decisions go nowhere, only into logs. You compare its would-be decisions against the current model on real traffic, zero risk. If shadow looks good, run an A/B rollout: give the new model a small slice of real decisions, maybe 5 percent, and watch the business metrics, fraud losses and false block complaints, not just model scores. Only then ramp to 100.

The offline-online gap is where careers get humbled

Models that win offline lose online all the time, through leakage, drift, or feedback loops the historical data couldn't show. Shadow and A/B stages exist because offline evaluation is a rehearsal, not the performance.

The mistakes that fail candidates

Having watched this from both sides of the table, the same handful of mistakes come up over and over.

Mistake	What the interviewer concludes
Jumping straight to deep learning	Chases hype, hasn't shipped ML
No baseline mentioned	Can't tell if their model adds value
Ignoring class imbalance	Hasn't touched real fraud or risk data
Design ends at deployment	Has never owned a model in production
Silent tradeoffs	Can't collaborate on design decisions

That last one deserves a moment. Interviewers can't grade thoughts, only speech. Narrate your tradeoffs: "Precomputed features are cheaper at request time but can be minutes stale. For fraud velocity checks, staleness is dangerous, so I'd keep those in the streaming path." You just showed judgment, awareness of alternatives, and a decision. Every fork in your design is a chance to do this.

One phrase to keep within reach: "It depends on the constraint, and here's how I'd decide." That's the sound of a senior engineer.

The end of this road, and the start of the next one

That's the framework: clarify, define the label and metric, data and features, baseline then model, serving, monitoring. Run any ML design question through those six stages and you will sound like someone who has shipped this stuff, because you'll be thinking the way people who ship it think.

And that closes out Applied ML Systems. You've gone from fraud detection through credit risk, recommenders, and forecasting, all the way to defending a design under interview pressure. That's a genuinely strong foundation, and you should feel good about it.

Next in the ML Engineer path: Transformers and LLMs, where modern AI takes over. The attention mechanism, why it changed everything, and how the models behind today's AI boom actually work under the hood.