</>
Vizly

Cloud AI: AWS, GCP & Azure for AI Engineers

April 6, 202611 min
AICloudAWSGCPAzureSageMaker

SageMaker, Vertex AI, Azure OpenAI — navigating cloud AI services and knowing when to use managed endpoints vs self-hosting.

GPUs are expensive. Like, really expensive.

You want to train a model. Or maybe just run inference on a big one. You fire up your laptop and... it takes 47 hours. Or crashes. Or both.

Here's the thing: serious AI work needs serious hardware. A single NVIDIA A100 GPU costs around 10k to buy. And you'd need multiple. Plus the cooling, the power, the maintenance.

That's why the cloud exists. You rent the hardware, use it, and stop paying when you're done.

But which cloud? There are three big players, and they each have their own AI ecosystem.


The big three at a glance

AWS, GCP, and Azure all offer AI services. They all have GPU instances, managed training, model hosting, and pre-built APIs. But they're not the same.

Think of it like choosing a car rental company. They all have cars. But the fleet is different, the pricing works differently, and the experience of picking up your keys varies wildly.

Here's what each one brings to the table.


AWS SageMaker: the Swiss Army knife

Amazon's AI platform is SageMaker. It does... everything. Training, hosting, data labeling, feature engineering, notebooks, pipelines. It's been around since 2017, and it shows — there's tooling for almost every ML workflow you can imagine.

SageMaker Studio is the IDE. Think Jupyter notebooks, but hosted in the cloud with built-in access to GPU instances. You can train a model, track experiments, and deploy — all without leaving the browser.

Training jobs are where it gets interesting. You define your training script, choose an instance type (like ml.p4d.24xlarge with 8 A100 GPUs), and SageMaker handles provisioning, running, and tearing down the infrastructure. You pay only for the time the training job runs.

Endpoints are for serving. You deploy a trained model, and SageMaker gives you a REST API. Auto-scaling, load balancing, model monitoring — it handles the infrastructure so you focus on the model.

Bedrock is newer. It's Amazon's managed foundation model service. Instead of hosting your own model, you call Claude, Llama, or Mistral through a single API. Pay per token. No infrastructure management.

When to pick AWS: you're already on AWS, you need maximum flexibility, or you want the deepest ecosystem of tools.


GCP Vertex AI: the ML-native platform

Google's approach feels different. Vertex AI is designed as an end-to-end ML platform where everything connects together.

Model Garden is the highlight. Hundreds of pre-trained models — Google's own (Gemini, PaLM) plus open-source ones (Llama, Mistral). Browse, test, and deploy from a catalog. It's like a model app store.

Pipelines use Kubeflow under the hood. You define your ML workflow as a graph — data prep, training, evaluation, deployment — and Vertex runs it. Great for reproducible experiments.

Predictions come in two flavors. Online predictions for real-time inference (like a chatbot). Batch predictions for processing large datasets overnight (like scoring a million customer records).

One thing Google does well: TPUs. These are Google's custom AI chips, and they're only available on GCP. For certain workloads — especially training large transformers — TPUs can be faster and cheaper than GPUs.

When to pick GCP: you want tight integration with Google services, access to TPUs, or the cleanest developer experience for ML.


Azure AI: the enterprise play

Microsoft's approach leans hard into enterprise. Azure ML integrates with Active Directory, compliance tools, and the broader Microsoft ecosystem.

Azure OpenAI Service is the big differentiator. It gives you private, enterprise-grade access to OpenAI models — GPT-4, DALL-E, Whisper. Same models, but deployed in your Azure region, with your data policies, your compliance requirements. For big companies worried about data governance, this is a huge deal.

Azure ML Studio provides the drag-and-drop pipeline builder plus notebooks. Similar to SageMaker in scope, but with a more visual interface.

Responsible AI dashboard is a nice touch — built-in tools for model fairness, interpretability, and error analysis. If you need to explain your model's decisions to regulators, Azure makes it easier.

When to pick Azure: you're an enterprise already on Microsoft, you need Azure OpenAI's data governance, or compliance is a top priority.


Quick comparison

Each cloud has strengths in different areas.

FeatureAWSGCPAzure
ML PlatformSageMakerVertex AIAzure ML
Foundation ModelsBedrockModel GardenAzure OpenAI
Custom HardwareInferentia, TrainiumTPUMaia (coming)
Notebook IDESageMaker StudioColab EnterpriseML Studio
Best ForFlexibility, ecosystemML-native experienceEnterprise, compliance
Pricing ModelPay-per-use, complexPay-per-use, simplerPay-per-use, enterprise deals

Honestly, all three are competent. If your company already uses one cloud, stay there. The migration cost of switching clouds for AI alone almost never makes sense.


Managed endpoints vs self-hosting

This is the real decision. Not "which cloud" but "how much control do you need?"

Managed endpoints (SageMaker Endpoints, Vertex AI Predictions) are the middle ground. You bring the model. The cloud handles scaling, load balancing, health checks, and rollbacks. You get a URL. You call it. Models go up, models come down. You don't manage Kubernetes clusters at 2 AM.

The tradeoff? Less control over the runtime. You're locked into the cloud provider's container images and scaling policies. And the cost per request is higher than raw compute — you're paying for the convenience.

Self-hosting means renting GPU instances (EC2 p4d, GCE a2-ultragpu, etc.) and running the model yourself. Maybe with vLLM, TGI, or Triton Inference Server. You manage the containers, the scaling, the health checks.

More work. But full control. And often cheaper at scale because you're not paying the managed service markup.

API providers (OpenAI, Anthropic, Google) are the easiest path. No model to host. No GPU to manage. Just an API key and a credit card. Perfect for prototyping and many production use cases.


The GPU instance jungle

All three clouds offer GPU instances. The naming is confusing. Here's the cheat sheet.

GPUAWSGCPAzureGood For
T4 (16GB)g4dnn1 + T4NC T4Inference, small training
A10G (24GB)g5g2NC A10Medium inference
A100 (80GB)p4da2-ultragpuNC A100Large training, big models
H100 (80GB)p5a3NC H100Frontier models, fast training

Spot instances are the hack. All three clouds let you bid on unused GPU capacity at 60-90% discounts. The catch: they can be reclaimed with little notice. Great for training (just checkpoint frequently). Terrible for serving.

On-demand instances guarantee availability but cost full price. For inference endpoints that need to be always-on, this is usually the way.


When to use API providers vs self-host

This is a surprisingly simple decision once you know the variables.

Use API providers when:

  • You're using a frontier model (GPT-4, Claude, Gemini) — you literally can't self-host these
  • Your traffic is bursty or unpredictable — pay-per-token scales naturally
  • You're prototyping or moving fast — zero infrastructure setup
  • Your volume is low to moderate — the per-token cost is worth the convenience

Self-host when:

  • You need a fine-tuned open-source model (Llama, Mistral) on your own data
  • Compliance requires data to never leave your infrastructure
  • You have consistent, high-volume traffic — the economics flip at scale
  • You need custom inference optimization (batching, quantization, speculative decoding)

The sweet spot for many teams? Start with API providers, measure your usage, and migrate high-volume endpoints to self-hosted as you scale.


The hybrid approach

Most real-world AI systems don't pick just one option. They mix and match.

Your chatbot uses Claude via Anthropic's API — it's a frontier model and traffic is unpredictable. Your embedding model runs on a self-hosted GPU instance — it's high-volume and a small open-source model. Your batch classification job runs on SageMaker with spot instances — it's cost-sensitive and fault-tolerant.

This is normal. The "right" approach is the one that matches each workload to the best serving option.


The decision framework

When you're staring at a new AI workload, ask these questions in order:

  1. Can an API provider do this? If yes, start there. Fastest path to production.
  2. Does data governance prevent using external APIs? If yes, look at managed endpoints (Azure OpenAI, SageMaker, Vertex).
  3. Is cost a problem at current volume? If yes, consider self-hosting with open-source models.
  4. Do you need custom model modifications? If yes, self-host with your fine-tuned model.

Most teams overthink this. Start simple. Optimize later.


Cost reality check

Cloud AI spending surprises people. Here's a rough feel for 2026 pricing:

  • API calls (GPT-4 class): roughly 5-15 dollars per million input tokens. Sounds cheap until you're processing 10 million documents.
  • Managed endpoints (SageMaker with a T4): around 0.50-1.00 dollars per hour. Always on = 400-700 dollars per month.
  • GPU instances (A100 on-demand): around 3-5 dollars per hour. Spot can be under a dollar per hour, but availability varies.
  • Training a large model: hundreds to tens of thousands of dollars depending on model size, dataset size, and training time.

The most expensive mistake? Leaving GPU instances running when nobody's using them. Set up auto-shutdown policies. Seriously.


Reserved capacity and savings plans

If you know you'll be using GPU compute consistently, all three clouds offer reservation discounts.

AWS Savings Plans and Reserved Instances give you 30-60% off in exchange for a 1-3 year commitment. You commit to a certain spend level per hour, and any usage up to that level is discounted.

GCP Committed Use Discounts work similarly. Commit to a specific instance type for 1 or 3 years, get significant savings.

Azure Reserved VM Instances follow the same pattern.

The math is straightforward. If you're running a GPU instance 24/7 for inference, the reserved price is almost always worth it. If your usage is sporadic, stick with on-demand and optimize for auto-scaling to zero when idle.

One more trick: serverless inference. Both SageMaker and Vertex offer serverless endpoints that scale to zero when there's no traffic. You pay nothing when idle, and the endpoint spins up when a request arrives. The downside is cold start latency — the first request after an idle period takes longer. Fine for batch-ish workloads. Not great for real-time chatbots.


What's next?

You've picked a cloud and figured out how to deploy. But how do you manage the whole lifecycle? How do you track experiments, version models, test deployments, and monitor drift? That's the world of MLOps — treating AI systems with the same engineering rigor as software. Next up: experiment tracking, model registries, and CI/CD for models.

Edit this page on GitHub