Your First AI App with HuggingFace · AI Engineer

1Your First AI App with HuggingFace

3LangChain & LlamaIndex: Build with LLMs

The GitHub of AI

If you've been reading about AI models, you've probably seen "HuggingFace" mentioned everywhere. Every time someone releases a new model, it ends up on HuggingFace. Every research paper includes a HuggingFace link. It's become the default place where AI lives.

Think of it like GitHub, but for machine learning. GitHub hosts code. HuggingFace hosts models, datasets, and demo apps. And just like you can git clone a repo and start using it, you can load a HuggingFace model in three lines of Python and start running predictions.

That's what makes it special. You don't need to train anything from scratch. Thousands of pre-trained models are sitting there, ready to use.

The ecosystem

HuggingFace isn't just one thing. It's a collection of tools that work together.

The Hub is the central website. Over 500,000 models live there. You can browse, search, filter by task, and read model cards that explain what each model does and how well it performs.

The Transformers library is the Python package you install locally. It knows how to download models from the Hub and run them. One pip install transformers and you're set.

The Datasets library gives you access to thousands of curated datasets, from text classification to image captioning. Same idea — one line of code to download and start using.

Spaces lets you deploy small demo apps (usually Gradio or Streamlit) for free. You can try someone's model in a browser without installing anything.

Your first pipeline

Here's the fastest way to do something useful with HuggingFace. Install the library:

pip install transformers

Now run a text classification model:

from transformers import pipeline
 
classifier = pipeline("sentiment-analysis")
result = classifier("I love building AI apps!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

That's it. Three lines. The pipeline function is HuggingFace's one-liner interface. It handles downloading the model, loading the weights, tokenizing your input, running inference, and formatting the output. You just tell it what task you want.

Behind the scenes, it picked a default model for sentiment analysis (distilbert-base-uncased-finetuned-sst-2-english), downloaded it the first time, cached it locally, and ran your text through it.

What tasks can pipelines do?

Pipelines cover a surprisingly wide range of tasks. Here are the most common ones:

Task	What It Does	Example Use
`sentiment-analysis`	Positive/negative classification	Product reviews
`summarization`	Condense long text	Meeting notes
`translation_en_to_fr`	Translate between languages	Localization
`text-generation`	Continue a prompt	Creative writing
`question-answering`	Answer from a context	FAQ bots
`fill-mask`	Predict missing words	Autocomplete
`zero-shot-classification`	Classify without training	Topic tagging
`text2text-generation`	General text transformation	Paraphrasing

Each one works the same way. Change the task string, and the pipeline picks an appropriate default model.

summarizer = pipeline("summarization")
text = "HuggingFace is a platform that hosts AI models..."
result = summarizer(text, max_length=50)
print(result[0]["summary_text"])

You can also specify which model you want instead of using the default:

generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)

This is useful when you know a specific model works better for your use case.

How tokenizers work

When you pass text to a model, it doesn't see words. Models work with numbers. So your text needs to be converted into a sequence of numbers first. That's what a tokenizer does.

Each model has its own tokenizer. A BERT tokenizer and a GPT-2 tokenizer split text differently and produce different token IDs. They learned different vocabularies during training. You always need to use the tokenizer that matches your model.

from transformers import AutoTokenizer
 
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer("I love building apps")
print(tokens)
# {'input_ids': [101, 1045, 2293, 2311, 6078, 102],
#  'attention_mask': [1, 1, 1, 1, 1, 1]}

The input_ids are the numbers the model sees. The attention_mask tells the model which tokens are real content (1) versus padding (0). When you batch multiple inputs of different lengths, shorter ones get padded to match the longest.

You can also see what tokens the text was split into:

tokenizer.tokenize("I love building apps")
# ['i', 'love', 'building', 'apps']

Simple words stay whole. Uncommon words get broken into subword pieces. That's the BPE (Byte Pair Encoding) approach most modern tokenizers use — it handles any text, including words the model has never seen.

Choosing the right model

The Hub has hundreds of thousands of models. How do you pick one?

Model cards are your best friend. Every good model on HuggingFace has a model card — a page that explains what the model was trained on, what tasks it's good at, its limitations, and benchmark scores. Read these before using any model.

Some practical filters:

By task. The Hub lets you filter models by task (text classification, summarization, etc.). Start here.

By downloads. Popular models are popular for a reason. They tend to be well-tested and reliable.

By size. Larger models are generally more capable but slower and need more memory. A 125M parameter model runs on any laptop. A 7B parameter model needs a good GPU. A 70B model needs multiple GPUs.

By license. Not all models are truly open. Check the license — some restrict commercial use, some require attribution, some are fully permissive (Apache 2.0, MIT).

Start small

When prototyping, pick a small model first. DistilBERT (66M parameters) is often good enough for text classification. You can always upgrade to a larger model once you've validated your approach works.

Running inference: local vs API

You have two main options for running models.

Local inference means downloading the model and running it on your machine. This is what pipeline() does by default. Pros: free, private (your data never leaves your machine), no rate limits. Cons: needs hardware (especially GPU for large models), you manage everything yourself.

Inference API means sending your text to HuggingFace's servers and getting results back via HTTP. Pros: no hardware needed, works from any device. Cons: latency from network round-trips, rate limits on the free tier, your data goes to their servers.

# Local inference
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("This movie was great")
 
# API inference
import requests
API_URL = "https://api-inference.huggingface.co/models/distilbert-base-uncased-finetuned-sst-2-english"
headers = {"Authorization": "Bearer YOUR_TOKEN"}
response = requests.post(API_URL, headers=headers, json={"inputs": "This movie was great"})

For prototyping and learning, local is easier. For production apps where you don't want to manage GPU servers, the API (or the paid Inference Endpoints) makes more sense.

Spaces: instant demos

Spaces is HuggingFace's free hosting for demo apps. Most Spaces use either Gradio (a Python library for building quick ML interfaces) or Streamlit.

You can create a Space by pushing a simple app.py file to a Git repo on HuggingFace. The platform builds and hosts it automatically.

# A complete Gradio app in 8 lines
import gradio as gr
from transformers import pipeline
 
classifier = pipeline("sentiment-analysis")
 
def predict(text):
    return classifier(text)[0]
 
gr.Interface(fn=predict, inputs="text", outputs="json").launch()

This gives you a web interface where anyone can type text and see the sentiment prediction. No frontend code, no deployment headaches.

Spaces are great for sharing your work, demoing ideas to teammates, or just experimenting with models interactively.

Putting it together: a practical example

Here's a slightly more realistic example. You want to build a text summarizer that also classifies the sentiment of the summary.

from transformers import pipeline
 
# Load two models
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
classifier = pipeline("sentiment-analysis")
 
article = """
The company reported record quarterly earnings, driven by strong
demand for its AI products. Revenue grew 35% year-over-year,
exceeding analyst expectations. The CEO highlighted new partnerships
and an expanding customer base as key growth drivers. However,
operating costs also increased due to infrastructure investments.
"""
 
# Summarize
summary = summarizer(article, max_length=60, min_length=20)
summary_text = summary[0]["summary_text"]
print(f"Summary: {summary_text}")
 
# Classify the summary's sentiment
sentiment = classifier(summary_text)
print(f"Sentiment: {sentiment[0]['label']} ({sentiment[0]['score']:.2f})")

Two pipelines chained together. The output of one becomes the input of the next. This pattern — combining simple building blocks into useful workflows — is the foundation of everything you'll build with AI.

Common gotchas

A few things that trip people up early on:

Model downloads are big. The first time you use a model, it downloads the weights. DistilBERT is about 250MB. Larger models can be several gigabytes. They're cached in ~/.cache/huggingface/, so subsequent loads are fast.

GPU vs CPU. Pipelines run on CPU by default. For small models this is fine. For anything over ~1B parameters, you'll want a GPU. Add device=0 to use the first GPU:

pipe = pipeline("text-generation", model="gpt2", device=0)

Tokenizer limits. Every model has a maximum context length. BERT handles 512 tokens. GPT-2 handles 1024. If your text is longer, it gets truncated. You need to handle long documents by chunking them yourself.

Model vs task mismatch. Don't use a summarization model for question answering. Each model was trained for specific tasks. The pipeline will run without errors, but the output will be nonsense.

What's next?

You now know how to load pre-trained models, run them on your own data, and build simple applications by chaining pipelines together. That's a solid starting point.

But what happens when the pre-trained model isn't quite right for your use case? The sentiment classifier works, but it doesn't understand your company's internal jargon. The summarizer is good, but it misses the key points in your specific domain.

That's when you fine-tune. In the next article, we'll roll up our sleeves and learn how to customize a model for your exact needs using PyTorch — the training loop, dataloaders, optimizers, and all the practical details that tutorials usually skip.