AI Tools.

Search

Updated daily from HuggingFace

Open-source AI models,
compared at a glance.

1045 models · 41 pipelines · 2,339,634,092 total downloads tracked. Use cases, pros, cons, and alternatives for each.

Browse by pipeline

41 categories of AI models

text generation

234

Top: Qwen3-0.6B

Browse →

image text to text

129

Top: Qwen3-VL-2B-Instruct

Browse →

sentence similarity

77

Top: all-MiniLM-L6-v2

Browse →

feature extraction

62

Top: bge-small-en-v1.5

Browse →

automatic speech recognition

57

Top: speaker-diarization-3.1

Browse →

fill mask

43

Top: bert-base-uncased

Browse →

text classification

37

Top: bge-reranker-v2-m3

Browse →

image classification

35

Top: mobilenetv3_small_100.lamb_in1k

Browse →

time series forecasting

25

Top: chronos-2

Browse →

zero shot image classification

22

Top: clip-vit-large-patch14

Browse →

text ranking

20

Top: ms-marco-MiniLM-L6-v2

Browse →

token classification

17

Top: bert-base-NER

Browse →

translation

16

Top: t5-small

Browse →

text to image

16

Top: stable-diffusion-xl-base-1.0

Browse →

text to speech

13

Top: Kokoro-82M

Browse →

image to text

11

Top: GLM-OCR

Browse →

any to any

11

Top: gemma-4-E4B-it

Browse →

image feature extraction

11

Top: vit-base-patch16-224-in21k

Browse →

audio classification

9

Top: clap-htsat-fused

Browse →

image to video

7

Top: LTX-2.3

Browse →

object detection

6

Top: table-transformer-detection

Browse →

depth estimation

6

Top: Depth-Anything-V2-Small-hf

Browse →

image segmentation

6

Top: RMBG-1.4

Browse →

summarization

4

Top: bart-large-cnn

Browse →

zero shot object detection

4

Top: grounding-dino-base

Browse →

question answering

4

Top: electra_large_discriminator_squad2_512

Browse →

zero shot classification

3

Top: bart-large-mnli

Browse →

audio to audio

3

Top: bigvgan_v2_22khz_80band_256x

Browse →

video classification

3

Top: kandinsky-videomae-large-camera-motion

Browse →

audio text to text

3

Top: ultravox-v0_5-llama-3_2-1b

Browse →

voice activity detection

2

Top: segmentation-3.0

Browse →

mask generation

2

Top: sam3

Browse →

image to 3d

1

Top: TRELLIS-image-large

Browse →

keypoint detection

1

Top: vitpose-plus-base

Browse →

robotics

1

Top: openvla-7b

Browse →

text to audio

1

Top: musicgen-medium

Browse →

table question answering

1

Top: tapex-base-finetuned-wikisql

Browse →

tabular classification

1

Top: mitra-classifier

Browse →

image text to image

1

Top: Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Browse →

image to image

1

Top: FLUX.2-klein-base-4B

Browse →

visual document retrieval

1

Top: jina-embeddings-v4

Browse →

Top by downloads

Most popular models across all pipelines

all-MiniLM-L6-v2

sentence-similarity

Distilled BERT model that encodes sentences into 384-dimensional vectors for measuring semantic similarity. Trained on over a billion sentence pairs spanning scientific papers, web QA, NLI datasets, and community forums. At 22M parameters and 6 transformer layers, it is fast enough for CPU inference while remaining competitive on standard sentence similarity benchmarks.

239,973,503 4,754

Qwen3-VL-2B-Instruct

image-text-to-text

Qwen3-VL-2B-Instruct is a 2-billion-parameter vision-language model from Alibaba Cloud that jointly processes images and text for visual question answering, captioning, and document understanding. Its 2B scale positions it as one of the smaller instruction-tuned VLMs capable of zero-shot visual reasoning. Apache 2.0 licensed.

186,904,434 386

bert-base-uncased

fill-mask

Google's original BERT base model in uncased form, pre-trained on BookCorpus and English Wikipedia via masked language modeling. Tokens are lowercased before processing, making it insensitive to capitalization. It remains a standard fine-tuning base for classification, NER, and extractive QA, though newer encoders outperform it on most benchmarks.

59,598,776 2,641

ELECTRA base discriminator from Google, pre-trained using replaced token detection rather than masked language modeling. A small generator produces candidate replacements; this model learns to identify which tokens were swapped — a task that uses every token for training signal, making pre-training more efficient than BERT per compute dollar. Intended as a fine-tuning base for classification and token-level tasks.

55,141,992 102

Multilingual sentence embedding model covering 50+ languages, built on a 12-layer distilled MiniLM architecture. Produces 384-dimensional vectors designed for semantic similarity and paraphrase detection across language boundaries. Trained on multilingual paraphrase data to align semantically equivalent sentences even when expressed in different languages.

44,875,889 1,218

ms-marco-MiniLM-L6-v2

text-ranking

Cross-encoder reranker trained on the MS MARCO passage retrieval dataset, designed to score query-document pairs jointly rather than encoding them independently. Distilled from a 12-layer cross-encoder into 6 layers to reduce latency while retaining re-ranking accuracy. Used as a second-stage ranker on top of fast first-stage retrieval (BM25 or bi-encoder).

40,186,774 229

all-mpnet-base-v2

sentence-similarity

Sentence embedding model based on the MPNet architecture, producing 768-dimensional vectors. Trained on over a billion sentence pairs from MS MARCO, NLI datasets, and community QA forums, it is frequently used when accuracy matters more than inference speed among English embedding models. The MPNet backbone enables masked and permuted prediction during pre-training for stronger representations.

36,513,639 1,287

bge-small-en-v1.5

feature-extraction

Small English dense embedding model from BAAI's BGE (BAAI General Embedding) series, producing 384-dimensional vectors via MIT license. Optimized for MTEB retrieval benchmarks through a retrieval-focused training strategy, it achieves competitive scores relative to its parameter count. Suited for embedding workflows where throughput and cost matter more than peak accuracy.

34,386,222 451

clip-vit-large-patch14

zero-shot-image-classification

OpenAI's CLIP model using a ViT-L/14 image encoder, trained contrastively on 400 million image-text pairs from the internet. It aligns image and text in a shared embedding space, enabling zero-shot image classification by comparing image embeddings against text label embeddings. The ViT-L/14 variant offers higher accuracy than the smaller ViT-B/32 at greater compute cost.

25,187,308 2,000

mobilenetv3_small_100.lamb_in1k

image-classification

MobileNetV3 small model at 100% width multiplier, trained on ImageNet-1k using the LAMB optimizer via the timm library. At under 3M parameters, it targets image classification on mobile and edge hardware where latency and memory are primary constraints. Part of timm's standardized pretrained model zoo with consistent preprocessing and inference APIs.

22,549,780 66

chronos-2

time-series-forecasting

Chronos-2 is Amazon's second-generation pretrained foundation model for zero-shot time-series forecasting. It frames forecasting as a language modeling problem over quantized time-series tokens using a T5 encoder-decoder architecture, enabling it to forecast across diverse domains without per-dataset training. Released under Apache 2.0.

21,797,667 266

nsfw_image_detection

image-classification

Vision Transformer (ViT) fine-tuned for binary NSFW vs. safe image classification. Provides a single classifier for flagging potentially unsafe image content without category-level labeling. Built on ViT-base architecture and fine-tuned on a curated dataset of safe and unsafe images.

21,530,509 1,065

clip-vit-base-patch32

zero-shot-image-classification

OpenAI's CLIP model using a ViT-B/32 image encoder, the smaller of the two widely deployed CLIP variants. Trained contrastively on 400 million image-text pairs, it aligns image and text representations in a shared embedding space for zero-shot classification and retrieval. The B/32 variant sacrifices accuracy versus ViT-L/14 for faster inference.

21,261,234 931

bge-m3

sentence-similarity

BAAI's BGE-M3 embedding model supporting over 100 languages with a unified architecture capable of dense, sparse (lexical), and late-interaction (ColBERT-style) retrieval modes from a single checkpoint. Built on XLM-RoBERTa with large-scale multilingual training, it targets multi-lingual and cross-lingual retrieval where a single model must handle diverse language inputs.

20,983,869 2,977

Qwen3-0.6B

text-generation

Qwen3-0.6B is the 0.6-billion-parameter instruction-tuned model from Alibaba Cloud's Qwen3 series, fine-tuned from the Qwen3-0.6B-Base for conversational and task-following use. It targets deployment in environments where even a 1B model is too large — edge hardware, mobile devices, or ultra-low-latency services. Apache 2.0 licensed.

19,085,165 1,224

roberta-base

fill-mask

RoBERTa base from Facebook AI, trained with the same architecture as BERT base but significantly more data, longer training schedules, larger batch sizes, and dynamic masking. Pre-trained on BookCorpus, Wikipedia, CC-News, OpenWebText, and Stories — substantially more data than the original BERT. MIT licensed with multi-framework support.

18,684,651 595

roberta-large

fill-mask

RoBERTa large, the 355M-parameter version of Facebook AI's strongly trained BERT variant, offering doubled hidden size and additional attention heads over RoBERTa base. It provides stronger NLU accuracy at roughly 4x the inference compute cost of the base variant. Used where task accuracy on complex English language understanding outweighs latency constraints.

18,627,609 283

xlm-roberta-base

fill-mask

XLM-RoBERTa base from Facebook AI, pre-trained on 2.5TB of filtered CommonCrawl text across 100 languages using the RoBERTa training procedure. Enables cross-lingual transfer — models fine-tuned on labeled English data can infer on other languages without parallel annotations. The standard starting point for multilingual classification and token-level tasks.

18,605,818 822

clap-htsat-fused

audio-classification

LAION's CLAP (Contrastive Language-Audio Pretraining) model using the HTSAT (Hierarchical Token-Semantic Audio Transformer) encoder, fused with a text encoder to align audio and text in a shared embedding space. Analogous to CLIP for images, it enables zero-shot audio classification and retrieval using natural language descriptions without task-specific labeled audio data.

18,153,697 82

gpt2

text-generation

OpenAI's original GPT-2 at 124M parameters, an autoregressive language model trained on WebText (over 8 million web documents filtered from Reddit outlinks). It generates English text continuation given a prompt using next-token prediction, trained without any instruction tuning or RLHF. MIT licensed and runnable on commodity CPU hardware.

15,630,303 3,227

ADetailer is a collection of Ultralytics YOLO-based face, body, and hand detection models distributed for use with the Stable Diffusion WebUI's ADetailer extension. The models detect regions of interest in generated images (faces, hands) to trigger targeted inpainting passes for quality improvement. Trained on WIDER FACE and anime segmentation datasets, covering both photorealistic and anime styles.

15,448,394 689

nomic-embed-text-v1.5

sentence-similarity

Nomic Embed Text v1.5 is a matryoshka-capable English embedding model from Nomic AI, built on a custom nomic-BERT architecture trained with contrastive learning on large-scale text pairs. Matryoshka Representation Learning allows truncating embeddings to shorter dimensions (e.g. 64, 128, 256) without retraining, enabling flexible precision-cost tradeoffs. The model is transformers.js-compatible for browser-side inference.

15,328,805 812

bge-large-en-v1.5

feature-extraction

BGE-Large-EN-v1.5 is BAAI's highest-capacity English embedding model in the v1.5 series, producing 1024-dimensional vectors. It achieves top MTEB retrieval scores among its generation of English-only embedding models, at the cost of higher compute and storage than BGE-small or BGE-base. MIT licensed with ONNX export support.

14,929,062 657

ColBERTv2 is a late-interaction retrieval model from Stanford that encodes queries and passages as per-token embeddings rather than a single vector, allowing MaxSim matching at retrieval time. This token-level interaction yields higher accuracy than bi-encoders on many retrieval benchmarks while remaining more efficient than cross-encoders. The model is MIT licensed and implemented in PyTorch with ONNX support.

14,593,513 335