fill mask models

43 models · ranked by HuggingFace downloads

bert-base-uncased

Google's original BERT base model in uncased form, pre-trained on BookCorpus and English Wikipedia via masked language modeling. Tokens are lowercased before processing, making it insensitive to capitalization. It remains a standard fine-tuning base for classification, NER, and extractive QA, though newer encoders outperform it on most benchmarks.

59,598,776 ↓ · 2,641 ♡

roberta-base

RoBERTa base from Facebook AI, trained with the same architecture as BERT base but significantly more data, longer training schedules, larger batch sizes, and dynamic masking. Pre-trained on BookCorpus, Wikipedia, CC-News, OpenWebText, and Stories — substantially more data than the original BERT. MIT licensed with multi-framework support.

18,684,651 ↓ · 595 ♡

roberta-large

RoBERTa large, the 355M-parameter version of Facebook AI's strongly trained BERT variant, offering doubled hidden size and additional attention heads over RoBERTa base. It provides stronger NLU accuracy at roughly 4x the inference compute cost of the base variant. Used where task accuracy on complex English language understanding outweighs latency constraints.

18,627,609 ↓ · 283 ♡

xlm-roberta-base

XLM-RoBERTa base from Facebook AI, pre-trained on 2.5TB of filtered CommonCrawl text across 100 languages using the RoBERTa training procedure. Enables cross-lingual transfer — models fine-tuned on labeled English data can infer on other languages without parallel annotations. The standard starting point for multilingual classification and token-level tasks.

18,605,818 ↓ · 822 ♡

distilbert-base-uncased

DistilBERT-base-uncased is a distilled version of BERT-base-uncased, 40% smaller and 60% faster while retaining approximately 97% of BERT's language understanding performance on the GLUE benchmark. Trained via knowledge distillation from BERT using BookCorpus and Wikipedia. Commonly used when BERT's performance is needed but inference speed or resource constraints are limiting factors.

13,940,511 ↓ · 872 ♡

xlm-roberta-large

XLM-RoBERTa Large, the 560-million-parameter multilingual encoder from Facebook AI, trained on 2.5TB of CommonCrawl data across 100 languages. It offers stronger multilingual language understanding than the base variant across classification, NER, and cross-lingual tasks, at roughly 4x the compute cost. MIT licensed with multi-framework support.

6,758,270 ↓ · 511 ♡

bert-base-cased

Google's BERT base model in cased form, pre-trained on BookCorpus and English Wikipedia with original case preserved. Unlike bert-base-uncased, this model maintains distinctions between 'bert' and 'BERT' — essential for tasks where capitalization carries semantic information, such as named entity recognition. Same architecture as bert-base-uncased but with case-sensitive tokenization.

4,439,406 ↓ · 357 ♡

bert-base-multilingual-cased

BERT-base-multilingual-cased is Google's multilingual BERT trained on 104-language Wikipedia data with case preserved, making it better suited than the uncased variant for named entity recognition and tasks where capitalization carries semantic meaning. It shares the same 12-layer Transformer architecture and 768-dimensional embedding space as BERT-base-uncased. Despite its age, it remains a common transfer learning starting point for multilingual tasks.

4,221,839 ↓ · 587 ♡

bert-base-multilingual-uncased

BERT-base-multilingual-uncased is Google's multilingual BERT trained on Wikipedia text from 104 languages with all text lowercased before tokenization. Lowercasing simplifies processing but removes capitalization signals that help named entity recognition. It produces 768-dimensional embeddings shared across all supported languages.

3,878,218 ↓ · 156 ♡