AI Tools.

Search

fill mask

bert-base-multilingual-uncased

BERT-base-multilingual-uncased is Google's multilingual BERT trained on Wikipedia text from 104 languages with all text lowercased before tokenization. Lowercasing simplifies processing but removes capitalization signals that help named entity recognition. It produces 768-dimensional embeddings shared across all supported languages.

Last reviewed

Use cases

  • Cross-lingual text classification with a single model
  • Zero-shot transfer to low-resource languages in the 104-language set
  • Multilingual masked language model pretraining baseline
  • NER and POS tagging in contexts where case carries no meaning

Pros

  • Single model spans 104 languages with a shared multilingual vocabulary
  • Apache 2.0 license, widely integrated in community NLP pipelines
  • Well-understood baseline with extensive published benchmarks

Cons

  • Lowercasing removes signals critical for named entity recognition
  • Outperformed on most tasks by XLM-RoBERTa-base and above
  • Fixed 512-token context limit with no built-in sliding window support

FAQ

What is bert-base-multilingual-uncased used for?

Cross-lingual text classification with a single model. Zero-shot transfer to low-resource languages in the 104-language set. Multilingual masked language model pretraining baseline. NER and POS tagging in contexts where case carries no meaning.

Is bert-base-multilingual-uncased free to use?

bert-base-multilingual-uncased is an open-source model published on HuggingFace. License terms vary by model — check the model card for the specific license.

How do I run bert-base-multilingual-uncased locally?

Most HuggingFace models can be loaded with transformers or the appropriate framework library. See the model card for framework-specific instructions and hardware requirements.

Tags

transformerspytorchtfjaxsafetensorsbertfill-maskmultilingualafsqaranhyastazbaeubarbebn