AI Tools.

Search

audio classification

clap-htsat-fused

LAION's CLAP (Contrastive Language-Audio Pretraining) model using the HTSAT (Hierarchical Token-Semantic Audio Transformer) encoder, fused with a text encoder to align audio and text in a shared embedding space. Analogous to CLIP for images, it enables zero-shot audio classification and retrieval using natural language descriptions without task-specific labeled audio data.

Last reviewed

Use cases

  • Zero-shot audio event classification using natural language labels
  • Audio-to-text retrieval in sound effect or music libraries
  • Environmental sound tagging without collecting labeled audio training data
  • Building natural language queries for acoustic search systems
  • Audio feature extraction backbone for downstream acoustic ML tasks

Pros

  • Zero-shot audio classification without task-specific training data
  • Natural language label specification supports flexible, updateable categories
  • HTSAT encoder handles variable-length audio inputs
  • Apache 2.0 license; supports audio event detection and retrieval in one model

Cons

  • Text conditioning is English-only
  • Accuracy degrades on fine-grained or highly domain-specific audio categories
  • Real-world recording quality and sample rate mismatches affect reliability
  • Less validated than image CLIP for generalization across diverse audio domains
  • Higher computational overhead vs. dedicated narrow-domain audio classifiers

FAQ

What is clap-htsat-fused used for?

Zero-shot audio event classification using natural language labels. Audio-to-text retrieval in sound effect or music libraries. Environmental sound tagging without collecting labeled audio training data. Building natural language queries for acoustic search systems. Audio feature extraction backbone for downstream acoustic ML tasks.

Is clap-htsat-fused free to use?

clap-htsat-fused is an open-source model published on HuggingFace. License terms vary by model — check the model card for the specific license.

How do I run clap-htsat-fused locally?

Most HuggingFace models can be loaded with transformers or the appropriate framework library. See the model card for framework-specific instructions and hardware requirements.

Tags

transformerspytorchsafetensorsclapfeature-extractionzero-shot audio classificationzero-shot audio retrievalaudio-classificationenarxiv:2211.06687license:apache-2.0endpoints_compatibleregion:us