THETA / README.md

Update README.md

027ab17 verified 16 days ago

3.39 kB

language:
  - zh
  - en
  - de
  - fr
license: mit
pipeline_tag: feature-extraction
library_name: transformers
tags:
  - embeddings
  - lora
  - sociology
  - retrieval
  - feature-extraction
  - sentence-transformers

THETA: Textual Hybrid Embedding–based Topic Analysis

Model Description

THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). It is designed to generate dense vector representations for texts in the sociology and social science domain.

The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).

Base Models:

Fine-tuning Methods:

Unsupervised: SimCSE (contrastive learning)
Supervised: Label-guided contrastive learning with LoRA

Intended Use

This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations.

It is not designed for text generation or decision-making in high-risk scenarios.

Model Architecture

Component	Detail
Base model	Qwen3-Embedding (0.6B / 4B)
Fine-tuning	LoRA (Low-Rank Adaptation)
Output dimension	896 (0.6B) / 2560 (4B)
Framework	Transformers (PyTorch)

Repository Structure

CodeSoulco/THETA/
├── 0.6B/
│   ├── supervised/
│   └── unsupervised/
├── 4B/
│   ├── supervised/
│   └── unsupervised/
└── logs/

Pre-computed embeddings are available in a separate dataset repo: CodeSoulco/THETA-embeddings

Training Details

Fine-tuning method: LoRA
Training domain: Sociology and social science texts
Datasets: germanCoal, FCPB, socialTwitter, hatespeech, mental_health
Objective: Improve domain-specific semantic representation
Hardware: Dual NVIDIA GPU

How to Use

from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import torch

# Load base model
base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "CodeSoulco/THETA",
    subfolder="0.6B/unsupervised/germanCoal"
)

# Generate embeddings
text = "Social structure and individual behavior"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state[:, 0, :]  # CLS token

Limitations

Fine-tuned for sociology/social science domain; may not generalize well to unrelated topics.
Performance depends on input text length and quality.
Does not generate text and should not be used for generative tasks.

License

This model is released under the MIT License.

Citation

@misc{theta2026,
  title={THETA: Textual Hybrid Embedding--based Topic Analysis},
  author={CodeSoul},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/CodeSoulco/THETA}
}