Collab-uniba
/

github-issues-mpnet-st-e10

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Model card Files Files and versions

github-issues-mpnet-st-e10 / README.md

PeppoCola's picture

insert how to cite

8c528fa verified about 1 year ago

|

history blame contribute delete

2.99 kB

	---
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers

	---

	# GitHub Issues MPNet Sentence Transformer (10 Epochs)

	This is a [sentence-transformers](https://www.SBERT.net) model, specific for GitHub Issue data.

	## Dataset

	For training, we used the [NLBSE22 dataset](https://nlbse2022.github.io/tools/), after removing issues with empty body and duplicates.
	Similarity between title and body was used to train the sentence embedding model.


	## Usage (Sentence-Transformers)

	Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

	```
	pip install -U sentence-transformers
	```

	Then you can use the model like this:

	```python
	from sentence_transformers import SentenceTransformer
	sentences = ["This is an example sentence", "Each sentence is converted"]

	model = SentenceTransformer('Collab-uniba/github-issues-mpnet-st-e10')
	embeddings = model.encode(sentences)
	print(embeddings)
	```


	## Training
	The model was trained for ten epochs, using Multiple Negative Ranking Loss. We assumed that title and body of the same issue have to be similar.
	We used the following parameters:

	DataLoader:

	`torch.utils.data.dataloader.DataLoader` of length 39221 with parameters:
	```
	{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
	```

	Loss:

	`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
	```
	{'scale': 20.0, 'similarity_fct': 'cos_sim'}
	```

	Parameters of the fit()-Method:
	```
	{
	"epochs": 10,
	"evaluation_steps": 0,
	"evaluator": "NoneType",
	"max_grad_norm": 1,
	"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
	"optimizer_params": {
	"lr": 2e-05
	},
	"scheduler": "WarmupLinear",
	"steps_per_epoch": null,
	"warmup_steps": 39221,
	"weight_decay": 0.01
	}
	```


	## Full Model Architecture
	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
	)
	```

	## Citing & Authors
	```
	@article{Colavito_2025_Benchmarking,
	title = {Benchmarking large language models for automated labeling: The case of issue report classification},
	author = {Giuseppe Colavito and Filippo Lanubile and Nicole Novielli},
	year = 2025,
	journal = {Information and Software Technology},
	volume = 184,
	pages = 107758,
	doi = {https://doi.org/10.1016/j.infsof.2025.107758},
	issn = {0950-5849},
	url = {https://www.sciencedirect.com/science/article/pii/S0950584925000977},
	keywords = {Issue labeling, Generative AI, Software maintenance and evolution}
	}
	```