azherali
/

CodeGenDetect-CodeBert

Model card Files Files and versions

CodeGenDetect-CodeBert / README.md

azherali's picture

Update README.md

39c26a6 verified 3 months ago

|

history blame contribute delete

2.8 kB

	# CodeGenDetect-CodeBERT

	Model Name: `azherali/CodeGenDetect-CodeBert`
	Task: Code Generation Detection (Human vs Machine Generated Code)
	Languages Supported: C++, Java, Python
	Base Model: CodeBERT
	Author: Azher Ali

	---

	## 📌 Model Overview

	`CodeGenDetect-CodeBert` is a transformer-based classification model designed to distinguish human-written code from machine-generated code produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning C++, Java, and Python, making it suitable for real-world, cross-language code analysis tasks.

	Built on top of CodeBERT, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code.

	---

	## 🎯 Intended Use Cases

	This model is well-suited for:

	- Academic integrity & plagiarism detection
	- LLM-generated code identification
	- Code authenticity verification
	- Research on AI-generated programming artifacts
	- Code forensics and auditing pipelines

	---

	## 🧠 Model Details

	- Architecture: Transformer-based (CodeBERT)
	- Task Type: Binary Sequence Classification
	- Labels:
	- `0` → Human-generated code
	- `1` → Machine-generated (LLM) code
	- Input: Source code as plain text
	- Output: Class probabilities and predicted label

	---

	## 🌐 Supported Programming Languages

	The model has been trained and evaluated on code written in:

	- C++
	- Java
	- Python

	It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs.

	---

	## 🏋️ Training Summary

	- Training Objective: Binary cross-entropy loss for classification
	- Tokenization: CodeBERT tokenizer with fixed-length padding and truncation
	- Optimization: Fine-tuned using modern deep learning best practices
	- Evaluation Metrics: Accuracy, Precision, Recall, F1-score

	The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance.

	---

	## 🚀 Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "azherali/CodeGenDetect-CodeBert"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	code_snippet = """
	def add(a, b):
	return a + b
	"""

	inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True)
	outputs = model(**inputs)

	prediction = torch.argmax(outputs.logits, dim=1).item()
	label = "Machine-generated" if prediction == 1 else "Human-written"

	print(label)