| # CodeGenDetect-CodeBERT |
|
|
| **Model Name:** `azherali/CodeGenDetect-CodeBert` |
| **Task:** Code Generation Detection (Human vs Machine Generated Code) |
| **Languages Supported:** C++, Java, Python |
| **Base Model:** CodeBERT |
| **Author:** Azher Ali |
|
|
| --- |
|
|
| ## π Model Overview |
|
|
| `CodeGenDetect-CodeBert` is a transformer-based classification model designed to distinguish **human-written code** from **machine-generated code** produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning **C++**, **Java**, and **Python**, making it suitable for real-world, cross-language code analysis tasks. |
|
|
| Built on top of **CodeBERT**, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code. |
|
|
| --- |
|
|
| ## π― Intended Use Cases |
|
|
| This model is well-suited for: |
|
|
| - **Academic integrity & plagiarism detection** |
| - **LLM-generated code identification** |
| - **Code authenticity verification** |
| - **Research on AI-generated programming artifacts** |
| - **Code forensics and auditing pipelines** |
|
|
| --- |
|
|
| ## π§ Model Details |
|
|
| - **Architecture:** Transformer-based (CodeBERT) |
| - **Task Type:** Binary Sequence Classification |
| - **Labels:** |
| - `0` β Human-generated code |
| - `1` β Machine-generated (LLM) code |
| - **Input:** Source code as plain text |
| - **Output:** Class probabilities and predicted label |
|
|
| --- |
|
|
| ## π Supported Programming Languages |
|
|
| The model has been trained and evaluated on code written in: |
|
|
| - **C++** |
| - **Java** |
| - **Python** |
|
|
| It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs. |
|
|
| --- |
|
|
| ## ποΈ Training Summary |
|
|
| - **Training Objective:** Binary cross-entropy loss for classification |
| - **Tokenization:** CodeBERT tokenizer with fixed-length padding and truncation |
| - **Optimization:** Fine-tuned using modern deep learning best practices |
| - **Evaluation Metrics:** Accuracy, Precision, Recall, F1-score |
|
|
| The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance. |
|
|
| --- |
|
|
| ## π Example Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| model_name = "azherali/CodeGenDetect-CodeBert" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| |
| code_snippet = """ |
| def add(a, b): |
| return a + b |
| """ |
| |
| inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True) |
| outputs = model(**inputs) |
| |
| prediction = torch.argmax(outputs.logits, dim=1).item() |
| label = "Machine-generated" if prediction == 1 else "Human-written" |
| |
| print(label) |
| |