Thanks John, this is a great workflow, especially the part about having the model summarize what did and did not work, then feeding those summaries back in as context. That "what broke and what fixed it" record is basically the signal my dataset tries to capture, just at the dependency-upgrade level. Drift really is the quiet killer. Appreciate you sharing the prompt link too, I'll take a look.
Abhisek Behera PRO
Abhisek987
AI & ML interests
None yet
Recent Activity
upvoted a collection 3 days ago
DepDoctor repliedto their post 3 days ago
Every Python developer has hit this: you upgrade numpy or pandas, and code that worked yesterday breaks today.
I built an open dataset for exactly this problem. DepDoctor is 6,204 examples of Python code broken by a dependency upgrade, each paired with the fix and a short note on the API change that caused it. It is a mixture of real cases mined from public GitHub commits and synthetic cases generated from a database of known breaking changes.
A few things I tried to get right:
- 935 "leave it alone" examples, to teach a model restraint, not just what to change.
- Honest evaluation: a fine-tuned Qwen2.5-Coder-7B gets 62% of fixes fully correct. I report that, not just the 97% text-similarity score that hides the truth.
- The main failure mode, over-editing, is measured and explained rather than buried.
Dataset, fine-tuned model, and a live demo are all open in one place:
https://huggingface.co/collections/Abhisek987/depdoctor
Feedback welcome, especially from anyone working on code repair or API migration. repliedto their post 3 days ago
Every Python developer has hit this: you upgrade numpy or pandas, and code that worked yesterday breaks today.
I built an open dataset for exactly this problem. DepDoctor is 6,204 examples of Python code broken by a dependency upgrade, each paired with the fix and a short note on the API change that caused it. It is a mixture of real cases mined from public GitHub commits and synthetic cases generated from a database of known breaking changes.
A few things I tried to get right:
- 935 "leave it alone" examples, to teach a model restraint, not just what to change.
- Honest evaluation: a fine-tuned Qwen2.5-Coder-7B gets 62% of fixes fully correct. I report that, not just the 97% text-similarity score that hides the truth.
- The main failure mode, over-editing, is measured and explained rather than buried.
Dataset, fine-tuned model, and a live demo are all open in one place:
https://huggingface.co/collections/Abhisek987/depdoctor
Feedback welcome, especially from anyone working on code repair or API migration.Organizations
None yet