OpenBugger: a tool for injecting bugs into Python scripts

Name of Project: OpenBugger: a tool for injecting bugs into Python scripts

Proposal in one sentence: Create a tool to automatically bug code that synthetically generates debugging instructions to tune OpenAssistant

Description of the project and what problem it is solving:
The OpenBugger project GitHub - furlat/OpenBugger: Code to create bugged python scripts for OpenAssistant Training, maintained by https://twitter.com/Cyndesama aims at generating large amount of synthetic conversations about debugging code between a user and an assistant. This data will be used as a part of Open-Assistant training set.
To achieve this the bugger starts from a working code snippet and injects it with errors of several different types (currently ~30 syntax and logic error types ), finally the diads of working and bugged code are used to generate a conversation:
e.g: User:"Is this code snippet correct?: {bugged code, bug_type, num_bugs}:
Assistant: "No the code snippet is not correct, it has {num_bugs} erorrs of {bug_type}
User: “Can you fix the code?”
Assistant: “Sure, I will be glad to do it: {orginal_code}”

Grant Deliverables:

Python package at GitHub - furlat/OpenBugger: Code to create bugged python scripts for OpenAssistant Training, maintained by https://twitter.com/Cyndesama that will be used for OpenAssistant (demo version is already in main branch here Create openbugger_example.ipynb by furlat · Pull Request #418 · LAION-AI/Open-Assistant · GitHub)
The main deliverable is exstending the current code to include Runtime errors, Type errors, Name errors, Import errors and Indentation errors as well as the scripts to map from tuples of (code/bugged_code) to conversation.

Squad:

Name: Tommaso Furlanello
Twitter handle: @cyndesama
Discord handle: iriden#1609

3 Likes

Interesting. Is open assistant already at the stage where you can do chat bot like functionalities or do you just plan to generate the dataset while waiting for that?

Hi thanks for the interest! First of all disclaimer that I am just an enthusiast contributor to OpenAssistant and not an organizer, therefore, I can’t really speak for the whole project.

The path from pre-trained language model to chat-bot is roughly —> fine-tune on large collection of Q/A and User/Assistant data → Use the fine-tuned model to propose multiple answers and let a human evaluate them, use this evaluation to train a critic in the rl sense and use this to create a reward signal to further fine-tune the network with policy gradient. Reagarding OpenAssisting, we are currently merging multiple Q/A and User/Assistant datasets as well as preparing an interface to collect human data, in parallel we are producing multiple synthetic data-sources like OpenBugger and other projects that extract Q/A from YoutubePodcasts or wikiHow and many other. Other groups are setting up the pipelines and infrastructure for training each of the steps.

The plan for OpenBugger is to work as a data-augmentation tool to enrich whatherver code-data we can get with negative bugged examples to align the bot towards being really good at spotting and correcting bugs.

1 Like

Have you seen this project, it seems very similar, or maybe to assist you with making bad scripts.

1 Like

Whao, unfortunately they stopped maintaining after python2, but it could almost become adversarial training for code generators: while the model tries to write code to run, pyringe could attempt minimal changes to it to try and break it.

My approach for OpenBugger is currently much more simple, and easy to maintain, it simply uses regular expressions and the inspect module, furthermore, I suggest not to run the bugged code outside of dockers or other containers because some scripts create memory leaks or infinite loops.

1 Like

Title: “Revolutionizing Code Debugging and Refactoring with OpenBugger: An Open-Source Python Library for Self-Supervised Training of AI Debuggers”

Audience Level: Developers interested in cutting-edge AI-assisted programming and self-supervised machine learning, as well as Python experts who want to automate large-scale code debugging and refactoring.

Brief Description: OpenBugger is an open-source library that uses the power of LibCST to automatically generate bugs in Python code, providing a valuable tool for training AI models for code generation without direct human supervision.

Abstract/Summary:

The OpenAssistant project is an ambitious effort to develop a Large Language Model assistant with comparable capabilities to Chat-GPT, with a strong focus on programming and accessing external resources. One of the main challenges of this project is the scarcity of publicly available datasets for code debugging and refactoring. OpenBugger addresses this issue by providing a framework for the automatic generation of bugged Python code, which can be used to train AI models in a self-supervised manner.

OpenBugger uses the power of LibCST to perform reproducible edit-operations on entire codebases. The library generates code with a set of invertible edit operators, called Mutators, that can be chained together to create complex bug patterns. Example mutators range from simple syntax bugs that completely halt code execution, to more subtle changes in logic that introduce infinite loops and memory overflows. The library also provides tools to automatically debug the generated code by storing a trace of the edit operations performed, such that step-by-step debugging instructions can always be recovered.

By the end of this talk, attendees will have a clear understanding of the necessary data sources to train powerful Python code generators, as well as how to create them with OpenBugger.