Automated Repair Techniques
Qualitative evaluation w/out user study
Using DBGBench, we can evaluate the correctness of the auto-generated patches. We provide the patches generated by the participants, elicit the general fix strategies, and classify them as correct and incorrect. For each incorrect patch, we provide a rationale as to why we classify it as incorrect. For some incorrect patches, we even provide test cases that show their incorrectness.
- To conduct a plausibility check, you could execute the existing test suite and the previously failing test cases and confirm that all test cases pass. Concrete steps are provided here. For GNU findutils, the developers have constructed a test suite in excess of 1,000 test cases. However, plausible patches (i.e., those passing the test suite) are not necessarily also correct (i.e., those passing a code review).
- Hence, to conduct an additional regression check, we would execute the additional test cases for the participant-provided, incorrect patches. The execution of this extended test suite allows to test for common mistakes when auto-generating a patch.
Qualitative evaluation with user study
Using DBGBench, you can significantly reduce the time and effort required for user studies (e.g., the manual review of auto-generated patches). Users can leverage bug diagnosis, simplified and extended regression test cases, the bug report, the bug diagnosis, fault locations, and developer-provided patches to make the call.
- User studies can be used to evaluate patch acceptability, whether users would accept the auto-generated patches. For instance, Kim et al. i) asked several students to rate the acceptability of patches generated by GenProg, Par, and a developer for five bugs on a 3-point Likert scale, and ii) asked several students and practitioners to choose the most acceptable from a pair of human-generated and auto-generated patches. Similar user studies can leverage DBGBench to extract the correct/incorrect patches provided by professional software engineers.
- DBGBench includes the time taken by professional software engineers to fix real-world software bugs. We can use these timing information to evaluate the usefulness of an automated program repair tool. To this end, we can design an experiment involving several software professionals and measure the reduction in debugging time while using an automated program repair tool.
DBGBench is a first milestone towards the realistic evaluation of tools in software engineering that is grounded in practice. DBGBench can thus be used as necessary reality check for in-depth studies. For now, we would strongly suggest to also utilize other benchmarks, such as CoREBench or Defects4J, for the empirical evaluation. Going forward, we hope that more researchers will produce similar realistic benchmarks which take the practitioner into account. To this end, we also publish our battle-tested, formal experiment procedure, effective strategies to mitigate common pitfalls, and all our material.