IBM CodeNet dataset aims to train AI to tackle programming challenges

On May 11, 2021

IBM introduced Project CodeNet, which the company claims is the largest open source dataset for benchmarking around AI for code. Consisting of 14 million code examples, 500 million lines of code, and 55 programming languages including C++, Java, Python, Go, COBOL, Pascal, and FORTRAN, CodeNet is approximately 10 times larger than the next most similar dataset, which has 52,000 samples.

According to a study from the University of Cambridge’s Judge Business School, programmers spend 50.1% of their work time not programming; the other half is debugging. And the total estimated cost of debugging is $312 billion per year. AI-powered code suggestion and review tools, then, promise to cut development costs substantially while enabling coders to focus on more creative, less repetitive tasks.

CodeNet focuses specifically on the problems of code translation, code similarity, and code constraints. The goal is to advance the development of AI systems that can automatically translate code into another programming language, identify overlaps and similarities between different sets of code, and customize constraints based on a developer’s specific needs and parameters.

Programming language translation could be especially useful, given that migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages. For example, the Commonwealth Bank of Australia spent around $750 million over the course of five years to convert its platform from COBOL to Java. Transcompilers could help in theory — they eliminate the need to rewrite code from scratch — but they’re difficult to build in practice because different languages can have a different syntax and rely on distinctive platform APIs, standard-library functions, and variable types.

The CodeNet dataset

CodeNet contains samples designed to train AI to complete a range of programming tasks, including code search and clone detection. Beyond this, the dataset has metadata and annotations with a rich set of information spanning code size, memory footprint, CPU run time, and status, which helps to distinguish correct code from problematic code.

Over 90% of the sample problems in CodeNet come with descriptions that contain a problem statement and specifications of the input and output format. For over half of the problems and seven million examples, IBM also curated sample inputs and outputs from the problem description.

Using CodeNet, data scientists can execute code samples to extract additional metadata and verify outputs from generative AI models for correctness. IBM says that this will enable researchers to program “intent equivalence” when translating one programming language into another.

“Given its wealth of programs written in a multitude of languages, we believe Project CodeNet can serve as a benchmark dataset for source-to-source translation and do for AI and code what the ImageNet dataset did years ago for computer vision,” Ruchir Puri, IBM fellow and chief scientist at IBM Research, wrote in a blog post.

IBM isn’t the only company pursuing AI-driven code completion and auditing. Codota is developing a platform that suggests and autocompletes scripts in Python, C, HTML, Java, Scala, Kotlin, and JavaScript. Ponicode taps AI to check the accuracy of code, and DeepCode is developing an AI-powered system for whole-app code reviews (as are Amazon and Intel). Perhaps one of the most impressive projects to date is TransCoder, an AI transcompiler Facebook researchers developed to convert code from one programming language into another. Another contender is a model from OpenAI that was trained on GitHub repositories to generate entire functions from English-language comments.