Hugging Face and ServiceNow are releasing a free code-generating model

Posted on

AI startup Hugging Face and ServiceNow Research, the R&D division of ServiceNow, have released StarCoder, a free code-generating AI system alternative along the lines of GitHub’s Copilot.

Code generating systems such as DeepMind’s AlphaCode; CodeWhisperer from Amazon; and OpenAI’s Codex, which powers Copilot, offers a tantalizing glimpse of what’s possible with AI within the realm of computer programming. Assuming the ethical, technical, and legal issues are resolved one day (and AI-powered coding tools don’t cause more bugs and vulnerabilities than they fix), they could significantly reduce development costs while allowing programmers to focus on more creative tasks.

According to a study by the University of Cambridge, at least half of developers’ efforts are spent on debugging, not active programming, costing the software industry an estimated $312 billion a year. But so far only a handful of code-generating AI systems have been made freely available to the public – a reflection of the commercial incentives of the organizations building them (see: Replit).

In contrast, StarCoder, which is licensed to allow royalty-free use by anyone, including businesses, is trained in over 80 programming languages ​​and text from GitHub repositories, including documentation and programming notebooks. StarCoder integrates with Microsoft’s Visual Studio Code code editor and, like OpenAI’s ChatGPT, can follow basic instructions (e.g. “Create an app UI”) and answer questions about code.

Leandro von Werra, a machine learning engineer at Hugging Face and a co-lead of StarCoder, claims that StarCoder matches or outperforms the OpenAI AI model used to power early versions of Copilot.

“One thing we learned from releases like Stable Diffusion last year is the creativity and capabilities of the open source community,” Von Werra told AapkaDost in an email interview. “Within weeks of release, the community had built dozens of variants of the model, as well as custom applications. By releasing a powerful code generation model, anyone can refine and adapt it to their own use cases, and countless downstream applications will become possible.

Build a model

StarCoder is part of Hugging Face and ServiceNow’s 600+ person BigCode project, launched late last year, which aims to deliver “state-of-the-art” AI systems for code in an “open and accountable way to develop. ServiceNow provided an internal compute cluster of 512 Nvidia V100 GPUs to train the StarCoder model.

Several BigCode working groups focus on sub-topics such as collecting datasets, implementing methods for training code models, developing an evaluation suite, and discussing ethical best practices. For example, the Legal, Ethics and Governance working group examined questions about data licensing, attribution of generated code to original code, redacting personally identifiable information (PII), and the risks of running malicious code.

Inspired by Hugging Face’s previous efforts to open source advanced text-generating systems, BigCode seeks to address some of the controversies that arise around the practice of AI-powered code generation. The nonprofit Software Freedom Conservancy, among others, has criticized GitHub and OpenAI for using public source code, not all of which is under a permissive license, to train and monetize Codex. Codex is available through OpenAI and Microsoft’s paid APIs, while GitHub recently started charging for access to Copilot.

For their part, GitHub and OpenAI claim that Codex and Copilot — protected by fair use doctrine, at least in the US — do not violate licensing agreements.

“Releasing a capable code-generating system can serve as a research platform for institutions interested in the topic but lack the necessary resources or know-how to train such models,” von Werra said. “We believe that in the long run this will lead to fruitful research into security, capabilities and limits of code-generating systems.”

Unlike Copilot, the StarCoder with 15 billion parameters was trained over the course of several days on an open source dataset called The Stack, which has more than 19 million curated, permissively licensed repositories and more than six terabytes of code in more than 350 programming languages . In machine learning, parameters are the parts of an AI system that are learned from historical training data and essentially determine the system’s proficiency for a problem, such as code generation.

The heap

A graphical representation of the contents of The Stack dataset. Image Credits: big code

Because it is permissively licensed, code from The Stack can be copied, modified, and redistributed. But the BigCode project also offers developers a way to opt out of The Stack, similar to efforts elsewhere to have artists remove their work from text-to-image AI training datasets.

The BigCode team has also been working to remove PII from The Stack, such as names, usernames, email and IP addresses, and keys and passwords. They created a separate dataset of 12,000 files containing PII, which they plan to release to researchers through gated access.

In addition, the BigCode team used Hugging Face’s malicious code detection tool to remove files from The Stack that could be considered “unsafe,” such as files with known exploits.

The privacy and security issues with generative AI systems, which are trained for the most part on relatively unfiltered data from the internet, are well established. ChatGPT once voluntarily provided a journalist’s phone number. And GitHub has acknowledged that Copilot can generate keys, credentials, and passwords that appear in its training data on new strings.

“Code is one of the most sensitive intellectual properties for most companies,” von Werra said. “Particularly sharing it outside of their infrastructure is a huge challenge.”

According to him, some legal experts have argued that code-generating AI systems could put companies at risk if they unwittingly include copyrighted or sensitive text from the tools in their production software. As Elaine Atwell points out in a piece on the Kolide company blog, because systems like Copilot strip code from their licenses, it’s hard to tell which code is allowed to use and which may have incompatible terms of use.

In response to the criticism, GitHub added a toggle that allows customers to prevent suggested code from appearing that matches public, potentially copyrighted content from GitHub. Following suit, Amazon has flagged CodeWhisperer and optionally filtered the license associated with features it suggests that resemble snippets found in its training data.

Commercial drivers

So what has ServiceNow, a company primarily into business automation software, got out of this? A “high-performing model and a responsible AI model license that enables commercial use,” said Harm de Vries, the Large Language Model Lab leader at ServiceNow Research and the co-lead of the BigCode project.

You can imagine that ServiceNow will eventually build StarCoder into its commercial products. The company wouldn’t reveal how much, in dollars, it invested in the BigCode project, except that the amount of donated computers was “significant.”

“The Large Language Models Lab at ServiceNow Research is building expertise in the responsible development of generative AI models to ensure the safe and ethical deployment of these powerful models for our customers,” said de Vries. “BigCode’s open-scientific research approach provides ServiceNow developers and customers with complete transparency about how everything was developed and demonstrates ServiceNow’s commitment to making socially responsible contributions to the community.”

StarCoder is not open source in the strict sense. Rather, it is released under a licensing scheme, OpenRAIL-M, which contains “legally enforceable” usage restrictions that derivatives of the model — and apps using the model — must comply with.

For example, StarCoder users must agree not to use the model to generate or distribute malicious code. While real-world examples are scarce (at least for now), researchers have shown how AI like StarCoder can be used in malware to evade basic forms of detection.

Whether developers actually honor the terms of the license remains to be seen. Aside from legal threats, there is nothing at a basic technical level to stop them ignoring the terms for their own purposes.

That’s what happened with the aforementioned Stable Diffusion, whose similarly restrictive license was ignored by developers who used its generative AI model to create photos of celebrity deepfakes.

But the possibility hasn’t discouraged Von Werra, who believes the drawbacks of not releasing StarCoder outweigh the benefits.

“At launch, StarCoder won’t provide as many features as GitHub Copilot, but with its open-source nature, the community can help improve it and integrate custom models along the way,” he said.

The StarCoder code repositories, model training framework, dataset filtering methods, code evaluation suite, and research analysis notebooks are available on GitHub starting this week. The BigCode project will keep them going forward as the groups search for more capable code-generating models fueled by community input.

There is certainly work to be done. In the technical paper accompanying the StarCoder release, Hugging Face and ServiceNow say the model potentially produces inaccurate, offensive, and misleading content, as well as PII and malicious code that managed to get past the dataset filtering stage.

Leave a Reply

Your email address will not be published. Required fields are marked *