PolyCoder, an open source code generating AI that could outperform Codex 

Author: @Laurent - Fotolia.com

Currently, We have started to see an increase in the different solutions that they are beginning to offer in relation to the code generation using artificial intelligence (AI) and it is that the field of natural language processing (NLP) has paved the way for a series of code-generating AIs in various programming languages.

Of which we can highlight for example GitHub Copilot, AlphaCode and Codex and to which we can now add a new solution from the hand of the researchers at Carnegie Mellon University who recently introduced "PolyCoder", a code generator based on OpenAI's GPT-2 language model that was trained on a 249 GB code database in 12 programming languages.

About PolyCoder

The authors of PolyCoder claim that it is capable of writing C more accurately than any known model, including Codex.

The code generating AI, can write source code in different programming languages Right off the bat, it promises to lower software development costs while allowing developers to focus on less repetitive, creative tasks.

PolyCoder was powered by data from various GitHub repositories, covering 12 popular programming languages: C, C#, C++, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Scala, and TypeScript.

The unfiltered data set totaled 631 GB of data and 38,9 million files. The team said that chose to train PolyCoder with GPT-2 due to budget constraints. PolyCoder is available as open source, and the researchers hope it can democratize research in the field of AI code generation, which until now has been dominated by well-funded companies.

The researchers believe that PolyCoder it works better than other models in generating code in the C language. However, Codex has always outdone it in other languages. "PolyCoder dramatically outperforms Codex and all other models in the C language.

“When Copilot came out on GitHub last summer, it became clear that these very large language code models can be very useful in helping developers and increasing their productivity. But no model even close to that scale was publicly available," the researchers told VentureBeat by email. “So [PolyCoder] started with Vincent trying to figure out what was the largest model that could be trained on our lab server, which ended up being 2700 billion parameters… and that model was a league ahead of other code-oriented models that we had. were publicly available at the time.”

When comparing only the open source models, PolyCoder outperforms the similarly sized GPT-Neo 2.7B model in C, JavaScript, Rust, Scala, and TypeScript." they point out "In the other 11 languages, all other open source models, including our own, are significantly worse (greater perplexity) than Codex," the CMU researchers added.

With this, PolyCoder is positioned as a very interesting solution, since while research laboratories such as Elon Musk's OpenAI and Alphabet's DeepMind have developed powerful code-generating AI, many of the most successful systems are not available in open source. Low-income companies do not have access to it and this situation limits their research in the field.

For example, training data from the OpenAI Codex, which powers GitHub's Copilot feature, has not been made public, preventing researchers from refining the AI ​​model or studying certain aspects of it, such as interoperability.

"Big tech companies are not publicly releasing their models, which is really holding back scientific research and the democratization of such large language code models," the researchers said. “To some extent, we hope that our open source efforts will convince others to do the same. But the big picture is that the community should be able to train these models on their own. Our model pushed the limit of what you can train on a single server – anything larger requires a pool of servers, which dramatically increases the cost.”

Finally if you are interested in knowing more about it, you can check the details in the following link

The content of the article adheres to our principles of editorial ethics. To report an error click here.

Be the first to comment

Leave a Comment

Your email address will not be published. Required fields are marked with *



  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.