Should GitHub Be Sued For Training Copilot On GPL Code?

Just yesterday, GitHub announced that it is working on a new feature for its platform called “Copliot“; which is an artificial coding assistant that predicts the next chunks of code that a programmer may want to write while developing software, and offers to insert it just in the right time and place.

GitHub 5 — Copilot working in practice, the highlighted part the is suggested code by Copilot

The technology, a bleeding-edge application of deep learning and neural networks, was trained using the public repositories published on GitHub. Training a neural network model means that you take the data (source code of these repositories in our case) and feed it to the network, so that it can learn what to do in future similar cases.

Copilot has seen billions of lines of code, functions, classes and object definitions before, and hence, can suggest the next steps whenever enough information about the programmer’s desire are determined.

However, this brought a large issue into debate: Many of these public repositories were licensed under the GPL license and other copyleft licenses (MIT, AGPL… etc), so is this process legal? Is it OK for GitHub to use free software source code to train its proprietary, paid and commercial service?

Different opinions emerged in the open source community.

Table of Contents:

The Conservatives
The Rationalists
The Don’t Cares
Conclusion

The Conservatives

Some open source software developers argued that the resulting neural networks is a derivative work of the GPL work, and hence, should be demanded to be released under the GPL license as well.

GitHub’s current CEO said that from their point of view, they see this as a part of “fair use”; which implies that using few lines of modified codes from a public source code is not enough to establish any type of lawsuits against them:

In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.

We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!
— Nat Friedman (@natfriedman) June 29, 2021

However, others argue that the neural network outputs (on the average of a 0.1% probability) copy-pasted snippets from various repositories on GitHub, and hence, it can not fall under fair use:

github copilot has, by their own admission, been trained on mountains of gpl code, so i'm unclear on how it's not a form of laundering open source code into commercial works. the handwave of "it usually doesn't reproduce exact chunks" is not very satisfying pic.twitter.com/IzqtK2kGGo
— eevee (@eevee) June 30, 2021

Moreover, open source developers are already suffering burnouts because of gigantic multi-billion dollar corporations taking their free code and re-bundling it as a SaaS, hence, introducing this new feature takes even more from them than there was before.

The Rationalists

Others argued that just like a human who reads various books, tutorials and software source codes to understand how software development works and doesn’t need to cite the materials which he/she learned from, a neural network is not obligated to do that as well.

“What is the difference between this and someone doing it manually? Is it just because the AI can do it faster and with larger data then the AI should not do it while humans can”? Different users argued on Twitter and Reddit.

Others from the first camp, however, see that as naive thinking; Neural networks depend on absolute probabilistic approaches to determine which code snippets to suggest and does not actually understand what it is doing or what should be the absolute right way of miming that code snippet into the new software:

"but eevee, humans also learn by reading open source code, so isn't that the same thing"
– no
– humans are capable of abstract understanding and have a breadth of other knowledge to draw from
– statistical models do not
– you have fallen for marketing
— eevee (@eevee) June 30, 2021

The Don’t Cares

Others could think in a different way: Let’s put copyright away.

Training AI models has proven to have many useful cases for humanity. Whether the training data is publicly available or protected by copyright laws… isn’t actually what matters. What matters is how we can – as a human race – build useful and good AI models that help us in our everyday lives.

Copilot does help the programmer in his/her everyday life.

One could argue, of course, that it is a commercial service that feasts on public free software (as in freedom) given for free (as in free coffee). However, there is nothing that prevents anyone from doing the same for free. If stood on the same ground, anyone can take the same public repositories and train a large model to suggest the next coding lines, just like Copilot does.

Then, you can offer the source code, data and model for free however you like.

Just because they did it before you and offered it for a price tag doesn’t mean that they are wrong.

If anyone can train his/her AI model on any publicly accessible database, then that is a good thing that should be encouraged and supported, because it means everyone will have access to the same opportunities to unlock the next step in technology. Training AI models on various types of data – by anyone – is crucial for the advancement of our race.

Preventing GitHub from doing it will not help the free software community or the general technological momentum in advancing. Instead, it would just slow the development of the human race for a bit while some workarounds get created.

That’s why we see that regardless of whether US courts see it as fair use or not, it is OK from an ethical point to use publicly available data to everyone to train a computational model to provide a service to users, whether for free or profit. Since this data is normally accessible to the everyday end-user then there should be nothing that prevents a computational AI or bot from accessing it as well.

As for crediting the original authors of the suggested code snippets; Copilot currently – as claimed – only suggests few lines of code, and doesn’t directly copy & paste from people’ repositories (Variable and method names… etc might be changed). GitHub said that they are working on pushing that 0.1% rate of “verbatim code” to lower rates.

Conclusion

The topic is of course open to debate, and will not end very soon.

Currently, Copilot is still in the early technical preview phase, and hasn’t entered the stable status yet. That’s why very few people had the chance so far to put their hands on it and see what results it produces in real-world scenarios. Until the public release is available, IANAL tags are expected to be seen in many places on the Internet.

Feel free to leave your two cents in the comments section below.

Opinions

M.Hanny Sabbagh

Hanny is a computer science & engineering graduate with a master degree, and an open source software developer. He has created a lot of open source programs over the years, and maintains separate online platforms for promoting open source in his local communities.

Hanny is the founder of FOSS Post.

Category	Software
Business Software	Open Source ERP Open Source Survey Software Open Source eCommerce Platforms Open Source Project Management Software Open Source Log Management Software Open Source Network Asset Management Software
Designing Software	Open Source Animation Software Open Source Prototyping Tools Open Source Images
Development	Open Source Speech Recognition Software Open Source Machine Learning Libraries
Engineering	Open Source Math Software Open Source CAD Software Open Source Digital Twin Platforms
Medical Software	Open Source EMR Software
User Software	Open Source Remote Desktop Software Open Source VPNs Open Source Conferencing Software Open Source Password Managers

Should GitHub Be Sued For Training Copilot on GPL Code?

The Conservatives

The Rationalists

The Don’t Cares

Conclusion

Other interesting reads:

Newsletter

Social Links

Recent Comments

Open Source Directory

Join the Force!