Should GitHub Be Sued For Training Copilot On GPL Code?

Just yesterday, GitHub announced that it is working on a new feature for its platform called “Copliot“; which is an artificial coding assistant that predicts the next chunks of code that a programmer may want to write while developing software, and offers to insert it just in the right time and place.

GitHub 5 June 30, 2021
Copilot working in practice, the highlighted part the is suggested code by Copilot

The technology, a bleeding-edge application of deep learning and neural networks, was trained using the public repositories published on GitHub. Training a neural network model means that you take the data (source code of these repositories in our case) and feed it to the network, so that it can learn what to do in future similar cases.

Copilot has seen billions of lines of code, functions, classes and object definitions before, and hence, can suggest the next steps whenever enough information about the programmer’s desire are determined.

However, this brought a large issue into debate: Many of these public repositories were licensed under the GPL license and other copyleft licenses (MIT, AGPL… etc), so is this process legal? Is it OK for GitHub to use free software source code to train its proprietary, paid and commercial service?

Different opinions emerged in the open source community.

The Conservatives

Some open source software developers argued that the resulting neural networks is a derivative work of the GPL work, and hence, should be demanded to be released under the GPL license as well.

GitHub’s current CEO said that from their point of view, they see this as a part of “fair use”; which implies that using few lines of modified codes from a public source code is not enough to establish any type of lawsuits against them:

However, others argue that the neural network outputs (on the average of a 0.1% probability) copy-pasted snippets from various repositories on GitHub, and hence, it can not fall under fair use:

Moreover, open source developers are already suffering burnouts because of gigantic multi-billion dollar corporations taking their free code and re-bundling it as a SaaS, hence, introducing this new feature takes even more from then than there was before.

The Rationalists

Others argued that just like a human who reads various books, tutorials and software source codes to understand how software development works and doesn’t need to cite the materials which he/she learned from, a neural network is not obligated to do that as well.

“What is the difference between this and someone doing it manually? Is it just because the AI can do it faster and with larger data then the AI should not do it while humans can”? Different users argued on Twitter and Reddit.

Others from the first camp, however, see that as naive thinking; Neural networks depend on absolute probabilistic approaches to determine which code snippets to suggest and does not actually understand what it is doing or what should be the absolute right way of miming that code snippet into the new software:

The Don’t Cares

Others could think in a different way: Let’s put copyright away.

Training AI models has proven to have many useful cases for humanity. Whether the training data is publicly available or protected by copyright laws… isn’t actually what matters. What matters is how we can – as a human race – build useful and good AI models that help us in our everyday lives.

Copilot does help the programmer in his/her everyday life.

One could argue, of course, that it is a commercial service that feasts on public free software (as in freedom) given for free (as in free coffee). However, there is nothing that prevents anyone from doing the same for free. If stood on the same ground, anyone can take the same public repositories and train a large model to suggest the next coding lines, just like Copilot does.

Then, you can offer the source code, data and model for free however you like.

Just because they did it before you and offered it for a price tag doesn’t mean that they are wrong.

If anyone can train his/her AI model on any publicly accessible database, then that is a good thing that should be encouraged and supported, because it means everyone will have access to the same opportunities to unlock the next step in technology. Training AI models on various types of data – by anyone – is crucial for the advancement of our race.

Preventing GitHub from doing it will not help the free software community or the general technological momentum in advancing. Instead, it would just slow the development of the human race for a bit while some workarounds get created.

That’s why we see that regardless of whether US courts see it as fair use or not, it is OK from an ethical point to use publicly available data to everyone to train a computational model to provide a service to users, whether for free or profit. Since this data is normally accessible to the everyday end-user then there should be nothing that prevents a computational AI or bot from accessing it as well.

As for crediting the original authors of the suggested code snippets; Copilot currently – as claimed – only suggests few lines of code, and doesn’t directly copy & paste from people’ repositories (Variable and method names… etc might be changed). GitHub said that they are working on pushing that 0.1% rate of “verbatim code” to lower rates.

Conclusion

The topic is of course open to debate, and will not end very soon.

Currently, Copilot is still in the early technical preview phase, and hasn’t entered the stable status yet. That’s why very few people had the chance so far to put their hands on it and see what results it produces in real-world scenarios. Until the public release is available, IANAL tags are expected to be seen in many places on the Internet.

Feel free to leave your two cents in the comments section below.

1 Comment
tamusjroyce July 28, 2021
|
However, if the copilot was able to scrape protocols, function signatures, etc., spin off a version of itself that didn't ingest any non-appropriate-licensed software, and implement those protocols based on the results passed in...it could, in theory, rewrite GPL to MIT code. In fact, the functions themselves could be mini-neural networks.

Newsletter

.

Ads

Become a Supporter

For the price of one cup of coffee per month:

  • Support the FOSS Post to produce more content.
  • Get a special account on our website.
  • Remove all the ads you are seeing (including this one!).
  • Help us get to our goal of 100 supporters, to start many initiatives.

Recent Comments