GitHub’s Automatic Coding Tool Rests On Untested Legal Ground
GitHub’s automatic coding tool rests on untested legal ground
Just days after GitHub announced the launch of its new Copilot tool, which generates complementary code for programmers' projects; web developer Kyle Peacock noticed an oddity.
“I love to learn new things and build things,” the algorithm wrote, when asked to generate an About Me page. “I have a <a href=“https://github.com/davidcelis”> Github</a> account.”
While the About Me page was allegedly created for a fictitious person, the link to his GitHub profile points to David Celis, who The Verge can confirm is not a figment of Copilot's imagination. Celis is a programmer and a prolific GitHub user who previously worked at the company.
“I'm not surprised that my public repositories are included in Copilot's training data,” Celis told The Verge, adding that he found the algorithm's recitation of his name amusing. While Celis is unconcerned about an algorithm parroting its training data, he is concerned about the copyright implications of GitHub snatching up any code it can to improve its AI.
On June 29, when GitHub announced Copilot, the company stated that the algorithm was trained using publicly available code on GitHub. Nat Friedman, the CEO of GitHub, has stated on forums such as Hacker News and Twitter that the company is legally sound. According to the Copilot page, “training machine learning models on publicly available data is considered fair use in the machine learning community.”
However, the legal question is not as settled as Friedman implies — and the confusion extends well beyond GitHub. Artificial intelligence algorithms are only effective when they analyze massive amounts of data, the majority of which comes from the open internet. A simple example is ImageNet, perhaps the most influential AI training dataset, which is entirely composed of publicly available images that the creators of ImageNet do not own. If a court rules that using this readily available data is illegal, it could significantly increase the cost and transparency of training AI systems.
Despite GitHub's assertion, there is no direct legal precedent in the United States that supports publicly available training data as fair use, according to Stanford Law School professors Mark Lemley and Bryan Casey, who published a paper last year in the Texas Law Review about AI datasets and fair use.
That is not to say they are opposed to it: Lemley and Casey argue that publicly available data should be considered fair use in order to improve algorithms and adhere to machine learning community standards.
And, they assert, there are precedents to support their position. They compare training an algorithm to the Google Books case, in which Google downloaded and indexed over 20 million books to create a literary search database. The Supreme Court upheld Google's fair use claim, finding that the new tool enhanced the original work and benefited readers and authors in general.
“There is no debate about the ability to store all that copyrighted material in a database for machine reading,” Casey says of the Google Books case. “What a machine then produces is still in the dark and will be figured out.”
This means that the details will change when the algorithm generates its own media. Lemley and Casey argue in their paper that when an algorithm begins to generate songs in the style of Ariana Grande or directly plagiarizes a coder's novel solution to a problem, the line between fair use and plagiarism becomes much murkier.
Because this has not been directly tested in court, a judge has not been forced to determine how extractive the technology truly is: If an AI algorithm transforms a copyrighted work into a profitable technology, it is not out of the question for a judge to rule that the creator should compensate or otherwise credit for what they take.
In Conclusion
However, if a judge finds that GitHub's method of training on publicly available code is fair use, it eliminates the need for GitHub and OpenAI to cite the licenses of the coders who created their training data. Celis, for example, states on his GitHub profile that he uses the Creative Commons Attribution 3.0 Unported License, which requires attribution for derivative works.
“And I am of the opinion that Copilot's generated code is entirely derivative,” he told The Verge.
However, until this issue is resolved in court, there is no definitive ruling on whether this practice is legal.
“My hope is that individuals would be receptive to their code being used for training purposes,” Lemley says. “Not that it will appear verbatim in someone else's work, but we are all better off with better-trained AIs.”