[ad_1]
Do you want to buy some student data for your AI? The University of Michigan can help. It appears that representatives of the school or its partners are cold-emailing tech employees at Google and other companies, offering data on University of Michigan students to train large language models. The data includes recordings of lectures, student discussions and office hours, as well as essays written by seniors and graduate students, which are available for a nominal license fee. It is not clear whether the students gave their consent or not.
The story surfaced in an X/Twitter post by an employee of Google DeepMind, the company’s AI research hub. Susan Zhang, an engineer at DeepMind, said she received a sponsored LinkedIn message that provided information and offered a free sample of data from the University of Michigan to prove its usefulness.
“I am contacting you because, based on your profile, you may be working with large language models (LLM) or natural language processing,” the sales message said. “I wanted to let you know that the University of Michigan is licensing academic speech data and student papers that could be very useful for training or tuning the LLM.”
This message offers data from 85 hours of lectures, discussion sections, and interviews for $15,595, a second set of 829 papers written by University of Michigan students in a variety of disciplines for $12,595, or a discounted package for both data sets at $25,000 Is.
“I think it’s worth finding out which universities are selling student data and what the terms are,” Zhang told Gizmodo in a message on X. The creators won’t get a penny, while the reseller who stores the data will capture all the profits).
The university appears to be working with an organization called Catalyst Research Alliance, which also claims to have a partnership with North Carolina State University. The website offers a sample data setWhich comes with an essay titled “The Democratic Inadequacies of the European Union” and what appears to be a recording of a classroom discussion section.
Catalyst Research Alliance and North Carolina State University did not immediately respond to requests for comment. A representative from the University of Michigan said they were preparing a statement. We’ll update this article when we hear back.
Training large language models, such as software that runs chatbots like ChatGPT and Bard, requires massive, clearly labeled data sets across a variety of topics and subjects. While the University of Michigan data set is small, the well-organized material on a narrow set of topics may be useful for tuning some models, especially tools designed for specific purposes related to education, formal communication, or their improvements. Performance on individual areas of subject matter expertise to train more general AI.
[ad_2]