Feeding the AI Beast
To learn and keep improving, AI needs large data sets. Of text. Images. Videos. Whatever. But already, writes Rahul Matthan:
“The
trouble is that the availability of high-quality content needed for training
these models is fast dwindling.”
There are multiple
reasons for this. One, the cost of storage media keeps falling, so you
can store more data at the same cost. The processing power of CPU/GPU’s keeps
growing, so they can process more data in the same time. This combo, in turn,
means that the rate at which the training data set is fed to the AI far exceeds
the rate at which humans produce new content (that could be used as new
training material).
Two, as the importance of data sets for AI is
understood, Internet sites (which were the biggest source of such data) have
started to put restrictions on how much data can be “scraped”.
Three, content producers have started to file
lawsuits on copyright infringement, further reducing the available data to feed
the AI. On a similar note, privacy concerns impose further restrictions.
The obvious
solution was to create “synthetic” data to feed the AI, i.e., let one AI create
content which serves as input to the second AI to learn, whose output can be
used to train a third AI and so on.
“If
successful, not only does this give us a virtually infinite supply of training
data, it suffers from none of the intellectual property and data protection
concerns that scraped content must contend with.”
Unfortunately,
synthetic data doesn’t work as intended.
“(Already)
both the quality and diversity of these underlying models have shown evidence
of substantial degradation, a phenomenon the authors call Model Autophagy
Disorder (or MAD for short).”
Apparently, there
is no substitute for “fresh, real data”, i.e., human created content, if we
want the AI to keep improving.
If you thought AI was moving and impacting multiple fields at an insane speed, well, it also seems to have hit a roadblock on how much further it could go. Everything happens at a ridiculously fast pace.
Comments
Post a Comment