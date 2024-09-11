Artificial Intelligence (AI) is still in its Wild West phase, where anything and everything goes.

Speculative billions, trillions of dollars are being pumped into this new toy, and investors want to see some return. Humans are using it to write sentences, essays and songs; program robotic vacuum cleaners; borrow ideas; paint pictures; and, of course, make fake porn.

Bespoke AI models are hoovering up anything and everything they can lay their mitts on, sucking it up at industrial levels never before seen, stealing it, storing it, massaging it, weighting it, bending it, publishing it, hallucinating on it, selling it.

Nobody knows where this heading – hopefully incredible medical, scientific and technological breakthroughs – and if anybody says they do and you believe them, then I have a very impressive bridge that will take you from Kirribilli into the Sydney CBD, and it’s yours for a song. Call my people.

Studies are showing that AI really has no idea what it’s doing. And that means neither do the people training these machines.

Yes, there are theories, hypotheses, hunches, early warnings, tech advances, all of that, but the fact is that within the space of just a few years finance, medicine, education, entertainment and many other sectors have hitched themselves to this wagon because, well, they had no choice. You have to be in the game, because without AI you are in the rear view mirror, fading fast.

Tech is no different, which is why you see “AI” in every second product launched – phones, speakers, vacuum cleaners, cameras.

A 2024 study from Amazon Web Services AI Labs and the University of California, Santa Barbara, was titled “A Shocking Amount of the Web is Machine Translated”, and the gist of the findings was that “content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT)”.

Per Forbes: “English makes up 52% of the Internet with the rest scattered across 19 languages. Of all content that has been translated, roughly 57% of web-based text has been translated into three or more languages, and the low quality of this content indicates that it was likely machine translated.”

One poor translation leads to an even worse translation, which begets one that is even more nonsensical, and ultimately the whole thing implodes or starts getting high on its own supply. They have a term for it – model collapse.

Another study, published in July in Nature, from a group of researchers at Cambridge and Oxford universities, stated: “The development of LLMs [Large Language Models] is very involved and requires large quantities of training data. Yet, although current LLMs, including GPT-3, were trained on predominantly human-generated text, this may change.

“If the training data of most future models are also scraped from the web, then they will inevitably train on data produced by their predecessors.”

And that means scraping AI-generated and translated data that mightn’t be all that one hopes for.

The AWS/UC study found “Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages”.

“We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT.

“Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.”

The Nature study said that “indiscriminately learning from data produced by other models causes ‘model collapse’ – a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time”.