“Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease,” they added. “We term this condition Model Autophagy Disorder (MAD).”

Interestingly, this might be a more challenging problem as we increase the use of generative AI models online.

  • argv_minus_one@beehaw.org
    link
    fedilink
    English
    arrow-up
    16
    ·
    2 years ago

    Note that humans do not exhibit this property when trained on other humans, so this would seem to prove that “AI” isn’t actually intelligent.

    • h3ndrik@feddit.de
      link
      fedilink
      English
      arrow-up
      6
      ·
      2 years ago

      Wasn’t the echo chambers during the covid pandemic kind of proof that humans DO exhibit the same property? A good amount will start repeating stuff about nanoparticles and some black lint in a mask are worms that will control your brain?

    • ParsnipWitch@feddit.de
      link
      fedilink
      English
      arrow-up
      4
      ·
      2 years ago

      Current AI is not actually “intelligent” and, as far as I know, not even their creators directly describe them as that. The programs and models existing at the moment aren’t capable of abstract thinking oder reasoning and other processes that make an intelligent being or thing intelligent.

      The companies involved are certainly eager to create something like a general intelligence. But even when they reach that goal, we don’t know yet if such an AGI would even be truly intelligent.

    • echo@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      2 years ago

      I don’t think LLMs are intelligent, but “does it work the same as humans” is a really bad way to judge something’s intelligence

      • frog 🐸@beehaw.org
        link
        fedilink
        English
        arrow-up
        9
        ·
        2 years ago

        Even if we look at other animals, when they learn by observing other members of their own species, they get more competent rather than less. So AIs are literally the only thing that get worse when trained on their own kind, rather than better. It’s hard to argue they’re intelligent if the answer to “does it work the same as any other lifeform that we know of?” is “no”.

    • lloram239@feddit.de
      link
      fedilink
      English
      arrow-up
      0
      ·
      2 years ago

      Key point here being that humans train on other humans, not on themselves. They are also always exposed to the real world.

      If you lock a human in a box and only let them interact with themselves they go a bit funny in the head very quickly.

      • ParsnipWitch@feddit.de
        link
        fedilink
        English
        arrow-up
        4
        ·
        2 years ago

        The reason is different from what is happening with AI, though. Sensory deprivation or extreme isolation and the Ganzfeld effect lead to hallucinations because our brain seems to have to constantly react to stimuli in order to keep functioning. Our brain starts creating things from imagination.

        With AI it is the other way around. They lose information when presented with the same data again and again because their statistical models look for probabilities.

  • frog 🐸@beehaw.org
    link
    fedilink
    English
    arrow-up
    4
    ·
    2 years ago

    Good!

    Was that petty?

    But, you know, good luck completely replacing human artists, musicians, writers, programmers, and everyone else who actually creates new content, if all generative AI models essentially give themselves prion diseases when they feed on each other.

      • frog 🐸@beehaw.org
        link
        fedilink
        English
        arrow-up
        3
        ·
        2 years ago

        I absolutely agree! I’ve seen so many proponents of AI argue that AI learning from artworks scraped from the internet is no different to a human learning by looking at other artists, and while anyone who is actually an artist (or involved in any creative industry at all, including things like coding that require a creative mind) can see the difference, I’ve always struggled to coherently express why. And I think this it. Human artists benefit from other human art to look at, as it helps them improve faster, but they don’t need it in the same way, and they’re more than capable of coming up with new ideas without it. Even a brief look at art history shows plenty of examples of human artists coming up with completely new ideas, artworks that had absolutely no precedent. I really can’t imagine AI ever being able to invent, say, Cubism without having seen a human do it first.

        I feel like the only people that are in favour of AI artworks are those who don’t see the value of art outside of its commercial use. They’re the same people who are, presumably, quite happy playing the same same-y games and watching same-y TV and films over and over again. AI just can’t replicate the human spark of creativity, and I really can’t see it being good for society either economically or culturally to replace artists with algorithms that can only produce derivations of what they’ve already seen.

  • Exaggeration207@beehaw.org
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 years ago

    I only have a small amount of experience with generating images using AI models, but I have found this to be true. It’s like making a photocopy of a photocopy. The results can be unintentionally hilarious though.

  • Cybrpwca@beehaw.org
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 years ago

    So we have generation loss instead of AI making better AI. At least for now. That’s strangely comforting.

  • feeltheglee@beehaw.org
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 years ago

    You know how when you’re on a voice/video call and the audio keeps bouncing between two people and gets all feedback-y and screechy?

    That, but with LLMs.

  • coolin@beehaw.org
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 years ago

    For the love of God please stop posting the same story about AI model collapse. This paper has been out since May, been discussed multiple times, and the scenario it presents is highly unrealistic.

    Training on the whole internet is known to produce shit model output, requiring humans to produce their own high quality datasets to feed to these models to yield high quality results. That is why we have techniques like fine-tuning, LoRAs and RLHF as well as countless datasets to feed to models.

    Yes, if a model for some reason was trained on the internet for several iterations, it would collapse and produce garbage. But the current frontier approach for datasets is for LLMs (e.g. GPT4) to produce high quality datasets and for new LLMs to train on that. This has been shown to work with Phi-1 (really good at writing Python code, trained on high quality textbook level content and GPT3.5) and Orca/OpenOrca (GPT-3.5 level model trained on millions of examples from GPT4 and GPT-3.5). Additionally, GPT4 has itself likely been trained on synthetic data and future iterations will train on more and more.

    Notably, by selecting a narrow range of outputs, instead of the whole range, we are able to avoid model collapse and in fact produce even better outputs.

    • shanghaibebop@beehaw.orgOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      We’re all just learning here, but yeah, that’s pretty interesting to learn about effective synthetic data used for training.

  • h3ndrik@feddit.de
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    2 years ago

    Wow. How is this going to affect all the projects that fine-tune Meta’s Llama model with synthetic training data?

    • lloram239@feddit.de
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 years ago

      Not much at all I would think. The Llama models get trained on the superior GPT-4 output, not on their own output. In general I think it’s a bit of an artificial problem, nobody really expects to train AI on their own output and get good results. What actually happens is AI being used to curate real world data and use that curated data as input, which gives much better results than feeding the raw data directly into the AI (as can be seen by early LLMs that just go completely off track and start repeating comment section and HTML code, that has nothing to do with your prompt, but that just happens to be part of raw websites).

      • h3ndrik@feddit.de
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 years ago

        Thank you for explaining. Yes. Now that i have skimmed through the paper i’m kind of disappointed in their work. It’s not a surprise to me that quality will degrade if you design a feedback loop with low quality data. And does this even mean anything for a distinction between human and synthetic data? Isn’t it obvious a model will deteriorate if you feed it progressively lower quality input, regardless of where you got that from? I’m pretty sure this is the mechanism behind that. A better question to ask would be: Is there some point where synthetic output gets good enough to train something with it. And how far away is that point. Or can we rule that out because of some properties we can’t get around. I’m not sure if learning from own output is even possible like this. I as a human certainly can’t teach myself. I would need some input like books or curated assignments/examples prepared by other people. There are kind of intrinsic barriers when teaching oneself. However I can certainly practice stuff. But that’s kind of a different mechanism. And difficult to compare to the AI stuff.

        I’m glad i can continue to play with the language models, have them tuned to follow instructions (with the help of GPT4 data) etc

  • voluntaryexilecat@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 years ago

    But…isn’t unsupervised backfeeding the same as simply overtraining the same dataset? We already know overtraining causes broken models.

    Besides, the next AI models will be fed with the interactions from humans with AI, not just it’s own content. ChatGPT already works like this, it learns with every interaction, every chat.

    And the generative image models will be fed with AI-assisted images where humans will have fixed flaws like anatomy (the famous hands) or other glitches.

    So as interesting as this is, as long as humans interact with AI the hybrid output used for training will contain enough new “input” to keep the models on track. There are already refined image generators trained with their own but human-assisted output that are better than their predecessor.