@starchturrets @justsoup @valpackett Well, the worst part is that I'm 100% sure LLMs were trained on stolen intellectual property... and either nobody noticed, or nobody cares?
I was working on a project, where we were the first ones to pair PCIe DSPs with x86 CPU. We ran into an unusual problem where client used two of those DSPs, and each one of them required 512+512+256MB of 32-bit BAR space (so it had to fit under 4GB).
One DSP was fine as default size for BAR space on Intel CPUs is 2GB, but it ran out of resources to allocate with both devices attached (which used 2.4GB on their own, plus PCH etc.).
It was a first for me, couldn't think of a solution from the top of my head and went to sleep. Client's engineers kept looking into it and when I woke up I had two links to... ChatGPT prompts in my mailbox. I thought "what a waste of time, but it's client so I guess I gotta read that".
First one was an absolute waste of time, in second one ChatGPT suggested that MmioSize pointer might increase bar spacing. I know Intel CPUs pretty well but this pointer isn't well explained anywhere - I checked FSP headers and it wasn't explained, in FSP Integration Guide you only see mention that if you allocate more memory for iGPU, you need to reduce MmioSize because of 4GB boundary limitation.
So I thought "weird but ok", I passed a pointer that set MmioSize to 2.8GB... and it worked. What the f...?
Surely most people would go "oh cool, it found a solution hooray thanks AI" but I went "how did the overclocked text predictor find that information? I refuse to use the code if I don't understand what it's doing.
Started digging - Intel documentation, headers, drivers and so on. Nothing explained it. Then I looked at FSP source code (which obviously is under NDA) where that pointer is being parsed... and I went "oh... shit".
There's no way that Intel uploaded their confidential code to train LLMs. I doubt any company leaked it either (because NDA is strict and penalties would destroy you). So the last explanation was that... OpenAI threw leaks (like ExConfidential) into their training dataset.
Any developer using those leaks would've been sued. I considered reporting this to Intel's legal team but then I went "hold on... what if OpenAI sues me for this or something along those ways? I can't afford the court case...".
I didn't end up reporting it, but if Intel's lawyers would get angry, it would get really ugly. That's the risk of using so-called "LLMs" for coding.