Comment on Amazon discovered a 'high volume' of CSAM in its AI training data but isn't saying where it came from

<- View Parent
phx@lemmy.world ⁨16⁩ ⁨hours⁩ ago

Yeah, a lot of people seem to think that these companies built these AI’s by buying or building some sort of special training set/data, when in reality no such thing really existed.

They’ve basically just scraped every bit of data they can. When it comes to big corps, at least some of that data is likely from scraping customer’s data. There’s also scraping of the Internet in general, including sites such as Reddit (which is a big reason why they locked down their API, they wanted to sell that data) but many have also been caught with a ton of ‘pirated ’ data from torrents etc.

I’m sure there was a certain amount of sludge in customers’ synced files, and sites like Reddit, but I’d also hazard a guess that the stuff grabbed from torrents etc likely had some truly heinous materials that they simply added to what was getting force-fed to AI, especially the early ones

source
Sort:hotnewtop