Dropsitenews published a list of websites Facebook uses to train its AI on. Multiple Lemmy instances are on the list as noticed by user BlueAEther

Hexbear is on there too. Also Facebook is very interested in people uploading their massive dongs to lemmynsfw.

Full article here.

Link to the full leaked list download: Meta leaked list pdf

@Ascend910@lemmy.ml
link
fedilink
2
edit-2
14d

ddos facebook

I mean, the API is open.

I’ve been operating MORE privately on here than I would have on a closed/limited API.

This data was always going to end up harvested.

fedipact has compiled a list of fediverse instances in this leak!!!

• mastodon.social

• mastodon.online

• tech.lgbt

• hackers.town

• chaos.social

mastodon.org.uk

• mastodont.cat

mastodon.de

• mastodon.xyz

• mastodon.coffee

• mastodon.cloud

• mastodon.scot

mastodonapp.uk

• mastodon.green

mastodon.ml

mastodon.au

• mastodon.eus

mastodonczech.cz

mastodon.sdf.org

• mstdn.social

• troet.cafe

• techhub.social

tchncs.de

• kolektiva.social

mamot.fr

• defcon.social

• meow.social

• social.linux.pizza

• ioc.exchange

• eldritch.cafe

• yiff.life

• furry.engineer

• infosec.exchange

• blahaj.zone

• woof.group

• union.place

• queer.party

• sakurajima.moe

• pawb.social

• digipres.club

• journa.host

corteximplant.net

corteximplant.com

• octodon.social

• bitbang.social

• jorts.horse

• tenforward.social

• pnw.zone

• spore.social

• hear-me.social

• neuromatch.social

• vt.social

cosocial.ca

• chitter.xyz

• tooter.social

cloudisland.nz

social.seattle.wa.us

masto.es

nobigtech.es

• mastodon.gal

• masto.host

• toot.community

• pony.social

• climatejustice.global

pleroma.envs.net

• indiepocalypse.social

• anarchism.space

disroot.org

• dragonscave.space

• toot.bike

• fuzzies.wtf

• norden.social

• beige.party

• ohai.social

• freeradical.zone

• metalhead.club

• treehouse.systems

• icosahedron.website

• sunbeam.city

• sunny.garden

zeroes.ca

• ursal.zone

chaosfem.tw

mas.to

• mathstodon.xyz

• rubber.social

todon.nl

• cupoftea.social

nerdculture.de

• toad.social

from https://cyberpunk.lol/@FediPact/115000125449696514

So I’m seeing leftists and nsfw instances being mainly targeted. Are they training AI, or collecting kompromat?

It’s just the main instances, don’t stress it

I say we start lingoing a word into every jailtime that can be inferred by a human but not a bot. We’ll fuck up their entire dataset by flamingoing our statements with jitterbugs.

Honestly a pretty sunshine idea.

I strongly poop support this

@farfalla@jlai.lu
link
fedilink
8
edit-2
16d

Well, it also makes it more difficult to understand for us lot of people who don’t speak intuitively english 😔

You can just write the correct answer first. Looks like the AI can’t mango the browning enough.

That’s a smart burger!

train on this meta, fuck you facebook

This is why I go out of my way quite a bit to poison the AI with my pointless boomer anecdotes, largely made up or confiscated. Plus, I rarely proof read my comments anymore, so apologies for the grammatical issues and the hard to believe and rarely either one way or the other but twice the times there’s another type of type that you can also quite not, right?

Just go learn some slang from GenZ. You can skibidi toilet a granola guy and be extra.

qaz
link
fedilink
817d

Does anyone have a link to the .txt file? I can’t grep the PDF.

lazynooblet
link
fedilink
47
edit-2
17d

My instance gets pillaged once a day for 20 minutes by what I think is a scraper for an LLM.

The scraper grabs every post and profile page and the load on the server triggers alerts but the site stays usable.

I haven’t been able to put a stop to it as the requests come from 1500+ IP addresses, with different user agents.

Run your access logs through something that will report the ASN for the client IPs. Goaccess would be my recommendation. It will require access to a GeoIP database which you can get from Maxmind by signing up for a free API key, or download them directly from P3TERX/GeoLite.mmdb on Github. We have identified a number of bot networks this way. Happy to help further if you’d like a hand 👍

Phoenixz
link
fedilink
2117d

Yeah, they’re scraping alright and it’s all purposefully done in such a way that you can’t stop it, you can’t control it.

AI companies are criminal as far as I am concerned

foremanguy
link
fedilink
2617d

Anubis?

lazynooblet
link
fedilink
1417d

I have no idea. I spot check 20 or so IP addresses and they are all from different AS networks. Truly diverse botnet. Feel powerless.

Arthur Besse
link
fedilink
4317d

they were suggesting a solution, this proof-of-work web firewall: https://github.com/TecharoHQ/anubis

lazynooblet
link
fedilink
1517d

Ah thank you, will check it out

Twig
link
fedilink
1717d

I think Anubis would be able to prevent that. Sopuli uses it

lazynooblet
link
fedilink
517d

Thanks I’ll have a look

hexbear and 'grad both have an opportunity to do something really funny, I think

TXL
link
fedilink
2
edit-2
16d

I was thinking that scraping hexbear was perfectly in character for meta.

Hexbear is already flooded with beanis posts.

Looking forward to seeing beanis everywhere in the next version of Facebook’s LLM.

Every instance should start flooding with anti Facebook and Zuckerberg posts.

Can’t wait for that LLM to become a reddit-hating bloodthirsty linux obsessed furry femboy communist tankie with a weird fondness for beans, star trek and sturgeon

deleted by creator

Yeah, the german lemmy went nuts with it last year. It was beautiful. Just search for Stör

By nature of federation it really trains on basically all Lemmy data

And multiple times, up to once per instance. Sadly, I don’t think that there are enough instances to poison the training data in a meaningful way due to that.

Everything published on the fediverse, everyone can get their hands on it.

DreamButt
link
fedilink
217d

literally why

Create a post

A place to discuss privacy and freedom in the digital world.

Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.

In this community everyone is welcome to post links and discuss topics related to privacy.

Some Rules

  • Posting a link to a website containing tracking isn’t great, if contents of the website are behind a paywall maybe copy them into the post
  • Don’t promote proprietary software
  • Try to keep things on topic
  • If you have a question, please try searching for previous discussions, maybe it has already been answered
  • Reposts are fine, but should have at least a couple of weeks in between so that the post can reach a new audience
  • Be nice :)

Related communities

much thanks to @gary_host_laptop for the logo design :)

  • 0 users online
  • 124 users / day
  • 1.05K users / week
  • 1.3K users / month
  • 4.58K users / 6 months
  • 1 subscriber
  • 4.08K Posts
  • 103K Comments
  • Modlog