Home Entrance Everyone Wiki Search Login Register

Welcome, Guest. Please login or register. - Thinking of joining the forum??
May 17, 2024 - @537.40 (what is this?)
Forum activity rating: Three Star Posts: 28/1k.beats Unread Topics | Unread Replies | Own Posts | Own Topics | Random Topic | Recent Posts
News: :ozwomp: Reminder: Forum messages stay readable for years! Keep yours high quality! :ozwomp:

+  MelonLand Forum
|-+  World Wild Web
| |-+  ✁ ∙ Web Crafting
| | |-+  Protecting Web Content From Scraping and AI Training


« previous next »
Pages: [1] Print
Author Topic: Protecting Web Content From Scraping and AI Training  (Read 838 times)
Spots
Casual Poster
*



View Profile

First 1000 Members!Joined 2023!
« on: June 10, 2023 @23.30 »

Ever since AI started becoming a really big deal and getting frighteningly good at mimicking art, writing, coding, and other stuff, I've been feeling very hesitant to post anything I make onto the public-facing Internet. I don't want to get too deep into the whole debate of ethics in AI and particularly AI art since that's not the point of this thread, but it's something that bothers me a lot personally and I'm interested in trying to prevent my work from being used as training data as much as possible. So, I'm starting this thread to go over some possibilities for making it harder for website content to be ripped and potentially used for AI training purposes, to see if anyone else is concerned about AI and interested in this concept, and to see if anyone has any critiques or ideas to add.

Before getting into the meat of this topic, I'd like to get a couple things out of the way. First off, I realize that the work of a single person is a drop in the bucket of a huge AI training data set, but I think there are still reasons to do this. If enough people protect their work, then it could start to have an impact. It could also make it harder for an individual artist to be specifically targeted and harassed with AI-generated art in their style. Secondly, there are obviously some negative aspects surrounding asset protection in general. Encrypting and obfuscating the contents of a website could hinder legitimate archival efforts, reduce accessibility, and prevent search engines from recommending it. So, it's really a matter of what your priorities are and whether or not those things are worth it to protect the content. I don't really like the concepts of DRM and asset protection myself and I highly doubt many of the other people here are fans either, but ultimately I dislike AI more than I dislike asset protection.

All right, with that said, let's get into it...



GLAZE

Glaze is a very clever piece of software I discovered recently that aims to modify images in a way that confuses current AI systems and makes them see a different art style than the original image actually has. It doesn't work very well for cartoon/anime-style images at the moment, but hopefully it will improve in that regard in the future. I tried it on some of my drawings which are on the more cartoony side and it ended up putting very noticeable colorful, swirly patterns all over the image. Some other drawbacks are that it takes a considerable amount of time to process images and currently doesn't support Linux. I have an 8-core CPU and it took over half an hour to glaze an image. There's also a GPU version which was much faster, but it ended up just giving me an error and producing no output. Still, it could be useful for anyone with a more painted style and definitely worth keeping an eye on for the rest of us. It's free and can be found here: https://glaze.cs.uchicago.edu/


ENCRYPTION

All right, now here's where things get a little spicy and I'm sure a lot of you will disagree with this method, but I'm gonna throw it out there anyway. Images, text, and other website content could be stored in an encrypted form and then decrypted and presented to the user at load time. That way, any scraper that just downloads all of the content of a website without actually visiting it wouldn't come away with anything immediately usable. A huge drawback to this technique is that it would probably also make search engines just see a bunch of gibberish when trying to index the site, but of course not everything needs to be encrypted. If only images are encrypted but not text, then it would only pose a problem for image search engines but not regular text search engines. I suppose this could also be seen as a benefit if you don't want your site to come up in search results for whatever reason or a non-issue if you don't care about your site getting popular. As long as the decryption keys are stored somewhere on the site itself and not on a remote server, I assume that archived copies of the site will still work so it can be preserved by something like the Wayback Machine.

I'm not exactly an expert on encryption, but I do know of a simple and seemingly pretty fast way of scrambling and unscrambling content that I've tested out myself. It's called an XOR cipher and involves generating a random pattern and XORing it with whatever content needs to be encrypted. The same pattern is used for both encryption and decryption. From what I understand, this method isn't very secure if the random pattern repeats, but if the pattern is suitably random and unique across the whole data, it would be very hard to crack by brute force. Here's an example of image encryption using an XOR cipher:


As far as how to actually implement this, it can be done with JavaScript pretty easily. To decrypt and display an image, the following process can be used:

  • Request the encrypted image and wait for it to load using an Image object.
  • Draw it to a canvas element of the same size using the drawImage function.
  • Get the raw pixel data using the getImageData function on the canvas context.
  • Loop through all of the pixels and decrypt them.
  • Put the decrypted image onto the canvas using the putImageData function.

The random pattern used for encryption and decryption can either be generated beforehand by any old random function like JavaScript's built-in Math.random() function and then stored somewhere on the site or generated at load time with a custom function that can generate the same pattern every time as long as the same seeds are used. In my test implementation, I went with the latter approach so that only seeds would have to be stored for each piece of data that needs to be decrypted. Here's a simple test implementation that you can use as a reference or build off of if you want to use this in your site. This code works for both encryption and decryption.
Code
<!DOCTYPE html>

<html>
    <head>
        <script>
            window.onload = function() {
                const canvas = document.getElementById("imageviewer");
                let context = canvas.getContext("2d");

                function Dot(x1, y1, x2, y2) {
                    return x1 * x2 + y1 * y2;
                }

                function Random2D(vec) {
                    let result = {r: Math.sin(Dot(vec.x, vec.y, 12.541647773170244, 78.8200383690443)) * 43758.685599560275,
                                  g: Math.sin(Dot(vec.x, vec.y, 22.479740860284494, 91.9021302363166)) * 73934.85600553661,
                                  b: Math.sin(Dot(vec.x, vec.y, 45.388330983755054, 66.1049239208788)) * 56655.20010125384};
                    
                    result.r -= Math.floor(result.r);
                    result.g -= Math.floor(result.g);
                    result.b -= Math.floor(result.b);

                    return result;
                }

                let testImage = new Image();
                testImage.src = "encryptedimage.png";

                testImage.onload = function() {
                    // Set the canvas to the size of the image.
                    canvas.width = testImage.width;
                    canvas.height = testImage.height;

                    // Draw the image onto the canvas.
                    context.drawImage(testImage, 0, 0);

                    // Get the pixel data.
                    let imageData = context.getImageData(0, 0, testImage.width, testImage.height);

                    // Encrypt or decrypt the image.
                    let randomColor;

                    for (let i = 0; i < testImage.height; i++) {
                        for (let j = 0; j < testImage.width; j++) {
                            randomColor = Random2D({x: j / testImage.width, y: i / testImage.height});

                            imageData.data[i * testImage.width * 4 + j * 4] ^= Math.floor(randomColor.r * 255);
                            imageData.data[i * testImage.width * 4 + j * 4 + 1] ^= Math.floor(randomColor.g * 255);
                            imageData.data[i * testImage.width * 4 + j * 4 + 2] ^= Math.floor(randomColor.b * 255);
                        }
                    }

                    // Draw the encrypted/decrypted image back to the canvas.
                    context.putImageData(imageData, 0, 0);
                }
            }
        </script>
    </head>

    <body>
        <canvas id="imageviewer"></canvas>
    </body>
</html>

The seeds in the random functions are hardcoded here for demonstration purposes, but they can be stored somewhere clever and passed into the function instead. Both of the functions output values ranging from 0-1 that can be scaled to whatever is needed.

The same principle works for text by storing encrypted ASCII values somewhere and then decrypting them and putting them into an HTML element using the innerHTML property. Since some site hosts like Neocities don't allow free users to store binary data, encrypted text might need to be stored in another format. Two possible options are storing it as base64 text and storing it as an image. I wasn't able to decipher text stored as an image because the getImageData was giving me the wrong values, but base64 worked and is more convenient anyway. Here's some sample code for decrypting text:

Code
<!DOCTYPE html>

<html>
    <head>
        <script>
            window.onload = function() {
                let pageText = document.getElementById("pagetext");
                
                function Random1D(x) {
                    let result = Math.sin(x) * 86207.11192052712;

                    return result - Math.floor(result);
                }

                // Convert from base64 to normal text.
                let encryptedText = atob(pageText.innerHTML);
                let decryptedText = new Uint8Array(encryptedText.length);
                
                // Loop through the text and decrypt it.
                for (let i = 0; i < encryptedText.length; i++)
                    decryptedText[i] = encryptedText.charCodeAt(i) ^ Math.floor(Random1D(i / encryptedText.length) * 255);
                
                // Replace the encrypted base64 text with the final decoded text.
                pageText.innerHTML = String.fromCharCode(...decryptedText);
            }
        </script>
    </head>

    <body>
        <p id="pagetext">QUtUPaAzhmPWfPuGoEDdVgk6zsP8Rg==</p>
    </body>
</html>

The routine for encrypting the text is slightly different. Here's a little snippet that will encrypt a string, convert it to base64, and dump it to the console so it can be pasted into a webpage:
Code
let testText = "All my homies hate AI.";
let encryptedText = new Uint8Array(testText.length);

// Loop through the text and encrypt it.
for (let i = 0; i < testText.length; i++)
    encryptedText[i] = testText.charCodeAt(i) ^ Math.floor(Random1D(i / testText.length) * 255);

// Convert to base64 and output to console.
console.log(btoa(String.fromCharCode(...encryptedText)));
If you have a question about this code, run into a problem, or want something clarified, feel free to ask. I'm sure there are better and more robust ways of going about this, but these snippets are simply meant as a starting place for integrating this into an actual website.


UNCONVENTIONAL DESIGN

I've heard that some types of scrapers can recognize the layout of a site and take screenshots, so I figure that designing a website in a super weird and unconventional way could be a defense against that. This will probably be the least controversial idea since the whole web revival movement is all about wacky web design anyway. In fact, I'm not sure if we would even need to do much different than we're already doing in order for visually-oriented scrapers to be ineffective since web revival-style sites tend to be very unique and visually complex.


That's all I have for now, please let me know what you all think. Here are the main questions I'd like to ask everyone:
  • What do you think of these ideas and do you have any ideas to add for ways to thwart web scraping and AI training?
  • Are you concerned about your work being used as AI training data and, if so, is it a big enough deal for you to want to use countermeasures in your website?
  • At what point do you think countermeasures against bots go too far and start to impede other goals like archival and accessibility?
Logged

Gans
Sr. Member
****


Scrap Vulture


View Profile

First 1000 Members!€100 IRC InvestmentJoined 2022!
« Reply #1 on: June 10, 2023 @374.34 »

Will be tough.

I know a blog writer, who deliberately makes lots of spelling mistakes. So every word is a piece of art, if you like. Still readable, downright comical in some cases. This also should come with the side-effect of confusing the AI.

In the <head> part in your HTML code, you can use this:
<meta name="robots" content="noindex, nofollow">
"Proper" web crawlers will stop indexing your site now.
"Wacky" web crawlers won't care and still index it.
What a pathetic defense. I'm just assuming that the AI will be going for search engine results in the first place. That's all I have to offer.
Logged
sig
Sr. Member ⚓︎
****


the great

iMood: sigmatic

View Profile WWW

First 1000 Members!Certified chocoboGoose!Joined 2022!
« Reply #2 on: June 10, 2023 @500.54 »

"Wacky" web crawlers won't care and still index it.

at least the AI is goofy about it, The Future Will Be Silly  :ozwomp:  :pc:
Logged

          
The world ends with you. If you want to enjoy life, expand your world. You gotta push your horizons out as far as they go.
arcus
Jr. Member ⚓︎
**


StatusCafe: arcus
Matrix: Chat!

View Profile WWW

First 1000 Members!Joined 2023!
« Reply #3 on: June 10, 2023 @625.93 »

Quote
What do you think of these ideas and do you have any ideas to add for ways to thwart web scraping and AI training?

Projects such as Glaze are interesting, though I'm not sure how useful they will be long term. It seems to be too energy hungry, however.

Encryption with Javascript seems excessive. I'll have to test it out later, before I can form an opinion on it.

Something interesting that I haven't seen in the English side of the web, is passworded art sites. There's Japanese sites out there, for shippers who want to keep their art private between other shippers. (Ship art in general. Shipping is still a big taboo there, to the degree that people put "shipping warning" in their twitter profiles if they draw ship art.) It makes heavy use of passwords, has preventions in place to prevent people from saving images, and doesn't let you view the site at all with an account. The passwords people use you would only know if you were in the know. Sites like these might be handy for fandom artists at least. There is? Was? Similar sites that were used with Twitter, but they were supposed to be used exclusively with Twitter.

Quote
Are you concerned about your work being used as AI training data and, if so, is it a big enough deal for you to want to use countermeasures in your website?

Yes, but not because my art being taken without permission. My issue is that I'm forced into supporting corporations that profit off the training data from my work, especially since these services rely on underpaid workers for quality control. If you're not going to pay me, at least pay your workers.

I haven't uploaded my art publically in awhile. If I were to upload my art again, it would only be sketches, but I wouldn't use any drastic measures.

Quote
At what point do you think countermeasures against bots go too far and start to impede other goals like archival and accessibility?

I'm not sure. This is always up to the artist, and how they feel about their art, and who the art is for. If the art is supposed to be private to a degree, small, private art communities seem more appropriate for this. That said, having the internet become more closed isn't good either.

I know a blog writer, who deliberately makes lots of spelling mistakes. So every word is a piece of art, if you like. Still readable, downright comical in some cases. This also should come with the side-effect of confusing the AI.

Spelling mistakes aren't an issue. The corpora machine learning services use is huge. Machine learning incorporates dictionaries, and can correct typos.

In the <head> part in your HTML code, you can use this:
<meta name="robots" content="noindex, nofollow">
"Proper" web crawlers will stop indexing your site now.
"Wacky" web crawlers won't care and still index it.
What a pathetic defense. I'm just assuming that the AI will be going for search engine results in the first place. That's all I have to offer.

A popular source for corpora is archive.org, so this does help a little.

I've heard that some types of scrapers can recognize the layout of a site and take screenshots, so I figure that designing a website in a super weird and unconventional way could be a defense against that. This will probably be the least controversial idea since the whole web revival movement is all about wacky web design anyway. In fact, I'm not sure if we would even need to do much different than we're already doing in order for visually-oriented scrapers to be ineffective since web revival-style sites tend to be very unique and visually complex.

This will do nothing. This is specifically for saving websites, in addition to crawling all the urls.
Logged

grovyle
Jr. Member
**



View Profile

First 1000 Members!Joined 2023!
« Reply #4 on: June 10, 2023 @687.37 »

I don't really go here because my HTML and CSS knowledge is close to zero and I don't put up any "content" on the internet. Just wanted to add two things:
  • For AI that steals drawn art, I've seen that making uneven watermarks throws the AI off. Like, it'll still rip your art, but your name will be all over it if it can't recognize a pattern in the watermark that will make it erase it easily. This sounds easier than encrypting in my head but it doesn't really solve the problem of AI ripping people off. Similar to what you meant with "unconventional design".
  • For fanfic on AO3 we are being told to lock them so only logged in users can see it. There are individual people saying that they're feeding unfinished fanfics to chatgpt so that it finishes them, but I think it should still work for when someone tries to feed it with huge amounts of fanfics.
Logged
Memory
Guest
« Reply #5 on: June 10, 2023 @743.07 »

For fanfic on AO3 we are being told to lock them so only logged in users can see it. There are individual people saying that they're feeding unfinished fanfics to chatgpt so that it finishes them, but I think it should still work for when someone tries to feed it with huge amounts of fanfics.

I'm aware of... certain sites that posts (or leaks on how you look at it) content from sites that requires an account and or payment like Patreon. Making an account for a bot to scrape content that isn't public isn't too difficult, and might not even be noticed. I doubt any sufficient solution to scraping for AI training data will be made in the near future.
« Last Edit: June 10, 2023 @758.30 by Valeria22 » Logged
shevek
Sr. Member ⚓︎
****


˚₊⁀꒷₊˚︰₊︶꒦꒷₊⊹︰꒷

iMood: daintyeco

View Profile WWW

First 1000 Members!Joined 2023!
« Reply #6 on: June 23, 2023 @594.66 »

There have been concerns for a little while now (and that are currently being hotly debated) that GitHub could be used to train AI, specifically even private repositories. The risk of that is concerning, since many of them also feature company secrets of companies using GitHub to host. But this also affects people in our sphere who host their content there.

At the center of the fears are GitHub being owned by Microsoft, and also their stance on privacy and data sharing. GitHub openly says that while no human eyes will see your private repositories except for described as in their ToS, the data is still scanned (for safety purposes), and they share aggregate data learned from analysis with their partners.
Some see a risk in that being slowly reworded into, or already including, allowing AI training on that data.

People who want to protect themselves from this have switched from GitHub to self hosting via Gitea. I thought I could mention it here in case this is also something anyone wants to do to protect their content from being included in AI training.
Logged

Odo was just an idea. Shevek is the proof.
microbyte
Newbie ⚓︎
*


View Profile WWW

First 1000 Members!Joined 2023!
« Reply #7 on: June 24, 2023 @139.26 »

Just a thought, but adding human-invisible, but computer-visible noise to images is HIGHLY effective at screwing up image classification (e.g. from the search I just did after remembering this technique: https://www.bleepingcomputer.com/news/technology/adding-noise-to-images-is-enough-to-fool-googles-top-notch-image-recognition-ai/). Idk if it'll work with image-generation AI, but I would imagine it does, as a lot of the image generators use classification GANs.
Logged
kero
Newbie
*


View Profile

First 1000 Members!Joined 2023!
« Reply #8 on: July 07, 2023 @818.69 »

 :ohdear: I took down my site in 2021 but found out not too long ago that it had already been used as part of Google’s AI training set before that. I don’t like that.
Logged
shevek
Sr. Member ⚓︎
****


˚₊⁀꒷₊˚︰₊︶꒦꒷₊⊹︰꒷

iMood: daintyeco

View Profile WWW

First 1000 Members!Joined 2023!
« Reply #9 on: August 09, 2023 @840.45 »

Update: OpenAI has now released ways to block their crawler GPTBot (allegedly)

Official website

Excerpt:

Code
Disallowing GPTBot

To disallow GPTBot to access your site you can add the GPTBot to your site’s robots.txt:

User-agent: GPTBot
Disallow: /

Code
Customize GPTBot access

To allow GPTBot to access only parts of your site you can add the GPTBot token to your site’s robots.txt like this:

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

Bots seem to come from these IPs?
Logged

Odo was just an idea. Shevek is the proof.
Pages: [1] Print 
« previous next »
 

Vaguely similar topics! (3)

Website size

Started by RolyBoard ✁ ∙ Web Crafting

Replies: 59
Views: 6859
Last post March 30, 2024 @910.61
by Semper
Collecting Webgardens - Post your webgarden & greenhouse!

Started by MelooonBoard ⚛︎ ∙ Share your Resources

Replies: 34
Views: 13656
Last post April 24, 2024 @150.30
by honowo
Website example page

Started by Icey!Board ☆ ∙ Showcase & Links

Replies: 3
Views: 1908
Last post December 16, 2021 @285.10
by cinni

Melonking.Net © Always and ever was! SMF 2.0.19 | SMF © 2021, Simple Machines | Terms and Policies Forum Guide | Rules | RSS | WAP2


MelonLand Badges and Other Melon Sites!

MelonLand Project! Visit the MelonLand Forum! Support the Forum
Visit Melonking.Net! Visit the Gif Gallery! Pixel Sea TamaNOTchi