Entrance Events! Chat Gallery Search Everyone Wiki Login Register

Welcome, Guest. Please login or register. - Thinking of joining the forum??
April 04, 2025 - @42.77 (what is this?)
Activity rating: Four Stars Posts & Arts: 64/1k.beats Unread Topics | Unread Replies | My Stuff | Random Topic | Recent Posts Start New Topic  Submit Art
News: :skull: Websites are like whispers in the night  :skull: Super News: Upload a banner!

+  MelonLand Forum
|-+  World Wild Web
| |-+  ☞ ∙ Life on the Web
| | |-+  Bots and Scrapers


« previous next »
Pages: [1] Print
Author Topic: Bots and Scrapers  (Read 286 times)
brisray
Sr. Member ⚓︎
****


⛺︎ My Room

View Profile WWW

RocketmanFirst 1000 Members!Joined 2023!
« on: February 20, 2025 @69.12 »

I'm not a great fan of bots and scrapers but I wrote one anyway.

Towards the end of each month I visit every webring I know of to make sure it is still active. There's over 320 of them, so it was taking a while. I thought I could save a bunch of time if I wrote something to check the rings for me, so the other day I wrote a simple PowerShell script to chck them for me. What was taking most of the day, I can now do in less than 3 minutes.

Only around 20% (1/5) of webrings work as they should. For whatever reason the code disappears from the ring sites and so the ring gets broken. The script can be adapted to look on the member sites to look for the ring code. Or one of the premade scrrapers can be used, or you can even write your own using Bash's sed and grep commands on Linux or Macs.
Logged
Melooon
Hero Member ⚓︎
*****


So many stars!

⛺︎ My Room
SpaceHey: Friend Me!
StatusCafe: melon
iMood: Melonking
Itch.io: My Games
RSS: RSS

View Profile WWWArt

Hyperactive DonutGreat Posts PacmanOfficially DogThanks for being rad!a puppy for your travelsAlways My Pal
« Reply #1 on: February 20, 2025 @105.74 »

I thought you'd find this interesting; its the complete indexer bot code from a search engine I made a few years ago in Java https://github.com/Melonking906/Daniels-Network-Search/tree/main/indexer/src/net/danielsnet/indexer/indexers

It essentially takes a list of domains, visits every domain and finds every link within the domain, then gathers every word used on every page and turns it into an index.

I discovered pretty quickly that if you compare sites that use similar words, you can quite accurately guess what their subject or genre is (e.g. sites that say "picture", "art", "commission" on their homepage are almost always sites of artists), so I used it for sorting sites into a simple directory at the time!

It could also be used to create a simple "if you like this, then you might like that" suggestion system that gathers sites that use similar words to each other.

Also here is code which was designed to scrape every site hosted on neocities by clicking through each page on the browse listing: https://github.com/Melonking906/Daniels-Network-Search/blob/main/indexer/src/net/danielsnet/indexer/neocities/NeocitiesScrapeRunnable.java

The neocities code is interesting because it essentially means you can compile metrics on every neocities site and figure out community trends, but that proved more of a curiosity for me and the work needed to actually make sense of all the info was not worth the time!

I agree scrapers are a slippery slope and you can get lost in all the data, but the computer scientist in me still loves biting into some crunchy data processing (Before I remember how annoying it all becomes to manage  :tongue: )



For webrings I totally agree, I wouldn't wanna manage a large webring without some sort of scraper to filter out discontinued sites. The tricky bit comes in when you have sites that are just slow to load or temporarily offline. I've not solved this for the surf club yet, but I might add a "three strikes and you're out" system for offline sites!
Logged


everything lost will be recovered, when you drift into the arms of the undiscovered
wolfkitty42
Casual Poster ⚓︎
*


WOOF! WOOF! WOOF! WOOF! WOOF! WOOF! WOOF! WOOF!

⛺︎ My Room

View Profile WWWArt

BelieverJoined 2025!
« Reply #2 on: February 20, 2025 @127.12 »


Towards the end of each month I visit every webring I know of to make sure it is still active. There's over 320 of them, so it was taking a while.


WOW. I just spent a while looking at this webring index! That was really wonderful. I have one question- Is the description for ACDS meant to be a joke, or is that just what showed up when you googled the acronym? I recognize a lot of the usernames listed on it and I'm pretty sure it's a webring for members of the Andrew Cunningham Discord Server. If it's a joke feel free to just disregard this, haha.
Logged

- Wolfkitty42, out!

It's tamaNOTchi! Click to feed!It's tamaNOTchi! Click to feed!
nobo
Jr. Member ⚓︎
**


⛺︎ My Room
Itch.io: My Games
RSS: RSS

View Profile WWW

First 1000 Members!Joined 2023!
« Reply #3 on: February 21, 2025 @124.80 »

I might add a "three strikes and you're out" system for offline sites!

You could refill the strikes if uptime resumes.



I have a bot that scrapes and patches Super Mario World romhacks. Then it has a TUI launcher to query them and play them.

https://github.com/divSelector/mario-mod-manager

I might write a howto article about this eventually since its pretty cool but not straightforward to set up.



Scrapers are a pretty decent intro project for someone who has a little web experience and wants to get into programming.



You can get yourself into a bit of trouble with them though if you're not mindful. :innocent:
Logged
candycanearter07
Hero Member ⚓︎
*****


i like slimes

⛺︎ My Room
SpaceHey: Friend Me!
StatusCafe: candycanearter
Itch.io: My Games
RSS: RSS

View Profile WWWArt

Goomy, I Choose You!Suck At Something September - Did It!uh oh! a pigeon got in!Artsy Candy CaneJoined 2024!
« Reply #4 on: February 21, 2025 @728.15 »


You could refill the strikes if uptime resumes.



I have a bot that scrapes and patches Super Mario World romhacks. Then it has a TUI launcher to query them and play them.

https://github.com/divSelector/mario-mod-manager

That's pretty cool! I used to have a py script to patch SMW roms that I used a lot.

snippet:
Code
 39   │ subprocess.run(["bspatch",
  40   │                 f'{rompath}/{workingdata[0]}',
  41   │                 patchedfile,
  42   │                 f'{working}/delta.bsdiff4'
  43   │                 ])
  44   │ print(f"Created patch file {patchedfile}")
Logged

new to oldnet be nice




nobo
Jr. Member ⚓︎
**


⛺︎ My Room
Itch.io: My Games
RSS: RSS

View Profile WWW

First 1000 Members!Joined 2023!
« Reply #5 on: February 22, 2025 @266.61 »

I used to have a py script to patch SMW roms that I used a lot.

fellow romhacker  :unite: 

yeah even just that saves a lot of time.

I started with something like that as well until I realized the main barrier to playing another hack is that you have to go online, find another one, download it, patch the vanilla rom, etc.

It starts out as not a big deal. But then you start playing a lot of them, and then you start realizing that only 1 in 10 are actually interesting anymore. So you have to figure out how to speed up the process.

That's what led to the bot.  :ha:
Logged
brisray
Sr. Member ⚓︎
****


⛺︎ My Room

View Profile WWW

RocketmanFirst 1000 Members!Joined 2023!
« Reply #6 on: February 25, 2025 @669.06 »

@Melooon It's so cool that you write your own utilities. I love the idea you tell these lumps of plastic and metal to do what you want and why Igot into programming years ago. It's a good idea about the three strikes rule. Things happen and sites can be unavailable when they are checked. Now and then I have come across pages that are being edited while I am reading them.

@wolfkitty42 - oops my mistake. Some days I can be such a ditz. I usually take the description from the ring home page but messed up with that one.

A little bit of trivia. Some algorithms are older than computers! One place I worked at did things to databases, among which was very thoroughly looking for duplicates. We used Soundex a lot. Soundex looks for words that sound similar to eachother (homophones) and the algorithm for doing that was patented in 1918. Wikipedia and my site.
Logged
Y2KStardust
Jr. Member ⚓︎
**


ooh, what does this button do??

⛺︎ My Room

View Profile WWW

Great Posts PacmanFirst 1000 Members!Joined 2023!
« Reply #7 on: February 28, 2025 @466.60 »

Out of curiosity with the webring scraper - is there a way to tell a false negative from an actual dead site? I immediately thought of how I have a robots.txt file and iirc those are now a default part of a new web page - the ones given by neocities don't automatically block AI/scrapers/etc, but my robots.txt file does and I feel like more people would be moving to that as a response to things like chatGPT.

My basic question here is - if a website flags as dead from this scraper, is there a way to tell apart actual dead-ness from a site that's just blocking scraping? :O
Logged

Melooon
Hero Member ⚓︎
*****


So many stars!

⛺︎ My Room
SpaceHey: Friend Me!
StatusCafe: melon
iMood: Melonking
Itch.io: My Games
RSS: RSS

View Profile WWWArt

Hyperactive DonutGreat Posts PacmanOfficially DogThanks for being rad!a puppy for your travelsAlways My Pal
« Reply #8 on: February 28, 2025 @573.98 »

from a site that's just blocking scraping?
So a robot.txt is more like an honour system, its for telling a crawler "hey this directory is just for internal stuff or things I don't want to show up in search results please ignore it" - at that point the crawler can choose to honour your request, or it can ignore it. Additionally, some crawlers will tell you their "User Agent" e.g. who they are "Google Bot" or they will just give a generic browser name "Firefox 10.0". Annd some crawlers will slow down their clicks to make themselves appear more like a real person reading the page.

There is no reliable way to block a crawler that has decided to ignore your robots.txt request and is not telling you its user agent, and there is no law saying they have to do these things :ziped: Its a wild web out there and once you put something on the web you have to assume that every person and bot can see it!

Sooo as to your question! A robots.txt would not bother a bot for checking webring compliance because it would probably not look at the robots.txt in order to get its job done :skull:
Logged


everything lost will be recovered, when you drift into the arms of the undiscovered
brisray
Sr. Member ⚓︎
****


⛺︎ My Room

View Profile WWW

RocketmanFirst 1000 Members!Joined 2023!
« Reply #9 on: March 02, 2025 @794.87 »

Apart from the output of the script to a CSV file, the scraper also returns the server  error messages such as "Page Not Found" and "The remote name could not be resolved" if it cannot find the site at all.

I keep a separate list of webrings that have gone, and I check those as well. Stuff happens, and pages go offline now and then.

Bots aren't neccessarily bad, but the ones that annoy me most are those that belong to companies that make you pay to see whatever information they've gathered. According to my copy of AWStats, One of my sites served 56,000 pages to humans last month, but 858,000 to bots. I'm thinking of going through the list of bots it produces and blocking some of them from the server.
Logged
Pages: [1] Print 
« previous next »
 

Vaguely similar topics! (3)

Submit your 3D cameo charicters for Ozwomp's Voyage Android Edition!

Started by MelooonBoard ☑︎ ∙ Events and Activities

Replies: 5
Views: 593
Last post October 29, 2024 @942.07
by Bumperdog
Ozwomp's Voyage for Android

Started by MelooonBoard ⚛︎ ∙ Melon's Junk

Replies: 5
Views: 384
Last post November 13, 2024 @977.81
by candycanearter07
A cool avatar site (andypants.net)

Started by SilkSkullBoard ♺ ∙ Web Crafting Materials

Replies: 17
Views: 5760
Last post April 01, 2023 @105.91
by SilkSkull

Melonking.Net © Always and ever was! SMF 2.0.19 | SMF © 2021 | Privacy Notice | ~ Send Feedback ~ Forum Guide | Rules | RSS | WAP | Mobile


MelonLand Badges and Other Melon Sites!

MelonLand Project! Visit the MelonLand Forum! Support the Forum
Visit Melonking.Net! Visit the Gif Gallery! Pixel Sea TamaNOTchi