The Best of the Web Filter
There are a lot of bad websites out there and visiting one can do bad things to your network. There's plenty of technologies that try and detect bad pages and block or filter them but the problem is that they're imperfect. There are other solutions like whitetrash that only allow you to visit sites that are listed as good (whitetrash is awesome, go play with it). The problem is creating the list of what sites are good, whitetrash allows users and administrators to decide (and I think there's a training mode where it tries to figure it out for you) but I was thinking about other lists that you could use. At about the same time I was writing a paper about Wikipedia, 4chan and Twitter (I may post it here in the future) and I was looking for statistics about their popularity, it's then I remembered alexa.com and their handy ratings and while poking around there I also discovered that they'll provide you with a csv file of the top 1 million domains. I decided to see what the web was like if you could only visit the top n sites of the web.
I decided to implement it as a squid redirector in python 3. They're dead simple, read a request off of standard in, write the new destination to standard out, I chose python 3 because I keep meaning to spend more time getting used to the language changes (it's trivial to switch it back to 2.6).
Anyway here's the script (and a download link):
#!/usr/bin/python3
import sys
import urllib.parse
maxsites = 5000
failurl = 'http://myfaildomain/'
pathtofile = '/path/to/top-1m.csv'
sites = {l.split(',')[1].strip() for l in open(pathtofile) if int(l.split(',')[0]) <= maxsites}
sites.add(urllib.parse.urlparse(failurl).netloc)
for l in sys.stdin:
try:
fqdn = urllib.parse.urlparse(l.split()[0]).netloc
if fqdn not in sites and \
".".join(fqdn.split('.')[1:]) not in sites:
sys.stdout.write('%s?fqdn=%s\n' % (failurl, fqdn))
else:
sys.stdout.write('\n')
sys.stdout.flush()
except Exception as e:
pass