Wednesday, December 30, 2009

The Best of the Web Filter

There are a lot of bad websites out there and visiting one can do bad things to your network. There's plenty of technologies that try and detect bad pages and block or filter them but the problem is that they're imperfect. There are other solutions like whitetrash that only allow you to visit sites that are listed as good (whitetrash is awesome, go play with it). The problem is creating the list of what sites are good, whitetrash allows users and administrators to decide (and I think there's a training mode where it tries to figure it out for you) but I was thinking about other lists that you could use. At about the same time I was writing a paper about Wikipedia, 4chan and Twitter (I may post it here in the future) and I was looking for statistics about their popularity, it's then I remembered and their handy ratings and while poking around there I also discovered that they'll provide you with a csv file of the top 1 million domains. I decided to see what the web was like if you could only visit the top n sites of the web.

I decided to implement it as a squid redirector in python 3. They're dead simple, read a request off of standard in, write the new destination to standard out, I chose python 3 because I keep meaning to spend more time getting used to the language changes (it's trivial to switch it back to 2.6).

Anyway here's the script (and a download link):

import sys
import urllib.parse

maxsites = 5000
failurl = 'http://myfaildomain/'
pathtofile = '/path/to/top-1m.csv'
sites = {l.split(',')[1].strip() for l in open(pathtofile) if int(l.split(',')[0]) <= maxsites}

for l in sys.stdin:
        fqdn = urllib.parse.urlparse(l.split()[0]).netloc
        if fqdn not in sites and \
            ".".join(fqdn.split('.')[1:]) not in sites:
            sys.stdout.write('%s?fqdn=%s\n' % (failurl, fqdn))
    except Exception as e:

You may notice that this is horribly inefficient because squid uses multiple instances of the script (5 by default) to load balance and each one is going to have a big data structure in memory containing all the allowed sites. This is part of the reason why there's a parameter to limit the number of sites to add to the whitelist.

The list actually creates a pretty decent browsing experience but it isn't perfect. First of all there'll be sites that won't have all their content available because some of it gets served off a non-white listed domain (some of that is mitigated by trying the domain a second time without its least significant subdomain) but this is also a feature as you miss injected iframes as well as nasty tracking scripts. Secondly there'll just be some sites that you visit that won't actually be as popular as you think, in a real implementation I think it'd be worth scraping the top X sites from your country and the top Y sites from each category and adding them to the list as well. Finally the list is only of popular sites there are plenty of sites on the list that could be unsafe, several porn sites and other sites of dubious security posture so there's still risks and that this isn't a good policy enforcement list.

I don't see much in the way of production use of this type of list but maybe it would help people pre-screen URLs.  For example if you were doing blacklisting based off of a malware database (like Google's Safe Browsing API) but lookups were expensive you could filter out any domains on the popular list to speed things up (with some risks). Same if you have a high interaction honey-client, screening out URLs could save you a lot of time.

As an aside I wrote a similar script to the one above to compare URLs against the Google Safe Browsing API database but I seem to have lost the code.

No comments:

Post a Comment