Web Category Filtering – ██FR█████ █INTELL███████████

This content originally appeared on text/plain and was authored by ericlaw

Since the first days of the web, users and administrators have sought to control the flow of information from the Internet to the local device. There are many different ways to implement internet filters, and numerous goals that organizations may want to achieve:

blocking malicious sites,
blocking ads or trackers,
blocking responses based on the category of content a site serves.

Today’s post explores the last of these: blocking content based on category.

The Customer Goal

The customer goal is generally a straightforward one: Put the administrator in control of what sorts of content may be downloaded and viewed on a device. This is often intended as an enforcement mechanism for an organization’s Acceptable Use Policy (AUP).

An AUP often defines what sorts of content a user is permitted to interact with. For example, a school may forbid students from viewing pornography and any sites related to alcohol, tobacco, gambling, firearms, hacking, or other criminal activity. Similarly, a company may want to forbid their employees from spending time on data and social media sites, or from using legitimate-but-unsanctioned tools for sharing files, conducting meetings, or interacting with artificial intelligence.

Mundane Impossibility

The simplicity of the goal belies the impossibility of achieving it. On today’s web, category filtering is inherently impossible to perfect for several reasons:

New sites arrive all day, every day.
The content served by any site can change at any time.
There are an infinite number of categories for content (and no true standard taxonomy).
Decisions of a site’s category are inherently subjective.
Many sites host content across multiple categories, and some sites host content from almost every category.
Self-labelling schemes like ICRA and PICS (Platform for Internet Content Selection), whereby a site can declare its own category, have all failed to be adopted.

As an engineer, while it would be nice to work on only tractable problems, in life there are many intractable problems for which customers are willing to buy imperfect best-effort solutions.

Web content categorization is one of these, and because sites and categories change constantly, it’s typically the case that content filtering is sold on a subscription basis rather than as a one-time charge. Most of today’s companies love recurring revenue streams.

So, given that customers have a need, and software can help, how do we achieve that?

Filtering Approaches

The first challenge is figuring out how and where to block content. There are many approaches; Microsoft’s various Web Content Filtering products and features demonstrate three of them:

Browser Integration – Defender WCF-atop-Edge, Edge’s Native WCF (a non-free feature)
On-Device Network Filtering – Defender WCF-atop-Network-Protection
Remote Proxy or VPN – Microsoft Entra Global Secure Access

Each implementation approach has its plusses and minuses, from:

Supported browsers: does it work in any browser? Only a small list? Only a specific one?
Performance: Does it slow down browsing? Because the product may categorize billions of URLs, it’s usually not possible to store the map on the client device.
User-experience: What kind of block notice can be shown? Does it appear in context?
Capabilities: Does it block only on navigating a frame (e.g. HTML), or can it block any sub-resources (images, videos, downloads, etc)? Are blocks targeted to the current user, or to the entire device?

Categorization

After choosing a filtering approach, the developer must then choose a source of categorization information. Companies that are already constantly crawling and monitoring the web (e.g. to build search engines like Bing or Google) might perform categorization themselves, but most vendors acquire data from a classification vendor like NetStar or Cyren that specializes in categorization.

Exposing the classification vendor’s entire taxonomy might be problematic though– if you bind your product offering too tightly to a 3rd-party classification, any taxonomy changes made by the classification vendor could become a breaking change for your product and its customers. So, it’s tempting to go the other way, and ask your customers what categories they expect, then map any of the classification vendor’s taxonomy onto the customer-visible categories.

This is the approach taken by Microsoft Defender WCF, for example, but it can lead to surprises. For example, Defender WCF classifies archive.org in the Illegal Software category, because that’s where our data vendor’s Remote Proxies category is mapped. But to the browser user, that opaque choice might be very confusing — while archive.org almost certainly contains illegal content (it’s effectively a time-delayed proxy for the entire web), that is not the category a normal person would first think of when asked about the site.

Ultimately, an enterprise that implements Web Content Filtering must expect that there will be categorizations with which they disagree, or sites whose categories they agree with but wish to allow anyway (e.g. because they run ads on a particular social networking site, for instance). Administrators should define a process by which users can request exemptions or reclassifications, and then the admins evaluate whether the request is reasonable. Within Defender, an ALLOW Custom Network indicator will override any WCF category blocks of a site.

Aside: Performance

If your categorization approach requires making a web-service request to look up a content category, you typically want to do so in parallel with the request for the content to improve performance. However, what happens if the resource comes back before the category information?

Showing the to-be-blocked content (e.g. a pornographic website) for even a few seconds might be unacceptable. To address that concern for Defender’s WCF, Edge currently offers the following flag on the about:flags page:

Aside: Sub-resources

It’s natural to assume that checking the category of all sub-resources would be superior to only checking the category of page/frame navigations: after all, it’s easy to imagine circumventing, say, an adult content filter by putting up a simple webpage at some innocuous location with a ton of <video> elements that point directly to pornographic video content. A filter that blocks only top-level navigations will not block the videos.

However, the opposite problem can also occur. For example, an IT department recently tried to block a wide swath of Generative AI sites to ensure that company data was not shared with an unapproved vendor. However, the company also outsourced fulfillment of its benefits program to an approved 3rd party vendor. That Benefits website relied upon a Help Chat Bot powered by a company that had recently pivoted into generative AI. Employees visiting the Benefits website were now seeing block notifications from Web Content Filtering due to the .js file backing the Help Chat Bot. Employees were naturally confused — they were using the site that HR told them to use, and got block notifications suggesting that they shouldn’t be using AI. Oops.

Aside: New Sites

Generally, web content filtering is not considered a security feature, even if it potentially reduces an organization’s attack surface by reducing the number of sites a user may visit. Of particular interest is a New Sites category — if an organization blocks users from accessing all sites that are newer than, say, 30 days, and which have not yet been categorized into another category, not only do they reduce the chance of a new site evading a block policy (e.g. a new pornographic site that hasn’t yet been classified by the vendor), this also provides a form of protection from a spear-phishing attack.

Unfortunately, providing a robust implementation of a New Sites category isn’t as easy as it sounds: for a data vendor to classify a site, they have to know its domain name exists. Depending upon the vendors data collection practices, that discovery might take quite a bit of time. That’s because of how the Internet is designed: there’s no “announcement” when a new domain goes online. Instead, a DNS server simply gets a new record binding the site’s hostname to its hosting IP address, and that new record is returned only if a client asks for it.

Simply treating all unclassified sites as “new” has its own problems (what if the site is on a company’s intranet and the data vendor will never be able to access it?). Instead, vendors might learn about new sites by monitoring Certificate Transparency logs, crawling web content that links to the new domain, or by integrating with browsers (e.g. as a browser extension) to discover new sites as users navigate to them. After a domain is discovered, the vendor can attempt to load it and use its classification engine to determine the categories to which the site should belong.

This content originally appeared on text/plain and was authored by ericlaw