PDA

View Full Version : How Browser Sorting is like Chemistry Lab


BigBison
November 5th, 2004, 08:42
This PHP script sorts browsers into the following categories:
-- XML capable (with WAP check)
-- handheld device
-- bad spider or robot
-- DOM level < 0 or ≥ 1 compliant desktop browser
-- Other

When I was a freshman chemistry student back in college, one of the major lab projects for the semester was developing a Qualitative Analysis Scheme (qual scheme). You have to determine what tests need to be run, and in what order, to analyze an unknown mixture and distill out any desired ingredients. The tests start off rough -- you want to get the mixture divided out into a half-dozen or so beakers, from which you can draw samples for further testing. Ultimately, you wind up with a few test tubes of desirable substance, and a big mix of what's left over.

Sound familiar? The Web resembles the vat of unknown goop from Chem Lab, to me. The easiest test to apply is checking to see if the browser accepts the "application/xhtml+xml" MIME type. If it does, it's either a nice modern browser or a WAP phone, so we deliver it the appropriate content and send it on its way, quickly. Now that future compatibility is taken care of, with XHTML 1.1 for desktop and XHTML Basic for phones, the problem becomes where to draw the line on backwards compatibility.

I already made a decision to use CSS 2.1 to lay out the site instead of tables, and utilize PNG instead of GIF with CSS image rollovers instead of javascript. My HTML 4.01 version of the site is just the XHTML 1.1 version, minus the />'s, with a different DOCTYPE. I have made a decision to support antique browsers with an HTML 3.2 version with javascript GIF rollovers and CSS 1, probably because I started with that the other month before learning the new standards. The trick, of course, is getting the proper content to the requesting browser. Back to the script flow:

Now that XML-rendering (i.e. modern) browsers are out of the way, we can begin sorting the rest of the goo. The next class of browsers I want to distill out are the known handhelds. Not every handheld made has to be accounted for here, as a great deal of handhelds have already been classified as WAP-compliant in my "qual scheme", but there are some sure-fire telltales.

Interestingly, the strings I'm parsing the USER_AGENT header for don't show up in the UA strings of any known malware, so I altered the code just a touch from the initial flowchart I made in Visio (attached) to allow handhelds to bypass the "BadBot" check. Not a deliberate "express lane" for handhelds, but as long as this is valid it should make for some real-world efficiency.

Everything left in the goo now gets scanned on a "guilty until proven innocent" basis for the presence of Elbonian Spambots, like Missigua Locator or anything that claims to be "Internet Explore 5.x", or uses easily identifiable UA strings. Unfortunately, the latest trend on these nasties is UA strings of randomly generated gobbledeygook, if anyone knows of a solution, don't keep it a secret. I'll set up a spider trap in the future, see link below. The list, unfortunately, doesn't account for the multitude of bots spoofing the IE6 UA, etc., so I'll keep an updated list of IP's blocked at the firewall.

Why should I allow some cottonpickin' son of a bitch bot access to my site, if it either accidentally or deliberately ignores (or downright abuse) the robots.txt file? So sorry, NEC and QWEST, but you have the money to hire someone who knows how to write a proper bot, so I won't pay for your bots to index the files on my "noindex" list, dig? I don't care if it costs me a tiny bit of latency and some processor cycles to block you out, I'm not charged per FLOP like I am per MB.

The logic I use to determine which markup to serve relies on two separate checks. I was wondering aloud the other week, what HTTP version is a Netscape or MSIE 4 client? It took a while for the newer doodads in HTTP 1.1 to catch on, and I had hoped there was a similar lag in compliance. Unfortunately, I forgot the viciousness with which Browser War I was fought. At the first whiff of RFC2616, certain browser vendors immediately reacted by upgrading their headers to identify the browser as an HTTP 1.1 client.

Which was pretty freaking stupid, if you ask me, because the HTTP protocol always specified that compatibility be handled on the server side. Instead, Microsoft and Netscape raced to see whose product could blatantly lie about compliance first, to show how forwards compatible those products were. Hmph.

So why not just check the ACCEPT header for the image/png MIME type? Unfortunately, that idea doesn't hold up under testing in the real world. What else can headers tell us to give us a rough idea about a browser's age? Well, what else was going on in Technology Politics at the time? Right! Compuserve's sudden and unexpected assertion of patent rights over LZW compression, the heart of the GIF format, by demanding license fees from everything which used that CODEC. Including: lib.gzip, which was just coming into use on web servers to compress HTTP streams.

The open source community fought back and came up with "deflate". As near as I can tell, it's the same algorithm anyway. In the real world, deflate is part of Apache 2, but not Apache 1.3. However, for many years savvy browser developers have been including "deflate" in their headers (and code). There's an exception or two I've found to this, but they both send "identity" which is why I've decided on an either-or check.

There's a very high likelihood that the browsers I'm targeting accept "deflate", with the notable exception of Internet Explorer, which supports it in only certain versions of IE6, only Mac versions of IE5, and all versions of IE4. So, if IE4 is detected, the script's flag is set for HTML 3.2. WebTV version 2 is already gone as it's a Win CE device, but we have to add a special exception for WebTV 1 which is 540px wide, so I want it to get the X-Basic version just like WebTV2 does. There are two browsers I know of picked up by the "or identity", one is Konqueror.

Which is why, after the "deflate check", the script moves on to determining a MSIE compatibility level, if present. Far fewer browsers sport this than sport a "Mozilla" in the UA string. Since the special WebTV check applies very rarely, it's placed where it is so as to hardly ever get checked for at all. The script only does the following Mozilla checks if "MSIE" was not detected in the UA string.

At this point, I want to separate the remaining browsers by their "Mozilla" compatibility level. Versions 2 and under have already been sorted, at this point I just want to know if we're dealing with 3 or 4, or 5 and up so I alter the $IV toggle accordingly. I use $IV to represent the Roman numeral 4, if it's positive the script delivers HTML 4.01, negative gets 3.2, nonexistent gets X-Basic. Modern Mozillas are already gone because they're XML. What if a browser doesn't have a Mozilla or an IE string?

Fine. Mostly bots are left at this point, with the cellphones and handhelds already sorted. With two notable exceptions - Opera, which when not spoofing some other string just says Opera, and Dillo, an open-source browser which understands HTML 4.01 and CSS 2.1., unfortunately it doesn't support compression nor does it send "identity". Note that this is a fallback pickup for Opera, which accepts 'deflate', and applies only to non-XML versions. Opera 4 an up understand enough CSS and HTML 4 to avoid being shunted to HTML 3.2 with the other 4-and-under browsers, plus they have PNG support.


All Opera browsers (identifying themselves as such) are accounted for here because there are rare instances of noncompliant proxies stripping out the ACCEPT_ENCODING header entirely. Squid cache is a real-world example. All other deflating browsers use MSIE or Mozilla in their UA string, and are thus accounted for. I know of only one browser misclassified by this qual scheme, and it would be easy enough to fix if I decide to support Opera 3, which is only DOM level 0 although it will probably display my site properly as it does support PNG graphics.

I believe the middle portion of this script, from the handheld check through to the Opera/Dillo check, represented in the first line following, to be the functional equivalent of the next line below it, in java script:

if ($IV=true)
if (document.getElementById && document.createElement)

In other words, the meat of this script is a server-side DOM 1 compliance check. If I were using DHTML, I'd rather detect this on the server side and deliver the script, even if the client has scripting turned off, because that may be a temporary condition. With client-side DOM detection, scripts are only sent to browsers with javascript turned on, so what happens if the first client requesting a page has javascript turned off? That version gets cached, not the DHTML version. This is a problem overlooked in most scripts I've seen which implement the above javascript.

If you need a finer degree of control over DOM support, account for it within the javascript as I've seen done many times, but if you do this rough accounting on the server those client-side scripts can be an awful lot slimmer. I've also seen plenty of scripts (plus products like BrowserHawk) send the javascript snippet above as part of a refresh, which then gets the proper version. Implementing this server-side saves a server transaction, if you think in terms of site optimization.

UA strings: http://www.zytrax.com/tech/web/browser_ids.htm
Spider Traps: http://www.ikt-ret.dk/projects/werd.shtml
List of Bad Bots: http://www.kloth.net/internet/badbots.php[/i]

BigBison
November 5th, 2004, 08:44
The more XML browsers people start using, the less load on your server if you do any browser or BadBot detection, and less latency for page delivery to XML browsers. So yes, XML is faster. But not "per se".

MalWare sends ACCEPT:*/*. So do text browsers in general, as well as WGET, cURL, and other tools. Under my scheme, MalWare gets a 204 Error and tools get XHTML Basic. It doesn't matter, all my URLs end with /, not .html so the links are identical across versions.

Unfortunately, although Safari will properly render XHTML 1.1 delivered as "application/xhtml+xml", it (lazily and stupidly I might add) also sends ACCEPT:*/*. so it will always get HTML 4.01 in my scheme. ATM, Safari is the only misclassified browser I know of, so I'm working on an exception for it, however I refuse to add a UA parse to the initial check in the script because of Apple's (or anyone else's) sloppiness.

Make that *two* misclassified browsers. I've now been made aware of the fact that Safari is based on KHTML, which is the Konqueror or KDE rendering engine, which is XML. I had previously thought Konqueror and Safari's classification as HTML 4.01 was good enough. I do hope other developers aren't so sloppy after going through all the trouble to develop XML-parsing browsers. Disgraceful not to specify "application/xhtml+xml". It's a half-assed approach, btw the keystonewebsites.com script will misclassify these two for the same reason.

There is one improvement that needs to be made. I'm making an awfully big deal about adhering to HTTP standards particularly in headers. So it's disgraceful that I've ignored q values! A client may very well accept "application/xhtml+xml" yet prefer "text/html" and it's wrong to ignore that and force-feed the xml, like my script does.

Not that I'm in any rush to implement this feature. Let's just put it on the "to do" list, shall we? I don't see any evidence of that preference in existing XML-browser UA strings so I'm not all that bothered by this.

When utilizing any form of server-side content negotiation, and analyzing header strings certainly qualifies, you must be certain to send the HTTP "Vary" header, depending on which headers were analyzed to determine what to send the requesting client. So I've added a variable called "$Vary" to the code, it contains the actual string needed for the vary header. It doesn't do anything in this script, of course. It's up to whoever uses the script to figure out what to do with the variable. At this point, I have the script integrated into PmWiki (where it sets a variable, as opposed to 'echo' or 'include') and my solution is only valid within that framework.

I don't release the $Vary variable like I do the others, because I'm passing it.

Here's the full BadBots string:

'WebCopier','California','Boston','Cerberian','Por t Huron','PlantyNet','Explore 5','Indy Library','Teleport','Program Shareware','Green Research','YComp','JoBo','Franklin','Missigua','Sc ooter','Finder','UrlDispatcher','BravoBrian','http generic','minibot','LWP','lwp','Webdup','GalaxyBot ','Missouri','Lachesis','Second Street','Zeus','linko','Openbot','NetResearch','IP iumBot','metabot','URL_Spider','Java','URL Control','DigExt','Capture','Siphon','WebLight','D ownloader','WebZIP','HTTrack','w3mir','Sucker','Sn agger','Stripper','Extractor','Mewsoft','iaea','FA ST'

BigBison
November 5th, 2004, 08:56
<?php
$XML = false;
$Vary = 'Accept';
//note1
//by reading echo as include perhaps you'll grok this script
if (strstr($_SERVER['HTTP_ACCEPT'], 'application/xhtml+xml')) {
$XML = true;
if (strstr($_SERVER['HTTP_ACCEPT'], 'vnd.wap')) {
echo 'xbasic';
} else {
echo 'xhtml';}
} else {//ugly-ass hack for clients with ugly-ass headers
$UA = $_SERVER['HTTP_USER_AGENT'];
if (strstr($UA,'Safari') || strstr($UA, 'Konqueror')) {
$XML = true;
$PageTemplateFmt = 'pub/skins/bison/xhtml.html';
$Vary = 'Accept, User-Agent';
//malware is always HTTP_ACCEPT:*/* without fail... shhh...
} else {//encapsulates remainder - comment out to test
//note2
//note3
$IV = null; //declaring the variable (optional)
$Vary = 'Accept, User-Agent';
//note7
if (!preg_match('(UP|Windows CE|PPC|Palm|PDA|EPOC|MMP|Mozilla/[0-2])',$UA)) {
$BadBot = false;
$BB = array('Boston','California','Missouri');//etc...etc...etc...
if (in_array($UA,$BB)) {
$BadBot = true;}
unset($BB);//we're done with this
if (!$BadBot) {
//note4
$Vary = 'Accept, User-Agent, Accept-Encoding';
$AE = $_SERVER['HTTP_ACCEPT_ENCODING'];
if (strstr($AE, 'deflate') || strstr($AE, 'identity')) {
$IV = true;} //roman numeral 4.01, get it?
unset($AE);
if (stristr($UA,'MSIE')) {
$IV = true;
//note5
if (preg_match('|MSIE [0-4]|',$UA)) {
$IV = false;
//note6
if(strstr($UA,'WebTV')) {
$IV = null;}}
} else {
if (strstr($UA,'Mozilla')) {
$IV = true;
if (preg_match('|^Mozilla/[3-4]|',$UA)) {
$IV = false;}
} else { //there's always an exception isn't there?
if (strstr($UA,'Opera') || strstr($UA,'Dillo')) {
$IV = true;}}}}}//closed back to handheld check
unset($UA);
if (!$BadBot) {
switch ($IV) {
case true:
echo 'html401';break;
case null:
echo 'xbasic';break;
case false:
echo 'html32';break;}
unset($IV);
} else {
unset($BadBot);
header ("HTTP/1.1 204 No Content");}}}//closes capsule
//
//note1: deliver xml to capable desktop and mobile browsers
//this is the express lane baby -- we're checking to see if
//the browser accepts the "application/xhtml+xml" MIME type
//and if so, is it WAP compliant, in which case it's a
//handheld either way this accounts for the most capable
//browsers
//
//note2: clients that remain at this point need to clarify a
//few things before we proceed, let's start by reading the UA
//of the browser requesting this page and assign it a variable
//
//note3: exclude Mozilla handhelds as well as vintage
//Mozillas WAP, WebTV and Blazer are deliberately left off
//the list this is a minimum list of surefire handheld tell-
//tales and we want to give single-column to any vintage
//Mozillas some early handheld UA strings claim Mozilla/2 as
//well
//
//note4: with those out of the way we can continue by
//checking for some telltale signs of HTTP1.1 clients - we
//can't look for HTTP1.1 user agents directly because the
//end browser may be HTTP1.0 via HTTP1.1 proxy or cache how-
//ever, compliant intermediaries may not change the content
//encoding - gzip has been around too long to be of any use
//and unfortunately checking the accept header for png sup-
//port ultimately proved worthless - I tried :) so I'm
//making an assumption that the presence of deflate or
//identity means HTML4.01, CSS2.1, and PNG compliant
//
//note5: should we demote this browser to HTML 3.2 / CSS 1
//(despite the fact that IE4 passes the deflate check)
//(which is why there's no 'else' after the deflate check)
//YPC users are SOL if they're using obsolete Explorer that
//infinitesimally small chance isn't enough to make me use
//the case-insensitive flag it costs processor cycles and
//using a regex is slower than strstr to begin with
//
//note6: since far fewer browsers report MSIE compatibility
//than begin with "Mozilla" we check for WebTV on this tiny
//subset of UA strings for efficiency new WebTV2 is MSIE6 on
//Windows CE, already sorted old WebTV will be sorted as HTML
//3.2 if we don't nullify the MSIE-compat variable, otherwise
//sites wider than 520 px will horizontal-scroll I'm doing
//this to avoid futzing about with the css2 tv media type as
//it's irrelevant with WebTV2 sporting 800x600 imho
//
//note7: "you've been a Bad Bot" list -- to remove a bot it
//must behave itself and mind my robots.txt directives -- in
//the meantime, enjoy receiving 404 or somesuch errors
//some bots on this list read robots.txt and head right for
//the excluded content -- ain't that sweet -- I'd rather
//take a little latency and some processor cycles than pay
//the bandwidth charges incurred by these jackasses who
//can't even include compression in these bots say thankya
?>


Yuck! I think the syntax highlighting needs work...

the_pm
November 5th, 2004, 15:21
As I've stated before privately and in other forums, this is a superb piece of work, Eric! Thank you very much for posting it here with all of your explanations. I'd be interested in turning it into an article if you're amenable to that. Full credit and links are a given.

Any objections, or would you like to make changes to anything?

BigBison
November 5th, 2004, 20:08
Any objections, or would you like to make changes to anything?

Feel free.

the_pm
November 11th, 2004, 20:19
Ok, I'm in the process of testing this script on the current IWDN site (and I'm making some CSS modifications as well to use positioning instead of float for the menus, so avoid bugginess in handheld devices). I tried figuring out what to serve in the place of xbasic, but I couldn't find the appropriate header. Also, do you have a live version of this script so I can look at your headers when I visit it, just to make sure I'm returning the right thing?

BigBison
November 11th, 2004, 21:03
No, not live anywhere currently. I don't understand what you're asking about "appropriate header".

the_pm
November 11th, 2004, 21:07
The current script returns a value based on what characteristics it identifies in the browser calling it. I'm replacing those values with the appropriate headers.

Here's my very rough mockup: http://iwdn.net/index2.php

BigBison
November 12th, 2004, 03:52
Uh, all I see is a web page...
If you want to post php sources, put them in their own directory and set up a .htaccess file to serve all .php in that directory as text/plain. If that's what you were after?

The script wasn't really designed to "return" anything. It's procedural, it was intended to include as opposed to return. If you wait a few days, it will be rewritten as a function. I'm still not sure what you meant by "serve in the place of xbasic".

In my script, xbasic is used twice and shouldn't be, really. I'm serving xbasic in both places, doesn't mean you have to. The initial occurrence is used to define XML handhelds, the other occurrence really represents a "default" flavor for your website.

One change that occurs to me is to use "406 Not Acceptable" as the error, instead of "206 No Content". 200-range errors indicate success. 400-range errors indicate client errors, which is more appropriate here.

the_pm
November 12th, 2004, 04:02
The reason you simply see a Web page is that I placed this script at the beginning of the page in the place of a DOCTYPE. When you load that page from a good browser, you see the XHTML1.1 DOCTYPE. When you load it in IE, you see and HTML4.01 Strict DOCTYPE. I was testing it by placing it directly into the page. I'm assuming you had something in mind where the script would run once and leave it's mark on all pages via a session, or something to that effect. Am I right? If so, I just haven't gotten to that point - to the point of asking you the best way to do this. But it sounds like you were alluding to it in that last post.

Don't mind me - like I said in that PM, I'm on sensory overload right now. Not much is making sense to me.

BigBison
November 12th, 2004, 04:20
Read this tomorrow, Paul!

The script executes every time the page is called, which is why (in another thread on another board) we were talking about setting a cookie to override it, and get the other skin.

In my scheme, I used different skins to present the DOCTYPES, so I was using my script to set the $PageSkin variable in PmWiki. I (think I) understand what you're doing, but I don't know phpBB.

XHTML Basic and XHTML 1.1 are the same regarding what MIME header to send. Bear in mind the standard says "should not" be served as text/html and "should" be served as "application/xhtml+xml", not "must not" and "must". In my system, the default is XHTML Basic served as text/html, because it's primarily spiders that see it and they won't care about that little technicality.