BigBison
November 5th, 2004, 08:42
This PHP script sorts browsers into the following categories:
-- XML capable (with WAP check)
-- handheld device
-- bad spider or robot
-- DOM level < 0 or ≥ 1 compliant desktop browser
-- Other
When I was a freshman chemistry student back in college, one of the major lab projects for the semester was developing a Qualitative Analysis Scheme (qual scheme). You have to determine what tests need to be run, and in what order, to analyze an unknown mixture and distill out any desired ingredients. The tests start off rough -- you want to get the mixture divided out into a half-dozen or so beakers, from which you can draw samples for further testing. Ultimately, you wind up with a few test tubes of desirable substance, and a big mix of what's left over.
Sound familiar? The Web resembles the vat of unknown goop from Chem Lab, to me. The easiest test to apply is checking to see if the browser accepts the "application/xhtml+xml" MIME type. If it does, it's either a nice modern browser or a WAP phone, so we deliver it the appropriate content and send it on its way, quickly. Now that future compatibility is taken care of, with XHTML 1.1 for desktop and XHTML Basic for phones, the problem becomes where to draw the line on backwards compatibility.
I already made a decision to use CSS 2.1 to lay out the site instead of tables, and utilize PNG instead of GIF with CSS image rollovers instead of javascript. My HTML 4.01 version of the site is just the XHTML 1.1 version, minus the />'s, with a different DOCTYPE. I have made a decision to support antique browsers with an HTML 3.2 version with javascript GIF rollovers and CSS 1, probably because I started with that the other month before learning the new standards. The trick, of course, is getting the proper content to the requesting browser. Back to the script flow:
Now that XML-rendering (i.e. modern) browsers are out of the way, we can begin sorting the rest of the goo. The next class of browsers I want to distill out are the known handhelds. Not every handheld made has to be accounted for here, as a great deal of handhelds have already been classified as WAP-compliant in my "qual scheme", but there are some sure-fire telltales.
Interestingly, the strings I'm parsing the USER_AGENT header for don't show up in the UA strings of any known malware, so I altered the code just a touch from the initial flowchart I made in Visio (attached) to allow handhelds to bypass the "BadBot" check. Not a deliberate "express lane" for handhelds, but as long as this is valid it should make for some real-world efficiency.
Everything left in the goo now gets scanned on a "guilty until proven innocent" basis for the presence of Elbonian Spambots, like Missigua Locator or anything that claims to be "Internet Explore 5.x", or uses easily identifiable UA strings. Unfortunately, the latest trend on these nasties is UA strings of randomly generated gobbledeygook, if anyone knows of a solution, don't keep it a secret. I'll set up a spider trap in the future, see link below. The list, unfortunately, doesn't account for the multitude of bots spoofing the IE6 UA, etc., so I'll keep an updated list of IP's blocked at the firewall.
Why should I allow some cottonpickin' son of a bitch bot access to my site, if it either accidentally or deliberately ignores (or downright abuse) the robots.txt file? So sorry, NEC and QWEST, but you have the money to hire someone who knows how to write a proper bot, so I won't pay for your bots to index the files on my "noindex" list, dig? I don't care if it costs me a tiny bit of latency and some processor cycles to block you out, I'm not charged per FLOP like I am per MB.
The logic I use to determine which markup to serve relies on two separate checks. I was wondering aloud the other week, what HTTP version is a Netscape or MSIE 4 client? It took a while for the newer doodads in HTTP 1.1 to catch on, and I had hoped there was a similar lag in compliance. Unfortunately, I forgot the viciousness with which Browser War I was fought. At the first whiff of RFC2616, certain browser vendors immediately reacted by upgrading their headers to identify the browser as an HTTP 1.1 client.
Which was pretty freaking stupid, if you ask me, because the HTTP protocol always specified that compatibility be handled on the server side. Instead, Microsoft and Netscape raced to see whose product could blatantly lie about compliance first, to show how forwards compatible those products were. Hmph.
So why not just check the ACCEPT header for the image/png MIME type? Unfortunately, that idea doesn't hold up under testing in the real world. What else can headers tell us to give us a rough idea about a browser's age? Well, what else was going on in Technology Politics at the time? Right! Compuserve's sudden and unexpected assertion of patent rights over LZW compression, the heart of the GIF format, by demanding license fees from everything which used that CODEC. Including: lib.gzip, which was just coming into use on web servers to compress HTTP streams.
The open source community fought back and came up with "deflate". As near as I can tell, it's the same algorithm anyway. In the real world, deflate is part of Apache 2, but not Apache 1.3. However, for many years savvy browser developers have been including "deflate" in their headers (and code). There's an exception or two I've found to this, but they both send "identity" which is why I've decided on an either-or check.
There's a very high likelihood that the browsers I'm targeting accept "deflate", with the notable exception of Internet Explorer, which supports it in only certain versions of IE6, only Mac versions of IE5, and all versions of IE4. So, if IE4 is detected, the script's flag is set for HTML 3.2. WebTV version 2 is already gone as it's a Win CE device, but we have to add a special exception for WebTV 1 which is 540px wide, so I want it to get the X-Basic version just like WebTV2 does. There are two browsers I know of picked up by the "or identity", one is Konqueror.
Which is why, after the "deflate check", the script moves on to determining a MSIE compatibility level, if present. Far fewer browsers sport this than sport a "Mozilla" in the UA string. Since the special WebTV check applies very rarely, it's placed where it is so as to hardly ever get checked for at all. The script only does the following Mozilla checks if "MSIE" was not detected in the UA string.
At this point, I want to separate the remaining browsers by their "Mozilla" compatibility level. Versions 2 and under have already been sorted, at this point I just want to know if we're dealing with 3 or 4, or 5 and up so I alter the $IV toggle accordingly. I use $IV to represent the Roman numeral 4, if it's positive the script delivers HTML 4.01, negative gets 3.2, nonexistent gets X-Basic. Modern Mozillas are already gone because they're XML. What if a browser doesn't have a Mozilla or an IE string?
Fine. Mostly bots are left at this point, with the cellphones and handhelds already sorted. With two notable exceptions - Opera, which when not spoofing some other string just says Opera, and Dillo, an open-source browser which understands HTML 4.01 and CSS 2.1., unfortunately it doesn't support compression nor does it send "identity". Note that this is a fallback pickup for Opera, which accepts 'deflate', and applies only to non-XML versions. Opera 4 an up understand enough CSS and HTML 4 to avoid being shunted to HTML 3.2 with the other 4-and-under browsers, plus they have PNG support.
All Opera browsers (identifying themselves as such) are accounted for here because there are rare instances of noncompliant proxies stripping out the ACCEPT_ENCODING header entirely. Squid cache is a real-world example. All other deflating browsers use MSIE or Mozilla in their UA string, and are thus accounted for. I know of only one browser misclassified by this qual scheme, and it would be easy enough to fix if I decide to support Opera 3, which is only DOM level 0 although it will probably display my site properly as it does support PNG graphics.
I believe the middle portion of this script, from the handheld check through to the Opera/Dillo check, represented in the first line following, to be the functional equivalent of the next line below it, in java script:
if ($IV=true)
if (document.getElementById && document.createElement)
In other words, the meat of this script is a server-side DOM 1 compliance check. If I were using DHTML, I'd rather detect this on the server side and deliver the script, even if the client has scripting turned off, because that may be a temporary condition. With client-side DOM detection, scripts are only sent to browsers with javascript turned on, so what happens if the first client requesting a page has javascript turned off? That version gets cached, not the DHTML version. This is a problem overlooked in most scripts I've seen which implement the above javascript.
If you need a finer degree of control over DOM support, account for it within the javascript as I've seen done many times, but if you do this rough accounting on the server those client-side scripts can be an awful lot slimmer. I've also seen plenty of scripts (plus products like BrowserHawk) send the javascript snippet above as part of a refresh, which then gets the proper version. Implementing this server-side saves a server transaction, if you think in terms of site optimization.
UA strings: http://www.zytrax.com/tech/web/browser_ids.htm
Spider Traps: http://www.ikt-ret.dk/projects/werd.shtml
List of Bad Bots: http://www.kloth.net/internet/badbots.php[/i]
-- XML capable (with WAP check)
-- handheld device
-- bad spider or robot
-- DOM level < 0 or ≥ 1 compliant desktop browser
-- Other
When I was a freshman chemistry student back in college, one of the major lab projects for the semester was developing a Qualitative Analysis Scheme (qual scheme). You have to determine what tests need to be run, and in what order, to analyze an unknown mixture and distill out any desired ingredients. The tests start off rough -- you want to get the mixture divided out into a half-dozen or so beakers, from which you can draw samples for further testing. Ultimately, you wind up with a few test tubes of desirable substance, and a big mix of what's left over.
Sound familiar? The Web resembles the vat of unknown goop from Chem Lab, to me. The easiest test to apply is checking to see if the browser accepts the "application/xhtml+xml" MIME type. If it does, it's either a nice modern browser or a WAP phone, so we deliver it the appropriate content and send it on its way, quickly. Now that future compatibility is taken care of, with XHTML 1.1 for desktop and XHTML Basic for phones, the problem becomes where to draw the line on backwards compatibility.
I already made a decision to use CSS 2.1 to lay out the site instead of tables, and utilize PNG instead of GIF with CSS image rollovers instead of javascript. My HTML 4.01 version of the site is just the XHTML 1.1 version, minus the />'s, with a different DOCTYPE. I have made a decision to support antique browsers with an HTML 3.2 version with javascript GIF rollovers and CSS 1, probably because I started with that the other month before learning the new standards. The trick, of course, is getting the proper content to the requesting browser. Back to the script flow:
Now that XML-rendering (i.e. modern) browsers are out of the way, we can begin sorting the rest of the goo. The next class of browsers I want to distill out are the known handhelds. Not every handheld made has to be accounted for here, as a great deal of handhelds have already been classified as WAP-compliant in my "qual scheme", but there are some sure-fire telltales.
Interestingly, the strings I'm parsing the USER_AGENT header for don't show up in the UA strings of any known malware, so I altered the code just a touch from the initial flowchart I made in Visio (attached) to allow handhelds to bypass the "BadBot" check. Not a deliberate "express lane" for handhelds, but as long as this is valid it should make for some real-world efficiency.
Everything left in the goo now gets scanned on a "guilty until proven innocent" basis for the presence of Elbonian Spambots, like Missigua Locator or anything that claims to be "Internet Explore 5.x", or uses easily identifiable UA strings. Unfortunately, the latest trend on these nasties is UA strings of randomly generated gobbledeygook, if anyone knows of a solution, don't keep it a secret. I'll set up a spider trap in the future, see link below. The list, unfortunately, doesn't account for the multitude of bots spoofing the IE6 UA, etc., so I'll keep an updated list of IP's blocked at the firewall.
Why should I allow some cottonpickin' son of a bitch bot access to my site, if it either accidentally or deliberately ignores (or downright abuse) the robots.txt file? So sorry, NEC and QWEST, but you have the money to hire someone who knows how to write a proper bot, so I won't pay for your bots to index the files on my "noindex" list, dig? I don't care if it costs me a tiny bit of latency and some processor cycles to block you out, I'm not charged per FLOP like I am per MB.
The logic I use to determine which markup to serve relies on two separate checks. I was wondering aloud the other week, what HTTP version is a Netscape or MSIE 4 client? It took a while for the newer doodads in HTTP 1.1 to catch on, and I had hoped there was a similar lag in compliance. Unfortunately, I forgot the viciousness with which Browser War I was fought. At the first whiff of RFC2616, certain browser vendors immediately reacted by upgrading their headers to identify the browser as an HTTP 1.1 client.
Which was pretty freaking stupid, if you ask me, because the HTTP protocol always specified that compatibility be handled on the server side. Instead, Microsoft and Netscape raced to see whose product could blatantly lie about compliance first, to show how forwards compatible those products were. Hmph.
So why not just check the ACCEPT header for the image/png MIME type? Unfortunately, that idea doesn't hold up under testing in the real world. What else can headers tell us to give us a rough idea about a browser's age? Well, what else was going on in Technology Politics at the time? Right! Compuserve's sudden and unexpected assertion of patent rights over LZW compression, the heart of the GIF format, by demanding license fees from everything which used that CODEC. Including: lib.gzip, which was just coming into use on web servers to compress HTTP streams.
The open source community fought back and came up with "deflate". As near as I can tell, it's the same algorithm anyway. In the real world, deflate is part of Apache 2, but not Apache 1.3. However, for many years savvy browser developers have been including "deflate" in their headers (and code). There's an exception or two I've found to this, but they both send "identity" which is why I've decided on an either-or check.
There's a very high likelihood that the browsers I'm targeting accept "deflate", with the notable exception of Internet Explorer, which supports it in only certain versions of IE6, only Mac versions of IE5, and all versions of IE4. So, if IE4 is detected, the script's flag is set for HTML 3.2. WebTV version 2 is already gone as it's a Win CE device, but we have to add a special exception for WebTV 1 which is 540px wide, so I want it to get the X-Basic version just like WebTV2 does. There are two browsers I know of picked up by the "or identity", one is Konqueror.
Which is why, after the "deflate check", the script moves on to determining a MSIE compatibility level, if present. Far fewer browsers sport this than sport a "Mozilla" in the UA string. Since the special WebTV check applies very rarely, it's placed where it is so as to hardly ever get checked for at all. The script only does the following Mozilla checks if "MSIE" was not detected in the UA string.
At this point, I want to separate the remaining browsers by their "Mozilla" compatibility level. Versions 2 and under have already been sorted, at this point I just want to know if we're dealing with 3 or 4, or 5 and up so I alter the $IV toggle accordingly. I use $IV to represent the Roman numeral 4, if it's positive the script delivers HTML 4.01, negative gets 3.2, nonexistent gets X-Basic. Modern Mozillas are already gone because they're XML. What if a browser doesn't have a Mozilla or an IE string?
Fine. Mostly bots are left at this point, with the cellphones and handhelds already sorted. With two notable exceptions - Opera, which when not spoofing some other string just says Opera, and Dillo, an open-source browser which understands HTML 4.01 and CSS 2.1., unfortunately it doesn't support compression nor does it send "identity". Note that this is a fallback pickup for Opera, which accepts 'deflate', and applies only to non-XML versions. Opera 4 an up understand enough CSS and HTML 4 to avoid being shunted to HTML 3.2 with the other 4-and-under browsers, plus they have PNG support.
All Opera browsers (identifying themselves as such) are accounted for here because there are rare instances of noncompliant proxies stripping out the ACCEPT_ENCODING header entirely. Squid cache is a real-world example. All other deflating browsers use MSIE or Mozilla in their UA string, and are thus accounted for. I know of only one browser misclassified by this qual scheme, and it would be easy enough to fix if I decide to support Opera 3, which is only DOM level 0 although it will probably display my site properly as it does support PNG graphics.
I believe the middle portion of this script, from the handheld check through to the Opera/Dillo check, represented in the first line following, to be the functional equivalent of the next line below it, in java script:
if ($IV=true)
if (document.getElementById && document.createElement)
In other words, the meat of this script is a server-side DOM 1 compliance check. If I were using DHTML, I'd rather detect this on the server side and deliver the script, even if the client has scripting turned off, because that may be a temporary condition. With client-side DOM detection, scripts are only sent to browsers with javascript turned on, so what happens if the first client requesting a page has javascript turned off? That version gets cached, not the DHTML version. This is a problem overlooked in most scripts I've seen which implement the above javascript.
If you need a finer degree of control over DOM support, account for it within the javascript as I've seen done many times, but if you do this rough accounting on the server those client-side scripts can be an awful lot slimmer. I've also seen plenty of scripts (plus products like BrowserHawk) send the javascript snippet above as part of a refresh, which then gets the proper version. Implementing this server-side saves a server transaction, if you think in terms of site optimization.
UA strings: http://www.zytrax.com/tech/web/browser_ids.htm
Spider Traps: http://www.ikt-ret.dk/projects/werd.shtml
List of Bad Bots: http://www.kloth.net/internet/badbots.php[/i]