molly.com

Thursday 8 September 2005

Searching for Standards

I did a small comparative analysis of markup practices at several major search engines. It’s interesting to note that only one engine is using valid markup and CSS layouts, and that would be MSN. Close behind is AOL, whose validation problems are mostly related to ampersands not being escaped, and HotBot, who have a few easily corrected errors.

Engine Markup Language Table Layouts or CSS? Markup Validation
Alta Vista Presentational HTML, no DOCTYPE Tables Does Not Validate
AOL (beta) XHTML 1.0 Transitional CSS Does Not Validate (mostly due to ampersands not being escaped)
Excite Presentational HTML, HTML 4.01 DOCTYPE Tables Does Not Validate
Google HTML, no DOCTYPE Tables Does Not Validate
HotBot XHTML 1.0 Strict CSS Does Not Validate but only a few conformance errors
Lycos Presentational HTML, no DOCTYPE Tables Does Not Validate
MSN XHTML 1.0 Strict CSS Validates
Yahoo! HTML 4.01 Transitional with presentational and proprietary elements and attributes in use, partial DOCTYPE CSS Does Not Validate

With the exception of Yahoo! which I know has progressive developers examining markup issues, it’s curious to think that many search engines and portals, which tend to be highly trafficked, haven’t been exposed to the benefits of Web standards.

Filed under:   general
Posted by:   Molly | 03:35 | Comments (72)

Comments (72)

  1. Maybe someone needs to remake Google using CSS and work out the bandwidth savings they’d make due to the (presumably) smaller filesize.

    It looks like that approach (eventually) convinced Slashdot to change so you never know! ;P

  2. Makes you want to go and bang your head against a brick wall. When the biggest names on the Internet can’t be bothered what hope is there for the rest of us!

  3. I’m sure they will realize that soon and they will regret every minute passed on them without complying the standards.

  4. One of the reasons not to use Google… stuck in the 90’s. They gzip compress their content but serve it based on user agent sniffing, so Opera receives the content uncompressed.

    You forgot at least one search engine, Yahoo Search based AlltheWeb . The content is based on tables (partially) and huge amounts of nested divs and spans, but at least it is a bit more accessible. They claim to be Opera compliant too ๐Ÿ™‚

  5. I also briefly touched on the state of the code generated by MSN search and Google in my Why accessibility? post.

  6. Google particularly seem to enjoy producing as garbled markup as possible. It’s not just their search results, it is all their content. Google news looks like it is marked up with a random tag generator. All the geniuses they hire and they can’t produce valid code? Before they start invading our desktops they should look to improving their bread and butter products.

  7. Google’s search results page has been retooled using semantic markup and CSS by at least two dozen people as it is, but they just don’t care. It’s annoying, it’s sad, and it’s also pointless.

    There have been some retooling jobs that saved an awful lot of markup and would thus, as a result, save Google ridiculous amounts of bandwidth, but did they show any interest? Nope.

  8. James: I did this a couple years ago –

    http://9rules.com/projects/css_google/

    Wasn’t that hard at all.

  9. Pingback: In Other News

  10. Pingback: Alex | weblog

  11. With AOL Search (which I don’t work on anymore, so can’t really vouch for where it is now), we got close, and decided that we got the benefits of standards mode, and standards-based design by being “close enough” to valid. Escaping ampersands doesn’t help the user any and adds to the weight of the page. It’s nice to be valid, but it’s better to be close enough and faster.

  12. Yahoo is very much a mixed bag, with the newer properties they’ve developed using ‘modern’ design (DIV soup sometimes).

    But one thing is clear: they don’t do much testing in Opera, and their browser-sniffing doesn’t detect Opera properly. You can tell the latter by switching ID options between ‘msie’ and ‘opera’ on a site like my.yahoo.com, then noticing that some content is missing when ID-ing as Opera! If you know Yahoo developers actually interested in working together with Opera to fix these issues…

  13. I just searched “bottom dwellers” in both Google and MSN. Here’s some sats:

    Google’s results returned 15422 characters (15,435 bytes)

    MSN returned 13219 characters (12827 bytes) came out to 16k.

    MSN has lots of unnecessary spans and nested divs and some more complex form elements. They beat Google in size, but they could certainly stand to improve quite a bit.

  14. Pingback: UltraNormal

  15. If memory serves, Google’s video-search pages are more compliant, as are all of the more recent offerings.

    FindForward.com is a compliant interface for Google results.

  16. On a similar vein, it makes me crazy that Google AdSense ads are not valid html. I work really hard to get my page done up just right and ensure that it is valid and all that gets bonked because Google Ads use crappy code. I’m suprised THIS one hasn’t been talked about more often. Thanks Molly!

  17. I’ve always been stunned by how a site as famous big and famous as Google doesn’t have a DOCTYPE – when I saw that for the first time, awhile back, I nearly fell off my chair!

    And how about MSN Search eh? (Now there’s a turn-up for the books…Microsoft/MSN…Standards….*pinches self*)

  18. Pingback: Tom Raftery’s I.T. views » Blog Archive » Standards compliant Search Engines

  19. Pingback: tag-strategia.com » Standards and Search Engines

  20. Google engineers are going to be some of the most opinionated engineers you’ll meet. A doctype, they’ll tell you, is over a hundred extra characters of weight per page. And as their entire company is premised on extracting ridiculously useful information from bad markup, it’s going to ba a hard sell to tell them that valid code is going to get them any farther than they already are. These people are not configured to care about standards.

  21. Validation is based on the DOCTYPE declared for the page being tested. Since Google’s site does not declare a DOCTYPE, it can’t possibly fail any validation test.

    I’m not saying I agree with Google’s lack of a DOCTYPE, or their inability to embrace standards-based web development. But, until you declare a DOCTYPE, how can you be judged for invalid markup?

  22. i’ve heard through the grapevine that that for Google to go with a CSS compliant site, it would actually generate larger file sizes than what they are currently serving up. Seems counter intuitive, but they seem like a pretty bright company and have probably already considered the pro’s and con’s of a web standards site before.

  23. Pingback: PhilLedgerwood.com

  24. A quick check this morning reveals that MSN doesn’t validate (validator.w3.org).

    It is worth pointing out that on many portal sites it is hard to adhere completely to web standards or to ensure 100% validated code due to outside influences such third party advert code.

  25. If the major search engines, and the top companies in the world (I checked this out myself) all fail standards tests, I have to ask the question – what is the benefit of web standards?

    Don’t get me wrong, I build all my sites to standard specs, just because I think it’s the best way to code. But I’d really love if someone could show me tangible business benefits, so next time I’m working with a client who has an existing site, I can tell them why re-coding is worth the expense.

  26. Ed Byrne Said:

    Donโ€™t get me wrong, I build all my sites to standard specs, just because I think itโ€™s the best way to code. But Iโ€™d really love if someone could show me tangible business benefits, so next time Iโ€™m working with a client who has an existing site, I can tell them why re-coding is worth the expense.

    Yesterday, I had to update a prospective client’s website, just the content … built 2-3 years ago by someone using an office generating web application. There were 5 web documents, about 15 printable pages of content. The content was seasoned dramatically with tag soup. Spans, fonts, empty paragraphs, inline styles for sections of sentences, inside font tags, etc. Changes, maybe 3-4 or so per document, though it probably took about 3 hours or more to weed through the mess [using search, find, replace] to make those edits that should have taken about 10-15 minutes for the whole process. I fixed a few areas and added a couple of styles to take care of about 60 lines of markup for just 6 onsite document links. Yes, the 6 link navigation area consumed over 75 lines or more of markup alone. Business benefit number 1. And it is a big one, labor hours.

    If a document is structured, well-formed and follows standards, it is often easier to edit, later. No matter who is doing the edit. Business benefit number 2. Hire a new web team member and they do not have to clean up, or figure out what is going on.

    Business benefit number 3, if the document, in this case, followed better pratices, pages would get delivered faster. I did not bother to see how it delivered in a small or alternative device. [If it would.]

    Those are just 3 benefits, that I can personally vouch for.

  27. Pingback: zengun » your search for “valid search engine” yeilded one result

  28. I did a similiar survey on the front door of the major CMS vendors and was stunned by what I discovered. Not a one was compliant. Ouch!

    I contacted each vendor via email and shared my passion for web standards.

    What really shocked me, was when I checked the open source CMS sites. At the end of that day, I realized I needed to remember where I was 5 years ago on accessibility. Learning, teaching, evangelizing. And while I expected Web Standards to already be more pervasive, it certainly gives me something to strive for.

    So, my momentary “trough of disillusionment” has turned into my “to do” list. Onward web standards!

  29. I’ve known about this for a long time. It’s sad when Google is beaten by Micro$oft in conforming to web standards…

  30. The reason Google is the way it is very purposeful. Inspect the compressed data after Gzipping it fits in one packet. The could have crunched harder if they wanted they made purposeful decisions to make this happen. No doubt they should attempt this in CSS and aim for the same result, but don’t assume people are uncaring or idiots until you ask them why they do something.

  31. Pingback: Twan van Elk » Artikelen » Zoeken naar standaards

  32. I don’t know what’s more surprising: the fact that these high traffic sites would benefit the most from web standards and yet seem to be ignoring them, or the fact that *Microsoft* is the only one complying.

  33. Disappointing. My take on why such a dismal state:

    1) Search engines know browsers (principally IE) are forgiving in what they accept. “it’s fast and works on the popular browser(s), so why ‘fix’ it?”

    2) They don’t really care about their own content(primarily SERPs) being indexed, ranked and aggregegated by other engines or like services – they are at the top of the food chain after all.

    3) Web standards and accessibility are still “new” in some respects and it takes time for the business, political and philosophical wheels in these organizations to turn toward adoption.

    4) Engineers are stubborn.

  34. Look at Google’s HTML – it’s stripped down for ‘minimal’ filesize – quotes are missing from attributes, no doctype, etc. In some areas this might lead to smaller files, but surely with something like Gmail it’d hinder filesize more than helping – especially when my Firefox validator extension tells me there are 1341 warnings for the Gmail inbox!

    John Spitzer hits it on the head though. If it ain’t broke don’t fix it, and Google’s bad HTML is hardly affecting their bottom line.

  35. Maybe it isn’t perceived as a strategic business advantage. MSN who clearly lacks behind Google and Yahoo has to comply with standards to create some differentiator.
    For google it is simple – most relevant results.

  36. It seems as though M$N is also not XHTML 1.0 Strict valid as your article says when I use the W3C validator.

  37. In web designers and standard advocates’ views, we would love to see a site like Google, HotBot, etc. to be a role model when it comes to website development. But in a business standpoint, it seems like the only questions they ask are the following:

    1) Does it work?
    2) Is it efficient?

    Now, the efficiency question, as I believe somebody talked about earlier, is quite tricky. Google claims to not indent and not provide a valid CSS-based layout because they want to save the tiny little bytes of data. But, does it actually matter now where the cost of DSL is significantly lower, and many people already have high-speed connection to actually care about the little extra bytes of data that Google might have to send over to a computer due to semantically correct page? I don’t know. It is a tough call.

    We are still using TCP/IP and IPv4 which clearly are old-school protocols that took so much consideration in this data efficiency (for example, only a few bits are dedicated to identifying and flagging packet. If we were to design that packet and protocl layout again, we would probably make a few bits longer to make it easier for developers), so I guess one can make an argument that this is still an important factor.

  38. Or, perhaps, are we misguided zealots in search of a holy grail that isn’t there?

  39. At least from my experience, a high volume site with a very large organizational structure behind it makes it difficult for developers to really turn the corner from the old necessary evils of the past (the old days were ugly). The business keeps shoving revenue-producing projects down the pipe that choke out any kind of refactoring effort. There’s no perceived value in it from a business perspective. One constant is that companies want revenue to grow as quickly as possible, very often at the expense of vision.

    The larger the organization, the harder it is for developers to move through non-project changes. The codebase is often locked down so tightly, that anything other than sanctioned business projects are forbidden. And if the business gets wind that developers are busy doing something other than their projects, they freak out because resources are so tight.

    It was quoted nicely in an above comment… “If it ain’t broke, don’t fix it.”

    I’ve certainly been pushing for standards at every opportunity to become more standards-compliant, but it’s a very difficult effort against such a large audience who frankly just don’t care.

    As with many large sites, global layout files and other shared files make it difficult to really make a lot of headway one small project at a time. It’s a very difficult proposition to rework the main layout files, navigation, etc. etc. because that would require a lot of QA and Project Management time, which seems to be non-existent. So I do what I can with the time available on current projects, but in the end it doesn’t really amount to much. Add in the complacency of many developers, and it’s even harder to make a difference.

    It’s easy to make a brand new site standards-compliant, but it’s a whole different story trying to recover from thousands of existing pages that were written poorly many years ago. Sure, it’s not terribly hard if you have the time to do it (and it does require a fair amount of time and effort to lay the groundwork), but when time isn’t available it’s nearly impossible. Throw in a bunch of teams working on a bunch of concurrent projects on the same code all the time, and it’s even more difficult.

    The only gem of hope I’ve encountered so far is attaching standards to Search Engine Optimization. The SEO buzz word can help developers get support for time and resources to make the move toward standards. Apparently, execs can wrap their arms around the idea that if people can’t find you then they don’t come. Visitors equal money, which helps get focus on moving toward standards in the name of higher search rankings.

    At any rate, I think there are a lot of developers out there who really want to move the sites they work on to standards, but they are often outnumbered by people who see things working fine just the way they are and therefore don’t see the value in it.

  40. Molly,

    dont have any other way to contact you aside from commenting on your blog…sorry.
    my name is Maya Shved, I work for the Options Group in New York City.

    Looking for people who are very tech savvy to fill positions in a variety of investment banks.

    Currently looking for Java/C++ developers, as well as those with exposure, experience in FIX protocol.

    Also need GUI developers to support trade floor applications.

    Let me know if you are interested.
    [email protected]

    Thanks,

  41. Molly.
    On Yahoo!, the developers here at Y! know exactly how to use standards… but you’re often forced to work with marketing and third parties which cause your page not to validate. The simple use of Flash on the homepage will throw it off very simple like.

    Also not to mention there is tons of legacy code that generates presentational tags and it takes quite a huge effort to get something standards compliant over night.

    So in our defense, most of us here are fully aware on how to develop standards compliant websites. We don’t need another Douglas Bowman hoopla coming our way telling us how to design the Yahoo search pages…. although it was quite entertaining and showed several developers a good lesson. We needed a swift kick. Looking back on it all, if you were to compare the old homepage to the new, you need to take into account the tremendous progress.

    When you say MSN “validates” – well, technically… it doesn’t. Of course last I checked Y! has 290 errors showing on the validator… but the effort has been started.

  42. Oh. And thanks for calling us progressive. You rock. ๐Ÿ˜‰

  43. Who cares what’s behind the scenes? I’m a dummy consumer brwosing with a IE thing called version 6 or sometimes I browse with a Fox in Fire or a Concert Hall thing called Opera. I’m not trying to play a programmer and looking behind the secenes if some kind of code is written well enough. I don’t care how my electricity reaches my power outlet… I just want to use it. I don’t care if a photograph comes in standard 1.0 or standard 2.0 i just want to see it on my screen… If IE isn’t the standard then make it a standard and throw away the others or make the others in a way that it can read ALL without showing an error. If my DVD player can play several standards, including Mp3, Mpeg4, Jpeg, Tiff, Some other codecs, VCD, SVCD, CD-RW what ever why can’t a browser be made like that?

  44. Standards are made and broken by their programmers and designers. Everybody throws in any time any place something new as everybody any time want to invent something more new, something more trendy something more in dollars. There never will be a standard I think that’s quite clear now if I look back throughout the 35 years surfing around on what they call the Internet…. http://WWW…. Now it’s CSS tomorrow your talking about SSC, now it’s Ajax tomorrow it will be HaagenDaz ๐Ÿ˜‰

  45. Pingback: ara pehlivanian » Blog Archive » Why you should escape your ampersands

  46. I get particularly mad at Google for not using standards. We’re talking about a company which claims to include “do no evil” and “improve the world” in its credo, yet they can’t find it in their hearts to hire a standards person (maybe i should say Web Standards Developer ;)).

    I really would have thought Google could make some significant bandwidth savings; even a reduction of a couple of kilobytes would make a difference for them I’d imagine ๐Ÿ™‚

  47. Greetings! This just in: http://www.att.com‘s Google-powered search now delivers valid XHTML 1.0 Strict markup. Read all about it.

  48. Mario, these standars are what allowed the web to grow into it’s functional state. CSS was first proposed in 1994, and more folks are using it to separate content from design as awareness grows.

Newer Comments →

Upcoming Travels