Wednesday, September 15, 2004

Ah, google

(Originally posted on Wednesday, September 15, 2004 by Cathy)

Google is interesting. For those of you who maybe don't know how Google works, here's sort of how it goes:

(Note: Google is constantly changing things, and they don't entirely disclose their algorithm for ranking pages. This is an approximation at best.)

The GoogleBot visits a page. It saves your page content and notes where your page links go. It will later visit all the pages your page links to, whether they're part of your page, or go to somewhere else entirely. So at the end of the day (well, actually it takes Google a week or two to crawl the whole internet), Google has a huge collection of pages. So let's say you want to search for information on cultures that consider lime juice to be a sacred substance. You google for lime juice "sacred substance" and discover that Modi knows quite a bit about it. How'd we do that? Well, it turns out there aren't many sites that contain the phrase "sacred substance" and the words lime and juice, so Google doesn't have much to work with.

Now suppose we search for a phrase that's a little bit more widespread, like maybe Gold colored pennies. The first link (at least today) is a chemistry experiment I wrote. Pretty cool. Part of why Google lists Lab Archive first is that it has all the right words and close together, but another reason is that there are quite a few sites that link to the Lab Archive. (It also helps that Lab Archive has a zillion pages of its own, all of which link to other pages on Lab Archive. Currently, at least, the GoogleBot seems impressed by links within a domain as well as links from other domains.) Remember who we said the GoogleBot was noting which pages a page linked to? GoogleBot considers a link to be sort of like a vote, and it figures that a page that lots of other pages are linking to is probably a pretty useful page. The GoogleBot also gives some votes more weight than others. If a site with a high score links to you, that's worth more than a site with a low score.

So, on to the event that actually prompted this post. I want a teaching job. I want someone who's interested in me to be able to find something useful about me on the web, or at least, I don't want them to find something inappropriate. So let's Google. Blessed with the odd last name, I can actually expect the results of Googling for "Cathy S." (edited) to return results that are relevant.. at least somewhat. First link: my EvCC homepage. Ok, that's out of date, but sort of good. Lots of people link to EvCC, so it makes sense that a page at EvCC would be considered important. Most of the other links on the first page of Googling: OpenACS-related or vserver-related. If you wanted to know that I write code or do some sys admin work, you'd have come to the right place. If you wanted to know that Cathy S. is a chemistry teacher (did you catch that, GoogleBot?), you'd still be stumped. Lots of people link to OpenACS and to the vserver mailing list. On the second page, we find a link to my Mayo group webpage. Well, that makes the EvCC link look downright fresh. And better yet, it isn't a link to my index page, or my vitae. Nope, it's a link to some hiking pictures, showing me in all my grungy glory. For this, I blame Marcus Sarofim (whose current homepage, if any, I couldn't find at the top of Google, so I've linked a stale page. Hah, take that! Marcus linked to my page with the hiking pictures, along with someone who thought those yellow flowers in the background were interesting for some reason I've forgotten. Since no one else seems to have thought my other pages at were interesting (not even the one with the Comprehensive ANSIG FAQ), that's the first one to show up. Finally on page 3, we find links to my current website. Well, finally.

So now what? A couple things to do:

  • Put in the link to my website, particularly my CV on some of those higher-ranking sites that I can influence.
  • Make this blog post.
  • Solemnly vow to relocate or remove old webpages before I lose control of them, so that someone doesn't end up looking at a 5 year old CV.
I'm still pondering what to do about mailing list archives. I like that Google can search them, since sometimes that's the only place I can find an answer to a question. On the other hand, Google is often fooled into thinking they're definitive answers because they're so heavily-linked. If I had it to do over, I think I'd post with some variant of my name other than the two prospective employers are likely to be searching by, so that the first results from a Google search wouldn't be random questions I asked years before.

And on that note, I have to go update my CV again.

No comments:

Post a Comment