Wednesday, August 09, 2006
I regularly peruse the top 15 searches on Technorati. Oftentimes, it leads me to an interesting story percolating on the blogs-- (Excuse me while I remove the kitten from behind the computer screen...) --and sometimes it just leaves me with a "Hunh?!". For instance, the number 14 search yesterday was "sinful second homes". Um. Okay. The most interesting thing is that, while it was number 14 on searches, checking out the blog posts on it produced...nothing. That's right: even though tons of folks were searching on the phrase, nothing showed up. Even googling the phrase brought up a great big zippo. My question: Did some televangelist or radio personality do a show on the subject, which is what prompted the searching? Did someone with an obsessive personality and a particular hard-on for the subject decide to spend a day inputting the search term so that it would climb up to number 14? What gives? (Ooops. Had to separate dawg and cat and kitten as it was turning into a free-for-all.) More seriously, though, there are times when something that seems vitally interesting hits the Technorati searches. It gives you an idea of what the mainstream media cares about--sometimes that search will result in a few news stories, sometimes it will result in a blip on the screen in terms of the MSM. One such story this weekend was the story of Adnan Hajj, a freelance Reuters photographer who bloggers apparently discovered doctoring his photos of war-torn Lebanon. That one rose so quickly and was such a hot story that he ended up being fired by Reuters. Another story, which actually interests me more, was the story of how AOL released 20 million records of actual search data on the internet. The search data was not identified by AOL username but by randomly assigned ID number, and was only of a small sampling of users' searches for a period of three months. It was put on the web (as far as I can gather) on Friday. By Sunday afternoon, a spokesman from AOL was wandering the blogosphere posting a sincere apology in comment trails on various tech blogs and AOL had taken the dataset down. But the damage was done: the data set (450MB in zipped format) was immediately mirrored on various download sites, and it's now out there forever. Why was it put out in the first place? Well, AOL has a research group; they wanted to offer academics a chance to play with real search data. What they should have done is to place it in a protected secure download site and offered username/password combos to the researchers who might be interested. What they did do was to just plop it, free and clear, on their research page. (Oh, SHIT. The kitten just knocked down and BROKE one of the plant pots in the windowsill. Excuse me for a few!) Why is it such a big deal? After all, they cloaked the usernames by assigning generic ID numbers to each user. The problem is that one can come up with very clear ideas of who a person is and where that person is located. For instance, many people (yours truly included) search every now and then on their own name, social security number, phone number, address, to see if anything dangerous in terms of identity theft is out there. (So far, I haven't found anything dangerous, though I know there's an artist in New England and a couple of researchers in the UK with my name!) So, taking three months' worth of search data from one person can provide you with a snapshot of that person. Let's say you live in Cincinnati (and search on, say, car repair shops there), have a Cadillac and a BMW (you're searching for car repair shops that do work on those cars), are a landscape architect (you're searching for re-certification charges for landscape architecture), and you're looking to join a country club (you've searched for three country club golfing ranges in Cincinnati). This pretty much narrows it down. And then, oops, in the middle of all those mundane pieces of data, there's the fact that you searched on stuff like, "how to expose cheating wife", -- (Damn. Kitten is now banned from the office.) --"how to hire private investigator", etc. Or, worse, you searched for "n*ked pictures of h0t w0men" every night after the missus went to bed. The NYTimes has apparently already tracked one of the searchers down. Google refused to release such data to the Department of Justice for just these reasons. The story rose enough to actually show up on MSNBC for a few hours, in the Science and Technology section. Then it sank away. The NYTimes, obviously, found it interesting enough to pursue some more. Anyway, something to think about. Everytime you do a search, those search queries are being stored somewhere. ONE search engine (a rather obscure one) refuses to store the search queries. All the others? Your trail is out there.