06 maio 2005

Judging a Book by its Contents

Name that famous book from just these phrases: "pagan harpooneers," "stricken whale," "ivory leg." Or how about this one: "old sport."

Yes, it's Herman Melville's Moby Dick and F. Scott Fitzgerald's The Great Gatsby, respectively, but the words aren't just a game. They are Statistically Improbable Phrases, the result of a new Amazon.com feature that compares the text of hundreds of thousands of books to reveal an author's signature constructions.

The haiku-like SIPs are not the only word toys on the site. Customers can also see the 100 most common words in a book. Penny pinchers -- or those with back problems -- can check stats on how many words a volume delivers per dollar or per ounce. (Bargain hunters will love the Penguin Classics edition of War and Peace that delivers 51,707 words per dollar.)

Customers can also see how complicated the writing is (yes, post-structuralist Michel Foucault's prose is foggier than Immanuel Kant's), and how much education you need to understand a book. (To understand French philosopher Pierre Bourdieu, you'll need a second Ph.D.)

While such services seem to have little value and have generated scant publicity, except from bibliophilic thrill seekers, web watchers say the madcap stats aren't just for kicks.

"(Amazon CEO) Jeff Bezos was born on numbers," said Nathan Torkington, an editor and conference coordinator for O'Reilly Media. "Before starting Amazon.com, he was a Wall Street analyst. They will be looking at this thinking, 'What can we do to drive the bottom line?' There's no way they will be regarding this as, 'We are math geeks and you will enjoy the numbers, too.'"

Torkington thinks Amazon is currently just experimenting, but it will soon find intriguing ways, such as using authoritative texts to answer user questions, to wring profit out of what may well be the largest collection of electronic books in the world.

Bill Carr, Amazon's executive vice president of digital media, confirms that this is a serious attempt to sell more books.

"We've been spending a lot of time thinking, 'We have this rich digital content, how can we pull info out and expose it to customers that makes discovery even better?'" Carr said. "What you are seeing here are the fruits of a lot experimenting and brainstorming."

Carr points to the "adaptive unconscious" SIP from Malcolm Gladwell's best seller, Blink, as an example of how improbable data mining can get a curious reader into the long tail of Amazon's catalog.

"That distinctive phrase gets to the heart of the book, but also allows customers to discover books that range from topics like psychology to psychotherapy to how a smart woman can land her dream man in six weeks," Carr said. "One of the cool things is getting people to discover books that are not only related, but that they would have a hard time finding anywhere else."

Amazon is also crunching data to automatically categorize books and make related book suggestions, which complement its popular people-who-bought-this-also-bought-that feature.

Benjamin Vershbow, a researcher at the Institute for the Future of the Book, sees Amazon's SIPs as an automated version of tagging, a concept that fuels sites like del.icio.us, a bookmark-sharing site, and photo-sharing site Flickr. Both rely heavily on users attaching descriptive names to websites or photos so others can discover them.

Vershbow found, however, that Amazon's SIPs work much better for nonfiction than for novels.

"I don't see in Moby Dick's SIPs 'whiteness of whale,'" Vershbow said. "This is a big poetic trope, and I don't know why it isn't picked up. Perhaps it's because there are different ways you weave metaphor and tropes into a novel than you do in a theoretical book."

Still, Vershbow sees Amazon's data mining as part of a trend on the web where sites are learning to weave data sources together to create a new web experience. Amazon's Carr agrees.

"We are pioneers here ... in that we have this amazing corpus -- no one else has a corpus of this magnitude -- and are finding exciting ways to leverage that content to make a better discovery process for customers."

From Wired

Sem comentários: