Kellblog

This blog is written by Dave Kellogg, CEO of MarkLogic Corporation, covering next-generation information management, enterprise search, and content management technologies along with commentary on Silicon Valley, venture capital, and the business of software.

Kellblog header image 2

Semantic Technology at the New York Times

July 2nd, 2009 · No Comments

I recently had the pleasure of meeting Evan Sandhaus, semantic technologist at The New York Times R&D, and wanted to highlight and share a few things that we discussed.

Evan gave an information-packed, 79-slide keynote address at the recent Semantic Technology Conference in San Jose. During our meeting, we went through some of the slides and they were fantastic. While the slides aren’t publicly posted, I hope they soon will be and will update this post with a link once and if they are.

He also told me about the New York Times’ recent release of a 1.8M article corpus to the computer science research community, known as The New York Times Annotated Corpus. The corpus includes nearly every article published in the New York Times for twenty years (between 1/1/87 and 6/19/07) in XML format (NITF to be precise) along with various metadata about the articles.

They believe the corpus can can be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction. I think that’s true not only because it’s real content in real volume, but because that content comes with real, high-quality metadata that you can use to either build upon and/or validate various text processing algorithms.

Finally, in prepping for the meeting I found this video interview with Evan at the New York Semantic Meetup. Great stuff, embedded below.

Tags: New York Times · text analytics

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment