Kellblog

This blog is written by Dave Kellogg, CEO of MarkLogic Corporation, covering next-generation information management, enterprise search, and content management technologies along with commentary on Silicon Valley, venture capital, and the business of software.

Kellblog header image 2

IEEE Computer Society Article on NoSQL; An Executive-Level Overview

March 10th, 2010 · 3 Comments

I found this article today, Will NoSQL Databases Live Up To Their Promise? (PDF), in the IEEE Computer Society publication called Computing Now.  It’s a great IT executive-level overview of NoSQL systems, which explains things at (what one friend calls) a “big animal pictures” level. I’d caution that it’s written by the head of a PR firm, though I can’t tell if he’s writing on behalf of any given client.

Excerpt:

Many organizations collect vast amounts of customer, scientific, sales, and other data for future analysis. Traditionally, most of these organizations have stored structured data in relational databases for subsequent access and analysis. However, a growing number of developers and users have begun turning to various types of non-relational — now frequently called NoSQL — databases.

I’d quibble that most NoSQL systems do not qualify as what I’d call databases (or more precisely database management systems), so I dislike the term “NoSQL databases,” generally preferring “NoSQL systems.”  Some NoSQL systems are databases (e.g., MarkLogic, an XQuery-based XML database/server or CouchDB, a document database) while others are not — e.g., Hadoop is a distributed computing framework, Dynamo is a key-value store, memcached is a distributed caching mechanism, and Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store.  For more examples, see the Wikipedia structured storage page.

While I generally think the article does a good job at difficult task of explaining things in high-level terms, it does perpetuate the notion that NoSQL is primarily about unstructured data and I’m not at all sure that it is.

…  NoSQL databases will be used largely for working with unstructured data in ways that require scalability …

While several NoSQL technologies were developed for web applications (e.g., spiders) which handle large amounts of unstructured information, I don’t see much that specifically either makes them good at unstructured information or, for that matter, bad at structured information.  A key-value store works well when the value is a structured record or an unstructured text field, primarily because it doesn’t care much about the value.  It just knows how to find it fast given the key.

I think the vast majority of information that people call “unstructured” is actually semi-structured and the trick to managing it well is determining what structure is present, optionally enriching it further, and then leveraging the available structure as much as possible.  For example, consider email, which many people call unstructured.  Email has:

  • Address fields, such as to/from
  • Send time/date
  • Subject line
  • Body text
  • Footer/signature
  • And potentially a series of replies and comments that make up a conversation thread

That’s a lot of structure, and you’d like a good query system to be aware of it:

  • Find all emails that include the word “legal,” but not in the standard footer or disclaimer:  to avoid returning every email in the system if a company’s standard footer includes the word legal.
  • Find the emails that contain the word “option” within three words of “backdate” that were sent to the general counsel before a given date:  to run precise searches
  • Tell me who sends the most email about subject X:  so I can identify an expert.

This, by the way, is exactly what MarkLogic lets you do, and you can see an example of a system running MarkLogic against 40M emails at markmail.org.  Since I view MarkLogic as a NoSQL system, I suppose I’d say that some NoSQL systems are all about unstructured information, but to the extent a system treats unstructured information as a BLOB, I’d argue that it’s not really about unstructured information.  It’s more about providing a vessel in which to put it.

In any case, I still think it’s a nice article to hand the CIO who’s probably hearing some of the NoSQL hype.  If you’d like something one level more technical, I also found this deck, posted yesterday by Harri Kauhanen, which I’ve embedded below.

Tags: NoSQL · semi-structured data · unstructured data

3 responses so far ↓

Leave a Comment