I found this article today, Will NoSQL Databases Live Up To Their Promise? (PDF), in the IEEE Computer Society publication called Computing Now. It’s a great IT executive-level overview of NoSQL systems, which explains things at (what one friend calls) a “big animal pictures” level. I’d caution that it’s written by the head of a PR firm, though I can’t tell if he’s writing on behalf of any given client.
Excerpt:
Many organizations collect vast amounts of customer, scientific, sales, and other data for future analysis. Traditionally, most of these organizations have stored structured data in relational databases for subsequent access and analysis. However, a growing number of developers and users have begun turning to various types of non-relational — now frequently called NoSQL — databases.
I’d quibble that most NoSQL systems do not qualify as what I’d call databases (or more precisely database management systems), so I dislike the term “NoSQL databases,” generally preferring “NoSQL systems.” Some NoSQL systems are databases (e.g., MarkLogic, an XQuery-based XML database/server or CouchDB, a document database) while others are not — e.g., Hadoop is a distributed computing framework, Dynamo is a key-value store, memcached is a distributed caching mechanism, and Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. For more examples, see the Wikipedia structured storage page.
While I generally think the article does a good job at difficult task of explaining things in high-level terms, it does perpetuate the notion that NoSQL is primarily about unstructured data and I’m not at all sure that it is.
… NoSQL databases will be used largely for working with unstructured data in ways that require scalability …
While several NoSQL technologies were developed for web applications (e.g., spiders) which handle large amounts of unstructured information, I don’t see much that specifically either makes them good at unstructured information or, for that matter, bad at structured information. A key-value store works well when the value is a structured record or an unstructured text field, primarily because it doesn’t care much about the value. It just knows how to find it fast given the key.
I think the vast majority of information that people call “unstructured” is actually semi-structured and the trick to managing it well is determining what structure is present, optionally enriching it further, and then leveraging the available structure as much as possible. For example, consider email, which many people call unstructured. Email has:
- Address fields, such as to/from
- Send time/date
- Subject line
- Body text
- Footer/signature
- And potentially a series of replies and comments that make up a conversation thread
That’s a lot of structure, and you’d like a good query system to be aware of it:
- Find all emails that include the word “legal,” but not in the standard footer or disclaimer: to avoid returning every email in the system if a company’s standard footer includes the word legal.
- Find the emails that contain the word “option” within three words of “backdate” that were sent to the general counsel before a given date: to run precise searches
- Tell me who sends the most email about subject X: so I can identify an expert.
This, by the way, is exactly what MarkLogic lets you do, and you can see an example of a system running MarkLogic against 40M emails at markmail.org. Since I view MarkLogic as a NoSQL system, I suppose I’d say that some NoSQL systems are all about unstructured information, but to the extent a system treats unstructured information as a BLOB, I’d argue that it’s not really about unstructured information. It’s more about providing a vessel in which to put it.
In any case, I still think it’s a nice article to hand the CIO who’s probably hearing some of the NoSQL hype. If you’d like something one level more technical, I also found this deck, posted yesterday by Harri Kauhanen, which I’ve embedded below.
3 responses so far ↓
1 uberVU - social comments // Mar 12, 2010 at 9:24 pm
Social comments and analytics for this post…
This post was mentioned on Twitter by rgaidot: A good executive-level overview of NoSQL /via Kellblog http://ow.ly/1gRDO #nosql…
2 The Naming of the Foo | DBMS2 -- DataBase Management System Services // Mar 13, 2010 at 1:48 pm
[...] stores that don’t meet the HVSP criteria. Dave Kellogg stretches things when he claims that MarkLogic is a NoSQL system. (But then, that was in a post where he seemingly praised a train wreck of an [...]
3 No SQL « Rubber Tyres –> Smooth Rides // Mar 14, 2010 at 12:52 am
[...] IEEE Computer Society Article on NoSQL; An Executive-Level Overview (kellblog.com) [...]
Leave a Comment