Every so often, the company releases a research paper describing one of the sweeping software platforms that help drive its online empire, and a few years later, this paper will spawn an open source software project that seeks to share Google’s creation with the rest of the world.
Papers describing the Google File System and Google MapReduce spawned Hadoop, an open source platform that lets you spread data across thousands of dirt-cheap computer servers and then crunch it into something useful. Google BigTable gave rise to an army of “NoSQL” databases that can juggle unusually large amounts of information. Google Pregel delivered multiple “graph” databases that can map the many online relationships between people and things.
Some have complained that the outside world takes far too long in rebuilding these groundbreaking Google creations. And thatincludes Mike Olson, the CEO of Cloudera, a Silicon Valley startup that brought Hadoop to the business world. But this time is different.
On Wednesday, Cloudera uncloaked a software platform known as Impala. Under development for the past two years, Impala is a means of instantly analyzing the massive amounts of data stored in Hadoop, and it’s based on a sweeping Google database known as F1. Google only revealed F1 this past May, with a presentation delivered at a conference in Arizona, and it has yet to release a full paper describing the technology. Two years ago, Cloudera hired away one of the main Google engineers behind the project, a database guru named Marcel Kornacker.
Hadoop is now widely used across the web, driving such big-name operations as Facebook, Yahoo, and Twitter, and it’s spreading into traditional businesses as well. According to market research outfit IDC, it will fuel a $813 million software market by the year 2016.
It was originally designed as a “batch processing” platform. You give it a data-crunching task, and it takes several minutes — or several hours — to complete that task. It can build you, say, an index for the entire internet. With open source tools such as Hive, you can also analyze Hadoop data in much the same way you would query a traditional database using the common Structured Query Language, or SQL. If you’ve collected data describing a collection of digital books, for instance, you could run a query asking for a list of authors. But this too takes time.
Impala lets you query the same data “in real-time” — i.e., in seconds. According to Cloudera, it’s 10 times faster than a tool like Hive.
Cloudera is now four years old. But Jeff Hammerbacher — who helped found Cloudera after overseeing the rise of Hadoop at Facebook — refers to Impala as the company’s “version 1.0.” In other words, it’s the beginning. “We’re getting to the point,” he says, “where we’re building what I wanted to build when we started the company.”
Google’s F1 is a massive relational database management system, or RDBMS, that helps run the company’s online ad system. It sits atop Spanner, a much ballyhooed Google creation that lets the company store information across its worldwide network of data centers.
“Spanner stores records and data. F1 gives you access to those records. It runs queries. And it correlates them.”
— Marcel Kornacker
- Cloudera’s Drew O’Brien Says Hadoop is Set to Become the Facebook of Big Data (siliconangle.com)
- Meet Impala: Open Source Real-Time SQL Queries on Hadoop (architects.dzone.com)
- Hadoop: How it became big data’s lynchpin, and where it’s going next (zdnet.com)
- The Odd Couple: Hadoop and Data Security (zdnet.com)
- giant synthetic brain will become god on earth, etc (niqnaq.wordpress.com)
- Google Analytics Real Time API Launches In Beta (webpronews.com)