The biggest thing you need to know about Hadoop is that it isn’t Hadoop anymore.
Between Cloudera sometimes swapping out HDFS for Kudu while declaring Spark the center of its universe (thus replacing MapReduce everywhere it is found) and Hortonworks joining the Spark party, the only item you can be sure of in a “Hadoop” cluster is YARN. Oh, but Databricks, aka the Spark people, prefer Mesos over YARN -- and by the way, Spark doesn’t require HDFS.
Yet distributed filesystems are still useful. Business intelligence is a great use case for Cloudera’s Impala and Kudu, a distributed columnar store, is optimized for it. Spark is great for many tasks, but sometimes you need an MPP (massively parallel processing) solution like Impala to do the trick -- and Hive remains a useful file-to-table management system. Even when you’re not using Hadoop because you’re focused on in-memory, real-time analytics with Spark, you still may end up using pieces of Hadoop here and there.
By no means is Hadoop dead, although I’m sure that's what the next Gartner piece will say. But by no means is it only Hadoop anymore.
What in this new big data Hadoopy/Sparky world do you need to know now? I covered this topic last year, but there's so much new ground, I'm pretty much starting from scratch.
Spark is as fast as you've heard it is -- and, more important, the API is much easier to use and requires less code than with previous distributed computing paradigms. With IBM promising 1 million new Spark developers and a boatload of money for the project, Cloudera declaring Spark is the center of everything we know to be good with its One Platform initiative, and Hortonworks giving its full support, we can safely say the industry has crowned its Tech Miss Universe (hopefully getting it right this time).
Economics is also driving Spark's ascendance. Once it was costly to do it in memory, but with cloud computing and increased computing elasticity, the number of workloads that can’t be loaded into memory (at least on a distributed computing cluster) are diminishing. Again, we're not talking about all your data, but the subset you need in order to calculate a result.
Spark is still rough around the edges -- we’ve really seen this when working with it in a production environment -- but the warts are worth it. It really is that much faster and altogether better.
The irony is that the loudest buzz around Spark relates to streaming, which is Spark's weakest point. Cloudera had that deficiency in mind when it announced its intention to make Spark streaming work for 80 percent of use cases. Nonetheless, you may still need to explore alternatives for subsecond or high-volume data ingestion (as opposed to analytics).
Spark isn’t only obviating the need for MapReduce and Tez, but also possibly tools like Pig. Moreover, Spark’s RDD/DataFrames APIs aren’t bad ways to do ETL and other data transformations. Meanwhile, Tableau and other data visualization vendors have announced their intent to support Spark directly.
Hive lets you run SQL queries against text files or structured files. Those usually live on HDFS when you use Hive, which catalogs the files and exposes them as if they were tables. Your favorite SQL tool can connect via JDBC or ODBC to Hive.
In short, Hive is a boring, slow, useful tool. By default, it converts your SQL into MapReduce jobs. You can switch it to use the DAG-based Tez, which is much faster. You can also switch it to use Spark, but the word "alpha" doesn’t really capture the experience.
You need to know Hive because so many Hadoop projects begin with “let’s dump the data somewhere” and thereafter “oh by the way, we want to look at it in a [favorite SQL charting tool].” Hive is the most straightforward way to do that. You may need other tools to do that performantly (such as Phoenix or Impala).
I loathe Kerberos, and it isn’t all that fond of me, either. Unfortunately, it's the only fully implemented authentication for Hadoop. You can use tools like Ranger or Sentry to reduce the pain, but you’ll still probably integrate with Active Directory via Kerberos.
If you don’t use Ranger or Sentry, then each little bit of your big data platform will do its own authentication and authorization. There will be no central control, and each component will have its own weird way of looking at the world.
But which one to choose: Ranger or Sentry? Well, Ranger seems a bit ahead and more complete at the moment, but it's Hortonworks baby. Sentry is Cloudera’s baby. Each supports the part of the Hadoop stack that its vendor supports. If you’re not planning to get support from Cloudera or Hortonworks, then I’d say Ranger is the better offering at the moment. However, Cloudera’s head start on Spark and the big plans for security the company announced as part of its One Platform strategy will certainly pull Sentry ahead. (Frankly, if Apache were functioning properly, it would pressure both vendors to work together on one offering.)
HBase is a perfectly acceptable column family data store. It's also built into your favorite Hadoop distributions, it's supported by Ambari, and it connects nicely with Hive. If you add Phoenix, you can even use your favorite business intelligence tool to query HBase as if it was a SQL database. If you’re ingesting a stream of data via Kafka and Spark or Storm, then HBase is a reasonable landing place for that data to persist, at least until you do something else with it.
There are good reasons to use alternatives like Cassandra. But if you use Hadoop you already have HBase -- if you've purchased support from a Hadoop vendor, you already have HBase support -- so it's a good place to start. After all, it's a low-latency, persistent datastore that can provide a reasonable amount of ACID support. If Hive and Impala underwhelm you with their SQL performance, you'll find HBase and Phoenix faster for some datasets.
Teradata and Netezza use MPP to process SQL queries across distributed storage. Impala is essentially an MPP solution built on HDFS.
The biggest difference between Impala and Hive is that, when you connect your favorite BI tool, “normal stuff” will run in seconds rather than in minutes. Impala can replace Teradata and Netezza for many applications. Other structures may be necessary for different types of queries or analysis (for those, look toward stuff like Kylin and Phoenix). But generally, Impala lets you escape your hated proprietary MPP system, use one platform for structured and unstructured data analysis, and even deploy to the cloud.
There's plenty of overlap with using straight Hive, but Impala and Hive operate in different ways and have different sweet spots. Impala is supported by Cloudera and not Hortonworks, which supports Phoenix instead. While operating Impala is less complex, you can achieve some of the same goals with Phoenix, toward which Cloudera is now moving.
7. HDFS (Hadoop Distributed File System)
Thanks to the rise of Spark and ongoing migration to the cloud for so-called big data projects, HDFS is less fundamental than it was last year. But it's still the default and one of the more conceptually simple implementations of a distributed filesystem.
Distributed messaging such as that offered by Kafka will make client-server tools like ActiveMQ completely obsolete. Kafka is used in many -- if not most -- streaming projects. It's also really simple. If you’ve used other messaging tools, it may feel a little primitive, but in the majority of cases, you don't need the granular routing options MQ-type solutions offer anyway.
Spark isn't so great at streaming, but what about Storm? It's faster, has lower latency, and uses less memory -- which is important when ingesting streaming data at scale. On the other hand, Storm’s management tools are doggy-doo and the API isn’t as nice as Spark’s. Apex is newer and better, but it's not widely deployed yet. I’d still default to Spark for everything that doesn’t need to be subsecond.
10. Ambari/Cloudera Manager
I’ve seen people try and monitor and manage Hadoop clusters without Ambari or Cloudera Manager. It isn’t pretty. These solutions have taken management and monitoring of Hadoop environments a long way in a relatively short period of time. Compare this to the NoSQL space, which is nowhere near as advanced in this department -- despite simpler software with far fewer components -- and you gotta wonder where those NoSQL guys spent their massive funding.
I suspect this is the last year that Pig makes my list. Spark is much faster can be used for a lot of the same ETL cases -- and Pig Latin (yes, that's what they call the language you write with Pig) is a bit bizarre and often frustrating. As you might imagine, running Pig on top of Spark entails hard work.
Theoretically, people doing SQL on Hive can move to Pig in the same way that they used to go from SQL to PL/SQL, but in truth, Pig isn’t as easy as PL/SQL. There might be room for something between plain old SQL and full-on Spark, but I don’t think Pig is it. Coming from the other direction is Apache Nifi, which might let you do some of the same ETL with less or no code. We already use Kettle to reduce the amount of ETL code we write, which is pretty darn nice.
YARN and Mesos enable you to queue and schedule jobs across the cluster. Everyone is experimenting with various approaches: Spark to YARN, Spark to Mesos, Spark to YARN to Mesos, and so on. But know that Spark’s Standalone mode isn’t very realistic for busy multijob, multi-user clusters. If you’re not using Spark exclusively and still running Hadoop batches, then go with YARN for now.
Nifi would have had to try hard not to be an improvement over Oozie. Various vendors are calling Nifi the answer to the Internet of things, but that's marketing noise. In truth, Nifi is like Spring integration for Hadoop. You need to pipe data through transforms and queues, then land it somewhere on a schedule -- or from various sources based on a trigger. Add a pretty GUI and that’s Nifi. The power is that someone wrote an awful lot of connectors for it.
If you need this today but want something a bit more mature, use Pentaho’s Kettle (along with other associated kitchenware, such as Spoon). These tools have been working in production for a while. We’ve used them. They’re pretty nice, honestly.
While Knox is perfectly adequate edge protection, all it does is provide a reverse proxy written in Java with authentication. It's not very well written; for one thing, it obscures errors. For another, despite how it uses URL rewriting, merely adding a new service behind it requires a whole Java implementation.
You need to know Knox because if someone wants edge protection, this is the “approved” means of providing it. Frankly, a small modification or add-on for HTTPD’s mod_proxy and it would have been more functional and offered a better breadth of authentication options.
Technically, you can use Java 8 for Spark or Hadoop jobs. But in reality, Java 8 support is an afterthought, so salespeople can tell big companies they still use their Java developers. The truth is Java 8 is a new language if you use it right -- in that context, I consider Java 8 a bad knockoff of Scala.
For Spark in particular, Java trails Scala and possibly even Python. I don’t really care for Python myself, but it's reasonably well supported by Spark and other tools. It also has robust libraries -- and for many data science, machine learning, and statistical applications it will be the language of choice. Scala is your first choice for Spark and, increasingly, other toolsets. For the more “mathy” stuff you may need Python or R due to their robust libraries.
Remember: If you write jobs in Java 7, you’re silly. If use Java 8, it's because someone lied to your boss.
The Notebook concept most of us first encountered with iPython Notebook is a hit. Write some SQL or Spark code along with some markdown describing it, add a graph and execute on the fly, then save it so someone else can derive something from your result.
Ultimately, your data science is documented and executed -- and the charts are pretty!
Databricks has a head start, and its solution has matured since I last noted being underwhelmed with it. On the other hand, Zeppelin is open source and isn’t tied to buying cloud services from Databricks. You should know one of these tools. Learn one and it won’t be a big leap to learn the other.