Yes, you can haz big data. However, you can haz it the right way or the wrong way. Here are the top 10 worst practices to avoid.
1. Choosing MongoDB as your big data platform. Why am I picking on MongoDB? I'm not, but for whatever reason, the NoSQL database most abused at this point is MongoDB. While MongoDB has an aggregation framework that tastes like MapReduce and even a (very poorly documented) Hadoop connector, its sweet spot is as an operational database, not an analytical system.
When your sentence begins, "We will use Mongo to analyze ...," stop right there and think about what you're doing. Sometimes you really mean "collect for later analysis," which might be OK, depending on what you're doing. However, if you really mean you're going to use MongoDB as some kind of sick data-warehousing technology, your project may be doomed at the start.
2. Using RDBMS schema as files. Yeah, you dumped each table from your RDBMS into a file. You plan to store that on HDFS. You plan to use Hive on it.
First off, you know Hive is slower than your RDBMS for anything normal, right? It's going to MapReduce even a simple select. Look at the "optimized" route for "table" joins. Next, let's look at row sizes -- whaddaya know, you have flat files measured in single-digit kilobytes. Hadoop does best on large sets of relatively flat data. I'm sure you can create an extract that's more denormalized.
3. Creating data ponds. On your way to creating a data lake, you took a turn off a different overpass and created a series of data ponds. Conway's law has struck again and you've let each business group not only create their own analysis of the data but their own mini-repositories. That doesn't sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data. I don't mean flat versus cube -- I mean different answers for some of the same questions. Schema-on-read doesn't mean "don't plan at all," but it means "don't plan for every question you might ask."
Nonetheless, you should plan for the big picture. If you sell widgets, there is a good chance someone's going to want to see how many, to whom, and how often you sold widgets. Go ahead and get that in the common formats and do a little up-front design to make sure you don't end up with data ponds and puddles owned by each individual business group.
4. Failing to develop plausible use cases. The idea of the data lake is being sold by vendors to substitute for real use cases. (It's also a way to escape the constraints of departmental funding.) The data-lake approach can be valid, but you should have actual use cases in mind. It isn't hard to come up them in most midsize to large enterprises. Start by reviewing when someone last said, "No, we can't, because the database can't handle it." Then move on to "duh." For instance, "business development" isn't supposed to be just a titular promotion for your top salesperson; it's supposed to mean something.
What about, say, using Mahout to find customer orders that are common outliers? In most companies, most customer orders resemble each other. But what about the orders that happen often enough but don't match common ones? These may be too small for salespeople to care about, but they may indicate a future line of business for your company (that is, actual business development). If you can't drum up at least a couple of good real-world uses for Hadoop, maybe you don't need it after all.
5. Thinking Hive is the be-all, end-all. You know SQL. You like SQL. You've been doing SQL. I get it, man, but maybe you can grow, too? Maybe you should reach deep down a decade or three and remember the young kid who learned SQL and saw the worlds it opened up for him. Now imagine him learning another thing at the same time.
You can be that kid again. You can learn Pig, at least. It won't hurt ... much. Think of it as PL/SQL on steroids with maybe a touch of acid. You can do this! I believe in you! To do a larger bit of analytics, you may need a bigger tool set that may include Hive, Pig, MapReduce, Uzi, and more. Never say, "Hive can't do it, so we can't do it." The whole point of big data is to expand beyond what you could do with one technology.
6. Treating HBase like an RDBMS. You went nosing around Hadoop and realized indeed there was a database. Maybe you found Cassandra, but most likely you found HBase. Phew, a database -- now I don't have to try so hard! Trust me, HDFS-plus-Hive will drain less glucose from your head muscle (IANAD).
The only real commonality between HBase and your RDBMS is that both have something resembling a table. You can do things with HBase that would make your RDBMS's head spin, but the reverse is also true. HBase is good for what HBase is good for, and it is terrible at nearly everything else. If you try and represent your whole RDBMS schema as-is in HBase, you will experience a searing hot migraine that will make your head explode.
7. Installing 100 nodes by hand. Oh my gosh, really? You are going to hand-install Hadoop and all its moving parts on 100 nodes by Tuesday? Nothing beats those hand-rolled bits and bytes, huh? That is all fun and good until someone loses a node and you're hand-rolling those too. At the very least, use Puppet -- actually, use Ambari (or your distribution's equivalent) first.
8. RAID/LVM/SAN/VMing your data nodes. Hadoop stripes blocks of data across multiple nodes, and RAID stripes it across multiple disks. Put them together, what do you have? A roaring, low-performing, latent mess. This isn't eventurducken -- it's like roasting a turkey inside of a turkey. Likewise, LVM is great for internal file systems, but you're not really going to randomly decide all hundred of your data nodes need to be larger, instead of, like, adding a few more data nodes.
And your SAN, your holy SAN -- loved by many, I/O bound, and latent to all. You're using HDFS for a higher burst rate, so now you're going to stick everything back in the box? The idea is to scale horizontally -- how are you going to do that across the same network pipe to the same box o' disks?
Hey, EMC will sell you more SAN, but maybe you need to think outside the box. VMs are great. However, if you want high-end performance, I/O is king. Fine, you can virtualize the name node and much of the rest of Hadoop, but nothing beats bare metal for data nodes. You can achieve much of the same advantage as virtualization with devops tools. Even most of the cloud vendors are offering metal options.
9. Treating HDFS as just a file system. If you dump stuff onto HDFS, you haven't necessarily accomplished anything. The tooling around it is important, of course. Now you can Hive, Pig, and MapReduce it, but you have to think a bit about what, why, and where you're dumping things onto HDFS. You need to think about how you're going to secure all of this and for whom.
10. Whoo, shiney! Also known as, "today is Thursday, let's move to Spark." Yes, Hadoop is a growing ecosystem, and you want to stay ahead of the curve. I feel you, man, but let's remember that freedom is just another word for nothing left to lose. Once you have real data and real users, you don't have the same amount of freedom as when you had no real responsibility. Now you must have a plan.
Fortunately, you have the tools to manage that evolution and move forward responsibly. Maybe you don't get to deploy this week's cool thing while it is fresh, but you don't have to run Hadoop 1.1 anymore, either. As with any technology -- or anything in life -- find that moderate path that prevents you from being the last gazelle in the pack or the first lemming off the cliff.
This is the current top 10 I'm seeing in the field. How's your big data project going? What anti-patterns or patterns have you found?
This story, "The 10 worst big data practices" was originally published by InfoWorld.