Datastax Enterprise is the commercial distribution of Apache Cassandra, a column-family NoSQL database developed by Facebook and probably best known for powering Netflix. The new 4.5 release of DataStax Enterprise, announced June 30, advances DataStax's case that NoSQL is ready for enterprise applications. It features Apache Spark integration for fast in-memory analytics, Hortonworks and Cloudera integration for easy access to Hadoop data, and new diagnostic and security tools.
Probably the most visible new feature in DSE 4.5 is integration with Spark, which enables execution of advanced data analytics in memory. This is a significant change from MapReduce, which requires that all intermediate and final results be written to disk. That's how Spark can claim speeds up to 100 times faster than MapReduce for the same computation. By running Spark on top of Apache Cassandra, DataStax Enterprise 4.5 becomes the first platform to offer users the ability to perform computations on Cassandra data in near-real time.
DataStax has also bundled Apache Shark in the new release. Shark gives users the ability to run Hive queries using the Spark engine. Now DataStax users who have been running batch analytics in Hive will be able to run those jobs quickly in memory without needing to port their HiveQL code to Cassandra Query Language.
Spark integration continues efforts by DataStax to increase speed and performance. With the 4.0 release in February, DataStax introduced the ability to run transactional workloads on Cassandra data in memory. Now with DSE 4.5, users will be able to leverage both in-memory features to run their entire workloads, transactional and analytical, in memory. This opens up the ability to run fast read/writes and fast analytics over a unified database -- up to now out of reach in the Hadoop ecosystem.
DataStax Enterprise 4.5 offers many other new features beyond Spark integration. As part of the release, DataStax has announced official partnerships with commercial Hadoop distributors Hortonworks and Cloudera. This means that DataStax customers will now be able to query across their DataStax database and Hadoop installation simultaneously. According to Schumacher, "customers can now run a Hive query on our platform that joins together a Cassandra table and then an external Cloudera [or Hortonworks] Hive table in the same query."
Furthermore, users can then store the results in their DataStax database or remotely in their commercial Hadoop installation. This is a significant development for customers needing a way to integrate hot data in a DataStax Cassandra database with historical data stored in a commercial Hadoop installation.
Formally, DataStax will support such integration for only the current and most recent prior versions of Hortonworks and Cloudera Hadoop. If you're willing to forgo the benefits of formal support, however, you'll also be able to integrate your DataStax Enterprise installation with an open source Hadoop installation.
DSE 4.5 offers several other perks as well. The version of OpsCenter, DataStax's visual cluster management interface (which now ships with the release) supports clusters up to 1,000 nodes. The new release also comes with improved diagnostic and security tools.
The new data dictionary API is a Cassandra Query Language API that provides performance analysis from the cluster level all the way down to the level of individual nodes and even individual statements. The new API was designed to be accessible to users coming from the relational database world who may be familiar with Oracle's V$ Views or Microsoft SQL Server's performance tables. Previously, DataStax required users to access diagnostic tools through a JMX-enabled Java API.
The increased control extends to security. With the 4.5 release, administrators can control a user's access to each cluster and even specify which commands a user is able to execute against a given cluster. That kind of granular security control is a feature that Schumacher says many of DataStax's large customers have been asking for.
Still, Schumacher acknowledges that DataStax has more work to do to match the ease-of-use of commercial RDBMS solutions.
Take the new diagnostic features in DSE 4.5, for instance. The new data dictionary API provides detailed diagnostic tools. But DataStax has yet to integrate the new API into its OpsCenter visual tool suite. Schumacher calls that integration "the next step," whereas the first step was "getting the diagnostic objects set and ensuring their performance…. But yes, there'll be a whole suite of new overview displays of that type of information that will allow you to point and click your way down to that detail, absolutely. It's a no-brainer."
In the meantime, the OpsCenter 5.0 release, slated for later this month, will include a new "best practices expert" that will be able to scan your cluster and look for deviations from prescribed best practices. The system will also be able to check security, storage, and memory settings, as well as provide advice about fixing any anomalies.
Similarly, DataStax has no current plans for integration with popular system and network management tools, such as the product formerly known as HP OpenView or IBM's Tivoli Management Framework. For customers looking to integrate DataStax management into a unified system management framework, the only current option is to develop a custom solution using OpsCenter's REST API. For most customers, OpsCenter's visual dashboard will remain the most viable way of managing their DataStax clusters.
Schumacher sees continued ease-of-use improvements as one of DataStax's main areas for future development. Previous releases have already made improvements in terms of automated repairs and consistency checks, as well as automatic capacity planning. In the future, DataStax hopes to augment its Automatic Management Services suite with automatic scale-out, automatic upgrades, and automatic backup and recovery. All of this is intended to smooth the transition from an enterprise RDBMS to DataStax.
That focus on ease-of-use is one part of DataStax's argument that NoSQL is ready for production enterprise environments. The other is performance. With Spark integration, DataStax should be able to provide subsecond latency for most queries. It's hoping to attract customers who have been hesitant to adopt Hadoop as a big data solution because of the high latencies involved in MapReduce. As Schumacher puts it, "We just want to continue to be the go-to NoSQL database for scale and performance."
Ultimately, scale, performance, and ease-of-use will probably all play a pivotal role in driving adoption of NoSQL solutions by large enterprises. DataStax hopes to lead that trend.
This story, "DataStax Enterprise 4.5 turbocharges speed and security" was originally published by InfoWorld.