|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
This week on the New Tech Forum, we're taking a look at the challenges of traditional storage and compute in the world of big data -- and the growing role of object storage and integrated compute resources.
Jason Hoffman, CTO and founder of cloud service provider Joyent, details how combining object storage and parallel compute clusters can make working with big data easier and faster by eliminating bottlenecks.
How objects and compute will eat the world
Networked storage vendors' days are numbered. Customers are fleeing to consolidated online object storage, and soon the digital
object storage will surpass traditional file storage as the primary model for data outside of a DBMS. But there's a subtle
and often unappreciated downside to most distributed object storage: data inertia. The implicit limits on moving huge data
sets to in-network compute nodes deter business or clinical insights from surfacing.
At Joyent, we architected the Manta Storage and Compute Service -- "Manta," for short -- to be a best-in-class object store and an in-storage massively parallel compute cluster. It drives data latency effectively to zero, moving weekly or monthly jobs to an hourly or even an on-demand analytic cadence.
Whence big data?
Massive data volumes arise from machines (log files, API calls), digitized nature (DNA sequences, video, audio, environmental
sensors), and the humanity of billions of people online (Facebook, Baidu, e-commerce). Take a mere 10 million patients' genomes,
for example. That requires 20 exabytes (EB) of storage. Then there's camera phone resolution and market penetration, which
has been growing exponentially. And according to Digital Marketing Ramblings, Twitter distributes 400 million updates per day to 500 million subscribers.
But in 2012, all enterprise storage vendors shipped just 16EB of capacity.
With the big data wherewithal to capture it all, we could be at the early stages of a deeply disruptive wave of innovation. This sweeping crush threatens business models and technical architectures that assumed a paucity of data and scarcity of places to put it.
The additional hidden cost to networked object storage is the implicit inertia in petabytes of recorded audio or e-commerce server logs. Namely, there is a need to move that data from its resting place to a computational node. Computation is necessary to glean the business insights, social relevancy, and clinical results that make saving digital ephemera worthwhile. That theoretical 1Gbps or even 40Gbps network is a severe limit to the class of algorithms that can be considered and the rate at which they can be applied.
What is object storage?
Objects differ from files in one key respect: They are immutable once written. They can be updated and overwritten in their
entirety, but the in-place update of a POSIX or similar file system is verboten. In practice this is not a severe constraint,
especially given that most object data itself is immutable. DNA rarely mutates, log tampering is bad, cat videos are remixed
and republished. Besides, mutable data winds up in a richly indexed DBMS, which itself may reside on an object or file store.