Shining some light on All Flash Array performance

mooreslawflashThe industry is finally seeing widespread adoption of All Flash Arrays (AFA) now that the cost of flash technology has made these things reasonably affordable for Enterprise customers.  These represent the next technological jump in storage technology that will cause storage professionals to unlearn what we have learned about performance, cost to value calculations, and capacity planning.

The first arrays to market used existing architectures just without any moving parts.

notdesigned

Not much thought was put into their designs, and flash just replaced spinning disk to achieve questionable long term results.  They continue to use inefficient legacy RAID schemes and hot spares. They continue to use legacy processor + drive shelf architectures that limit scalability.  If they introduced deduplication (a natural fit for AFAs) it was an afterthought post process that didn’t always execute under load.

IDC has recently released a report titled All – Flash Array Performance Testing Framework written by Dan Iacono that very clearly outlines the new performance gotchas storage professionals need to watch out for when evaluating AFAs.  It’s easy to think AFAs are a panacea of better performance.  While it’s hard not to achieve results better than spinning media arrays, IDC does a great job outlining the limitations of flash and how to create performance test frameworks that uncover the “horror stories” as IDC puts it in the lab before purchases are made for production.

Defining characteristics of flash based storage

You can’t overwrite a cell of flash memory like you’d overwrite a sector of a magnetic disk drive.  The first time a cell is written, the operation occurs very quickly.  The data is simply stored in the cell, basically as fast as a read operation.  Every subsequent re-write though, first you must erase a block of cells, and then program them again with new data to be stored.  This creates a latency for incoming write IO after the first, and should be accounted for in testing to make sure enough re-writes are occurring to uncover the performance of the device over time.

Flash wears out over time.  Each time a flash cell is erased and re-written it incurs a bit of damage or “wear.”  Flash media is rated by how many of these program erase (PE) cycles can occur before the cell is rendered inoperable.  SLC flash typically is rated at 100,000 PE cycles.  Consumer MLC (cMLC) is rated around 3,000, where enterprise MLC (eMLC) must pass higher quality standards to be rated for 30,000 PE cycles.  Most drives provide a wear-levelling algorithm that causes writes to be spread evenly across the drive to mitigate this.  Workload patterns, though might cause certain cells to be overwritten more than others, however so this is not a panacea in all cases.

Erase before write activity can lock out reads for the same blocks of cells until the write completes.  Different AFA vendors handle data protection in different ways, but in many cases, mixed read/write workload environments will exhibit greatly reduced IOPS and higher latencies than the 100% read hero numbers most vendors espouse.  This is yet another reason to do realistic workload testing to reset your own expectations prior to production usage.

How these flash limitations manifest in AFA solutions.

Performance degrades over time.  Some AFA solutions will have great performance when capacity is lightly consumed, but over time, performance will greatly diminish unless the implementation overcomes the erase-before-write and cell locking issues.  Look for technologies that are designed to scale, with architectures that overcome the cell locking issues inherent to native flash.

Garbage collection routines crush the array.  Under load, some garbage collection routines that are used to clean up cells marked for erasure, etc. if not handled properly can crush array performance.  In IDC’s testing, this lead to wildly fluctuating AFA performance — sometimes good, sometimes horrible.  Not all arrays exhibit this behavior, and only testing will show the good from the bad (because the vendors won’t tell you).

$ per usable GB is surprisingly inflated due to inefficient thin + dedup or best practice requirements to leave unused capacity in the array.  Comparing the cost of the raw installed capacity of each array is the wrong way to measure the true cost of the array.  Make sure you look at the true usable capacity expectations after RAID protection, thin provisioning, deduplication, spare capacity, requirements to leave free space available, or other mysterious system capacity overheads imposed but undisclosed by the vendor.  The metric you’re after is dollar per usable GB.

Check out the IDC report. It’s a great education about AFAs, and provides a fantastic blueprint to use when testing AFA vendors against each other.

Overselling Public Cloud Idealism to Enterprise IT Private Cloud Customers

Private Cloud in Enterprise IT has a lot of strong value propositions. Running Enterprise IT more like a Service Provider creates a huge potential win for everyone involved by transforming to a just in time financial model, removing fragility and risk using more automated systems deployments, being more respon

sive to business needs and quicker to market through radical standardization, and so on. This added value of Private Cloud has been discussed adover the last several years, and should be well understood at the rate customers are adopting the architectures.

I’m beginning to see that the hype around cloud is causing Enterprise IT decision makers to overlook the basic blocking and tackling of performance analysis, resiliency, availability, recoverability, etc. Developers by and large don’t have these skill sets. It’s the infrastructure architects, engineers, and operators that must continue to provide these aspects of Enterprise computing.

Here are a few idealistic (and perhaps false) characteristics of the Public Cloud that will not be readily available to most Enterprise IT within the next 5 years, and are being currently oversold to the Enterprise buyer.

Cloud is Automatically More Resilient to Failure

Does the “cloud” provide any additional reliability to the application? No, not as such. It’s still a bunch of technology sitting in some datacenter somewhere. Drives fail, nodes fail, connectivity fails, power fails, tornadoes and floods happen, etc. It’s very dangerous to assume that just because we call an infrastructure a “cloud” means it’s any more resilient to failure as legacy infrastructure designs. “Cloud” infrastructures can quickly become very large baskets in which to put all our eggs. I’m not going to say the chance of failure is any greater, but certainly the impact of failure can be much more widespread and difficult to recover from. Site failures still take out large chuncks of infrastructure unless traditional D/R solutions are in place to provide Business Restart, or where next-gen active active infrastructures provide Business Continuity.

I’ve discussed this assumption with many directors of IT that tell me the applications they run are the most unstable aspect of their IT environment. The promise of “cloud” is that the resilience issues will be handled by the application itself. Why do we honestly expect Enterprise application developers to quickly trasform themselves into savvy cloud-scale availability experts in the near term? Applications will continue to be buggy and unstable. Enterprises will continue to invest in products that provide infrastructure level reliability, recoverability, and continuity.

Scale Out of Any Given Application or Database

Let’s address the traditional three tier (presentation, app logic, database) model of Enterprise application deployment. If you’re lucky, the application you’ve purchased for use in the Enterprise allows a certain level of scale out. The Presentation and App tiers are now designed to allow additional nodes to support additional workload, but the database tier is a monolith. I say if you’re lucky, because many COTS apps are still entirely monolihic in nature, and do not follow this “standard model.”

Can you take a monolithic workload and put it on a “cloud” and have it magically adopt a scale out capability the cloud infrastructure provides? No, of course not. If the app is not aware of additional nodes it can’t “scale out.” It can’t inherently load balance, etc. The best you can do is virtualize this application with other monolithic applications and consolidate them onto a common infrastructure. We’ve all been doing this for years.

… and What About Scale Up?

Think about that monolithic database tier, or that other app (you know the one) that demands a larger footprint than the small node-based scale out architectures can provide. Enterprise IT has traditonally been a Scale Up type environement. Consolidation has been in play for a long time prompting the development of large infrastructure elements to hold these individual components. Times are shifting to more and more scale-out models, but it’ll take years for most of the Scale Up architectures to retire and many never will. The best infrastructures can scale up and out, providing node based designs that allow for affordable growth, but architecturally add additional firepower to exising apps to allow them to scale up to meed additional demand.

Cloud is Automatically Less Expensive

Scale-out promises deferred spend as needs arise, but ultimately if you want to purchase a pool of resources (build it and they will come), you’re necessarily over-purchasing for what you need today. The discipline of Capacity Planning is even more important to engage, because there is an important counterbalance at play: the agility of having a pool of expensive resources sitting idle vs. the cost benefits of just in time purchases and rapid deployments which may slow down “time to serve” applications.

Scale is an important factor in the cost of “cloud.” Many enterprises will need to invest significantly in their first forray into a converged cloud infrastructure, because it starts off so small. Small in my mind today is anything less than 100TB. Much of the financial benefit the public cloud providers gain is due to their large scale. Cost per GB / IOP/ Socket / Port goes down quickly when those fixed costs are prorated to an ever growing population of application customers.

Enterprise IT Professionals Automatically Know How to Deploy and Operationalize Cloud Models

The transformation and retooling of our people is a large obstical to Private Cloud models. Silicon Valley can create all kinds of new infrastructure technologies to host our virtual machine environments, but it’ll take years before most Enterprise IT shops are capable enough to utilize them. The bridge to the future is running existing technology stacks in new ways. The technical skills of our people can still be leveraged, and the new processes involved in Private Cloud deployments can be experimented and refined. Ultimately new software driven architectures will supplant the current hardware infrastructure based models, but not until they provide the same site level continuity capabilities enjoyed by IT.

I hope I didn’t rain on your cloud. I do absolutely understand that the Cloud / Social / Mobile / Big Data world is the 3rd Platform of computing technology. I fully embrace the added value to the Enterprise Business Users of the Cloud models. I just think that we can’t assume Public Cloud hype (much of which isn’t true anyway) will be fully applicable to Enterprise Private Cloud in the near term.

Thoughts?