You Call that Big Data? More Sensor Data Points than Stars in the Galaxy…
by Rod Bagg – VP & Architect Of InfoSight
InfoSightTM – the predictive analytics engine built into every Nimble array ever shipped – uses big data. So it’s been interesting to see recent announcements of copy-cat products from other storage vendors who have a rather small idea of big data.
Imitation is the sincerest form of flattery – but imitation is not the real thing. For starters, any product can claim it’s predictive, but to provide customers with economically valuable, accurate predictions you need really big data, plus a killer data science team.
Here’s our idea of big data: InfoSight collects more anonymized sensor data points every four hours than there are stars in our galaxy. At over 200trillion sensor data points collected a year, we have amassed what is probably the biggest repository of storage-related sensor data ever collected.
The key design principles behind InfoSight include:
- Collect the right data – This can only be done because InfoSight was conceived and built as part of the original product’s design. Before any line of code is written at Nimble, careful thought is placed into what information needs to be collected. It would be almost impossible to retrofit this kind of instrumentation into a product that has already been built.
- Collect a lot of data from a lot of customers over a long period of time – More than 6,200 customers rely on Nimble arrays that span almost every type of application, environment and use-case. Collected over more than six years, we have more than enough data to achieve statistical significance with our predictive modeling.
- Build a team of world-class data scientists – Nimble employed its first data scientist years before the term became a cool job title. Our data scientists pore over every support case and work to predictively eliminate support issues. More than nine out of ten support cases are initiated by InfoSight to address an issue even before a customer has noticed a problem. Over 85% of these cases are automatically resolved.
It’s About More than Storage
InfoSight collects tens of millions of sensor data points per array per day, not just from the array itself, but also from the surrounding network, servers and virtual machine infrastructure. This allows us to rapidly resolve the most complex performance issues even when they’re not related to storage. For example, at a glance, a user can pinpoint issues inside a VM that are affecting performance, and then see which other hosts this “noisy neighbor” is slowing down. All of this is done at a glance, in real-time.
Nimble customers find these capabilities enormously valuable:
“In the space of two months, my Nimble Storage product has detected and allowed me to resolve two non-storage related issues that my other monitoring solutions did not.”
“An unexpected VM was in the top VM list, this time a print server. Someone must be printing something right? I looked on the Nimble array, at the datastore, and it showed high usage once every five minutes, but all quiet on the printer spools datastore. A quick file search for recently modified files showed the Microsoft BITS download files were constantly modifying, a restart of that service and the Configuration manager service, and the datastore activity was back to normal.”
“They both couldn’t believe the level of detail and information they could get from InfoSight. Although he tried not to show it, you could see the expression on his face of pure joy.”
Real Data Science, Not Just Pretty Dashboards
Although some Nimble customers are convinced it’s magic, InfoSight actually works by leveraging the latest data science techniques:
- Sliding-window correlation analysis to automatically diagnose resource constraints
- Differential equation models of IO flux in order to assess workload contention
- Personalized forecasts of capacity usage over time by leveraging historical distribution of capacity deltas while performing appropriate outlier detection and filtering
- Automatic comparisons of array performance against similar “peer” workloads on other arrays to identify arrays that are experiencing abnormally high latencies within their peer group
- Bootstrapping and Monte-Carlo methods to quantify the uncertainty of forecasts and sizing predictions.
The resulting numbers from InfoSight look pretty amazing, especially when compared with those published recently by another storage vendor. Here are the results:
Cumulative hours of latency reduced:
Pure Storage – 338,311,230 lifetime vs Nimble Storage – 441,553,385 in just the last year*
Successful non-disruptive operations completed:
Pure Storage – 6,799 lifetime vs Nimble Storage – 17,347 in just the last year
Data center space reduction in rack units saved globally:
Pure Storage – 40,311 lifetime vs Nimble Storage – 134,400 lifetime
*Pure Storage recently published a statistic on the “performance metric of cumulative hours of latency reduced”. There’s a risk that metrics like this can dramatically oversimplify how one should think about performance. From an application perspective, not all latency is created equal; larger ops are expected to be more latent than smaller ops and different applications are more latency sensitive than others. For example, InfoSight’s performance anomaly automation takes op size, application type, and other relevant variables into account when quantifying the relative severity of a latency event. It may not make for flashy marketing numbers – but it makes for smarter customer support.
We’ll close with one of the most powerful metrics provided by InfoSight: For the last two and a half years, Nimble Storage arrays have maintained at least 99.9997% measured availability across our entire installed base.
It’s no wonder that when referring to a competitor’s solution, a prominent storage blogger recently told us “It’s just not InfoSight”.