Techs to CEOs feel the burden of blame in Equifax security breach

The Krebs Security headline reads:  “Breach at Equifax May Impact 143M Americans”1

* It’s a headline that is sure to make a lot of people angry, across a spectrum of people who rarely agree on anything.  All of our private data… naked and exposed to the most threatening of examination, splayed out there, by the huge Equifax security breach, to be seen by any sort of unsavory types that would be willing and able to take advantage of such a humiliating blunder.

And all of it because some piece of code, in some seemingly innocuous framework2, did not get patched; even though it was long known to be a security hole.  Seems like a simple thing to avoid.  We naturally ask:  “How could they have been so stupid with our cherished data?”.

It’s also natural to be caught up in the rush of anger and frustration, but if one of your goals is to make sure that these things happen much less often, it’s important to keep a level head.

We need to study how a simple, “routine” delay in patching some piece of software ended up causing such a world-wide impact.  We need to understand why it happened in order to take the appropriate steps to make it not happen so much.

Studying the problem at any depth quickly makes it obvious that it is critical to understand the “work-day in the life of” the people that manage these systems.

When you truly, earnestly, put yourself in the position of the technologists and administrators, things will probably seem a lot different than they do without taking into account the reality of the challenges faced by people in these careers.

Imagine what you’d be thinking if it was you that was responsible for the systems that caused such widespread chaos, while the headlines keep echoing the story of your delay’s impact.

Equifax Security Breach

Here’s what you or an admin who got caught in the middle of this would write/say…

 

The Krebs Security headline reads:  “Breach at Equifax May Impact 143M Americans”i.

There it is.  My secret’s out.  I knew it was probably going to get out eventually, but I’m still not ready for it… now that it ’s actually happened.

And now… everyone is talking about the catastrophic data breach.

My catastrophic data breach.  Great.

Well…  I know I’m at least significantly responsible.  I might not be the only one, but keeping those critical systems updated really is one of the myriad things I’m responsible for.  It helps to prevent problems.   Yes…  Problems just like this data breach, and many others.

If our patching process wasn’t 8-miles long, and didn’t require the dedicated time and patience of Sisyphus, then my reputation wouldn’t have this stain on it.  I need to determine if it would be better to update my resume to reflect a few years of involuntary institutionalization, rather than acknowledge my years and role at Equifax… just to improve the chances I find a new job quickly when this one blows up.

As it sinks into my mind how bad the situation really is, and how bad it is going to be, sudden, almost impossible rivulets of sweat start streaming down my back.  I struggle to focus, as I surf the overwhelming nausea of an adrenaline surge that cannot be heeded, and brace myself to endure the long-term grinding churn at the pit of my stomach.

And the overwhelming sense of impending doom…  If I could just overcome that, I think the situation would be tolerable.

I think back to a time when I thought I was under extreme work-stress, so many years ago I had almost forgotten when that $2.2M Sun setup crashed.  Even though it was not at all my fault, and my warnings to correct the situation went unheeded, I still felt a lot of compassion for the client company.  Already financially at the very end of their ropes and completely unable to scale as they had planned, and therefore desperate to sell their company to a major IT player, they had their only demo-able environment crash, for perfectly predictable reasons, about 75 minutes before they went in to give their last-ditch, one-shot demo of their great, highly fault-tolerant solution.  The meeting did not go well and, sadly, they went out of business very shortly after that.  It wasn’t my fault, but I still felt really bad for them… and a lot of stress.

Compared to this world-shattering security shockwave, that company-ending debacle was nothing.  There was really nothing more I could have done in the case of that old Sun system, but in this case, I feel like I should have had the tools to be able to prevent this from happening in the first place.  Surely, by now the technology must exist to enable and empower this.

I was trying to keep up, but it’s just not humanly possible.  It’s just too much.

I know most people will never really understand, and some won’t even try, but some of you must know how it is.  You’ve been there too, haven’t you?

You’ve got a bunch of different technologies that have been pulled together to try to get as much of a competitive edge as you can by identifying elusive patterns in the torrential flood of structured and unstructured data that we now have flowing through our fingertips.

It’s hard enough just getting it all set up… and then trying to keep it all going, let alone actually being able to quickly, easily, and inexpensively perform patching, and all the other complex application lifecycle tasks that we know we need to complete.  There are so many teams that depend on us to enable them to be more efficient and productive.  So we can deliver better applications to support the needs of our organization.

Of course, you NEED to test even critical updates BEFORE you push them to the production systems.  You need an easy, fast, and inexpensive way to safely test your patch/upgrade process in an exactly-equivalent environment BEFORE you apply them to production.

Of course, you still have to bring new capabilities to your organization as well, but your maintenance and application-lifecycle tasks (like patching, deploying, scaling, cloning, refreshing, upgrading, …) are taking up so much of your time that you cannot proceed as fast as you’d like, or as fast as your team needs.

It’s only human that something slips.  You’re not perfect, and even if you were… you can only do as much as your tools enable you to do.

As you continue to read the startling data breach headlines, and hear the rumors, and see people you’ve worked with and respected make the quiet, box-wielding walk one last time to their car, you start to become very concerned about your situation as well.

Simply not having access to the caliber of tools that make it easy to avoid these kinds of application maintenance problems might just end up being perceived as some rookie CLM[iii].

It’s hard to be sure if you should feel better or worse that others are feeling the burden of blame too.  This is impacting the careers and lives of so many good, hardworking, well-intentioned people.  It’s already cost two major executives their jobs[iv], and then today I was amazed to see that the CEO “resigned” suddenly[v].  I do feel bad for them, but…  at least it’s not just the little guys getting in trouble when we’re doing the best we can with these primitive tools and processes.  Hopefully, this will encourage the new management team to listen to us when we try to explain how challenging it actually is to run these systems, and that we need better tools that allow synergy between all our environments.

Now that the news of my contribution to this whole mess it out, I’m urgently searching for a way to make sure this never happens again.

I’ve been scouring the web… looking for solutions that can help me wrangle this rodeo of technologies and I’m finding so many different point-solutions that can help me in so many different areas, but most of them don’t even come close to being able to support our needs in all the different environments we need to deliver.

I really don’t want to have so many non-interoperable tools and technologies for all my various environments, not only because it makes it so hard to manage everything, but also because it makes it basically impossible to actually test how things are going to work as you transition from ‘Dev to Test to Staging to Production to Scaling to Patching to Upgrading to Cloning to Refreshing’ complete application-stacks and data-pipelines.  You know… all those critical activities being carried out by the DevOps and Production teams.

The good news is that I’ve found that Robin can solve all of these problems for me.

  1. Application Upgrade/Patch Pre-Validation – Robin’s application-stack/data-pipeline thin-cloning capability can create exact copies of my production application environment, without copying the data, achieving fully-functional clones amazingly quickly, easily, and inexpensively. Your upgrades/patches/etc can be safely performed and validated in a fully isolated environment before being pushed to production.
  2. Application In-Place Upgrade/Patch – Once you have validated that your changes are safe, unlike most container-consolidation platforms, Robin also makes it surprisingly easy/fast/inexpensive to update the container-based applications live, in-place.
  3. Host Upgrading – Robin also makes the process of upgrading host OS/Firmware very easy, either by simply shutting down the application for a few minutes and relocating it to another physical host. Or, in the case of no-downtime databases, by making it so easy to stand up another replication database, getting them in sync, and switching the master to the replica, then shutting down the original DB for host upgrade.

If I already had these abilities alone, I wouldn’t be in this huge mess.  Should I update my resume now, or just hang on and hope for the best?  Never mind, rhetorical question.  Focus.

The thing is, if Robin was helping me solve these problems, not only would I not be in this current nightmare situation, but I’d also be able to really be a hero for my organization.  I mean the whole thing, from end-to-end.

We could have amazingly fluid transitions between our teams and phases, while allowing us to dynamically scale our apps vertically, on our existing hardware, as well as horizontally and vertically with new systems, without changing our technologies or tools.  With Robin, we’d have the ability to easily and inexpensively have automated:

  1. Dev/Test:
    1. Self-Service: We’d be able to have a system where these groups can be completely self-sufficient.  They wouldn’t need help from the Storage, Networking, Server, or Apps specialists.  No more arduous meeting schedule to plan and execute essential complex lifecycle tasks.
    2. Individual Thin-Clones: Each person can have their own thin clone of the application stack, so anyone can do whatever they want, without worrying about causing trouble for their team members.
    3. Application-Stack Refresh: Each person can take snapshots of their application-stack’s configuration and data at different times, and refresh/restore the entire application back to these snapshot points.
  2. Production:
    1. Application Upgrade/Patch Pre-validation: Like I said, this is so easy with Robin.
    2. Application In-Place Upgrade/Patch: This too is easy, cheap, fast, and safe.
    3. Host Upgrades: Easy, as I mentioned earlier.

On top of preventing these three problems, which would have kept me out of this mess, Robin can also provide:

  1. Quality-of-Service (QoS) – Robin is uniquely able, in the stateful-application consolidation space, of being able to set and enforce the IOPS for any specific volume, of any specific container, acting as a host for any specific application-stack role, for any specific instance of any application-stack. In addition, RAM and CPU can be set and enforced as well.
  2. Vertical Scaling – Robin also has the control to allow QoS to be dynamically scaled, vertically, by varying IOPS/CPU/RAM as needed… per volume, per role, per application instance.
  3. Horizontal Scaling – Now, this one is not so hard, if you’ve really got good abilities, so of course Robin’s got it down too.
  4. Dev/Test-Master-From-Prod – They want it again? (They actually want it a lot more often, this is just the least they can accept.)  Robin’s thick- or thin-clones are great solutions for this, for the entire application stack.  Then, they can deal with all the Dev/Test environment churn.  I’ll focus on more productive things.
  5. DR/HA – Robin can also easily handle real-world DR and HA scenarios, enabling complex strategies and making it so easy to run your workloads anywhere.

Wow, I really didn’t expect to go into all of that right now.  Thanks for listening.  I don’t really know what I’m going to do about the data breach tsunami, and the tidal waves that are threatening so many now, including me.  I need more time to think about what I can do to improve a situation I have no way to change.  It’s in the immutable past.

I really just want to sit, and think, and try to figure out what to do, to come up with something I can do to help the situation, but I think I really need to focus on the future.  One thing I know:  I don’t want this to EVER happen again on my watch.

I could go out and start stitching together a bunch of different open source projects to try to get some of the features I need, but I know, from reading all the reports on these kinds of DIY efforts, that I’ll end up with a difficult/expensive-to-maintain system that is not able to handle most of the use cases I need.

It certainly won’t be an environment where I can easily manage my applications as the entities that they are, throughout their entire lifecycles, and especially not where I can granularly guarantee QoS of all compute resources to the important application instances, such as Production.

And just forget about ever using the higher-value abilities like complex entire-application-level cloning, both thin (where I do not have to consume the time and storage cost of copying the data, and changes are tracked in delta disks) and thick (where I do copy the data for physical parallelism, delivering isolation and performance).

I may not know what to do to try to change the past… but I know what to do to make the right changes, to ensure a better, more secure future.

I know I’m calling Robin.  And I don’t care who knows about it this time, I just want things to work out well going forward, at least.

It’s like my friend AI says:

“Let Robin do what he does… behind the scenes… without worrying about taking all the credit.  The credit should go to the technology super-heroes; Robin just wants things to go much more smoothly…  going forward.”

[1] Reported on krebsonsecurity.com, Sept 07, 2017, at: https://krebsonsecurity.com/2017/09/breach-at-equifax-may-impact-143m-americans

[2] Reported on theregister.co.uk, Sept 14, 2017, at:  https://www.theregister.co.uk/2017/09/14/missed_patch_caused_equifax_data_breach/

[3] * I’ve been a production DBA and I know the stress they have to deal with.  This is why I was so happy to join Robin, to be working for a company that provides such an advanced toolset that can help so many through their critical day-to-day operations.

[4] Reported on nytimes.com, Sept 14, 2017, at: https://www.nytimes.com/2017/09/14/business/equifax-hack-what-we-know.html?mcubz=0

[5] Reported on nytimes.com, Sept 26, 2017, at:  https://www.nytimes.com/2017/09/26/business/equifax-ceo.html?mcubz=0

Here’s my headline…

Equifax:  Yet another security breach, so easily preventable with Robin.

  • Robin makes it easy for the good guys to protect us from the bad guys.
mm

Author Mark Bayazit, Senior Solutions Architect

More posts by Mark Bayazit, Senior Solutions Architect