Amazon's cloud failed: How can your cloud be better?

How should industry respond to cloud failure like Amazon's? Gregory Machler looks at the the root cause and examines how weakness can be addressed within the cloud product industry

Amazon's cloud services failure will likely lead to reservations by corporations to deploy solutions in the public cloud. It is likely companies will focus on private cloud solutions until they believe it is safe to dip in the public cloud. The Amazon outage was caused by an improper configuration of network infrastructure components. Human error led to gigantic cloud failure and financial losses.

The failure points to a significant weakness in the cloud. I mentioned in an earlier disaster recovery article that critical infrastructure products have too many features and models. They need to be like a car with common engine configurations, similar to features for cloud products. There also needs to be a limited amount of car types or cloud product models.

The whole cloud system needs far fewer permutations so that the integration of those products can be properly tested for disaster recovery. Too many permutations are too expensive to test. Some pieces of software like an Energy Management System (controls power grids); have complex finite state machines, sophisticated power algorithms, and full system failover capabilities. But, as with many software products some software error paths are never tested.

Also see: Amazon service outage reinforces cloud doubts

Unlike EMS systems, cloud services needs to avoid untested permutations by making it simple to integrate via modular product sophistication. The complexity is hidden within the products but doesn't adversely impact integration. Like large aerospace, telecommunications, and defense projects there is a need for cloud systems architects responsible for the proper integration and testing of multiple vendors products. They can analyze the risk associated within the products and their integration. If they see weaknesses they can focus on other product vendors. They also can enforce a limit to the number of cloud permutations the service provider or company will deploy.

More on cloud computing and security

Involving architects in the design of these solutions will lead to a positive pressure on the cloud product providers. They will influence the choice of products that meet the product requirements and integrate simply. Lets call these products cloud-aware. The products could have a limited number of pre-defined templates that they support and integrate well with other products. The use of templates allows these products to integrate with little intervention.

The use of an architect is really common today. How does one go about finding a good one? I recommend getting one that is good at the 'big picture' and a generalist. When large projects like the Brooklyn Bridge were developed, a generalist architect often led the project. Often they weren't the smartest, many of the niche architects were smarter and potentially more detail oriented. But they were good communicators, focused on critical design issues, and resolved disagreements well. They implemented ideas from the best architects and moved the project forward.

The cloud systems architect needs similar skills as the historic bridge makers. They need to interact with application architects, platform architects, infrastructure virtualization architects, storage and network architects, and security architects that focus on disaster recovery and other product security concerns. There should be many counselors and outside experts that are brought in to address the design of the cloud service or private cloud. This up-front money is well worth the effort because it will help avoid the need for disaster recovery and/or a potential failures and lawsuits.

More thought needs to go into why, how, and what cloud products integrate well with one another. Maybe the cloud industry needs a SNIA (Storage Networking Industry Association) like the storage industry has. We need more dialogue about how to avoid failure and improve/simplify products.

Copyright © 2011 IDG Communications, Inc.

Subscribe today! Get the best in cybersecurity, delivered to your inbox.