Lee Atchison is the Senior Director and Cloud Architecture at New Relic. For the last seven years he has helped design and build a solid service-based product architecture that scaled from startup to high traffic public enterprise. Lee has 32 years of industry experience including seven years as a Senior Manager at Amazon. At Amazon, he led the creation of the company’s first software download store, created AWS Elastic Beanstalk, and managed the migration of Amazon’s retail platform to a new service-based architecture. Lee has consulted with leading organizations on how to modernize their application architectures and transform their organizations at scale; including optimize for cloud platforms, utilize service-based architectures, implement DevOps practices, and design for high availability. This experience led him to write his book “Architecting for Scale”, published in 2016 by O’Reilly Media. Lee is an industry expert and is widely quoted in publications such as Diginomica, IT Brief, Programmable Web, CIO Review, and DZone. He has been a featured speaker at events across the globe from London to Sydney, Tokyo to Paris, and all over North America.
Building a scalable application that has high availability is not easy. Problems can crop up in unexpected ways that can cause your application to stop working and stop serving your customer’s needs.No one can anticipate where problems will come from and no amount of testing will identify and correct all issues. Some issues end up being systemic problems that require the correlation of multiple systems in order for the problems to occur. Some are more basic, but are simply missed or not anticipated.Links and More InformationThe following are links mentioned in this episode, and links to related information:Modern Digital Applications Website (https://mdacast.com)Lee Atchison Articles and Presentations (https://leeatchison.com)Architecting for Scale, published by O’Reilly Media (https://architectingforscale.com)Application availability is critical to all modern digital applications. But how do you avoid availability problems? You can do so by avoiding those traps that cause poor availability.There are five main causes of poor availability that impact modern digital applications.Poor Availability Cause Number 1Often, the main driver of application failure is success. The more successful your company is, the more traffic your application will receive. The more traffic it receives, the more likely you will run out of some vital resource that your application requires.Typically, resource exhaustion doesn’t happen all at once. Running low on a critical resource can cause your application to begin to slow down, backlogging requests. Backlogged requests generate more traffic, and ultimately a domino effect drives your application to fail.But even if it doesn’t fail completely, it can slow down enough that your customers leave. Shopping carts are abandoned, purchases are left uncompleted. Potential customers go elsewhere to find what they are looking for.Increasing the number of users using your system or increase the amount of data these consumers are using in your system, and your application may fall victim to resource exhaustion. Resource exhaustion can result in a slower and unresponsive application.Poor Availability Cause Number 2When traffic increases, sometimes assumptions you’ve made in your code on how your application can scale are proven to be incorrect. You need to make adjustments and optimizations on the fly in order to resolve or work around your assumptions in order to keep your system performant. You need to change your assumptions on what is critical and what is not.The realization that you need to make these changes usually comes at an inopportune time. They come when your application is experiencing high traffic and the shortcomings start becoming exposed. This means you need a quick fix to keep things operating.Quick fixes can be dangerous. You don’t have time to architect, design, prioritize, and schedule the work. You can’t think through to make sure this change is the right long term change You need to make changes now to keep your application afloat.These changes, implemented quickly and at the last minute with little or no forethought or planning, are a common cause of problems. Untested and limited tested fixes, quickly thought through fixes, bad deployments caused my skipping important steps. All of these things can introduce defects into your production environment. The fact that you need to make changes to maintain availability, will itself threaten your availability.Poor Availability Cause Number 3When an application becomes popular, your business needs usually demand that your application expand and add additional features and capabilities. Success drives larger and more complex needs.These increased needs make your application more complicated and requires more developers to manage all of the moving parts. Whether these additional developers are working on new features, updated features, bug fixes or other general maintenance, the more individuals that are working on the application, the more moving parts that exist, the greater the chance of a problem occurring that brings your application down.The more your application is enhanced, the more likely there is for an availability problem to occur.Poor Availability Cause Number 4Highly successful applications usually aren’t islands unto themselves. Highly successful applications often interact with other applications, either applications that are part of your application suite, or third party applications. Third party applications can be provided by vendors or partners. They can be external SaaS services. Or, they can be integrations with customer systems. The more dependencies you have, the more exposed you are to problems introduced by those other external systems.Your availability will ultimately become tied to the availability and quality of those external applications. The more dependencies you have, the more fragile your application becomes.Poor Availability Cause Number 5As your application grows in complexity, the amount of technical debt your application has naturally increases. Technical debt is the accumulation of desired software changes and pending bug fixes that typically build up over time as an application grows and matures. Technical debt, as it builds up, increases the likelihood of a problem occurring.The more technical debt you have, the greater the likelihood of an availability problem.ConclusionAll fast-growing applications have one or more of these problems. These problems are the sort of problems that increase the risk of having a problem with availability. Potential availability problems can begin occurring in applications that previously performed flawlessly. The problems can quietly creep up on you, or the problems may start suddenly without warning. But most applications, growing or not, will eventually have availability problems. Availability problems cost you money, they cost your customer’s money, and they cost you your customer’s trust and loyalty. Your company cannot survive for long if you constantly have availability problems.Focusing on these five causes will go a long ways to improving the availability of your applications and systems.Tech Tapas — Database backup test failureI want to tell you a story. You tell me if this is ok or not.This was from a conversation I had heard in a company I was working with.The conversation was a message from one engineer to their peers, They were trying to update them on the situation of a production database. The message went like this:“We were wondering how changing a setting on our MySQL database might impact our performance…”“…but we were worried that the change might cause our production database to fail.”“Since we didn’t want to bring down production, we decided to make the change to the replica database instead…the backup database…”“After all, it wasn’t being used for anything at the moment.”Of course, you can imagine what happened next, and you would be right.The production database had a hardware failure, and the system automatically tried to switch over to use the replica database.But the replica database was in an inconsistent state due to the experimentation that was going on with it. As such, the replica database was not able to take on the job as the new master…it quickly became overwhelmed…and then it failed as well.Both the original master, and the replica failed. The replica, who’s sole purpose for existence was to take over in case the master failed, wasn’t able to do so because it was being tinkered on by other engineers.Those other engineers didn’t understand that, just because the replica wasn’t actively servicing production traffic, that doesn’t mean it wasn’t being used. It’s entire job was to sit in wait to take over if necessary. By experimenting on that replica database, they were inadvertently impacting production. They were introducing risk into the production system — risk that wasn’t appropriate. Risk that could — and in this case did — cause serious problems.This, by the way, was a true story. But it also is not an uncommon story. I hear similar sorts of problems occur in many engineering conversations, and many operations management conversations. Not having a clear understanding or appreciation for how certain actions impact the risk management plans for a production system can be disastrous. This is why active and continuous risk management planning is critical for production networks to stay operational.This podcast uses the following third-party services for analysis: Chartable - https://chartable.com/privacyPodtrac - https://analytics.podtrac.com/privacy-policy-gdrp
We often hear that being able to scale your application is important. But why is it important? Why do we need to be able to suddenly, and without notice, scale our application to handle double, triple, or even ten times the load it is currently experiencing?Why is scaling important?In this episode, I am going to talk about four basic reasons. Four reasons why scaling is important to the success of your business.And then, what is the dynamic cloud?This is Application Scaling, on Modern Digital Applications.Links and More InformationThe following are links mentioned in this episode, and links to related information:Modern Digital Applications Website (https://mdacast.com)Lee Atchison Articles and Presentations (https://leeatchison.com)Architecting for Scale, published by O’Reilly Media (https://architectingforscale.com)Why you must scaleWe often hear that being able to scale your application is important. But why is it important? Why do we need to be able to suddenly, and without notice, scale our application to handle double, triple, or even ten times the load it is currently experiencing?Why is scaling important?There are many reasons why our applications must scale. A growing business need is certainly one important reason. But there are other reasons why architecting your application so it can scale is important for your business.I am going to talk about four basic reasons. Four reasons why scaling is important to the success of your business.Reason #1. Support your growing businessThis is the first, and the most basic reason why your application has to scale. As your business grows, your application needs grow. But there is more to it than that. There are three aspects of a growing business that impact your application and require it to scale.First, is the most obvious. As you get more customers, your customer’s make more use of your applications and they need more access to your website. This requires more capacity and more growth for the IT infrastructure for your sites.But that’s not the only aspect.As your application itself grows and matures, typically you will add more and more features and capabilities to the application. Each new feature and each new capability means customers will make more use of your application. As each customer uses more of your application, the application itself has to scale. Simply by your business maturing over time, even if the size of your customer base doesn’t grow, the computation needs for your application grow and your application must scale.And finally, as your business grows and matures, and your application grows and matures, your more complex application will require more engineers to work on the application simultaneously, and they will work on more complex components. Your application might be rearchitected to be service based. It might add additional external dependencies and provisions. You will have to support more deployments and more updates. Your application and your application infrastructure will need to scale to support larger development teams and larger projects.This means you need more mature processes and procedures to scale the speed at which your larger team can improve your application.Reason #2. Handle surprise situationsThe second reason you need to be able to scale your application is to handle surprise situations and conditions. All businesses have their biggest days. These are the days where traffic is at the heaviest. These are days like Black Friday in retail, or the day of the Super Bowl for companies that advertise during that event, or open enrollment periods, or start of travel season.But your business may have unexpected business bumps. These are the traffic increases that occur not because of a known big event, but because of an unknown or unexpected event. When an event occurs that is favorable to your business, you need to be ready to handle the increased load that occurs to support the event. If you cannot handle the increased load, you risk losing the new business, and you risk disappointing your existing customers.Sudden business success can kill you if you can’t scale to meet the need.Just ask Robinhood Financial.Robinhood Financial is an investment company that provides investment management services. On Monday, March 2nd, 2020, Robinhood faced a business crisis. They faced a sudden increase in business.On that day, the United States stock market had a record-breaking day. This record-breaking day resulted in a record number of account signups and customer market transactions. This is good news for a company such as Robinhood.The problem was that their traffic load was not only high, it was too high.They needed to be able to respond to a huge spike in traffic to their application. Unfortunately, they were unable to keep up with the sudden demand.The result was a failure of their systems…and their application.The Robinhood Financial site was down…for a day and a half. This was during a peak stock market time, a time when their customer’s needed them the most.As a result, they lost out on a huge amount of easy, new business; and they created hardship and disappointment for many of their existing customers. Potential new customers and existing customers alike, were disappointed. A potential opportunity for huge growth and huge upside for the company…instead turned into a major negative event for the company. An event their founders had to publicly apologize for.All because they couldn’t scale to handle the surprise traffic load.To be successful, companies must be able to scale to meet sudden and unexpected traffic demands.Reason #3. Handle a partial outageThe third reason is a sneaky one. You need to be able scale in order to handle partial application outages. Partial outages can be a big problem for businesses. You have a large application, distributed across the globe in multiple data centers.or availability zones, if you are operating in the cloud.You spread your application out like this for improved redundancy, availability, accessibility, and resiliency.But now, one of your data centers goes down. Of course, since you are operating in more data centers, a single data center outage is far more likely. This means more chances for something to go wrong in any one of them.But when a data center goes down, the traffic that would normally be sent to that data center has to be re-routed to other data centers. This results in a big uptick in traffic to those other data centers.Can those data centers handle the increased traffic? If not, those data centers could go down as well. The result is your application can fail and become unavailable, due to excessive traffic.This seems counter intuitive, but your plan to increase availability, just made your application less available.Your plan for improved redundancy by increasing the number of data centers, actually made your application more fragile.By increasing the number of data centers you were using, you increased the risk of a data center failure. And your application isn’t able to scale to handle the increased traffic needs of a data center failure. The result is an application melt down. A step to improve availability makes availability worse.Can your other data centers accept the sudden challenge of handling the additional traffic that is sent to them from a failed data center? Can you respond to this sudden need for scale? You must, or your application is at risk.Reason #4. Maintain availabilityThe fourth reason is to maintain availability. As your application gets more complex, it requires more interactions between many different components to work correctly.If one of those components begins to act sluggishly, it can cause performance issues in downstream services. These downstream performance issues can become worse, and more critical problems can occur, such as transaction timeouts, corruption, data loss, and ultimately, upset customers.A single service, slowing down for some simple reason, can cascade into a larger problem. And if your application can’t scale, the likelihood of individual components saturating and slowing down becomes a matter of when it will happen, not if if will happen.Lack of scalability turns into lack of availability.Lack of availability turns into failed customer expectations.Failed customer expectations turns into a negative impact on your business.Scaling is CriticalScaling is critical to your business success. Whether your business is growing or not, you need to be able to handle the growing and spiky traffic needs of your customers…at anytime…or risk application failure, upset customers, and a business failure.Scaling isn’t just important, it is a business necessity.Tech Tapas — Dynamic CloudThere are two ways that people utilize the cloud...the first is by taking an application that is designed to run anywhere, and run it in an infrastructure that was created in the cloud. This is typically called the static cloud, because you create resources, such as servers, that are long lived and use them to operate the application. The resource usage typically does not change much — or at all — over the long term as the application runs.The other way is to only allocate the resources you absolutely need, when you need them. Given that it is very easy to allocate and free resources in a cloud — especially a public cloud — it’s relatively easy to build an application that allocates the resources it requires when it requires them, and frees the resources when they are no longer required.This is called the dynamic cloud.The dynamic cloud is where the true power of cloud computing exists, and where the true benefits of using the cloud can be unlocked. The ability to only consume the resources you absolutely require at the moment, coupled with the ability to quickly allocate the additional resources you require as your application needs increase — gives you incredible capabilities in building highly scalable applications that can meet your needs no matter the amount of traffic sent to them, yet conserve money when traffic is low.When you perform a lift-n-shift migration of an application to the cloud, you typically move the application from a static data center to operating in a static cloud. You typically do not take advantage of the dynamic capabilities of cloud computing. Too often such application migrations end up being disappointments, because the application does not run any better in the cloud than it did in your own data center, yet the cloud resources may end up actually costing more money when used statically than equivalent resources in a static data center.The only way to truly see the advantage of using cloud computing is to utilize the dynamic cloud to build dynamic applications. Then you only consume — and pay for — the resources you require at the time you require them, yet you can increase the resources available to your application very quickly to handle sudden increases in traffic.Whether you are doing dynamic auto scaling, or using dynamic services such as Amazon DynamoDB, Google Big Table, AWS Lambda, or Azure Functions, using the cloud in a dynamic fashion — using the dynamic cloud — is the key to effectively utilizing the cloud to improve, and hence modernize, your application.This podcast uses the following third-party services for analysis: Chartable - https://chartable.com/privacyPodtrac - https://analytics.podtrac.com/privacy-policy-gdrp
Ken Gavranovic was the Executive Vice President and GM for product at New Relic. In early 2019, Ken and I were in Boston together for an event, and we recorded an interview discussion about Risk Management in modern digital applications.Both Ken and I have experience dealing with Risk Management issues in current and past assignments. I discuss Risk Management in my book, Architecting for Scale. Ken used a very similar risk management technique in his past corporate management gigs. In this interview, we compare notes and make recommendations on best practices for Risk Management that everyone can use.Links and More InformationThe following are links mentioned in this episode, and links to related information:Modern Digital Applications Website (https://mdacast.com)Lee Atchison Articles and Presentations (https://leeatchison.com)Architecting for Scale, published by O’Reilly Media (https://architectingforscale.com)Risk Management with Ken Gavranovic Video (https://leeatchison.com/2019/02/06/managing-risk-in-modern-enterprise-applications/)Ken Gavranovic Twitter (https://twitter.com/kgavranovic)Ken Gavranovic LinkedIn (https://www.linkedin.com/in/gavranovic/) Risk Management InterviewKen: I know we both talk to a lot of customers. One of the questions is, where do I get started? What are some of the patterns we see in enterprises and our own experiences? We have an awesome opportunity to talk to a lot of companies doing digital transformation, but what is something that I can just go do tomorrow to get started?Lee: One of the things I find it’s very easy to wrap your mind around is risk management. How do you build a risk matrix to track the issues and the risks you have within your system? I like to talk to companies about that because it gets people starting to think about what their system is doing, what problems they have, and how they deal with them. It gets them thinking beyond just the problem/resolution cycle, and more into a pro/con and risk assessment process. What is the benefit of fixing something versus the benefit of mitigating it versus the benefit of simply ignoring it? I like to talk about that because it gets conversations going within the company about the sorts of things that are important to them.Creating a risk matrix is an important first step for anyone who is thinking about trying to improve their availability, trying to improve their scalability, or trying to modernize their application in many different ways. It helps get a grip on the issues that already exist in your system and what you are currently doing to manage those risks.Ken: I 100% agree. I remember in a previous role, I had a couple hundred-million-dollar project, I had some challenges. We created a risk matrix which helped us solve those challenges. So I thought it might be helpful for people watching this video. Let’s double click and see what this might look like.From my perspective, I think the key questions that need to be asked, those questions need to be asked in a bottoms-up way, not top down. Agreed?Lee: Yes, definitely.Ken: It’s not people at the top of the organization that are giving you the answers. It’s the team level that gives you the answers you need. Let me give you my shot and tell me where I miss.First of all, the things that can go into the risk are the things that can go bump in the night.Lee: Most people already have an idea of the things that keep them up at night. Things they think about, worry about. The things they think about on a regular basis, and that is a good place to start.Ken: That makes sense. So, bottom up, by team, just create a list. Just list all the things that we think are some sort of risk to the project. These are things you know you should be resolving, but instead you have a habit of prioritizing feature development work over it instead.Next, is to think about the likelihood that this risk will actually happen.Lee: I tell people they need to think about two values for every risk item they come up with. Create a spreadsheet, and list all the risks as rows in the spreadsheet. Each individual risk, line by line. Then, for each risk, add two values in separate columns: likelihood and severity. That is, how likely is this risk to happen and if it does happen, how much negative impact will I have with it.They should do this for every risk in the matrix, before they even begin to think about fixing or mitigation.Ken: I think it’s important to share that this is what we’ve seen, not just from personal experience, but from a lot of companies that we work with.Lee: Right.Ken: What types of values should I use for likelihood and severity? Some people say I should score it from 1 to 10. I think that’s too granular. I like to keep it simple. Just use: Low, Medium, and High.Lee: I agree with you. You do run into people that want to be highly analytic. They want to use numbers, say, from 1 to 100 and they end up arguing about whether a particular risk is a 35 or a 36. This is way too granular. Keep it simple.Ken: Sometimes teams like to use their SPRINT approaches of throwing numbers, such as using cards.Lee: Yeah, if you really want a more rigorous process, you can do something similar to the SPRINT throwing numbers approach, but just use three playing cards, say Ace, Five, and Ten. Then everyone can vote with a card and use that to determine High/Medium/Low.But that sort of process is only for people that really want a truly analytic solution. It can be done much simpler than that. Often, items are clear to everyone that they are a high or a low or somewhere in between.Ken: So, whether you use cards, or just use Low/Medium/High, or whatever. At the end of the day, the most important thing is to keep it simple. It’s not about a big debate.Lee: Exactly.Ken: At this stage, we are not trying to get into a great level of detail. Just a high-level description, likelihood, severity. Next thing for the matrix is, is this risk currently instrumented? Does it have observability? If this risk were to occur, would you know that it is occurring from a notification from an automated system, or would you find out from your customers telling you?Lee: That’s a fantastic way to think about it. It’s one thing to know that if something goes wrong, what’s going to happen. It’s another thing to know that you’ll know when it happens.Ken:** Agreed.Lee: And, certainly when we talk later about mitigation, you absolutely need to know that knowing when a risk is occurring is a critical aspect of risk management. This is especially true for your high severity risks, whether or not they are high likelihood or low likelihood.Ken: Kicking starting a program like this in an enterprise is obviously hard. You need top-down leadership to support this process that we are going to do.Lee: Yes.Ken: Risk matrix, containing lines with items, likelihood, severity, monitored or not monitored. Ok, what else, or is it just that simple?Lee: Well, coming up with that list is going to get you 80% of the way to what you need. That’s because it gets you and your organization thinking about what’s going on. That’s the most important benefit of this process. You start thinking about risk and the impact risk has on your system. What’s going to happen during this risk discovery process is the engineers in the room, their minds are quickly going to go to the next thing, which is mitigation. They are going to start to think about how to handle the risk.But, you are right. If you get nothing done but create that list of risks and put them in the matrix during the first meeting or two, that’s all you need and your world will be a whole lot better, just by simply having that matrix.Ken: Right. Another point I want to throw out there and see if you agree, is around RCA and incident response processes. I think when you have an incident, during the RCA you should always check if this issue was already on the risk matrix. If it was not there, then it should be added, and some time should be spent on why it wasn’t added in the first place. Maybe a team wasn’t as aggressive, and they didn’t want to put everything in the matrix. Because, going back to no surprises, you want to understand why this incident was a surprise. One of my favorite phrases is, “surprises, not a fan of giving or receiving”. If you have a risk matrix and it’s done right, anything that goes bump in the night should have been known and on the risk matrix ahead of time.Lee: Exactly, you know, every time you have an outage or an incident of any sort, you end up with some sort of post mortem whether it’s formalized or not. One of the key questions has to be, “did you know about this ahead of time?”, and that comes back to the risk matrix. Because, if you didn’t know about it, that’s a problem. It needs to be added to the risk matrix, so you understand that risk fully. But if you did know about it, you should also verify that the actual severity of the incident matches the severity you thought it should be on the risk matrix. Were you right or wrong in your estimates? You can gain a lot of knowledge when an incident occurs by answering questions like this with the help of a risk matrix.Ken: So, let’s assume that as a leader, I’ve told my organization to build a risk matrix. They’ve done the process, I now have this risk matrix. From an execution point of view, I think there are two things that need to happen next.First, you look at the high/highs – high likelihood, high severity. In some cases, removing these risks might involve rewriting. But the high/highs that you can fix, you should prioritize the work and get them fixed.Second, you always have business partners. I’m a big believer that you should take that risk matrix and present it, at the executive level, to your business partners. You show the high/highs, the medium/mediums, or whatever they are. Now, as a company, think about one of two things. Should we focus on fixing these high/highs or should we all take a breath and say we are willing, for whatever reason, to take this risk on as a company. You go into that with open eyes, blameless culture, and state your willingness to take that risk together.Lee: Yes, and that’s really critical too. Because no matter what, you are not going to remove all the risk from the system. You aren’t going to fix all the problems, nor is trying to do that necessarily the right investment for you. The right level of risk is whatever level your organization – your extended organization – is comfortable with. The business cost of the risk, the development cost of fixing it, all of these things have to fit together. But once you know what your risk is, you can evaluate whether you and the culture of your company, and your customers, and the business you provide, are comfortable with that level of risk.Now, for the things that you are not comfortable with, you have to address these right away. You have to either mitigate these risks or remove them. But the other risks, the ones where you are comfortable with the level of risk, it’s not necessarily a good investment to work on resolving those things. Because there are going to be higher priority issues you want to work on.Ken: Another important aspect is from the funding perspective. I look at the risk matrix as a living document. My thoughts are, you should run this exercise at least twice a year. Then, when you have incidents, you should update the risk matrix to match those incidents. The risk matrix should be accurate and maintained.Lee: Absolutely.Ken: Now, most companies fund on an annual basis. My perspective is a lot of times people forget about risk when it comes to funding. In some companies, what is funded are the “bright shiny objects”. That’s where the money is invested. So, for companies that are technology leaders, you should bring the risk matrix to the budgeting discussions. That way you can make sure everybody is clear and all discussions are open on what we are investing in and why we are investing in it. The risk matrix is part of the budgeting process.Lee: Yes, it’s definitely a feedback into your budgeting process. But it’s also at a much lower level a feedback back into your SPRINT planning process.Ken: Totally agree.Lee: You use it to determine what you can accomplish this SPRINT, and how much you want to spend on doing risk management activity during this specific SPRINT versus doing new features or dealing with other problems.Ken: I know many enterprises, if they are really focused on the customer experience say High/Highs must be done first, unless it involves a full rewrite. If you go into an organization that has a lot of technical debt, that may not be the case, you do as many as you can each SPRINT.Lee: Yes, absolutely. But the one important thing to consider is that fixing does not have to mean removing it. It might be creating a mitigation for it that reduces its severity or likelihood to an acceptable level.Ken: It might move from a High likelihood to a medium or might take it from a High impact to a medium or low.Lee: And just by doing that you’ve brought it to down to within the comfort level of your organization. And once it’s in the comfort level of the organization, that’s a very successful place to be.Ken: And you and I have seen this at hundreds of global companies. So, from a best practices’ standpoint, it really makes a lot of sense. Have a risk matrix, update it semi-annually and when incidents happen. Review it during the RCA process. Rinse and repeat. Then, take what you have and use that in the budgeting process.Anything else we should add?Lee: The only additional thing is that individual teams need to own their own risk matrices – remember, they are built bottom up. Individual teams need to have responsibility for their own risk matrices and be held accountable for the content. Then, they all need to bubble up to a high-level list that is known at the highest levels of the organization.Ken: I agree. And the initiative and guidance to do it needs to come top-down, because it’s important to the entire organization.Lee: Yes.Ken: The actual work itself happens bottoms-up. Totally agree.I’d like to thank Ken for his involvement in this interview and for the insights he provided to the very important topic of Risk Management. Additionally, thank you to Ken for providing the recording equipment he and I used for the interview.Tech Tapas — History of the term CloudWhat is the history of the term “cloud” as it is used in Cloud Computing?Well, that’s an interesting question and unfortunately there are probably as many answers as there are people who work in the cloud.So, it’s very hard to answer.But most people who have looked into this subject believe the term originally came in the 1980’s from the telephone companies of the day. Network engineers who drew diagrams of portions of their networks would often draw a big blob to indicate a portion of their network that they weren’t dealing with at the time. Rather than drawing these blobs as simple circles, squares, or rectangles, since those icons represented real entities in their diagrams, instead they drew the blob using rounded segments interconnected that made it look “nebulous” and nondescript in shape…it was, after all, suppose to represent a nebulous and nondescript part of the network.In fact, this nebulous and nondescript shape looked like a cloud. So much so that these engineers started talking about the part of the network that they weren’t focusing on at the time — the part external to their area of concern — as being out in the cloud.This usage expanded into software architects as they build their software diagrams and flow charts as well. They used the symbol for a similar purpose.The term cloud computing, though, is much more recent. Some would argue that early server farms were really cloud computing, but the term really wasn’t popularly used back then. Some would say that Google Compute Platform (GCP) provided some of the earliest cloud computing technology to the industry. Others would say that Salesforce.com was a major early creator of cloud technology. But I think the real mainstream usage of the term cloud computing among technical professionals was popularized with the start of Amazon Web Services. AWS mainstreamed cloud computing — and hence mainstreamed the term cloud computing. This happened in the mid 2000’s.But software running in the cloud was still something that was reserved for technical people to talk about. Mainstream non-techy people didn’t yet know what the term cloud was all about. This was certainly my personal experience. I was working at AWS in the early days, and I had a hard time telling my non-techy friends what I did for a living. They just didn’t know what I meant by “cloud computing”. They didn’t know what I meant when I said I worked “in the cloud”. They didn’t understand what the cloud was.Like many of the changes in modern popular tech culture, that changed once Apple computer came into the picture. On October 12, 2011, Apple introduced iCloud to the Apple universe, and overnight, the word “cloud” was a part of mainstream culture. Those friends of mine who couldn’t understand what I was doing when I said I “worked in the cloud”, now understood, at least at some level, what the cloud was all about. Apple brought the term cloud to the mainstream.I know I will likely get many people who disagree with my analysis on who invented the term cloud and who popularized its use. That’s because there really isn’t a single right answer. But I do believe that the biggest events in the history of the term “cloud computing”, were — first the network engineers, then AWS, then Apple computer. Each of those three groups played a role in bringing the world “cloud” into our everyday lives.This podcast uses the following third-party services for analysis: Chartable - https://chartable.com/privacyPodtrac - https://analytics.podtrac.com/privacy-policy-gdrp
Modern applications require high availability. Our customers expect it, our customers demand it. But building a modern scalable application that has high availability is not easy and does not happen automatically. Problems happen. And when problems happen, availability suffers. Sometimes availability problems come from the simplest of places, but sometimes they can be highly complex.In this episode, we will continue our discussion from last week with the remainder of the five strategies for keeping your modern application, highly available as well.This is How to Improve Application Availability, on Modern Digital Applications.Links and More InformationThe following are links mentioned in this episode, and links to related information:Modern Digital Applications Website (https://mdacast.com)Lee Atchison Articles and Presentations (https://leeatchison.com)Architecting for Scale, published by O’Reilly Media (https://architectingforscale.com)Robinhood Announcement (https://blog.robinhood.com/news/2020/3/3/an-update-from-robinhoods-founders) How to Improve Availability, Part 2Building a scalable application that has high availability is not easy and does not come automatically. Problems can crop up in unexpected ways that can cause your application to stop working for some or all of your customers. No one can anticipate where problems will come from, and no amount of testing will find all issues. Many of these are systemic problems, not merely code problems.To find these availability problems, we need to step back and take a systemic look at our applications and how they works.What follows are five things you can and should focus on when building a system to make sure that, as its use scales upwards, availability remains high. In part 1 of this series, we discussed two of these focuses. The first was building with failure in mind. The second was always think about scaling. In part 2 of this series, we conclude with the remaining three focuses.Number 3 - Mitigate riskKeeping a system highly available requires removing risk from the system. When a system fails, often the cause of the failure could have been identified as a risk before the failure actually occurred. Identifying risk is a key method of increasing availability.All systems have risk in them. There is risk that: A server will crashA database will become corruptedA returned answer will be incorrectA network connection will failA newly deployed piece of software will failKeeping a system available requires removing risk. But as systems become more and more complicated, this becomes less and less possible. Keeping a large system available is more about managing what your risk is, how much risk is acceptable, and what you can do to mitigate that risk.This is Risk management, and it is at the heart of building highly available systems. Part of risk management is risk mitigation. Risk mitigation is knowing what to do when a problem occurs in order to reduce the impact of the problem as much as possible. Mitigation is about making sure your application works as best and as completely as possible, even when services and resources fail. Risk mitigation requires thinking about the things that can go wrong, and putting a plan together now, to be able to handle the situation when it does happen.For example, consider a typical online e-commerce store. Being able to search for product on the e-commerce store is critical to almost any online store. But what happens if search breaks?To prepare for this, you need to have “Failed Search Engine” listed as a risk in your application risk plan. And in that risk, you need to specify a mitigation plan to execute if that risk ever triggers.For example, we might know from history that 60 percent of people who search our site end up looking at and buying our famous red striped shirt. So, if our search service stops functioning, rather than simply failing, we could display an appropriate “I’m Sorry” page, followed by a list of our most popular T-Shirts, including our red striped shirts. For some number of customers, this would be a success. For the rest, it might create alternatives for them other than simply leaving in frustration. Combine this I’m sorry page with showing a coupon for 10% off their next visit, and you’ve turned a bad customer experience into an experience that might just create some return customers.This is an example of a risk mitigation plan. It’s a plan that you build in advance of a potential but serious problem, and be able to implement it if that problem occurs. This is a great example of risk mitigation.Other risk mitigation plans might be entirely technical. They might involve failover servers, or rapid response plans to resolve an issue. Whatever the plan is, risk mitigation is the process of creating and putting these plans into place.Number 4 - Monitor availabilityYou can’t know if there is a problem in your application unless you can see the problem. Make sure your application is properly instrumented so that you can see how the application is performing.Proper monitoring depends on the specifics of your application and needs, but usually entails some of the following capabilities some or all of the following:Server monitoring. Monitoring the health of the server infrastructure that your application is running on. This might be physical resources, or cloud-based virtual resources.Configuration change monitoring. Understanding when and how your system infrastructure changes and how those changes impact the operation of your application.Application performance monitoring. Look inside your application and services to make sure they are operating the way you expect them to operate.Synthetic testing. Monitor how your application works from an external perspective in order to catch problems as customers may see them before they actually see them.Monitoring involves improving two aspects of modern application operation. MTTD and MTTR. That’s mean time to detection and mean time to resolution. Looking at key performance indicators for changes in patterns and alerting you of those changes will improve your mean time to detection. Giving you a wealth of data that can be used to diagnose the source of a problem will improve your mean time to resolution. Both of these are important measures for improving application availability.Number 5 - Respond to issues in a predictable and well defined mannerMonitoring systems are useless unless you are prepared to act on the issues that arise. This means being alerted when problems occur so that you can take action. Additionally, you should establish processes and procedures that your team can follow to help diagnose issues and easily fix common failure scenarios. For example, if a service becomes unresponsive, you might have a set of remedies to try to make the service responsive. This might include tasks such as running a test to help diagnose where the problem is, restarting a daemon that is known to cause the service to become unresponsive, or rebooting a server if all else fails. Having standard processes in place for handling common failure scenarios will decrease the amount of time your system is unavailable. It will help with improving Mean Time to Resolution.Additionally, they can provide useful followup diagnosis information to your engineering teams to help them deduce the root cause of common ailments, in order to reduce the likelihood of a reoccurrence of a problem, or the occurrence of a similar problem.When an alert is triggered indicating that a service is or might be failing, the owners of that service must of course be alerted so they can deal with the issue in a timely manner.However, other teams that are closely connected to the problem service may also want to be alerted. If you own a service that depends on the failing service, you might want to be informed of the problem even before it impacts your service, so that you could take preventative measures or institute actions from your risk management plan before they become critical. Additionally, if the failing service is a consumer of your service, you may want to be aware of the fact that traffic patterns from the failing service may change as the problem occurs and is being resolved. You may want to keep a close eye on your service to make sure any changes do not negatively impact you.Documented processes and operations are an essential part of this process. Support artifacts should be well documented and available to all parties that require them. They should also be frequently reviewed and updated, and updating support artifacts should be a regular part of your process for adding new features and capabilities.These processes and procedures are especially useful, after all, because outages often occur during inconvenient times — times such as the middle of the night or on weekends—times when your on-call team might not perform at peak mental efficiency.These recommendations will assist your team in making smarter and safer moves toward restoring your system to operational status. SummaryNo one can anticipate where and when availability issues will occur. But you can assume that they will occur, especially as your system scales to larger customer demands and more complex applications.Preparation and planning are critical to improving availability and maintaining availability as your application scales.That’s it for the five focuses to help improve your modern application availability. More information on these five focuses can be found in my book, Architecting for Scale, published by O’Reilly Media. A link can be found in the shownotes.Tech Tapas — Can’t Scale? Time to go out of businessWhy is it essential that your application scale? Well, why not ask Robinhood Financial. Robinhood is an investment company that provides investment management services for its tech-savvy clientele.Robinhood learned the hard way the cost of success.On Monday, March 2nd, 2020, the United States stock market had a record-breaking day. After previous significant drops due to virus pandemic scares, good news caused the stock market to rally. The result was record-breaking traffic in the stock market.For a young investment company, like Robinhood Financial, this would typically be considered great news! The company thrives on new account signups and on customer market transactions. Both of these were available in record numbers to Robinhood on this day.The problem? Their traffic was *too* high.You see, companies like Robinhood need to be able to respond to variable loads, as spikes in traffic occur all the time.Still, this record-breaking traffic spike and the volatile market conditions that went with it were more than Robinhood’s infrastructure could handle. The result? There systems started to fail. This failure created a “thundering herd” effect, as Robinhood founders described it, leading to a failure of their DNS system.The result was that Robinhood systems were down for approximately one and a half days.One and a half days.This one and a half day outage occurred during a peak stock market time, a time when their customer’s needed them most.During their most critical time, their systems were unavailable.That’s often the problem, scaling related outages more often than not occur during good times, not the bad times. This can very quickly turn a significant business opportunity for success...into an utter failure.Scaling and availability problems can take the moment of greatness you’ve been working towards all your life, and turn it into an event that can shutter your business.This isn’t true just for Robinhood — all modern companies face problems like this. One of the most challenging things for an online business to handle...is success.Success can be the killer of a business.Success can shut you down.Avoiding problems like this is why it is essential that you consider scaling and availability needs of your application well before your needs arise. The day when success is staring you in the face is too late to be planning for scaling in your system architecture.In the shownotes, I have a link to the announcement of the outage by Robinhood in their blog, along with links to other useful information.There are many resources you can use to help you build in scalability and availability into your business processes and your applications. My book, Architecting for Scale, published by O’Reilly Media, has lots of useful high level information on the systems and processes involved in building highly scaled, highly available applications. You can also listen to this podcast. I talk often about scaling and availability topics as it relates to building and operating your modern digital applications.This podcast uses the following third-party services for analysis: Chartable - https://chartable.com/privacyPodtrac - https://analytics.podtrac.com/privacy-policy-gdrp
View 99 more appearances
Get appearance alerts
Subscribe to receive notifications by email whenever this creator appears as a guest on an episode.

Subscribe to receive notifications by email whenever this creator appears as a guest on an episode.

Share Profile
Are you Lee? Verify and edit this page to your liking.

Recommend This Creator

Recommendation sent

Join Podchaser to...

  • Rate podcasts and episodes
  • Follow podcasts and creators
  • Create podcast and episode lists
  • & much more

Creator Details

Location
Seattle, Washington, United States of America
Episode Count
41
Podcast Count
7
Total Airtime
14 hours, 45 minutes
PCID
Podchaser Creator ID logo 120258