These are the three golden words which everyone is considered about when thinking about the IT Infrastructure. But, these three often confuses the beginners. Is there really a difference between High Availability, Disaster Recovery and Business Continuity? All are based on the same concepts. I personally had this doubt when I heard about these things during the initial days. For some days, I started thinking about these words more often and became interested in the concepts and the importance of these three terms. This is a brief post which might help the beginners to understand the underlying meaning of these concepts.
The term High Availability (often abbreviated as HA) is defined as an automated system, which will automatically failover the services from one node to another in the event of a failure. In this case, the failover process is initiated automatically. The time taken for this failover process to happen is minimal (approximately Zero) and there is no downtime incurred to the service. Clustering is one such high availability technique. In Windows Infrastructure, most of the high availability techniques are dependent on Windows Clustering. With this technique, high availability is achieved at the underlying hardware level, storage level, operating system level and ultimately to the service. The most important fact is that, we don’t expect a data loss during this. Another important thing to be noted in High Availability technique is that, there are no ‘Replication Mechanisms’ that is adopted for this technique. This is obvious. Because, if the data is to be replicated from one node to another and if some failure occurs to one node at any point of time, the most recent data will not be replicated to the second node (or other members of the cluster) which will result in Data loss.
In the event of a Disaster Recovery, we do expect data loss. There are some criteria which defines the Disaster Recovery Technique. These terms are defined below.
RPO – Recover Point Objective
This defines the last possible accepted state of the system, to which it can be recovered to. This is usually based on the SLA of the service. Take the case of a backup. If the backup is scheduled daily at 9 PM, the last possible accepted state of the system will be 24 hours (worst-case). This is because, the backup is happening only after 24 hours. If the backup is scheduled every 3 hours, the Recover Point Objective will be 3 hours.
So, it is evident that the RPO is measured against time (usually hours), and it defines the acceptable amount of data loss, which the system can withstand. So, based on the criticality of the service/application the RPO will be different. This is something which needs to be mutually agreed with the client/management for the services in consideration.
Sometimes, based on the nature of the disaster the system might require to be started from a recovery site, which is in a different location geographically. So, the data in the primary site would be replicated to the DR site. This replication might be on a daily basis, or weekly basis. RPO will also change according to that.
RTO – Recover Time Objective
This defines the time within which the System is back into production. But, it is not mandatory that once the system is back online, it is good to start/resume the service. After the system is made back online, we would need to re-install/restore the data from backup, test the functionality and might need to do some tweaks to bring the system back into production. RTO is also measured in time (usually hours/days). RTO is also termed as RTA (Recovery Time Actual) in some cases, but as per my understanding both are the same.
So in the case of Disaster Recovery (DR), it is evident that there is some sort of replication that is taking place. From one site to a different site, may be.
This is a set of rules/strategies, which anticipates the impact of the service on the business and to be followed so as to restore it at the earliest and with minimal impact. This is usually done through a BCP (Business Continuity Planning) process, which considers different levels of disaster that can happen to the IT Infrastructure and the appropriate strategies to be followed based on the outcome of it.