Snapshot Considerations for Sizing SANs

Posted by Lou Person on May 02, 2011 in Cloud Journey

Snapshots are a great way to ensure rapid Recovery Time Objective (RTO – How long it takes to recover data) and Recovery Point Object (RPO – How far back can data be recovered) but they add overhead to the storage pool.

Tape backup typically has an RPO of 24 hours since backups are performed once a day in the middle of the night. Recovery from tape is a long process, that could take anywhere from 4-48 hours. Snapshots store block level changes to data at a particular point in time. The snapshot window should be defined such that one snapshot will complete before the next one starts. All of the actual data is not stored, just the blocks that changed. Now that the snapshot is on the SAN, retention needs to be considered. Best practices are to store hourly intraday snapshots, daily, weekly, monthly and annually. These snapshots can then be moved to offline media for archiving purposes. The SAN will need to be sized accordingly to accommodate the storage of the snapshots. Sizing of snapshots is typically a function of rate of change, how much data is changing and how often. Keep in mind that traditional backups will skew the rate of change as backups “touch” and “change” most files. You may also be selective in which volumes need to be included in the snapshot based on whether or not those files are critical for recovery. Certain temp files drive tremendous change, but they are not needed for recovery. They should be put on their own volume which is not part of the snapshot. Some environments may consider a near line storage device for the snapshots. As a rule of thumb, account for an additional 20-25% of useable storage for storage of hourly, daily, weekly, monthly and annual snapshots. Hourly snapshots will be retained for 1 hour, daily for 1 day, and so on.

Snapshots can also be sent offsite, asynchronously, to another SAN. In this scenario, rate of change becomes even more important. The number of completed snapshots is a function of the amount of bandwidth between the production and Business Continuity Site. On average, .8 GB of block level changes can be replicated over a 1 mb/s connection. Roughly, 1 GB of block level changes will take 1.5 hours over a T1 (1.5 Mbps) connection. Since bandwidth is becoming broader and cheaper, more offsite asynchronous snapshots can be taken and stored, which will increase the amount of required storage at the Business Continuity site, as well as the need for additional bandwidth at the production site