DR for a Replication Server

Restoring a Primary Database ________________________________________________________________________ Phase I: Clean up the old system 1) Log on to your primary Replication Server and suspend the DSI connection to the primary database. suspend connection to srv1.pdb 2) Re-create the primary database.
  • · If your primary database is irretrievably damaged, drop it. If the database is
marked suspect, you must use the dbcc dbrepair command. Drop database will not execute on a suspect database. drop database pdb dbcc dbrepair (pdb,dropdb)
  • · Drop any damaged database devices using sp_dropdevice.
  • · Create new database devices using disk init (if necessary).
  • · Create the new database using create database with the for load option.
3) Restore the primary database.
  • · Load the database from the database dump.
load database pdb from <dump file>
  • · Restore any transaction log dumps.
load transaction pdb from <dump file>
  • · Activate the database using the online database command.
online database pdb Phase II: Re-Establish Replication If you have done up-to-the-minute recovery, your primary database should be current, and you can go directly to Step 8. Steps 4-7 are only necessary if you have not recovered all of the data in the database. 4) Zero out the LTM truncation point. Use the RSSD for your primary Replication Server, and run the following command: rs_zeroltm srv1, pdb 5) Log on to the primary Replication Server and get the generation number for the primary database. Record this number. admin get_generation, srv1, pdb 6) Use your primary database, and update the generation number. The new number should be higher than the old number. dbcc settrunc(’ltm’, ’gen_id’, <new number>) The reason for doing this is that the generation number will differ between the RSSD on the Repserver controlling this db and the actual db The reason for doing this is that if you are restoring a primary database to an earlier state,  you have to increment the database generation number so that the Replication  Server does not ignore log records submitted after the database  is reloaded. 7) Recover any lost or inconsistent data.
  • · Recover any data modification lost from the primary database.
  • · Re-synchronize the replicate database:
  • · Re-materialize the replicate database.
  • · Run rs_subcmp.
  • · Replay the transaction log from a log backup.
8) Once the primary and replicate databases are correct, you can resume replication. Log on to your primary Replication Server and resume the DSI connection to your primary database. resume connection to srv1.pdb 9) Restart the RepAgent on your primary database. 10) Review your error logs to ensure that no errors have occurred. Re-playing the Transaction Log ________________________________________________________________________ 1) Log on to your Adaptive Server. Create a new database that has the same size and segments as your primary database. This will be your temporary recovery database. 2) Shut down the Replication Server and restart it in single-user mode. To do this, you should have a run server file that contains the –M option. Once the Replication Server comes up, check the error log to ensure that the server came up in stand-alone mode. 3) Put the Replication Server into log recovery mode by executing the following commands: set log recovery for srv1.pdb allow connections 4) Configure a RepAgent for the temporary recovery database that you created in Step 1. Log into your primary data server, use the temporary recovery database, and execute the following commands: exec sp_config_rep_agent temp_rep, ‘enable’, ‘rep1’, ’sa’,’’ 5) Log on to your primary Adaptive Server and load the temporary database you created with the database dump of your primary database. load database temp_rep from ’<dump file>’ go online database temp_rep 6) Start the RepAgent for the temporary recovery database in recovery mode. Although you are starting the RepAgent on the temporary recovery database, the ‘connect database’ is your original primary database: exec sp_start_rep_agent temp_rep, recovery, ‘srv1’,‘pdb’,’rep1’ 7) Verify that the RepAgent has replayed everything that you have loaded. There are two ways to do this:
  • · Examine the error log of your primary Adaptive Server, looking for a message
that states that recovery of the transaction log is complete.
  • · Run sp_help_rep_agent on the RepAgent for the temporary recovery database:
exec sp_help_rep_agent temp_rep, ‘recovery’ The recovery status should be “not running” or “end of log”. 8) Log on to your primary Adaptive Server and load the temporary database you created with the first transaction dump of your primary database: load transaction temp_rep from ’<dump file>’ 9) Restoring the transaction dump will invalidate the LTM truncation point. Reset it by using the temporary recovery database and running the following command: dbcc settrunc(‘ltm’,’valid’) 10) Start the RepAgent for the temporary recovery database in recovery mode: exec sp_start_rep_agent temp_rep, recovery, ‘srv1’,‘pdb’,’rep1’ 11) Verify that the RepAgent has replayed everything that you have loaded. There are two ways to do this:
  • · Examine the error log of your primary Adaptive Server, looking for a message
that states that recovery of the transaction log is complete.
  • · Run sp_help_rep_agent on the RepAgent for the temporary recovery database:
exec sp_help_rep_agent temp_rep, ‘recovery’ The recovery status should be “not running” or “end of log”. 12) Repeat Steps 8, 9, 10, and 11 for each transaction log dump until you have played all transaction log dumps through the replication system. 13) Check the Replication Server error log for any loss detection messages. If you find any, then this process has failed, and you must find another way of re-synchronizing your data in the primary and replicate databases. Your other options are:
  • · Re-materializing the replicate database.
  • · Running rs_subcmp.
14) If you did not find loss detection messages in the previous step, shut down your Replication Server, and restart it in normal mode. To do this, you should have a run server file that does not contain the –M option. If you accidentally bring the server up in single-user mode, there will be a message to this effect in the error log. Shut Replication Server down, and start it up using a correct run server file. 15) Re-establish the LTM truncation point in the original primary database. Use the primary database, and run the following command: dbcc settrunc(‘ltm’,’valid’) 16) Restart the RepAgent for the original primary database in normal mode. exec sp_start_rep_agent pdb 17) Log on to your Replication Server and check for any open transactions admin who,sqt 18) Drop the temporary recovery database, and remove its database devices. Rebuilding Stable Queues ________________________________________________________________________ 1) Log on to your replication server, and drop the damaged partition. drop partition <partition name> 2) If necessary, add a new partition to replace the disk space. Remember, the file or raw partition must already exist before you execute this command.
  • · In UNIX, you can use the touch command to create a file.
  • · In Windows NT, you can create a file using a utility program such as Notepad.
add partition <partition name> on ’<physical_name>’ with size <size> 3) Check the disk space on your Replication Server. Your failed partition should contain the status DROPPED, while your new partition should be listed as ONLINE. admin disk_space 4) Rebuild the stable device: rebuild queues 5) If the connection to your replicate database is suspended, resume it. 6) Examine your Replication Server error log. Wait for the log to display the message “Rebuild Queues: Complete”. You may see the error message, “DSI: detecting loss for database ’srv1.pdb’…” If your RepAgent is not running, this message can be ignored. 7) If the RepAgent for the primary database is not running, start it up. Use your primary database and run the following: exec sp_start_rep_agent pdb 8) Examine your Replication Server error log. Check for loss detection messages. If a data loss was detected, then this process has not fully succeeded, and you have lost transactions.
  • · Compare the primary and replicate databases, and replace any lost data.
  • · Use the ignore loss command to restart replication.
ignore loss from srv1.pdb to srv1.rdb If no data loss was detected, Replication Server will end loss detection mode, and normal replication may continue. 9) Check the disk space used by your Replication Server’s stable device. admin disk_space If the damaged partition no longer appears in this display, the file or raw partition that was used by the damaged partition may be dropped. Restoring an RSSD Database ________________________________________________________________________ Phase I: Clean up the old system 1) Shut down all RepAgents that connect to the current Replication Server. 2) If the Replication Server is still running, log on to it and shut it down. 3) Re-create the RSSD database.
  • · If your RSSD database is irretrievably damaged, drop it. If the database is marked
suspect, you must use the dbcc dbrepair command. Drop database will not execute on a suspect database. drop database rep1_RSSD dbcc dbrepair (rep1_RSSD,dropdb)
  • · Drop any damaged database devices using sp_dropdevice.
  • · Create new database devices using disk init (if necessary).
  • · Create the new RSSD database using create database with the for load option.
4) Restore the RSSD database.
  • · Load the database from the database dump.
load database rep1_RSSD from ‘<dump device>’
  • · Load the database from any transaction log dumps. Be sure to apply the
transaction log dumps in order. load tran rep1_RSSD from ‘<dump device>’
  • · Activate the database using the online database command.
online database rep1_RSSD 5) If your Replication Server’s RSSD had a RepAgent and you were able to recover the database up-to-the-minute, re-establish its truncation point and restart your RepAgents. Use the RSSD database and run the following command: dbcc settrunc(’ltm’, ’valid’) 6) If your Replication Server’s RSSD had a RepAgent and there was an active route from this replication server to another one, but you were unable to recover up-to-the-minute, get the generation number. admin get_generation, srv1, rep1_RSSD Phase II: Rebuild the RSSD If you have done up-to-the-minute recovery, your RSSD database should be current. Restart your Replication Server and RepAgents. Steps 7-12 are only necessary if you have not been able to recover the RSSD database to its current state. 7) Shut down the Replication Server and restart it in single-user mode. To do this, you should have a run server file that contains the –M option. Once the Replication Server comes up, check the error log to ensure that the server came up in stand-alone mode. 8) Rebuild the stable device: rebuild queues 9) Start up the RepAgents for your primary databases in recovery mode. To do this, do the following: sp_start_rep_agent pdb1, recovery 10) Examine your Replication Server error log, and check for loss detection messages. If a data loss was detected:
  • · Compare the primary and replicate databases, and replace any lost data.
  • · Use the ignore loss command to restart replication.
ignore loss from srv1.pdb to srv1.rdb If no data loss was detected, Replication Server will end loss detection mode, and normal replication may continue. 11) If this Replication Server is a primary Replication Server, with a route to a downstream Replication Server, then you must clean up the truncation point in the RSSD. Remember that a Primary Replication Server’s RSSD is a primary database, replicating schema information to other RSSDs. If your Replication Server is the only one in the domain, or if it is a replicate Replication Server, you do not need to run this last step. To clean up the truncation point, do the following:
  • · Clear the RSSD’s LTM truncation point:
dbcc settrunc(’ltm’, ’ignore’)
  • · Move the transaction log forward. Execute the following batch multiple times (40
times, for example): begin tran checkpoint commit If the number of log records does not increase, create a table in the database and drop it, and then try the above batch again.
  • · Reset the RSSD’s LTM truncation point:
dbcc settrunc(’ltm’, ’valid’)
  • · Increment the generation id number in the RSSD:
dbcc settrunc(’ltm’, ’gen_id’, <new number>) 12) Shut down your Replication Server and your RepAgents, and restart them in normal mode.
  • · To start your Replication Server in normal mode, you should have a run server file
that does not contain the –M option. If you accidentally bring the server up in single-user mode, there will be a message to this effect in the error log. Shut Replication Server down, and start it up using a correct run server file.
  • · To start a RepAgent in normal mode, run the following command:
sp_start_rep_agent pdb Avoiding Disaster ________________________________________________________________________ To avoid disasters on your RSSD database: 1) Put your RSSD on database devices that are not being used by any other databases. If you can, put them on separate disk drives. 2) Separate your RSSD data and log portions on to separate database devices and, if possible, separate disk drives. 3) Mirror your RSSD database devices. 4) Keep current backups of your RSSD database. 5) Perform transaction backups of your RSSD database to permit up-to-the-minute recovery. To avoid disasters on your stable device: 6) Mirror your stable device. Note: Sybase does not provide a mechanism to mirror your stable device. Your disk drives or operating system must do this. 7) Always have more disk space in your stable device than is strictly necessary. If replication fails, having more disk space gives you time to fix the problem before the stable device fills up. To avoid disasters on your application databases: 8) Keep current backups of your primary database. 9) If your replicate database is significantly different than your primary database, and would be difficult to rebuild, keep a current backup of it, also.

Miscellaneous suggestions

10) Document your system, so that you know the names and locations of all important objects. 11) Create rs_subcmp parameter files for every subscription you are replicating, so that you don’t need to create them during a crisis. 12) On a regular basis, run rs_subcmp against your primary and replicate tables to ensure that they truly are synchronized.