SQL Server Always On Availability Groups — Setup, Configuration, Monitoring, and Failover
June 2026 | 20 min read | SQL Server, High Availability, Always On, HADR | — views
The definitive hands-on guide to SQL Server's premier high-availability feature — from architecture to T-SQL setup to day-2 operations
Always On Availability Groups (AG) is SQL Server's flagship high-availability and disaster-recovery solution. It protects one or more databases by maintaining synchronized copies on up to eight secondary replicas, provides sub-minute automatic failover within a data centre, and can simultaneously offload heavy read workloads to secondary servers. This guide walks you through everything: the architecture, prerequisites, complete T-SQL setup from scratch, the AG listener, readable secondaries, health monitoring DMVs, failover procedures, troubleshooting common failures, and day-2 maintenance practices including rolling-upgrade patching.
1. What Is an Availability Group?
An Availability Group is a logical container for a set of user databases (called availability databases) that fail over together. Every AG has exactly one primary replica — the SQL Server instance that accepts read-write connections — and one or more secondary replicas that keep synchronized copies of every database in the AG.
SQL Server ships the transaction log records generated on the primary to each secondary replica over a dedicated TCP database mirroring endpoint. The secondary applies those log records asynchronously (redo) to its local copy of the database, keeping it in sync.
Primary and Secondary Replicas
- Primary replica — accepts all DML and DDL. Sends log stream to all secondaries.
- Secondary replica — read-only (if configured as readable secondary). Applies the redo log continuously. Can be promoted to primary on failover.
Synchronous vs Asynchronous Commit
Each secondary can be configured with one of two commit modes:
| Mode | How it works | Data loss on failover | Automatic failover? | Best for |
|---|---|---|---|---|
| Synchronous commit | Primary waits for secondary to harden log before committing | Zero (RPO = 0) | Yes (if health conditions met) | Same datacenter, sub-5 ms latency |
| Asynchronous commit | Primary does not wait — secondary catches up when it can | Possible (RPO > 0) | No — manual only | DR replicas across WAN / cloud |
Automatic vs Manual Failover
Automatic failover occurs without human intervention when the primary replica becomes unavailable and the designated synchronous secondary detects the failure via the Windows Server Failover Cluster (WSFC) health detection mechanism. A maximum of two automatic-failover partners are allowed.
Manual failover is administrator-initiated (planned or unplanned) and can target any secondary regardless of commit mode.
Readable Secondaries
By default secondaries do not accept user connections. When you configure
ALLOW_CONNECTIONS = READ_ONLY on a replica, applications that include
ApplicationIntent=ReadOnly in their connection string are automatically
routed to that secondary by the AG listener. This lets you offload heavy reporting
queries without impacting the primary.
The AG Listener
The listener is a Virtual Network Name (VNN) + one or more IP addresses registered in DNS. Applications connect to the listener name; the cluster routes them to whichever server is currently primary. After a failover the listener IP moves to the new primary automatically — applications simply reconnect without any configuration change.
AG vs Other HA Technologies
| Technology | Granularity | RPO | RTO | Readable secondary | Shared storage? | Edition |
|---|---|---|---|---|---|---|
| Always On AG | Database group | 0 (sync) | < 30 s (auto) | Yes | No | Enterprise (Basic: Standard) |
| FCI (Failover Cluster Instance) | Entire SQL instance | 0 | ~1–5 min | No | Yes (SAN/S2D) | Standard / Enterprise |
| Database Mirroring | Single database | 0 (sync) | < 30 s | Snapshot only | No | Deprecated in 2012+ |
| Log Shipping | Single database | Minutes | Manual | Standby mode only | No | All editions |
| Replication | Table/article | Seconds–minutes | N/A | Yes (subscriber reads) | No | Standard / Enterprise |
2. Prerequisites
Windows Server Failover Clustering (WSFC)
Always On AGs require a Windows Server Failover Cluster for health detection and automatic failover. Every node that hosts a replica must be a member of the same WSFC. The cluster itself does not need shared storage — AGs use shared-nothing architecture.
SQL Server Edition and Version
- Enterprise Edition — required for full AGs (up to 8 secondaries, readable secondaries, multiple databases per AG).
- Standard Edition 2016+ — supports Basic Availability Groups (1 database per AG, no readable secondary, 2 replicas max).
- Same SQL Server version on all replicas — patch level can differ temporarily during rolling upgrades but the major version must match.
- Same collation on all instances — the server-level collation of the primary and all secondaries must be identical.
Network and Port Requirements
- TCP port 5022 open between all replicas (the default database mirroring endpoint port — choose any unused port).
- SQL Server service port 1433 open to application servers.
- AG listener IP must be reachable on the network segment (open firewall rules).
- All replicas need static IP addresses or reliable DNS resolution between nodes.
Hadr Feature Must Be Enabled
Always On is disabled by default. Enable it on every SQL Server instance that will host a replica:
-- Run in SQL Server Configuration Manager (GUI) OR via PowerShell:
-- Enable-SqlAlwaysOn -ServerInstance "NODE1\MSSQLSERVER" -Force
-- Verify it is enabled:
SELECT SERVERPROPERTY('IsHadrEnabled') AS IsHadrEnabled;
-- Returns 1 when enabled, 0 when disabled
Databases Must Be in Full Recovery Model
-- Check current recovery model
SELECT name, recovery_model_desc
FROM sys.databases
WHERE name IN ('SalesDB','HRDatabase','InventoryDB');
-- Switch to FULL recovery model (required for AG)
ALTER DATABASE SalesDB SET RECOVERY FULL;
ALTER DATABASE HRDatabase SET RECOVERY FULL;
-- Take a FULL backup before adding to AG
-- (secondary cannot join without a backup chain)
BACKUP DATABASE SalesDB
TO DISK = N'\\BACKUPSHARE\SalesDB_full.bak'
WITH COMPRESSION, STATS = 10;
BACKUP DATABASE HRDatabase
TO DISK = N'\\BACKUPSHARE\HRDatabase_full.bak'
WITH COMPRESSION, STATS = 10;
3. Step-by-Step AG Setup with T-SQL
The following scripts assume a two-node setup:
NODE1 (primary) and NODE2 (synchronous secondary),
both joined to a WSFC named WSFC-CLUSTER.
Run each block on the indicated server.
Step 3.1 — Enable Always On HADR on Both Nodes
-- Run on NODE1 via SQL Server Configuration Manager (GUI) OR:
-- PowerShell on NODE1:
-- Enable-SqlAlwaysOn -ServerInstance "NODE1" -Force
-- (restarts the SQL Server service)
-- PowerShell on NODE2:
-- Enable-SqlAlwaysOn -ServerInstance "NODE2" -Force
-- Verify on each node after restart:
SELECT SERVERPROPERTY('IsHadrEnabled') AS IsHadrEnabled;
-- Expected: 1
Step 3.2 — Create the Database Mirroring Endpoint (Both Nodes)
The endpoint is the TCP channel over which log records and control messages flow between replicas. Run this script on every instance that will host a replica.
-- Run on NODE1
CREATE ENDPOINT [Hadr_endpoint]
STATE = STARTED
AS TCP (
LISTENER_PORT = 5022, -- choose any unused port
LISTENER_IP = ALL -- listen on all IPs; restrict to specific IP if needed
)
FOR DATA_MIRRORING (
ROLE = ALL, -- can act as principal or mirror
AUTHENTICATION = WINDOWS NEGOTIATEΒ,
ENCRYPTION = REQUIRED ALGORITHM AES
);
-- Verify endpoint is running
SELECT name, state_desc, type_desc, role_desc
FROM sys.database_mirroring_endpoints;
------------------------------------------------------
-- Run the IDENTICAL script on NODE2 as well
------------------------------------------------------
GRANT CONNECT ON ENDPOINT::Hadr_endpoint TO [DOMAIN\SqlSvcNode2];
Step 3.3 — Create the Availability Group (Run on NODE1 — Primary)
-- Run on NODE1 (will become primary replica)
CREATE AVAILABILITY GROUP [AG_Sales]
WITH (
AUTOMATED_BACKUP_PREFERENCE = SECONDARY, -- prefer secondary for backups
FAILURE_CONDITION_LEVEL = 3, -- 1=very conservative .. 5=most aggressive
HEALTH_CHECK_TIMEOUT = 30000, -- ms before declaring failure (default 30s)
DB_FAILOVER = ON, -- fail over AG if any database goes DOWN
DTC_SUPPORT = NONE -- set PER_DB for distributed transactions
)
FOR DATABASE
[SalesDB], -- add every database that should fail over together
[HRDatabase]
REPLICA ON
-- PRIMARY REPLICA (this server)
N'NODE1' WITH (
ENDPOINT_URL = N'TCP://NODE1.corp.local:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC, -- SQL 2016+: seeds secondary automatically
SECONDARY_ROLE (
ALLOW_CONNECTIONS = NO -- primary role: not readable as secondary
),
PRIMARY_ROLE (
ALLOW_CONNECTIONS = ALL
),
SESSION_TIMEOUT = 10 -- seconds before marking replica as failed
),
-- SECONDARY REPLICA
N'NODE2' WITH (
ENDPOINT_URL = N'TCP://NODE2.corp.local:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC,
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY -- readable secondary
),
PRIMARY_ROLE (
ALLOW_CONNECTIONS = ALL
),
SESSION_TIMEOUT = 10
);
-- Verify AG was created
SELECT ag.name, ag.automated_backup_preference_desc,
ag.failure_condition_level, ag.health_check_timeout
FROM sys.availability_groups ag;
-- Check replica configuration
SELECT ar.replica_server_name, ar.availability_mode_desc,
ar.failover_mode_desc, ar.seeding_mode_desc,
ars.role_desc, ars.operational_state_desc, ars.connected_state_desc
FROM sys.availability_replicas ar
JOIN sys.dm_hadr_availability_replica_states ars
ON ar.replica_id = ars.replica_id
WHERE ar.group_id = (SELECT group_id FROM sys.availability_groups WHERE name = 'AG_Sales');
Step 3.4 — Join the Secondary Replica (Run on NODE2)
-- Run on NODE2
ALTER AVAILABILITY GROUP [AG_Sales] JOIN;
-- With SEEDING_MODE = AUTOMATIC, SQL Server seeds databases automatically.
-- Monitor seeding progress:
SELECT start_time, completion_time, is_source, current_state,
transferred_size_bytes, estimate_total_size
FROM sys.dm_hadr_physical_seeding_stats;
-- Alternatively (for SEEDING_MODE = MANUAL):
-- 1. Restore a copy of each database to NODE2 WITH NORECOVERY
-- 2. Then run on NODE2:
ALTER DATABASE [SalesDB] SET HADR AVAILABILITY GROUP = [AG_Sales];
ALTER DATABASE [HRDatabase] SET HADR AVAILABILITY GROUP = [AG_Sales];
SEEDING_MODE = AUTOMATIC,
the primary streams a live backup directly to the secondary over the mirroring endpoint —
no manual backup/restore required. Grant the SQL Server service account on the secondary
the ALTER ANY AVAILABILITY GROUP permission:ALTER AVAILABILITY GROUP [AG_Sales] GRANT CREATE ANY DATABASE;
(run on NODE2 after joining).
Step 3.5 — Verify the AG is Healthy
-- Run on PRIMARY (NODE1)
-- Overall AG health
SELECT ag.name AS ag_name,
ags.primary_replica AS current_primary,
ags.synchronization_health_desc AS ag_sync_health,
ags.operational_state_desc AS operational_state
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags
ON ag.group_id = ags.group_id;
-- Per-database synchronization state
SELECT DB_NAME(drs.database_id) AS database_name,
ar.replica_server_name,
drs.synchronization_state_desc,
drs.synchronization_health_desc,
drs.is_suspended,
drs.log_send_queue_size,
drs.redo_queue_size
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id;
-- All databases should show: SYNCHRONIZED / HEALTHY
4. The AG Listener
The listener is the single endpoint your applications should always connect to. It consists of a Virtual Network Name (VNN) and one or more IP addresses (one per subnet). After any failover the cluster brings the listener IP online on the new primary node — applications reconnect to the same name transparently.
Create the Listener (Run on PRIMARY)
-- Run on NODE1 (primary)
ALTER AVAILABILITY GROUP [AG_Sales]
ADD LISTENER N'AG-SALES-LN' (
WITH IP (
( N'10.10.1.50', N'255.255.255.0' ) -- static IP on subnet 1
-- add a second IP for multi-subnet AG:
-- , ( N'10.20.1.50', N'255.255.255.0' )
),
PORT = 1433 -- standard SQL port; use different port if 1433 is taken
);
-- Verify listener was created
SELECT agl.dns_name, agl.port,
agla.ip_address, agla.ip_subnet_mask, agla.state_desc
FROM sys.availability_group_listeners agl
JOIN sys.availability_group_listener_ip_addresses agla
ON agl.listener_id = agla.listener_id;
Connection String — Applications
-- ADO.NET connection string using the listener
-- "Data Source=AG-SALES-LN,1433;Initial Catalog=SalesDB;
-- Integrated Security=SSPI;MultiSubnetFailover=True"
-- JDBC (Java) connection string:
-- jdbc:sqlserver://AG-SALES-LN:1433;databaseName=SalesDB;
-- multiSubnetFailover=true;integratedSecurity=true;
-- Read-only (routes to readable secondary via listener):
-- "Data Source=AG-SALES-LN,1433;Initial Catalog=SalesDB;
-- Integrated Security=SSPI;MultiSubnetFailover=True;
-- ApplicationIntent=ReadOnly"
RegisterAllProvidersIP Cluster Setting
-- PowerShell — verify RegisterAllProvidersIP setting
Import-Module FailoverClusters
Get-ClusterResource "AG-SALES-LN" | Get-ClusterParameter RegisterAllProvidersIP
-- For multi-subnet AGs, RegisterAllProvidersIP should be 1 (default).
-- For single-subnet AGs, set to 0 and lower HostRecordTTL for faster DNS failover:
Get-ClusterResource "AG-SALES-LN" | Set-ClusterParameter RegisterAllProvidersIP 0
Get-ClusterResource "AG-SALES-LN" | Set-ClusterParameter HostRecordTTL 120
5. Readable Secondaries
A readable secondary allows reporting queries, SSRS reports, and ETL reads to run on a secondary replica — completely independent of the primary's I/O. This is one of the most valuable features of an AG in high-traffic environments.
Configure Read-Only Routing
Read-only routing tells the listener which secondary to send ApplicationIntent=ReadOnly
connections to. You must define a routing URL for each replica and a routing list on
the primary.
-- Run on PRIMARY (NODE1)
-- Step 1: Set the read-only routing URL for each replica
ALTER AVAILABILITY GROUP [AG_Sales]
MODIFY REPLICA ON N'NODE2' WITH (
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY,
READ_ONLY_ROUTING_URL = N'TCP://NODE2.corp.local:1433'
)
);
-- Also set the routing URL for NODE1 (used when NODE2 is primary):
ALTER AVAILABILITY GROUP [AG_Sales]
MODIFY REPLICA ON N'NODE1' WITH (
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY,
READ_ONLY_ROUTING_URL = N'TCP://NODE1.corp.local:1433'
)
);
-- Step 2: Set the routing list on the PRIMARY replica
-- (ordered list — first available replica in the list receives read connections)
ALTER AVAILABILITY GROUP [AG_Sales]
MODIFY REPLICA ON N'NODE1' WITH (
PRIMARY_ROLE (
READ_ONLY_ROUTING_LIST = (N'NODE2', N'NODE1')
-- Try NODE2 first; fall back to NODE1 if NODE2 is unavailable
)
);
-- Also set a routing list for when NODE2 becomes primary:
ALTER AVAILABILITY GROUP [AG_Sales]
MODIFY REPLICA ON N'NODE2' WITH (
PRIMARY_ROLE (
READ_ONLY_ROUTING_LIST = (N'NODE1', N'NODE2')
)
);
-- Verify routing configuration
SELECT ar.replica_server_name,
ar.secondary_role_allow_connections_desc,
ar.read_only_routing_url,
ar.primary_role_allow_connections_desc
FROM sys.availability_replicas ar
WHERE ar.group_id = (SELECT group_id FROM sys.availability_groups WHERE name = 'AG_Sales');
Testing the Read-Only Route
-- Connect via the listener with ApplicationIntent=ReadOnly
-- and verify you land on the secondary:
SELECT @@SERVERNAME AS ServerName,
DATABASEPROPERTYEX(DB_NAME(),'Updateability') AS Updateability;
-- Expected: ServerName = NODE2, Updateability = READ_ONLY
-- This query returns an error on a readable secondary (confirms read-only):
CREATE TABLE Test_RO_Check (id INT);
-- Msg 3906, The database is in read-only access mode.
6. Monitoring AG Health
SQL Server exposes a rich set of DMVs and catalog views for monitoring availability group health. Understanding these views is essential for proactive monitoring and rapid incident response.
Key DMVs and Catalog Views
| Object | What it shows |
|---|---|
sys.availability_groups | AG names, backup preference, failure condition level, health check timeout |
sys.availability_replicas | Replica configuration: server name, endpoint URL, availability mode, failover mode |
sys.dm_hadr_availability_group_states | Live AG state: primary replica, synchronization health (HEALTHY/PARTIALLY_HEALTHY/NOT_HEALTHY) |
sys.dm_hadr_availability_replica_states | Live per-replica state: role, operational state, connected state, recovery health |
sys.dm_hadr_database_replica_states | Per-database per-replica metrics: sync state, log send queue, redo queue, last commit time |
sys.dm_hadr_database_replica_cluster_states | Cluster-level database join state and synchronization health visible from any node |
Complete AG Health Dashboard Query
-- Run on PRIMARY — comprehensive AG health dashboard
SELECT
ag.name AS availability_group,
ags.primary_replica AS primary_replica,
ags.synchronization_health_desc AS ag_sync_health,
ar.replica_server_name,
ar.availability_mode_desc,
ar.failover_mode_desc,
ars.role_desc AS replica_role,
ars.operational_state_desc,
ars.connected_state_desc,
ars.synchronization_health_desc AS replica_sync_health,
ars.recovery_health_desc,
DB_NAME(drs.database_id) AS database_name,
drs.synchronization_state_desc AS db_sync_state,
drs.synchronization_health_desc AS db_sync_health,
drs.is_suspended,
drs.suspend_reason_desc,
drs.log_send_queue_size AS log_send_queue_KB,
drs.log_send_rate AS log_send_rate_KB_s,
drs.redo_queue_size AS redo_queue_KB,
drs.redo_rate AS redo_rate_KB_s,
-- Estimated catch-up time in seconds
CASE WHEN drs.redo_rate > 0
THEN drs.redo_queue_size / drs.redo_rate
ELSE NULL
END AS est_redo_catchup_sec,
drs.last_commit_time,
-- Data loss exposure: how far behind is the secondary?
DATEDIFF(SECOND, drs.last_commit_time,
(SELECT MAX(last_commit_time)
FROM sys.dm_hadr_database_replica_states
WHERE database_id = drs.database_id
AND is_primary_replica = 1)) AS lag_behind_primary_sec
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
ORDER BY ag.name, ar.replica_server_name, DB_NAME(drs.database_id);
Key Metrics to Alert On
| Metric | Column | Alert threshold |
|---|---|---|
| Log send queue | log_send_queue_size | > 50 MB (network or IO bottleneck on secondary) |
| Redo queue | redo_queue_size | > 100 MB (secondary redo thread falling behind) |
| AG sync health | synchronization_health_desc | NOT_HEALTHY or PARTIALLY_HEALTHY |
| Data movement suspended | is_suspended = 1 | Any database suspended |
| Commit lag | lag_behind_primary_sec | > 30 sec on synchronous replica |
| Connected state | connected_state_desc | DISCONNECTED |
AG Dashboard in SSMS
SSMS has a built-in graphical dashboard: right-click the AG in Object Explorer → Show Dashboard. It displays a real-time synchronization health timeline, replica states, and database synchronization status. Useful for a quick visual check, though the DMV queries above are better for automated monitoring.
-- Check for any AGs in a non-healthy state (good for monitoring job):
SELECT ag.name AS availability_group, ags.synchronization_health_desc
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
WHERE ags.synchronization_health_desc <> 'HEALTHY';
-- Alert if this returns any rows
7. Manual and Automatic Failover
Automatic Failover
Automatic failover requires: (1) the failing replica is in synchronous commit mode, (2) the secondary is in a SYNCHRONIZED state, and (3) the WSFC health detection fires. When the primary goes offline and the quorum sees the failure, the cluster promotes the designated synchronous secondary in roughly 20–30 seconds — no DBA action needed.
Planned Manual Failover (Zero Data Loss)
Use this for planned maintenance: patching the primary, hardware upgrades, etc. The secondary must be SYNCHRONIZED.
-- Pre-check: confirm secondary is synchronized
SELECT ar.replica_server_name,
drs.synchronization_state_desc,
drs.synchronization_health_desc
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
WHERE ar.group_id = (SELECT group_id FROM sys.availability_groups WHERE name = 'AG_Sales')
AND drs.is_primary_replica = 0;
-- synchronization_state_desc must be SYNCHRONIZED
-- Perform planned failover from the SECONDARY node (NODE2)
-- Connect to NODE2 and run:
ALTER AVAILABILITY GROUP [AG_Sales] FAILOVER;
-- After failover: verify NODE2 is now primary
SELECT primary_replica FROM sys.dm_hadr_availability_group_states
WHERE group_id = (SELECT group_id FROM sys.availability_groups WHERE name = 'AG_Sales');
Forced Failover (Potential Data Loss)
Use FORCE_FAILOVER_ALLOW_DATA_LOSS only when the primary is completely
unavailable and you need to restore service immediately. This is a last-resort
operation that accepts possible data loss.
-- STEP 1: Assess data loss exposure BEFORE forcing failover
-- Run on the target secondary (NODE2):
SELECT
DB_NAME(drs.database_id) AS database_name,
drs.last_commit_time AS secondary_last_commit,
drs.redo_queue_size AS outstanding_redo_KB,
drs.log_send_queue_size AS unsent_log_KB
FROM sys.dm_hadr_database_replica_states drs
WHERE drs.is_primary_replica = 0;
-- unsent_log_KB > 0 means transactions on primary are not yet on this secondary
-- Those transactions WILL be lost on a forced failover
-- STEP 2: If you accept the data loss, run on NODE2:
ALTER AVAILABILITY GROUP [AG_Sales] FORCE_FAILOVER_ALLOW_DATA_LOSS;
-- STEP 3: After forced failover, verify databases are accessible:
SELECT name, state_desc FROM sys.databases WHERE name IN ('SalesDB','HRDatabase');
-- STEP 4: If the old primary comes back, it will be in RESOLVING state.
-- Re-join it as a secondary:
-- (on the old primary, after its SQL service restarts)
ALTER AVAILABILITY GROUP [AG_Sales] JOIN;
ALTER DATABASE [SalesDB] SET HADR AVAILABILITY GROUP = [AG_Sales];
ALTER DATABASE [HRDatabase] SET HADR AVAILABILITY GROUP = [AG_Sales];
Failover with SSMS Wizard
For planned failovers, SSMS provides a wizard: right-click the AG in Object Explorer
→ Failover…. The wizard confirms synchronization health, estimates
data loss, and walks you through the failover steps. It is equivalent to running
ALTER AVAILABILITY GROUP … FAILOVER but with a safety checklist.
8. Common AG Problems and Fixes
Problem 1: Data Movement Suspended
-- Identify suspended databases
SELECT DB_NAME(database_id) AS db_name,
replica_id, is_suspended, suspend_reason_desc
FROM sys.dm_hadr_database_replica_states
WHERE is_suspended = 1;
-- Resume data movement
ALTER DATABASE [SalesDB] SET HADR RESUME;
-- Common suspend reasons:
-- USER_SUSPENDED → DBA manually suspended (ALTER DATABASE … SET HADR SUSPEND)
-- SUSPEND_FROM_REDO → redo thread hit an error (check SQL error log)
-- SUSPEND_FROM_APPLY → apply thread error on secondary
Problem 2: Secondary Falling Behind (Large Redo Queue)
-- Identify redo queue size per database per replica
SELECT ar.replica_server_name,
DB_NAME(drs.database_id) AS database_name,
drs.redo_queue_size AS redo_queue_KB,
drs.redo_rate AS redo_rate_KB_s,
CASE WHEN drs.redo_rate > 0
THEN drs.redo_queue_size / drs.redo_rate
ELSE NULL
END AS est_catchup_sec
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
WHERE drs.is_primary_replica = 0
ORDER BY drs.redo_queue_size DESC;
-- Root causes:
-- 1. High write workload on primary (large transactions, bulk loads)
-- 2. Secondary I/O bottleneck (slow disk on secondary)
-- 3. Long-running read queries on readable secondary blocking redo thread
-- Fix for readable-secondary redo blocking (SQL 2019+):
-- Enable accelerated database recovery (ADR) to reduce redo thread stalls:
ALTER DATABASE [SalesDB] SET ACCELERATED_DATABASE_RECOVERY = ON;
Problem 3: Listener Not Resolving
-- Test listener from application server:
-- nslookup AG-SALES-LN
-- Should return the current primary's IP address
-- Verify listener state in cluster:
-- PowerShell: Get-ClusterResource | Where-Object {$_.Name -like "*AG-SALES*"}
-- Check listener IP addresses in SQL Server:
SELECT agl.dns_name, agl.port, agla.ip_address, agla.state_desc
FROM sys.availability_group_listeners agl
JOIN sys.availability_group_listener_ip_addresses agla
ON agl.listener_id = agla.listener_id;
-- state_desc should be ONLINE for the current primary's IP
-- Common fixes:
-- 1. Listener IP not in same subnet as client → check IP configuration
-- 2. Firewall blocking port 1433 to listener IP → open firewall rule
-- 3. DNS TTL too high → lower HostRecordTTL on the cluster resource
-- 4. SQL Browser not running → needed for named instances (not default instance)
Problem 4: Synchronization Stuck — NOT_SYNCHRONIZING
-- Check replica connection state
SELECT ar.replica_server_name, ars.connected_state_desc, ars.operational_state_desc
FROM sys.availability_replicas ar
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id;
-- If DISCONNECTED:
-- 1. Check SQL Server error log on secondary for endpoint errors
-- 2. Verify endpoint is started: SELECT state_desc FROM sys.database_mirroring_endpoints
-- 3. Test TCP connectivity: Test-NetConnection -ComputerName NODE2 -Port 5022
-- 4. Check service account permissions on endpoint
-- Restart the endpoint if needed:
ALTER ENDPOINT [Hadr_endpoint] STATE = STOPPED;
ALTER ENDPOINT [Hadr_endpoint] STATE = STARTED;
Problem 5: Certificate Expiry on Mirroring Endpoint
If you use certificate-based authentication on the mirroring endpoint (common in workgroup clusters or cross-domain setups), certificates expire. Monitor with:
-- Check certificates used for mirroring endpoints
SELECT c.name, c.expiry_date, c.subject, c.pvt_key_encryption_type_desc
FROM sys.certificates c
JOIN sys.database_mirroring_endpoints ep
ON ep.certificate_id = c.certificate_id;
-- Alert if expiry_date < DATEADD(DAY, 30, GETDATE())
-- To renew: create a new certificate, update the endpoint, and distribute
-- the public key to all partner instances
9. Maintenance with an Availability Group
Rolling Upgrade / Patching Pattern
Always patch secondaries first, then fail over to an upgraded secondary and patch the (now) former primary. This minimises downtime to a single planned failover.
- Patch NODE2 (secondary) — apply CU/SP and restart SQL Server.
- Verify NODE2 re-synchronises after restart (watch
synchronization_state_desc). - Perform a planned manual failover from NODE2 (
ALTER AVAILABILITY GROUP … FAILOVER) — NODE2 becomes primary. - Patch NODE1 (now secondary) — apply CU/SP and restart.
- Verify NODE1 re-synchronises. Optionally fail back to NODE1 if desired.
Index Rebuilds and Replication
-- Index rebuilds on the PRIMARY replicate to secondaries automatically.
-- The redo thread applies all DDL, DML, and index operations.
-- You do NOT rebuild indexes directly on the secondary.
-- For ONLINE index rebuilds (reduces blocking on primary):
ALTER INDEX [IX_Sales_OrderDate] ON [dbo].[SalesOrders]
REBUILD WITH (ONLINE = ON, SORT_IN_TEMPDB = ON, MAXDOP = 4);
-- Check that the redo queue doesn't spike after a large index rebuild:
SELECT ar.replica_server_name, drs.redo_queue_size
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
WHERE drs.is_primary_replica = 0;
DBCC CHECKDB on a Readable Secondary
-- Run DBCC CHECKDB on the readable secondary to offload I/O from primary
-- Connect to NODE2 (secondary) with ApplicationIntent=ReadOnly or directly
DBCC CHECKDB ('SalesDB') WITH NO_INFOMSGS, ALL_ERRORMSGS;
-- Benefits:
-- 1. Does not compete with production I/O on primary
-- 2. Reads data pages locally (no network transfer)
-- 3. Detects page-level corruption on the secondary's copy
-- Note: DBCC checks the secondary's copy — corruption on primary
-- that was already replicated will show up here too.
Backup Preference — Prefer Secondary
-- AG was created with AUTOMATED_BACKUP_PREFERENCE = SECONDARY
-- Ola Hallengren / native backup jobs should respect this.
-- Check current preference:
SELECT name, automated_backup_preference_desc
FROM sys.availability_groups;
-- In your backup job, use sys.fn_hadr_backup_is_preferred_replica()
-- to dynamically decide which replica runs the backup:
IF sys.fn_hadr_backup_is_preferred_replica('SalesDB') = 1
BEGIN
BACKUP DATABASE [SalesDB]
TO DISK = N'\\BACKUPSHARE\SalesDB_diff.bak'
WITH DIFFERENTIAL, COMPRESSION, STATS = 10;
END
-- This ensures only the preferred replica (usually the secondary) runs the backup job,
-- even though the job is scheduled on all nodes.
10. Always On vs Basic Availability Group (Standard Edition)
SQL Server 2016 Standard Edition introduced the Basic Availability Group — a limited version of AG that provides high availability without the Enterprise Edition licence cost.
| Feature | Basic AG (Standard Edition) | Full AG (Enterprise Edition) |
|---|---|---|
| Max replicas | 2 (primary + 1 secondary) | 9 (primary + 8 secondaries) |
| Databases per AG | 1 | Unlimited |
| Readable secondary | No | Yes |
| Read-only routing | No | Yes |
| Backup on secondary | No | Yes |
| SSAS / SSRS integration | No | Yes |
| Distributed AG | No | Yes |
| Automatic seeding | Yes (2016+) | Yes |
| AG listener | Yes | Yes |
| Automatic failover | Yes | Yes |
-- Create a Basic AG on Standard Edition
-- Syntax is identical to full AG — SQL Server enforces limits automatically
CREATE AVAILABILITY GROUP [BasicAG_Finance]
WITH (BASIC) -- ← this keyword enables Basic AG mode
FOR DATABASE [FinanceDB]
REPLICA ON
N'NODE1' WITH (
ENDPOINT_URL = N'TCP://NODE1.corp.local:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC
),
N'NODE2' WITH (
ENDPOINT_URL = N'TCP://NODE2.corp.local:5022',
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
FAILOVER_MODE = AUTOMATIC,
SEEDING_MODE = AUTOMATIC
);
-- Note: attempting to add SECONDARY_ROLE (ALLOW_CONNECTIONS = READ_ONLY)
-- on a Basic AG will fail with an error.
Frequently Asked Questions
Was this guide helpful?
Continue Learning — SQL Server
Quick Navigation
AG Quick Facts
- Max replicas: 9 (1 primary + 8 secondary)
- Auto failover: Synchronous commit only
- Data loss: 0 with synchronous commit
- Endpoint port: 5022 (default)
- Edition: Enterprise (full) / Standard (Basic)
- Requires WSFC: Yes (or workgroup cluster)
- Shared storage: Not required
Related SQL Server Articles
Article Info
- Level: Intermediate / Advanced
- Read time: ~20 min
- Edition: SQL Server Enterprise
- Versions: 2012 – 2022
- Updated: June 2026