Skip to main content

Sharepoint, index defragmentation, dm_db_index_physical_stats, physical IO, sql server AlwaysOn failover

Sharepoint, index defragmentation, dm_db_index_physical_stats, physical IO, sql server AlwaysOn failover

How do these things come together?

I have a sqlserver alwaysOn hosting sharepoint databases, the sqlserver failover every night, it's caused by the sharepoint nightly job "health analysis job" which run proc_DefragmentIndices on each sharepoint databases and killed the sqlserver. the problem of this proc is that it use dm_db_index_physical_stats to find index fragmentation and rebuild the fragmented indexes, which caused huge physical disk IO and hung the sqlserver.

How did I find out. here are the the path.

First of all, SQL Server  log shows a lot of errors and failover messages.
such as
"The availability group database "sharepoint_content3" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization."

“SQL Server hosting availability group 'AG_SHAREPOINT' did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.”

Accompanying to the sqlserver failover message, there are I/O problem at same time:
"SQL Server has encountered 2476 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\datafile\content_4.mdf] in database id 41. "

So I started to monitor the "page life expectancy" and "current disk queue length": I can see the two performance counter are going bad dramatically, as shown in the screenshot, the "current disk queue length" even went to 7000, Never saw this in my DBA life. the D drive is where all the sharepoint datafiles are, the E drive are for transaction logfiles.

So the D drive hung, why? is there a lot of disk IO? use this query to find out the disk IO stall.

select db_name(database_id) dbname,io_type,file_id,io_stall/1000 IO_stall_seconds,io_pending_ms_ticks/1000 io_pending_seconds,
size_on_disk_bytes/1024/1024/1024 sizeInGB,scheduler_address
from sys.dm_io_virtual_file_stats (null,null) as t1
join sys.dm_io_pending_io_requests as t2
on t1.file_handle = t2.io_handle

IO Stall on file_id=1 which are the datafiles on D drive !  does IO stats support that?

WITH Aggregate_IO_Statistics
   AS
   (SELECT DB_NAME(database_id) AS dbname,
  CAST(SUM(num_of_bytes_read + num_of_bytes_written)/1048576 AS DECIMAL(12, 2)) AS io_in_mb,
  CAST(SUM(num_of_bytes_read )/1048576 AS DECIMAL(12, 2)) AS io_read_in_mb,
  CAST(SUM( num_of_bytes_written)/1048576 AS DECIMAL(12, 2)) AS io_write_in_mb
   FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS [DM_IO_STATS]
   GROUP BY database_id)
  -- select * from Aggregate_IO_Statistics
    SELECT ROW_NUMBER() OVER(ORDER BY io_in_mb DESC) AS [I/O Rank], dbname,
io_in_mb , io_read_in_mb, io_write_in_mb,
   CAST(io_in_mb/ SUM(io_in_mb) OVER() * 100.0 AS DECIMAL(5,2)) AS [I/O Percent]
  FROM Aggregate_IO_Statistics
  ORDER BY [I/O Rank] OPTION (RECOMPILE);


There are huge disk IO, maybe not all from the sharepoint job running at this time, let's check the last query disk IO:

select
SUBSTRING(c.text, (a.statement_start_offset/2)+1,
((CASE a.statement_end_offset
WHEN -1 THEN DATALENGTH(c.text)
ELSE a.statement_end_offset
END - a.statement_start_offset)/2) + 1) AS statement_text,execution_count,
last_worker_time/1000000 last_worker_time_sec,last_physical_reads,last_logical_reads,last_execution_time,last_elapsed_time/1000000 last_elapsed_time_seconds,
convert(XML, b.query_plan) ,a.*
from
sys.dm_exec_query_stats a cross apply
sys.dm_exec_text_query_plan(a.plan_handle, a.statement_start_offset, a.statement_end_offset) b
cross apply sys.dm_exec_sql_text(a.sql_handle) c
where 
--c.text like '%SELECT quotename(S.name)%' and
last_execution_time > getdate()-1 and 
last_elapsed_time > 1000000
order by last_elapsed_time desc


Confirmed on last_physical_reads!

So let's dig into this procedure proc_DefragmentIndices, it's not complicated query, but it's something like this:
"
...
select ... IPS.avg_fragmentation_in_percent,  IPS.avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats(db_id(), null, null, null, N''SAMPLED'')   as  IPS
inner join sys.indexes  as  I
on I.object_id = IPS.object_id and I.index_id = IPS.index_id
inner join sys.objects  as  O on O.object_id = I.object_id
inner join sys.schemas  as  S on S.schema_id = O.schema_id
WHERE...
"
Then why this is problem for disk IO? the dm_db_index_physical_stats is the replacement of "DBCC SHOWCONTIG ", which it will run for every join. DBCC SHOWCONTIG scan tables pages for fragmentation information, that's why there are so much disk IO on D drive! sample as shown below:

"
DBCC SHOWCONTIG scanning 'TimerJobHistory' table...
Table: 'TimerJobHistory' (1109578991); index ID: 1, database ID: 5
TABLE level scan performed.
- Pages Scanned................................: 49633
- Extents Scanned..............................: 6221
- Extent Switches..............................: 6253
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 99.22% [6205:6254]
- Logical Scan Fragmentation ..................: 0.36%
- Extent Scan Fragmentation ...................: 50.70%
- Avg. Bytes Free per Page.....................: 186.7
- Avg. Page Density (full).....................: 97.69%

"

I also checked tempdb usage, it shows some disk IO but not as much as other sharepoint databases disk IO.

Because this is not production environment, SAN admin do not take this seriously to tune the storage, I am ok with that, I suggest that sharepoint admin to disable this job and run the defrag job either manually or I run sqlserver built-in maintenance plan job to de-frag index.

what's the cause of memory pressure? it has to be the large number of sharepoint indexes metadata join other views etc. Cartesian product and aggregation, I believe. the solution is to increase sqlserver memory from 18GB to 30GB, arbitrarily.

The following queries helped to identify issues:

Top elapsed query, plan, IO, timestamp etc:
select
SUBSTRING(c.text, (a.statement_start_offset/2)+1,
((CASE a.statement_end_offset
WHEN -1 THEN DATALENGTH(c.text)
ELSE a.statement_end_offset
END - a.statement_start_offset)/2) + 1) AS statement_text,execution_count,
last_worker_time/1000000 last_worker_time_sec,last_physical_reads,last_logical_reads,last_execution_time,last_elapsed_time/1000000 last_elapsed_time_seconds,
convert(XML, b.query_plan) ,a.*
from
sys.dm_exec_query_stats a cross apply
sys.dm_exec_text_query_plan(a.plan_handle, a.statement_start_offset, a.statement_end_offset) b
cross apply sys.dm_exec_sql_text(a.sql_handle) c
where 
--c.text like '%SELECT quotename(S.name)%' and
last_execution_time > getdate()-1 and 
last_elapsed_time > 1000000
order by last_elapsed_time desc

who is waiting for what for how long:

SELECT dm_ws.wait_duration_ms,
dm_ws.wait_type,
dm_es.status,
dm_t.TEXT,
dm_qp.query_plan,
dm_ws.session_ID,
dm_es.cpu_time,
dm_es.memory_usage,
dm_es.reads,
dm_es.writes,
dm_es.logical_reads,
dm_es.total_elapsed_time,
dm_es.program_name,
DB_NAME(dm_r.database_id) DatabaseName,
-- Optional columns
dm_ws.blocking_session_id,
dm_r.wait_resource,
dm_es.login_name,
dm_r.command,
dm_r.last_wait_type
FROM sys.dm_os_waiting_tasks dm_ws
INNER JOIN sys.dm_exec_requests dm_r ON dm_ws.session_id = dm_r.session_id
INNER JOIN sys.dm_exec_sessions dm_es ON dm_es.session_id = dm_r.session_id
CROSS APPLY sys.dm_exec_sql_text (dm_r.sql_handle) dm_t
CROSS APPLY sys.dm_exec_query_plan (dm_r.plan_handle) dm_qp
WHERE dm_es.is_user_process = 1

GO

Comments

Popular posts from this blog

non-existent process lock port on windows server

I have a database link created between oracle and sqlserver using oracle tg4odbc, the product is installed on windows server and run as service "OracleOraGtw11g_home1TNSListener", but sometime the service cannot started, the root cause of this problem is that the port number 1521 is used by an non-existent process. The first step is to use netstat -bano|find "1521" to get the process id, in my case it's 5844, which shows the connection is from my oracle server 10.8.0.169 H:\>netstat -bano|find "1521"   TCP    0.0.0.0:1521           0.0.0.0:0              LISTENING       5844   TCP    10.14.45.33:1521       10.8.0.169:42987       ESTABLISHED     5844 however the process id does not show in either task manager or process explorer. The next step is to run tcpview, which shows non-existent under process column, there are three rows, two show status as "listening", the other one shows status "established", right click and k

Opatch apply/lsinventory error: oneoff is corrupted or does not exist

I am applying the quarterly patch for 19c RDBMS, I tried using napply but failed, but somehow it corrupted the inventory though nothing applied. further apply and lsinventory command ran into error like this: $ ./OPatch/opatch lsinventory Oracle Interim Patch Installer version 12.2.0.1.21 Copyright (c) 2020, Oracle Corporation.  All rights reserved. Oracle Home       : /u02/app/oracle/19.0.0 Central Inventory : /u01/app/oraInventory    from           : /u02/app/oracle/19.0.0/oraInst.loc OPatch version    : 12.2.0.1.21 OUI version       : 12.2.0.7.0 Log file location : /u02/app/oracle/19.0.0/cfgtoollogs/opatch/opatch2020-09-08_13-35-59PM_1.log Lsinventory Output file location : /u02/app/oracle/19.0.0/cfgtoollogs/opatch/lsinv/lsinventory2020-09-08_13-35-59PM.txt -------------------------------------------------------------------------------- Inventory load failed... OPatch cannot load inventory for the given Oracle Home. LsInventorySession failed: Unable to create patchObject Possible ca

shell script to clean up oracle dumpfile

https://github.com/iacosta/personal/blob/master/shells/cleanhouse.sh #!/bin/ksh # # Script used to cleanup any Oracle environment. # # Cleans:      audit_file_dest #              background_dump_dest #              core_dump_dest #              user_dump_dest #              Clusterware logs # # Rotates:     Alert Logs #              Listener Logs # # Scheduling:  00 00 * * * /networkdrive/dba/scripts/purge/cleanup.sh -d 7 > /u01/dba/bin/cleanup.log 2>&1 # # Created By:  Lei Dao # # # RM="rm -f" RMDIR="rm -rf" LS="ls -l" MV="mv" TOUCH="touch" TESTTOUCH="echo touch" TESTMV="echo mv" TESTRM=$LS TESTRMDIR=$LS SUCCESS=0 FAILURE=1 TEST=0 HOSTNAME=`hostname` ORAENV="oraenv" TODAY=`date +%Y%m%d` ORIGPATH=/usr/local/bin:$PATH ORIGLD=$LD_LIBRARY_PATH export PATH=$ORIGPATH # Usage function. f_usage(){   echo "Usage: `basename $0` -d DAYS [-a DAYS] [-b DAYS] [