DB2数据库内存耗尽故障处理经验分享
问题现象
P595服务器中有一个DB2数据库,在连接时,系统PAGING SPACE资源耗尽,从而宕机。
问题分析
问题初步分析
1.如果非计算内存太多,计算内存太少,有可能是这个问题导致计算内存不够用,使用到PAGING SPACE。可以使用vmo -L maxperm%命令来检查文件系统缓存参数的设置
2.如果maxperm%设置为20%左右(对于DB2环境来说,比较适合),那么是不是DB2参数/对象本身分配的内存过多造成计算内存不够用。需要仔细检查实例参数、数据库参数是否设置不当。
3.如果上述原因排除,那么分析db2mtrk得到的数据,分析DB2数据库在运行过程中是否分配了过多的计算内存(内存参数、缓冲池可能使用了AUTOMATIC设置;在这种情况下,内存是按需分配的,理论上可以无限大),如果是,那么分析是由于DB2实例、数据库占用的内存过多,还是代理进程(应用程序)占用的内存过多(比如使用动态数组,在死循环中不停的填充动态数组,导致动态数组无限增长,无论多少内存都能耗光),如果存在这个问题,那么继续分析.
可以使用如下命令跟踪DB2数据库使用内存的情况。
db2mtrk -i -d -p -r 300 500
>> db2mtrk.out
#每5分钟收集一次,收集500次
4.如果上述原因排除,那么结合ipcs得到的数据,分析AIX操作系统中除了DB2使用的内存段之外,还有没有其他的非DB2进程在使用大内存段占用过多的内存,如果有,那么继续分析是否是它导致的.
服务器内存使用
内存:8G
文件系统缓存设置:maxperm%=30
操作系统约消耗1G内存
所以留给DB2(实例、数据库、代理)使用的内存约为:8G-8G*30%-1G=4.5G
检查实例、数据库参数
Database Configuration for Database XXXX Database configuration release level = 0x0b00 Database release level = 0x0b00 Database territory = CN Database code page = 1386 Database code set = GBK Database country/region code = 86 Database collating sequence = UNIQUE Alternate collating sequence (ALT_COLLATE) = Database page size = 4096 Dynamic SQL Query management (DYN_QUERY_MGMT) = DISABLE Discovery support for this database (DISCOVER_DB) = ENABLE Restrict access = NO Default query optimization class (DFT_QUERYOPT) = 9 Degree of parallelism (DFT_DEGREE) = 1 Continue upon arithmetic exceptions (DFT_SQLMATHWARN) = NO Default refresh age (DFT_REFRESH_AGE) = 0 Default maintained table types for opt (DFT_MTTB_TYPES) = SYSTEM Number of frequent values retained (NUM_FREQVALUES) = 10 Number of quantiles retained (NUM_QUANTILES) = 20 Backup pending = NO Database is consistent = YES Rollforward pending = NO Restore pending = NO Multi-page file allocation enabled = YES Log retain for recovery status = NO User exit for logging status = NO Self tuning memory (SELF_TUNING_MEM) = ON Size of database shared memory (4KB) (DATABASE_MEMORY) = AUTOMATIC Database memory threshold (DB_MEM_THRESH) = 10 Max storage for lock list (4KB) (LOCKLIST) = AUTOMATIC Percent. of lock lists per application (MAXLOCKS) = AUTOMATIC Package cache size (4KB) (PCKCACHESZ) = AUTOMATIC Sort heap thres for shared sorts (4KB) (SHEAPTHRES_SHR) = AUTOMATIC Sort list heap (4KB) (SORTHEAP) = AUTOMATIC Database heap (4KB) (DBHEAP) = 4137 Catalog cache size (4KB) (CATALOGCACHE_SZ) = 579 Log buffer size (4KB) (LOGBUFSZ) = 108 Utilities heap size (4KB) (UTIL_HEAP_SZ) = 524288 Buffer pool size (pages) (BUFFPAGE) = 1000 Max size of appl. group mem set (4KB) (APPGROUP_MEM_SZ) = 40933 Percent of mem for appl. group heap (GROUPHEAP_RATIO) = 70 Max appl. control heap size (4KB) (APP_CTL_HEAP_SZ) = 128 SQL statement heap (4KB) (STMTHEAP) = 4096 Default application heap (4KB) (APPLHEAPSZ) = 1024 Statistics heap size (4KB) (STAT_HEAP_SZ) = 4384 Interval for checking deadlock (ms) (DLCHKTIME) = 10000 Lock timeout (sec) (LOCKTIMEOUT) = 11 Changed pages threshold (CHNGPGS_THRESH) = 80 Number of asynchronous page cleaners (NUM_IOCLEANERS) = AUTOMATIC Number of I/O servers (NUM_IOSERVERS) = AUTOMATIC Index sort flag (INDEXSORT) = YES Sequential detect flag (SEQDETECT) = YES Default prefetch size (pages) (DFT_PREFETCH_SZ) = AUTOMATIC Track modified pages (TRACKMOD) = OFF Default number of containers = 1 Default tablespace extentsize (pages) (DFT_EXTENT_SZ) = 32 Max number of active applications (MAXAPPLS) = AUTOMATIC Average number of active applications (AVG_APPLS) = AUTOMATIC Max DB files open per application (MAXFILOP) = 64 Log file size (4KB) (LOGFILSIZ) = 40960 Number of primary log files (LOGPRIMARY) = 50 Number of secondary log files (LOGSECOND) = 50 Changed path to log files (NEWLOGPATH) = Path to log files = Overflow log path (OVERFLOWLOGPATH) = Mirror log path (MIRRORLOGPATH) = First active log file = Block log on disk full (BLK_LOG_DSK_FUL) = NO Percent max primary log space by transaction (MAX_LOG) = 0 Num. of active log files for 1 active UOW(NUM_LOG_SPAN) = 0 Group commit count (MINCOMMIT) = 1 Percent log file reclaimed before soft chckpt (SOFTMAX) = 2000 Log retain for recovery enabled (LOGRETAIN) = OFF User exit for logging enabled (USEREXIT) = OFF HADR database role = STANDARD HADR local host name (HADR_LOCAL_HOST) = HADR local service name (HADR_LOCAL_SVC) = HADR remote host name (HADR_REMOTE_HOST) = HADR remote service name (HADR_REMOTE_SVC) = HADR instance name of remote server (HADR_REMOTE_INST) = HADR timeout value (HADR_TIMEOUT) = 120 HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC First log archive method (LOGARCHMETH1) = OFF Options for logarchmeth1 (LOGARCHOPT1) = Second log archive method (LOGARCHMETH2) = OFF Options for logarchmeth2 (LOGARCHOPT2) = Failover log archive path (FAILARCHPATH) = Number of log archive retries on error (NUMARCHRETRY) = 5 Log archive retry Delay (secs) (ARCHRETRYDELAY) = 20 Vendor options (VENDOROPT) = Auto restart enabled (AUTORESTART) = ON Index re-creation time and redo index build (INDEXREC) = SYSTEM (RESTART) Log pages during index build (LOGINDEXBUILD) = OFF Default number of loadrec sessions (DFT_LOADREC_SES) = 1 Number of database backups to retain (NUM_DB_BACKUPS) = 12 Recovery history retention (days) (REC_HIS_RETENTN) = 366 TSM management class (TSM_MGMTCLASS) = TSM node name (TSM_NODENAME) = TSM owner (TSM_OWNER) = TSM password (TSM_PASSWORD) = Automatic maintenance (AUTO_MAINT) = ON Automatic database backup (AUTO_DB_BACKUP) = OFF Automatic table maintenance (AUTO_TBL_MAINT) = ON Automatic runstats (AUTO_RUNSTATS) = ON Automatic statistics profiling (AUTO_STATS_PROF) = OFF Automatic profile updates (AUTO_PROF_UPD) = OFF Automatic reorganization (AUTO_REORG) = OFF Database Manager Configuration Node type = Enterprise Server Edition with local and remote clients Database manager configuration release level = 0x0b00 CPU speed (millisec/instruction) (CPUSPEED) = 2.676617e-07 Communications bandwidth (MB/sec) (COMM_BANDWIDTH) = 1.000000e+02 Max number of concurrently active databases (NUMDB) = 8 Federated Database System Support (FEDERATED) = NO Transaction processor monitor name (TP_MON_NAME) = Default charge-back account (DFT_ACCOUNT_STR) = Java Development Kit installation path (JDK_PATH) = /home/sqllib/java/jdk64 Diagnostic error capture level (DIAGLEVEL) = 3 Notify Level (NOTIFYLEVEL) = 3 Diagnostic data directory path (DIAGPATH) = /home/sqllib/db2dump Default database monitor switches Buffer pool (DFT_MON_BUFPOOL) = ON Lock (DFT_MON_LOCK) = ON Sort (DFT_MON_SORT) = ON Statement (DFT_MON_STMT) = ON Table (DFT_MON_TABLE) = ON Timestamp (DFT_MON_TIMESTAMP) = ON Unit of work (DFT_MON_UOW) = OFF Monitor health of instance and databases (HEALTH_MON) = ON SYSADM group name (SYSADM_GROUP) = DB2GRP1 SYSCTRL group name (SYSCTRL_GROUP) = SYSMAINT group name (SYSMAINT_GROUP) = SYSMON group name (SYSMON_GROUP) = Client Userid-Password Plugin (CLNT_PW_PLUGIN) = Client Kerberos Plugin (CLNT_KRB_PLUGIN) = Group Plugin (GROUP_PLUGIN) = GSS Plugin for Local Authorization (LOCAL_GSSPLUGIN) = Server Plugin Mode (SRV_PLUGIN_MODE) = UNFENCED Server List of GSS Plugins (SRVCON_GSSPLUGIN_LIST) = Server Userid-Password Plugin (SRVCON_PW_PLUGIN) = Server Connection Authentication (SRVCON_AUTH) = NOT_SPECIFIED Database manager authentication (AUTHENTICATION) = SERVER Cataloging allowed without authority (CATALOG_NOAUTH) = NO Trust all clients (TRUST_ALLCLNTS) = YES Trusted client authentication (TRUST_CLNTAUTH) = CLIENT Bypass federated authentication (FED_NOAUTH) = NO Default database path (DFTDBPATH) = /home Database monitor heap size (4KB) (MON_HEAP_SZ) = 90 Java Virtual Machine heap size (4KB) (JAVA_HEAP_SZ) = 2048 Audit buffer size (4KB) (AUDIT_BUF_SZ) = 0 Size of instance shared memory (4KB) (INSTANCE_MEMORY) = AUTOMATIC Backup buffer default size (4KB) (BACKBUFSZ) = 1024 Restore buffer default size (4KB) (RESTBUFSZ) = 1024 Sort heap threshold (4KB) (SHEAPTHRES) = 0 Directory cache support (DIR_CACHE) = YES Application support layer heap size (4KB) (ASLHEAPSZ) = 15 Max requester I/O block size (bytes) (RQRIOBLK) = 32767 Query heap size (4KB) (QUERY_HEAP_SZ) = 1000 Workload impact by throttled utilities(UTIL_IMPACT_LIM) = 10 Priority of agents (AGENTPRI) = SYSTEM Max number of existing agents (MAXAGENTS) = 400 Agent pool size (NUM_POOLAGENTS) = 400 Initial number of agents in pool (NUM_INITAGENTS) = 0 Max number of coordinating agents (MAX_COORDAGENTS) = (MAXAGENTS - NUM_INITAGENTS) Max no. of concurrent coordinating agents (MAXCAGENTS) = MAX_COORDAGENTS Max number of client connections (MAX_CONNECTIONS) = MAX_COORDAGENTS Keep fenced process (KEEPFENCED) = YES Number of pooled fenced processes (FENCED_POOL) = MAX_COORDAGENTS Initial number of fenced processes (NUM_INITFENCED) = 0 Index re-creation time and redo index build (INDEXREC) = RESTART Transaction manager database name (TM_DATABASE) = 1ST_CONN Transaction resync interval (sec) (RESYNC_INTERVAL) = 180 SPM name (SPM_NAME) = SPM log size (SPM_LOG_FILE_SZ) = 256 SPM resync agent limit (SPM_MAX_RESYNC) = 20 SPM log path (SPM_LOG_PATH) = TCP/IP Service name (SVCENAME) = 60009 Discovery mode (DISCOVER) = SEARCH Discover server instance (DISCOVER_INST) = ENABLE Maximum query degree of parallelism (MAX_QUERYDEGREE) = 7 Enable intra-partition parallelism (INTRA_PARALLEL) = YES Maximum Asynchronous TQs per query (FEDERATED_ASYNC) = 0 No. of int. communication buffers(4KB)(FCM_NUM_BUFFERS) = AUTOMATIC No. of int. communication channels (FCM_NUM_CHANNELS) = AUTOMATIC Node connection elapse time (sec) (CONN_ELAPSE) = 10 Max number of node connection retries (MAX_CONNRETRIES) = 5 Max time difference between nodes (min) (MAX_TIME_DIFF) = 60 db2start/db2stop timeout (min) (START_STOP_TIME) = 10
从以上实例、数据库参数分析,没有任何相关参数设置不当,导致分配了过多的内存。
DB2诊断日志分析
问题数据库版本为V9.1,DB2从版本9开始支持数据库内存自动调优。本数据库大部分内存参数也是使用AUTOMATIC方式由数据库引擎自动计算决定(由db2stmm进程进行实际调整),通过上面的从诊断日志中摘取的片段也可以看出来,编号①~⑤表示5个日志事件,日志①②表示db2stmm进程将IBMDEFAULTBP缓冲池根据需要调整了2次大小(第一次从1000调整到1512、第二次从1512调整到2280),日志③表示db2stmm进程将SORTHEAP从61调整到65,到这时为止,数据库的内存使用没有任何问题,问题出在日志④上,有人人为的将IBMDEFAULTBP缓冲池从5160一下子调整到了1300000——为什么说是人为的呢?因为这次调整不是由db2stmm进程完成,而是由db2agent代理程序(应用程序在数据库服务器中的替身)完成 ,而IBMDEFAULTBP缓冲池是在Database_memory总大小中分配的,而Database_memory被设置为AUTOMATIC,所以db2stmm进程必须先将Database_memory调整到足够大,以容纳IBMDEFAULTBP缓冲池(1300000*4K 大约为 5G),于是日志⑤便证明了我们的推断——db2stmm进程将Database_memory从789400 (789400*4K 约等于3G) 一下子调整到了2373371(2373371*4K 约等于9G),而操作系统留给DB2使用的内存只有4.5G左右,所以这时数据库进程将使用PAGING SPACE(也就是虚拟内存,相当于物理内存的扩展)来满足需要,一旦进程使用到了PAGING SPACE,那么操作系统将发生大量的PAGE IN/PAGE OUT操作,PAGE IN/OUT操作将导致系统震荡,响应缓慢,操作系统出于自我保护,甚至会杀死DB2服务进程,最严重的情况是:当PAGING SPACE也用完时,系统出于自我保护目的,会自动宕机。我们这个数据库由于最后PAGING SPACE也被用完,最终导致了系统宕机。
问题解决
通过以上的分析,我们知道IBMDEFAULTBP设置的太大了(1300000*4K),连接数据库时导致系统宕机。所以我们下面设法将IBMDEFAULTBP改小。
操作步骤:
db2set DB2_OVERRIDE_BPF=10000
#使IBMDEFAULTBP使用这个值
db2 terminate
db2stop
db2start
db2 connect to db
#连上db
db2 alter bufferpool ibmdefaultbp numblockpages 0
#原来的块SIZE太大,我们这里禁用它
db2 force applications all
db2 connect to db
db2 alter bufferpool ibmdefaultbp immediate size 50000
#将SIZE改小
db2set DB2_OVERRIDE_BPF= #设置为空,还原回去
db2 terminate
db2 force applications all
db2stop
db2start
OK!!!!
- 上一篇: C++ 数组到底能开多大。。
- 下一篇: 在JS数组指定位置插入元素