Subject: Please review: DS7.2 Linux_RHEL4_x86_64-64 Acceptance test report - 63% passed on nightcrawler.dsqa.sjc.redhat.com using /share/dev4/ds/ds72/ships/20061209.1/predator_x86-64_rhel4/directory-72-64bit-us.x86_64-unknown-linuxoldld.tar.gz
From: Noriko Hosoi
Date: Mon, 11 Dec 2006 18:20:31 -0800
To: ldap-devel-list@redhat.com

It turned out it was caused by the check in I made to fix this bug:
Resolves: #202843
Summary: referential integrity plugin does not stack with Class of Service appliance (Comment #11)
When shutting down the server, CoS is "closed" pretty early (note: before MMR)
[11/Dec/2006:15:51:26 -0800] - plugin_dependency_closeall:
[11/Dec/2006:15:51:26 -0800] - Roles Plugin
[11/Dec/2006:15:51:26 -0800] - Legacy Replication Plugin
[11/Dec/2006:15:51:26 -0800] - HTTP Client
[11/Dec/2006:15:51:26 -0800] - Class of Service
[11/Dec/2006:15:51:26 -0800] - ACL preoperation
[11/Dec/2006:15:51:26 -0800] - ACL Plugin
[11/Dec/2006:15:51:27 -0800] - 7-bit check
[11/Dec/2006:15:51:27 -0800] - Views
[11/Dec/2006:15:51:27 -0800] - State Change Plugin
[11/Dec/2006:15:51:27 -0800] - Multimaster Replication Plugin <== shutdown hangs here.

Then, when it comes to MMR, it calls some "internal" operations, which depend upon CoS for the post operation.  But CoS is already shutdown and MMR failed to finish closing and hangs there.  Since the plugin is not available, it could happen anything...  Instead of crashing there, the server went to the weird mode, and the backend was just left without calling shutdown.

With the fix, CoS is closed after MMR and MMR gets closed without any problem and so was the backend.
[11/Dec/2006:17:43:40 -0800] - plugin_dependency_closeall:
[11/Dec/2006:17:43:40 -0800] - [0<41] Legacy Replication Plugin
[11/Dec/2006:17:43:40 -0800] - [1<41] Roles Plugin
[11/Dec/2006:17:43:40 -0800] - [2<41] Multimaster Replication Plugin
[11/Dec/2006:17:43:41 -0800] - [3<41] HTTP Client
[11/Dec/2006:17:43:41 -0800] - [4<41] Class of Service
[11/Dec/2006:17:43:41 -0800] - [5<41] ACL preoperation
[11/Dec/2006:17:43:41 -0800] - [6<41] ACL Plugin
[11/Dec/2006:17:43:41 -0800] - [7<41] 7-bit check
[11/Dec/2006:17:43:41 -0800] - [8<41] Views
[11/Dec/2006:17:43:41 -0800] - [9<41] State Change Plugin
[11/Dec/2006:17:43:41 -0800] - [10<41] ldbm database
[11/Dec/2006:17:43:41 -0800] - dblayer_close: dblayer_stop_thread: 0
[11/Dec/2006:17:43:41 -0800] - Waiting for 4 database threads to stop
[11/Dec/2006:17:43:42 -0800] - All database threads now stopped
[11/Dec/2006:17:43:42 -0800] - [11<41] chaining database

I added the CoS plugin dependency to MMR.  Then, it solved the problem.   I'm also adding the dependency to Legacy Replication and Retro Changelog for safety, which use internal operations. 

Thanks,
--noriko

Noriko Hosoi wrote:
Actually, the problem has started earlier than the replica crashes.  It's in the startup of mmrepl/accept.  It generates 2 masters (slapd-s1, s2) and 2 replicas (slapd-c1, c2), then sets up SSL on slapd-s1 first.  In the phase, adds/modify/delete encryption related entries, and stops the server.  The stop failed and the server did hung up.  The process still alives.  The attached stack traces are from the hung up slapd-s1 process.  It might be that some thread exists without releasing a lock, but not sure yet...
[log from mmrepl/accept startup]
-- ssl_on. LOCAL_LDAP_PORT=24202
ldap_delete: No such object
modifying entry cn=config

modifying entry cn=encryption,cn=config

deleting entry cn=RSA,cn=encryption,cn=config

adding new entry cn=RSA,cn=encryption,cn=config

/export/DS7.2-2623/server/usr/bin/ldapsearch -1 -h nightcrawler -p 24202 -s base -b  objectclass=* supportedldapversion
dn:
supportedldapversion: 2
supportedldapversion: 3
Server still running!! Failed to stop the ns-slapd process: 452. Please check the errors log for problems.

[stacktraces from the hung-up slapd-s1]
(gdb) thread 1
[Switching to thread 1 (Thread 182909167136 (LWP 452))]
#0  0x0000003cc1f0ae0b in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003cc1f0ae0b in __lll_mutex_lock_wait ()
  from /lib64/tls/libpthread.so.0
#1  0x0000000000000001 in ?? ()
#2  0x000000000000000b in ?? ()
#3  0x0000003cc1f07c17 in pthread_mutex_lock () from /lib64/tls/libpthread.so.0
#4  0x0000000000000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (Thread 1084389728 (LWP 514))]
#0  0x0000003cc14bf946 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003cc14bf946 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x0000002a955cb8ea in DS_Sleep () from ./libslapd.so
#2  0x0000000000412ef5 in time_thread ()
#3  0x0000002a960bfd5e in PR_Select () from ./libnspr4.so
#4  0x0000003cc1f0610a in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003cc14c68c3 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 1074805088 (LWP 478))]
#0  0x0000003cc14bf946 in __select_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003cc14bf946 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x0000002a955cb8ea in DS_Sleep () from ./libslapd.so
#2  0x0000002a98949173 in ?? ()
#3  0x000000000000024a in ?? ()
#4  0x0000002a9893c078 in ?? ()
#5  0x000000000075cc60 in ?? ()
#6  0x0000002a960bfd5e in PR_Select () from ./libnspr4.so
#7  0x0000003cc1f0610a in start_thread () from /lib64/tls/libpthread.so.0
#8  0x0000003cc14c68c3 in clone () from /lib64/tls/libc.so.6
#9  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (Thread 1074538848 (LWP 477))]
#0  0x0000003cc1f0b90b in __fsync_nocancel () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003cc1f0b90b in __fsync_nocancel () from /lib64/tls/libpthread.so.0
#1  0x0000002a960bc89d in PR_GetSpecialFD () from ./libnspr4.so
#2  0x0000002a9559984e in slapd_log_error_proc () from ./libslapd.so
#3  0x0000002a955995d6 in slapd_log_error_proc () from ./libslapd.so
#4  0x0000002a955994a4 in slapd_log_error_proc () from ./libslapd.so
#5  0x0000002a98937dea in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (Thread 1074272608 (LWP 476))]
#0  0x0000003cc1f0ae0b in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003cc1f0ae0b in __lll_mutex_lock_wait ()
  from /lib64/tls/libpthread.so.0
#1  0x0000000000000001 in ?? ()
#2  0x0000000000000065 in ?? ()
#3  0x0000003cc1f07c17 in pthread_mutex_lock () from /lib64/tls/libpthread.so.0
#4  0x0000000000000000 in ?? ()
(gdb) thread 6
[Switching to thread 6 (Thread 1074006368 (LWP 475))]
#0  0x0000003cc1f0ae0b in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003cc1f0ae0b in __lll_mutex_lock_wait ()
  from /lib64/tls/libpthread.so.0
#1  0x0000000000000001 in ?? ()
#2  0x0000000000000065 in ?? ()
#3  0x0000003cc1f07c17 in pthread_mutex_lock () from /lib64/tls/libpthread.so.0
#4  0x0000000000007080 in ?? ()
#5  0x000000000000002d in ?? ()
#6  0x0000000000000000 in ?? ()

--noriko

Noriko Hosoi wrote:
The acceptance test result on 64-bit Linux went poor on 12/09 (100% pass on 12/08).  The cause looks like the 2 replicas crashed in the transaction checkpoint, then it made the rest of the tests and the next day's tests failed due to the piled up ns-slapd.... :(  I'll keep investigating it...

[nhosoi@laputa accept]$ ls logs*
logs-c1:
access               audit               core.854  errors.rotationinfo
access.rotationinfo  audit.rotationinfo  errors

logs-c2:
access               audit               core.1056  errors.rotationinfo
access.rotationinfo  audit.rotationinfo  errors

(gdb) bt
#0  0x0000003cc1468518 in malloc_consolidate () from /lib64/tls/libc.so.6
#1  0x0000003cc1469537 in _int_malloc () from /lib64/tls/libc.so.6
#2  0x0000003cc146b272 in malloc () from /lib64/tls/libc.so.6
#3  0x0000002a9629eb2a in __os_malloc () from ./libdb-4.2.so
#4  0x0000002a9629d830 in __memp_sync_int () from ./libdb-4.2.so
#5  0x0000002a9629e31b in __memp_sync () from ./libdb-4.2.so
#6  0x0000002a962a3242 in __txn_checkpoint () from ./libdb-4.2.so
#7  0x0000002a962a33a8 in __txn_checkpoint_pp () from ./libdb-4.2.so
#8  0x0000002a98937c8f in ?? ()
#9  0x000000000000003c in ?? ()
#10 0x0000002a9893c642 in ?? ()
#11 0x0000000000000000 in ?? ()

$ ps -ef | egrep DS7.2 | egrep ns-slapd | egrep -v egrep | wc
    42     588    9435

Thanks,
--noriko





Index: create_instance.c =================================================================== RCS file: /cvs/dirsec/ldapserver/ldap/admin/src/create_instance.c,v retrieving revision 1.38 diff -t -w -U8 -r1.38 create_instance.c --- create_instance.c 8 Dec 2006 18:11:09 -0000 1.38 +++ create_instance.c 12 Dec 2006 02:13:58 -0000 @@ -2986,41 +2986,44 @@ fprintf(f, "objectclass: extensibleObject\n"); fprintf(f, "cn: Legacy Replication Plugin\n"); fprintf(f, "nsslapd-pluginpath: %s/libreplication-plugin%s\n", cf->plugin_dir, shared_lib); fprintf(f, "nsslapd-plugininitfunc: replication_legacy_plugin_init\n"); fprintf(f, "nsslapd-plugintype: object\n"); fprintf(f, "nsslapd-pluginenabled: on\n"); fprintf(f, "nsslapd-plugin-depends-on-type: database\n"); fprintf(f, "nsslapd-plugin-depends-on-named: Multimaster Replication Plugin\n"); + fprintf(f, "nsslapd-plugin-depends-on-named: Class of Service\n"); fprintf(f, "\n"); fprintf(f, "dn: cn=Multimaster Replication Plugin,cn=plugins,cn=config\n"); fprintf(f, "objectclass: top\n"); fprintf(f, "objectclass: nsSlapdPlugin\n"); fprintf(f, "objectclass: extensibleObject\n"); fprintf(f, "cn: Multimaster Replication Plugin\n"); fprintf(f, "nsslapd-pluginpath: %s/libreplication-plugin%s\n", cf->plugin_dir, shared_lib); fprintf(f, "nsslapd-plugininitfunc: replication_multimaster_plugin_init\n"); fprintf(f, "nsslapd-plugintype: object\n"); fprintf(f, "nsslapd-pluginenabled: on\n"); fprintf(f, "nsslapd-plugin-depends-on-named: ldbm database\n"); fprintf(f, "nsslapd-plugin-depends-on-named: DES\n"); + fprintf(f, "nsslapd-plugin-depends-on-named: Class of Service\n"); fprintf(f, "\n"); fprintf(f, "dn: cn=Retro Changelog Plugin,cn=plugins,cn=config\n"); fprintf(f, "objectclass: top\n"); fprintf(f, "objectclass: nsSlapdPlugin\n"); fprintf(f, "objectclass: extensibleObject\n"); fprintf(f, "cn: Retro Changelog Plugin\n"); fprintf(f, "nsslapd-pluginpath: %s/libretrocl-plugin%s\n", cf->plugin_dir, shared_lib); fprintf(f, "nsslapd-plugininitfunc: retrocl_plugin_init\n"); fprintf(f, "nsslapd-plugintype: object\n"); fprintf(f, "nsslapd-pluginenabled: off\n"); fprintf(f, "nsslapd-plugin-depends-on-type: database\n"); + fprintf(f, "nsslapd-plugin-depends-on-named: Class of Service\n"); fprintf(f, "\n"); /* cos needs to be placed before other same type'ed plugins (postoperation) */ fprintf(f, "dn: cn=Class of Service,cn=plugins,cn=config\n"); fprintf(f, "objectclass: top\n"); fprintf(f, "objectclass: nsSlapdPlugin\n"); fprintf(f, "objectclass: extensibleObject\n");