Current config is: Orion Platform 2013.1.0, SAM 5.5.0, NCM 7.1.1, NPM 10.5, NTA 3.11.0, IVIM 1.6.0, VNQM 4.0.1
1.) I'm running a custom application monitor template using SQL Server User Experience Monitor Components, 14 total, each with a unique query, and consistently 3-4 specific components randomly go to 'Down', 'falsely' triggering alerts
Template I am using includes use of SQL Authentication first. Query is nothing special. logs in and tests fine.
The monitor credentials on target machine as used in SAM credential library are SQL Authenticating, have full sysadmin privs, target machine has unlimited connections enabled.
I was able to reproduce the same error from a separate NPM instance, producing similar errors, down state events, but at slightly different times.
Running debug logging on the application monitor, and subsequently reviewing both the login errors as seen from the target instance, and the security log, it appears that when poll requests go out, these 3-4 queries are reporting login failures (others log in just fine), thus denying access and going down state. Its totally random.
One of our server admins shared the following (i've edited out accnt and domain)
"This is pretty standard log for audit failure. Really it is pretty cut and dry, the user account or password are wrong. There are 16,000 of these going back just to 9/5 for this account. Almost all of these are from dxxxxxxxx. Some have the user account in the request like this:
Account For Which Logon Failed:
Security ID: NULL SID
Account Name: example
Account Domain: 10.10.10.10 (example)
Failure Information:
Failure Reason: Unknown user name or bad password.
Status: 0xc000006d
Sub Status: 0xc0000064
Others the account info is missing:
Subject:
Security ID: NULL SID
Account Name: -
Account Domain: -
Logon ID: 0x0
I have only seen this when the information is in fact missing or incorrect in the login request."
From SAM 5.5 debug logging, shows SQL Authentication is tried first, fails, then tries Windows Authentication. What our admin means by 'others' I think is those attempts by SAM to use Window Auth, which doesn't parse anything, so it see's blank. But the first 'failure reason' for unknown username or bad password I think is associated with the first SQL Authentication attempts.
I'm wondering if this is perhaps somehow performance related,
1.) I do have a modestly high disk queue length on the Orion SQL instance, and it's RAID 5 set up with NTA, tables are in need or rebuild
2.) This target node in particular has four application monitors running on it, all using the exact same SQL Authenticated account, so I intentionally disabled an unrelated SQL 'health' monitoring components, thinking it would reduce frequency of 'down' states, but just saw one of the four custom queries go down again. argh!
I'm thinking this may be performance related, so I'm attacking the problem from that angle, but could having, say four different SQL Authenticated applications, with roughly a total of only 33 component monitors cause some timeout, or bork parsing the username and password at times?