Reasons why Batch Accounting is Under-reporting
Or more specifically, why we can rule out some possible scenarios.
Current logging under-reports by about 2-5%. This is pretty significant.
Busted scenarios should be strikethroughed and shifted to the bottom of the list for reference.
Events may be reported late
What needs to be checked is how frequent events are logged on the database. Two things need to be done here:
- The number of records for a recent hour should be logged over a period of time. Extensively. This means logging every single event that appears so we can perform cross-checks on which events are being delayed in reporting if this is the case.
- The most recent event to be entered into the database should be monitored over a period time to test for any delay in logging.
Daily reporting might be done too early and an hour of logs are being trimmed
Hasn't been tested, but this is a possible scenario if it's trying to fetch logs that happened after the daily reporting period. An hour is 4% of a day, after all.
Currently the resacc cronjob is set to run
at four minutes past midnight every day which may be causing problems in the event of a bad timezone configuration.
All CERN services, including batchmon, seem to be configured to be set to CEST, so this might not be the problem.
The last second of each day might not be accounted
stopdate = stopdate.replace(hour=23, minute=59, second=59, microsecond=999)
...
where = "WHERE loc.eventTime >= :startdatebind AND loc.eventTime < :enddatebind"
I'm not sure how Oracle_cx handles dates with milliseconds, but in any case the Oracle DATE field does not support any time unit smaller than seconds.
This being an issue or not depends on if Oracle_cx is accounting for milliseconds when binding the field. If it is, then this should not be an issue. If it isn't, then we're losing logging data for one in every 86,400 records.
Obviously
not the cause of the 2-5% loss, but a potential discrepancy nevertheless.
Some events might not be recorded in the database
Not probably likely, however this scenario hasn't been completely confirmed to not happen yet. I tested this before I understood more about batch and batchtzero.
I need to check this a second time with both LSF job sources in consideration to make sure there is 100% parity between AFS logging and the records in the database.
Potential imezone differences between AFS logging and the database
There are no differences in timestamps for AFS logs and database entries so there's no problem there.