Meetings
2011/12/14
We discussed migration plans to the new WMAgent based Tier0 system for next year. Underlying main consideration is the fact that we have a working system and we should not switch until the new system has been shown to work equally as well. As such, a move won't happen before we need a working Tier0 for at least the commissiong phase. This means we will install a
ProdAgent Tier0 from scratch and start using it for production. We will then switch to a WMAgent Tier0 whenever that is ready. This could be after pp data taking has started (during one of the technical stops).
The plan is for Tier0 Development (Dirk) to provide a "Milestone 0", a minimally working WMAgent Tier0 that allows Ops to inject streamer files into a
T0AST instance and have the Tier0 issue jobs. This will allow
Tier0Ops to start testing general WMAgent stability in the CERN/LSF environment. Initially the type of jobs is of secondary importance, although it would be good if they would load the system both in terms of IO and CPU (like express processing jobs for instance).
Once this exists, the work splits into three parts that are somewhat orthogonal.
- 1) Expand the workflows this minimal WMAgent can run and make them fully functional Tier0 workflows (complete the WMSpec workflow definitions)
- 2) Stress tests the system and become familiar with the operation of a WMAgent Tier0
- 3) Work out how to monitor the system
Part 1 is all development and part 2 is all operations. Part 3 is a mix of both. Development needs to look at
Tier0Mon and evaluate which parts can be kept as is, which parts need to be modified to work in the new environment and which parts would have to be completely redone to function in the new environment. Ops needs to provide feedback on the relative importance of having which parts of the monitoring functional. Also, some of the monitoring functionality might be better handled via (SLS) alarms.
Action items:
- Dirk: provide "Milestone 0" minimally working Tier0 asap
- Dirk: evaluate T0Mon codebase
- Samir: look at alarm system, check how it needs to change with WMAgent and how it can help with monitoring
Monitoring
Within-Agent alarming
- We need to make a list of which error conditions we would like to get alarms to.
- Failing jobs (any number)
- Cooloffs (above (small) threshold)
- Workflows with PromptReco or Express tags, not finished after X hours.
- Jobs too long (we may let this with the pure SLS alarming)
- This is the chat I had with Dave withing the wave, just to give an overview of what is it and how we can use it, also a bunch of useful links TODO is organize this info :
You could also start thinking about alert conditions that you want the prompt skimming system to generate.
Aug 27
Me:
Does it has a built-in alarm system? Because I had in my task list doing some SLS alarms which trigger mails from error conditions. It's a framework we now know how to use, and could be relatively easy to code the alarms (almost no new code). But if WMAgent has something built-in will be good to at least know how it works.
Aug 27
Dave:
The whole WM System (and possibly PhEDEx and DBS) will be picking up the WMCore Alerts system built on ZeroMQ messages.
Basically, for every alert condition you can think of, you give it a severity between 1 (info) and 10 (building on fire) plus whatever details you want, and it gets packed into a JSON message and sent off to various alert sinks (RSS, Email, etc we probably need to write one for SLS as well)
Aug 27
Me:
Great, do we have any example on how we can see in action a working alarm like that? Writing for SLS can be easy, as it's fed with an XML. and a good idea as it's one of the most used monitoring interface for CMS Computing Shifts.
Aug 27
Dave:
Here are some relevant bits of code:
The Alert object itself
https://svnweb.cern.ch/trac/CMSDMWM/browser/WMCore/trunk/src/python/WMCore/Alerts/Alert.py
The Alert processing sinks:
https://svnweb.cern.ch/trac/CMSDMWM/browser/WMCore/trunk/src/python/WMCore/Alerts/ZMQ/Sinks
The AlertGenerator component that watches the agent & the machine that it is running on to generate Alerts about the state of the overall system:
https://svnweb.cern.ch/trac/CMSDMWM/browser/WMCore/trunk/src/python/WMComponent/AlertGenerator
For adding SLS support, it would mean implementing a new Sink, probably something like the FileSink, but writing XML instead of JSON...
Aug 27
Me:
We probably already have a server that runs the sinks?
Aug 27
Dave:
Each Agent runs an AlertCollector component that collects all the alerts from that agent (all components are being instrumented to send alerts, plus those generated by the AlertGenerator comp) and dispatch them to a set of sinks configured by the operator.
There is also a ForwardSink that pushes the alerts to another (external) Alert Processor, such as an aggregator or central collection points for alerts.
After that it should just be a case of implementing something like twitterfall for alert visualisation in the various control rooms
Aug 27
Me:
Great. So the AlertCollector could also be (or is) the alert handler(sink)? In case of a simple deploy?
Aug 27
Dave:
Yeah, the AlertCollector runs the sinks.
FYI: All open & closed trac tickets relating to the Alert system can be found here: http://tinyurl.com/3h5j7ks
SLS Alarms
We have now few SLS alarms that does a good job, this is the list and comments about migration :
- T0 - Permanent failures in ProdAgent
- It looks in ProdAgent filesystem, if WMAgent has the same concept of failure archive, we can just point to the new dir and should work OOTB.
- CMST0 Jobs Backlog
- Looks into ProdAgent scripts, will die in the migration, we need to figure out how backlogs appear in WMAgent and how to spot them, the rest will be easy
- We may ask dev's help, but it's too soon for that, first we need a working system.
- T0 - Long Running jobs in LSF queue
- It looks directly in LSF, we don't need to touch it
--
SamirCury - 19-Dec-2011