LCG Management Board
Tuesday 3 October 2006 at 16:00
(Version 1 - 6.10.2006)
A.Aimar (notes), D.Barberis, L.Bauerdick, S.Belforte, I.Bird, K.Bos, Ph.Charpentier, L.Dell’Agnello, B.Gibbard, J.Gordon, I.Fisk, D.Foster, F.Hernandez, M.Lamanna, E.Laure, S.Lin, H.Marten, P.Mato, B.Panzer, L.Robertson (chair), J.Shiers, O.Smirnova, R.Tafirout, J.Templon
Tuesday 10 October from 16:00 to 17:00, CERN time
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting
Comments on the minutes were received:
- Additional explanations received from D.Barberis on the ATLAS tag database and their rates to/from tape (email)
Minutes (“version 2“) updated and uploaded to the MB Minutes page.
F.Hernandez asked whether C.Eck could check with the experiments whether it is possible to gather separately the input and output network requirements to the sites, instead of a single “maximum” value per site.
Note: Outside the MB, meeting C.Eck confirmed that he will do it and modify his Megatable accordingly and distribute it to the MB.
1.2 Matters Arising
1.2.1 Quarterly Reports 2006Q3 (QRs to complete) – A.Aimar
To complete before Monday 9 October 2006 and send to A.Aimar.
1.2.2 Megatable Update Process (document) – L.Robertson
Updated version of the document (agreed in the MB of the 26.09.2006) about the process used to maintain the T1-T2 table is available.
1.2.3 Revised Requirements
There was an action to be completed by end of September (for ATLAS by 6-October) to release official versions of the revised requirements.
- ATLAS will provide it, as previously agreed, after the approval of the ATLAS Collaboration Board on 6 October.
CMS (reported by S.Belforte)
has discussed it during the CMS week, 10 days before, and will send a note
with updated values and motivations for the changes.
- LHCb (reported by Ph.Charpentier) have it ready and will send it after the MB
reminded that these values were needed for the RRB and must be circulated to
the Overview Board before being included in the RRB paper.
6 Oct 2006 - ATLAS, CMS and LHCb should send the updated values of resources required at the Tier-1 sites.
1.2.4 CMS Management Changes – L.Bauerdick
The CMS management structure includes now an Executive Board that, among others, includes:
- M.Kasemann as Computing Coordinator
- J.Harvey as Offline Coordinator, with L.Silvestris as his deputy.
The transition will happen during the next weeks and should be in effect after CSA06 toward end 2006.
2. Action List Review (list of actions)
Actions that are late are highlighted in RED.
H.Marten asked that the action about “experiments resources requirements for next 6 months (assigned to H.Renshall)” should be followed up by the MB and included in the MB Action List.
13 Oct 2006 - Experiments should send to H.Renshall their resource requirements at all Tier-1 sites (cpu, disk, tape, network in and out) covering at least 2007Q1 and 2007Q2.
3.1 ASGC (transparencies) - S.Lin
S.Lin presented the status and plans of the 24x7 user support at the ASGC Tier-1 site. Slides by Min Tsai.
At ASGC there are three levels of support teams in place:
Administrators. Handling of complex Grid services.
Administrators. Handling of network faults, coordination of recovery,
notification of affected parties
Operators. Handling of power, cooling, security and environment issues
The operational tools used are based on Nagios for the monitoring and provide:
- Automated email notification sent on fault detection
- Existing tests are used to debug the problems
- CE: job submission test
- LFC: catalog functionality, information system
- VOMS: proxy creation
- Castor MSS: GridFTP, SRM transfer test
- OS issues: ping, service port, disk, loading
Everything is migrated to resilient hardware and controlled via remote management.
Critical components are running on blade servers with
- Redundant power and RAID disks
- All instrumented with remote KVM over IP
- Hot backups deployment for grid services
The Help Desk system used is based on OTR (a free ticket tracker system) and:
- Automatically parses and creates tickets received from GGUS
- But the updates to GGUS are done manually for the moment
The ASGC plans for next year are to increase monitoring coverage in particular:
- Cover the Experiment services hosted at Tier-1 site
- Improve Worker Node environment testing
- Move to a more detailed Castor testing system
The Grid Service support coverage should improve and:
- Start rotation for 24x7 on-call Grid administrator coverage
- Hire operators to extend on-site coverage to 16x7, by 2007
In addition ASGC wants to test and deploy High Availability solutions for the critical Grid components.
J.Gordon asked how much of the time there are skilled people on site in average?
S.Lin said that in principle once in place the team will be of 6 grid skilled people, who will be in rotation (on call) once or twice a week each.
3.2 FNAL (transparencies) - I.Fisk
3.2.1 Support Infrastructure at FNAL
The 24x7 Support for the Tier-1 center works within the existing infrastructure at FNAL for off-hour support.
Critical Systems and Services are monitored by their NGOP tool (Next Generation OPerations tool developed at FNAL)
All failures on site are flagged and logged:
- System stops responding to ping
- Disk passes a fill percentage
- Load Level is exceeded
- Processes are missing
- Service tests fail
- SRM transfers are run between Tier-1 and Tier-2 centers
- Job submission is performed for both OSG and LCG
- File systems are monitored and worker nodes are held in case of failure
- dCache is heavily monitored, data scans, integrity checks, cron jobs
While all errors are given to Remedy which initiates a ticket or a page
- Tickets are tracked and reminders sent
- Pages go through a rotation of primary, secondary and tertiary responders
3.2.2 Current Team
The team currently consists of:
- 4 FTE for facility operations. Growing over the next 12 months
- 2 FTE of troubleshooting and integration work. One of the two is an open position they are trying to fill
- 1 FTE for storage operations (CMS contribution to a much larger team)
- Additional positions in grid development and integration are available
Everyone who is hired in facility support is asked whether they are willing to carry a pager and provide off-hour support
The requirements are in the job description. If the candidates do not accept them they are not hired:
- Off-hour support does not incur additional cost, but the management is aware that they need to prevent wearing people out
- Pager rotation of primary support person is performed weekly. Rotate to one week of secondary and then off for two weeks
3.2.3 Current Status
FNAL switched to 24x7 operations of critical grid components in July 2006.
They monitor the health of machines and generate pages for failures:
- Response to ping of srm server
- Response of ping grid gatekeepers
And monitor the existence of processes. Some of these can generate pages, but many generate tickets
If a percentage of the cluster is held, the operators are paged.
The list of services that result in an immediate page is given here: http://cmsmon1.fnal.gov/cgi-bin/get_critical_item_status and the main monitoring page is here http://cmsmon1.fnal.gov/cgi-bin/status.
The US-CMS Tier-1 center is going to be ready to offer 24x7 operations of all critical components by the beginning of 2007.
- Many services are currently monitored and responded to
Operations effort is increasing, which should allow improved response time and quality of service
- More services and functionality are being added to the currently monitored services
I.Bird asked how many of their monitoring tools (e.g. dCache monitoring) could be used by other sites.
I.Fisk replied that probably the tools and scripts can have some degree of sharing (or be used as examples). He also noted that, in any case, would be good if the operation models are similar from site to site.
J.Templon asked if the monitoring tools that check the health of the different systems and services also keep a log history and whether this could be used for some kind of independent availability metrics, in addition to the SAM testing.
I.Fisk replied that in principle it could be done, but that they did not look into that.
J.Gordon added that this could actually be used in addition to the SAM tests because they sometimes fail for unclear reasons. And this could be a further validation of the systems status and of the test results.
3.3 PIC (transparencies) - G. Merino
3.3.1 Basic Information about the Infrastructure
The available Power Supply consists of:
- 200 KVA UPS
- 500 KVA diesel generator
With cooling of 300 KW, which was sufficient last summer, which was exceptionally hot.
The Network is provided by the Spanish NREN (RedIris), with the same level of support as GÉANT, which implies support at the level of 24x7 (with emergency telephone available)
The plan is to provide High Availability hardware to critical services. Today many servers still running on “WN-like” hardware because new services in the last years had to be deployed/tested/run.
Currently they are moving critical services to a standardized “server-like” building block h/w
- Dual power supply
- Mirrored system disk
- High quality standard HDs, hot swappable
- Dual Ethernet (using 2 separated switches)
3.3.2 High Availability in Critical Servers
The Basic Infrastructure will have HA hardware is for DNS:
- Use secondary server in case of primary failure
- Move to robust platform in the near future
For Databases the hardware will be upgraded to HA:
- FTS (oracle) and LFC (mysql):
- RAID1 system and DB-quality disks
- Regular hot backup (FTS: 24h ; LFC: 1h)
For the two Storage system that they have installed now.
Still using castor1 in
production. Servers are not HA.
Now migrating to castor2.
Production servers will be deployed in reliable
- Core services are already deployed on 5 servers with reliable h/w
Deployment schema has already
3.3.3 Monitoring Tools
- Nagios: for alarm handling.
- The operator also watching SAM monitoring pages and they are in the process of interfacing this as a local Nagios alarm
- Ganglia: for metric time-dependence monitoring
They plan to:
- evaluate other tools, like lemon, with integrated capabilities and possibility of full monitoring history archiving.
create a dashboard
that facilitates global status check to the MoD that integrates the different
3.3.4 Monitoring Status
The staff is completed by two engineers from collaborating company (TID) developing INGRID:
- framework for implementing “expert system” that takes recovery actions depending on the given services alarms
- Not yet in production. Plan to deploy it for most critical services by 2007.
3.3.5 Manager on Duty
The MOD is in charge of:
- Monitoring: support mailing list + alarms for critical services
- Redirecting issues to relevant experts
- Tracking the problem until its resolution. Using the internal ticketing system in place to follow up and used as “knowledge database”
- Contacting back the user
- Writing a daily logbook/report with main incidences
The MOD team is a pool of 7 people (will be 10 in 2007) making weekly shifts (wed-wed)
- Today: MoD only active during working hours
The 24x7 Plan that is being implemented is:
- Implement SMS service for critical service alarms
- ADSL at home provided to all PIC employees
- MoD on-call during non-working hours
The MOD will act as 1st line support for alarms. Will be able to call 2nd line expert for escalation if needed.
The On-call system is being developed now (formal issues with contracts, pay extra hours vs extra holidays?, voluntary scheme)
The plan is to finalise definition of 24x7 procedures by Dec-2006, and start operating it by March-2007.
PIC is not planning to have staff on site 24x7. Therefore the emphasis put on:
- Deploy services in a reliable/robust way
- Monitoring + automating recovery actions as much as possible
The Pool of engineers taking Manager on Duty shifts will evolve to cover non-working hours through an on-call schema
PIC intends to provide 24x7 support by end of 2007Q1.
L.Robertson asked whether they had encountered particular problems with “normal” hardware, and whether this was the reason for moving to HA hardware now.
G.Merino answered that there have been some hardware failures using WN as servers. Several disks would break after about one year usage. Maybe it was due to the pool of disks purchased at one given time.
D.Barberis noted that if there is no 24x7 staff on site experiments requests (not only alarms) should still be answered by someone quite promptly (within hours, not at next working day).
G.Merino answered that if there will be some kind of agreement for answering within a given time, this will be allowed to some authorized persons and could be plugged in the alarm system.
J.Gordon said that there will probably be a group of “authorized” people at experiments and sites that will be allowed to call for urgent actions or to trigger the alarm system of a site.
L.Robertson noted that the rules and process for contacting in an emergency sites outside of their normal working day should possibly be defined in a uniform way across sites. Otherwise for each site one needs to know the sites rules and protocols. An assessment should be done later this year.
No slides prepared.
The Job Priorities WG result is ready to be deployed but some additional steps are needed:
Make sure that yaim supports
the mapping of VOMS groups and roles to UNIX accounts/groups.
- New version of Torque/maui has been released that implements hierarchical fair share
- A bug on job list match was found and is being fixed
- The Information Provider is ready in CVS but there is not an official release
Some of this will be for the current release and others postponed to next release cycle (VOMS support and Information Provider).
The component will have to be integrated, released and installed at the production sites.
The policies at the sites should be defined by the VOs. How to implement these policies is documented and the site admins can configure it.
Yaim will be configured to generate skeleton configuration files to facilitate the installation. The complexity will depend on how frequently the VO wants to changes the configuration and the priorities.
L.Robertson asked where the prototype is being deployed.
replied that this is deployed only on
At CERN there is a temporary solution with 3 queues that allows queuing short, medium and long jobs.
Ph.Charpentier noted that this would be a solution that should for now be implemented at all sites, waiting for a better one.
asked whom the sites should receive their configuration from? A single point
of contact with the experiments should be defined.
The percentages between experiments should not change often (about six months?), but percentages can change among groups within a VO.
I.Fisk asked how the experiment knows that the requested changes are performed? J.Templon answered that this is not easily visible. Except when the complete percentages are used, and therefore one can see whether the percentage used is the same that had been requested.
asked for a final estimate of the time by which the system will be
K.Bos added that this would be discussed at the GDB the day after.
H.Marten asked how the interface to the scheduler can be implemented.
J.Templon answered that one should create the groups and roles and map them using LC MAPS and in the Information provider.
Currently the system is implemented only for LSF and
Torque/maui. Not for the other systems used elsewhere.
noted that moving to many groups and roles will be a complex problem to handle;
and therefore it should be tried soon (if it is an expected use case).
2. Feedback from the Comprehensive Review (document, transparencies) - L.Robertson
L.Robertson presented the key points from the LHC Comprehensive Review. See transparencies.
Before he also distributed a written summary of the main feedback from the reviewers. See document.
2.1 Key Points on Services
1 - Focus on stable services, fixing bugs, ramping up scale and performance. New features are lower priority
This applies to services and software:
- CERN, Tier-1 and Tier-2 services (move towards a level of stability that makes 24 X 7 realistic)
- Middleware and any other software
2 - Additional Job performance and system
usage metrics are needed. Should include failure rates at each site, etc.
3 - The Experiments involvement at Tier-0/Tier-1s sites is considered essential. And First Level of support could be within the experiments
4 - A Service Operation Coordinator is needed (the SCOD proposed last week could be a solution)
5 - The VOs want to have User level accounting.
2.2 Key Points on Software
1 - SRM 2.2 - Essential for start-up – concentrate on delivering the current agreed functionality. Not move further on SRM 2.3, etc.
2 - 3D Phase 2 sites should move to production.
3 - Job priorities. They consider it a non trivial issue to manage and suggest a step by step approach.
4 - CASTOR 2 – still a critical area – does it need more staff?
2.3 Key Points on Outlook and Planning
1 - Usage of CAF must be defined, by the experiments. The sooner it is defined the sooner CERN knows what to purchase and install.
2 - Decisions by LHCb, CMS and ATLAS on level of interest in PROOF are needed also for completing point 1 above.
3 - Integration of the DAQ and testing of the full data flow– but timing of this is outside our scope. Depends on the experiments schedule.
4 - An open question: Should we relax the 2007 commissioning performance targets in view of the new assumptions about start-up? And rather look for stability of service and then ramp-up?
5 - Revised capacity requirements should allow funding agencies to fulfil the requirements – but they should NOT be tempted to reduce the level of funding.
Summary: The MB must define the objectives, and plan for the next 9 months taking into account the review’s recommendations
L.Robertson noted that next year the review should not coincide with the EGEE Conference.
4. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.