LCG Management Board
Tuesday 13 October 2009 16:00-18:00 – F2F Meeting
(Version 1 – 3.10.2009)
A.Aimar (notes), J.Bakken, I.Bird (chair), M.Bouwhuis, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst,, S.Foffano, Qin Gang, J.Gordon, A.Heiss, M.Kasemann, G.Merino, A.Pace, H.Renshall, M.Schulz, Y.Schutz, J.Shiers , O.Smirnova, J.Templon
Mailing List Archive
Tuesday 27 October 2009 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
No feedback received about the minutes of the previous meeting.
1.2 Comments to VO SAM Tests (CommentsVOSAM.zip) – A.Aimar
A.Aimar distributed the comments received on the VO SAM tests from the 4 Experiments.
Experiments and Sites should comment them before these are added them to the Quarterly Report for 2009Q3.
Received from F.Hernandez after the meeting: In addition, I would also suggest that we record in the minutes that remote participation by phone was almost impossible due to the extremely bad quality of the links. This problem is recurrent and needs attention.
2. Action List Review (List of actions)
Not done by: FR-CCIN2P3 and NDGF,
provided a URL but SLS cannot access it because of some certificate issues.
Operations Weekly Report (Slides)
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings
This summary covers the week 5th to 11th October. It included week 1 of CMS analysis tests and ATLAS throughput tests.
A mixture of problems was encountered for a total of 26 test alarm tickets of which:
- 11 from ATLAS. One to follow-up via BNL to trigger an SMS in OSG and one still assigned to TRIUMF – received but not closed
- 8 from CMS – all closed
- 7 from LHCb – all closed
- All ALICE tickets (closed) were the previous week.
A few incidents leading to eventual service incident reports
- RAL disk subsystem failures taking down FTS, LFC and CASTOR from 4th to 9th (7th for LFC and FTS). SIR later.
- SRM Failures at CC-IN2P3 on Thursday, October 08th and Saturday, October 10th 2009.
- ASGC ATLAS conditions database unstable/partially corrupt.
The meeting attendance is still not satisfactory in several cases.
3.2 VO SAM Availability (slide 5)
RAL Disk Failures
ATLAS, CMS and LHCb SAM tests saw the RAL LFC, FTS and CASTOR downtimes (4 to 7 October for LFC and FTS and up to 9 October for CASTOR) due to failing disk sub-systems. ALICE only tests their VOBoxes and saw an SL4 to SL5 migration interrupt.
4 October: RAL CASTOR runs on a SAN with disk systems containing primary and mirrored databases. Hardware faults on mirror since 10 September also hit primary on 4 October and CASTOR went down.
Decision was to revert to older hardware then revalidate the failing systems. Suspicion early on was temperature problems.
6 October similar fault appeared on other racks hosting the FTS and LFC databases. Work started to revert those also.
7 October FTS and LFC brought back. Side effect was ATLAS DQ2 trying to pull data from RAL and failing causing srm problems at other Tier-1. Decision taken to move 3D database to alternative hardware also and suspicion changing to be on power supply problems.
8 October CASTOR being restored without loss for ALICE and CMS and losing a few hours’ transactions for ATLAS and LHCb – estimated at 10000 files. List of lost files being prepared for experiment decision.
9 October CASTOR restored – experiments to recover lost files. Vendor working with RAL to understand root cause of failures.
ATLAS throughput tests 5 to 9 October
Goals were to make a final stress test of Tier-0 data distribution and to validate FTS 2.2 including its use of checksums.
Day 1 and 2: First two days distribution of many small files in functional tests.
Day 3: Increased file sizes lead to overloaded site services. File sizes were increased further and frequency was decreased. This allowed reaching target rate of 3 GB/sec out of CERN. Ran 24 hours with aggregate 4 GB/sec out of CERN Tier-0 pool of which 1 GB/sec was to CERN tape. Input rate of 1 GB/sec from ATLAS load generator. Conclusion is Tier-0 is ready for DAQ.
Day 4: sites (notably BNL) reported many transfers failing with connection refused over sets of CERN disk servers.
Failures analysed by CERN FTS team to be a consequence of a design change. Request pools for srm and gridftp are now separate and a ratio of 2 SRM per Gridftp was used. Resource allocation delays lead to Gridftp processes timing out before being picked up by a srm process. Ratio changed to 1.5 got error rate down from 50 to 10%.
Check-summing tests from day 4 were more successful with little apparent overhead but more tests with realistic loads are needed.
IN2P3 SRM failures
08-Oct-2009 23:00 - Number of process on SRM machine raised up quickly.
09-Oct-2009 00:30 - Maximum number of process was reached. Machine crashed.
09-Oct-2009 06:30 - On-duty staff noticed the problem. Unscheduled downtime set with GOCDB (4h).
09-Oct-2009 08:30 - Expert restarted the service, reports made.
09-Oct-2009 10:00 - SRM services available again.
10-Oct-2009 03:30 - Number of process on SRM machine raised up quickly.
10-Oct-2009 08:30 - On-duty staff noticed the problem. Unscheduled downtime set with GOCDB (5h).
10-Oct-2009 08:30 - Expert restarted the service, reports made.
10-Oct-2009 09:00 - SRM services up.
The post analysis have not yield any cause of the erratic behaviour of the system although a bug of dCache is suspected.
On Saturday, the raise of processes was detected at an earlier stage, thus allowing restarting the service more rapidly.
FTS transfers have been impacted by these failures. Inward transfers (writing operations) were unavailable for 10 hours (Thursday) and 5 hours (Saturday).
- The administrators have submitted a ticket to dCache support to get a deeper analysis into the causes of the incident.
- DCache experts will set up a system to monitor processes on SRM servers and to trigger automatic rebooting procedure when needed.
- Monitoring system will be enhanced to highlight the criticality of the alert during on-duty slots.
ASGC ATLAS Online DB Problems
ASGC – ongoing instabilities/partial corruption in ATLAS conditions DB since two weeks. Confused situation – site review meeting at CERN yesterday.
ASGC report the db has been shutdown successfully 30min before a scheduled intervention at Sept 26 23:30. The root cause remains unclear why the DB service could not then restart normally.
Working with RMAN (Oracle recovery manager) the final (2nd time) point-in-time restore performed last Fri refer to full backup at Sept 21th and it seems this is no longer valid in order to recover streams replication or it might be that the streams are not able to recover from a two weeks backlogs.
CERN DM group recommends performing a complete re-instantiation using transportable table spaces. They need to agree with another Tier1 site to help transferring the files.
Reported today: streams replication may soon be restarted.
Update on Sites Upgrades and Security
Since some time sites have been encouraged to upgrade their SL4/RHEL4 Linux OS to cover the security holes discovered in August. Last week the EGEE PMB authorised suspension of sites that had not upgraded with a week’s notice. Experiments and Tier-1 were reminded at the Tuesday daily meeting (only Tier-2 and 3 sites were affected).
Some 50 sites were given 48 hours warning of suspension on Wednesday by the security team. Experiments also pro-actively circulated their sites.
Finally there were good results with only two sites suspended.
A coordinated DB recovery validation exercise that is regularly tested should be considered to avoid such problems.
H.Renshall proposed that the recovery validation should be a milestone for all Sites.
All Tier-1 Sites should validate their DB recovery solutions.
M.Kasemann reported that one CMS Tier-2 Site claimed it did not know about the security issues and mandatory upgrades.
I.Bird replied that the security contacts in all ROC regions were informed and no other Site complained about lack of information.
J.Gordon added that is up to the Sites to maintain their security contact updated.
4. Report on RRB Meeting– (Agenda) – I.Bird
I.Bird provided a summary of the latest RRB Meeting.
The only question on the status report on WLCG was on the transition from EGEE to EGI.
It was noted that non-European countries (Russia, Taipei, etc) are not included in EGI but some form of collaboration should be defined. WLCG should monitor the situation.
On the resources report by S.Foffano the timescale of the process in the future was discussed. It was asked to have Experiments needs for 2011 by 1st March 2010. If this is not possible the Experiments should provide a reply.
M.Kasemann added that the estimates should be prepared in function of the schedule in 2011 in terms of dates and running periods.
No other particular issues were mentioned.
5. High Level Milestones Update (Tier2_Reliab.pdf; WLCG_HL_Milestones.pdf) – A.Aimar
The MB reviewed the status of the WLCG High Level Milestones (see File) that will be added to the coming Quarterly Report.
A.Aimar will check with M.Litmaath on the validation of the VO Pilot Jobs frameworks.
The Average reliability at 95% if for considerably less than expected in the milestone above. The values are obtained from the monthly Teir-2 Reliability reports (see File).
I.Bird added that the only way to encourage Sites to improve is to publish the reports.
A.Aimar added that he will ask that the GridView report should have green Sites above 95% and not 90%.
A.Aimar will ask GridView to change the threshold to 95% in the Tier-2 report.
There should be a percentage of how many WN are on SL5 vs. SL4 from the Information System.
A.Aimar will ask to S.Traylen or L.Field about how to extract the information on the deployment of SL5 at the Tier-1 Sites.
These should be checked in the Information System with the installed capacity.
The deployment of SCAS should be verified. Will be discussed at the GDB on the following day.
Check with S.Traylen.
J.Gordon will report on whether the Sites are reporting User Level accounting.
Both IN2P3 and NL-T1 have installed the CREAM CE. IN2P3 will have it in production in November.
WLCG-09-27: There are several Tier-2 installations for each VO. Action done for the 4 Experiment.
M.Schulz noted that only when the WMS will be updated the CREAM CE can be used. By other Experiments but ALICE.
Ph.Charpentier added that LHCb could use it but do not want to use workarounds.
For the moment there are no clear dates for the milestones above.
WLCG 09-11 should be discussed with the team working on Nagios.
The Sites should be verified on whether they report in the new benchmark.
Will be discussed at the GDB on the following day.
6. Update on Accounting Reports and HEP-SPEC06 (Slides) – J.Gordon
J.Gordon reported on the status of accounting and benchmarking at the WLCG Sites.
The User Accounting Policy has been approved by the WLCG and by the EGGE MB.
F. Hernandez added that the French sites involved in the WLCG collaboration got the approval from the french authority for protecting privacy and personal data to collect and record information on usage of grid resources. Some of the tier-2s and tier-3s are already publishing the accounting data with the associated DN.
Many sites are publishing HEPSPEC06 numbers but scaling them by 4 so that accounting still publishes SI2K hours. HEPSPEC06 hours can be recovered by scaling by 4 again.
Accounting has no knowledge of which benchmark has been run. But the information Service can tell us. Once a threshold of sites has been reached then the Portal can flip from reporting kSI2K to HEPSPEC06.
He has requested parallel reporting with HEPSPEC06 as an alternative to Normalised CPU.
A patch to APEL is required to use the Scaling Factor taken from BDII for normalisation. This is under development and should be available for testing next week.
Data is being collected but not from all Sites
- 336 different site names publishing to APEL
- 313 sites publishing FQAN info
- 190 sites publishing UserDN
A big improvement in six months and now it need feedback from Experiments on whether the portal is showing the required information for FQAN and UserDN.
Gstat2 is measuring status and RAL will store historical data
Bu they are short of staff to implement and will redeploy someone soon to this task.
There is active development to change the APEL client to use ActiveMQ instead of R-GMA. This is in alpha testing. The date for roll-out is unknown but it should be before the end of EGEE III.
There is a parallel development to distribute the APEL repository to NGIs but it is planned to continue to host the LHC VOs at RAL and NGIs will republish these, and other, VOs to RAL.
RAL will run the Accounting Repository EGI Global Task
See S.Traylen’s talk from last MB meeting and tomorrow’s GDB. Data will be gathered and stored and merged with reports (e.g. T1, T2)
Lots of errors to correct in the publishing of logical and physical CPUs.
The next step are:
- Publish UserDNs
- Benchmark with HEPSPEC06
- Publish installed capacity (S.Traylen Talk)
- Storage Accounting
Some Country and communities did not like the name HEPSPEC.
It could be changed to GRIDSPEC or EGEESPEC. Or have multiple equivalent names.
7. Update on the EGI InSPIRE ROSCOE Proposals (Slides) – J.Shiers
J.Shiers presented an update on the EU proposals.
The details of the “18.104.22.168 – Service Deployment” component of the EGI InSPIRE proposal are now converging and similarly Robust Scientific Communities for EGI (aka ROSCOE).
This includes the funding levels that we expect to request as well as the specific programme of work and partners.
There is reason to believe both proposals will be funded – and that future funding is at least a possibility.
7.2 Progress during EGEE09
During EGEE ’09 it was realised that further funding – at least for “EGI” – will not be possible under FP7. This has meant that the proposal has been adapted to a 4 year period within the same budget envelope – €25M for 22.214.171.124/2.
The talk focuses on the services of particular relevance to WLCG: TSA4.x
ROSCOE will continue with a 3 years Program of Work. Prior to EGEE ’09 we foresaw a 3 year project with the following breakdown:
- VO services: 10.5 FTE x 3 years
- Dashboard services: 4 FTE x 3 years
- Ganga services: 2 FTE x 3 years
Plus equivalent services for other “HUCs”.
In the last two weeks this was modified to fit the new 4 year profile, as follows
It is assumed that the funding ratio will be 50:50 – EU: institute. Where this still a 3 year project, the total FTE (per year) would be 15 2/3 [fifteen and two-thirds].
We have task descriptions and needs to iterate on these and complete proposal. The proposal will fit the budget but one should nevertheless foresee negotiation phase.
7.3 SA4 – Tasks, Milestones and Deliverables
Significant commonality in the tasks proposed.
- Workload management, data management, monitoring, service automation and operation
- All consistent with model that funding for SA4 is a “one-off” and “Services for user communities that are heavy users of DCIs and have multi-national dimension”
Feedback from EU (actually on SSCs) is that semi-annual reports are not enough one should assume quarterly reports but “much lighter weight that EGEE QRs”.
Suggest – for the proposal – to recast currently proposed tasks into a common framework
- Common deliverables: quarterly reports
- Milestones: some candidates, particularly in ATLAS tasks
But do we want to tie ourselves to these unless absolutely required? Retain as internal as many as possible ATLAS / WLCG milestones.
Manpower table (exists); task and deliverables table (as above)
A number of “vertical” SSCs and a few of “horizontal” ones are bidding against 1.2.3 – Virtual Research Communities – total budget €23M
Total request for ROSCOE is currently €8.5M (€2.7M for HEP).
More than WLCG: includes ILC and FAIR although SMEs dropped a long time ago to fit foreseen budget.
This seems to be the right ballpark although it has been suggested we better justify this well.
And it has only been possible by stripping back to the threshold level for participation. Any less and partners (including CERN) no longer consider it worthwhile to participate.
50:50 co-funding for all partners except CERN – closer to 1: 2 depending on “matching funding” (real costs).
We should still anticipate some level of reduction: it may in any case be enforced e.g. foreseeing a time-profile as done for SA4.
7.5 ROSCOE WPs and Timeline
The first draft was simply the component SSCs proposals glued together. This has been revised into a small number of NA, SA and at least one JRA (tbd),
HEP tasks and manpower have not changed – a lot of work to re-write and finalize the proposal. There are limits on the total # pages as well as length of partner descriptions etc. We must also keep in mind the review criteria.
The Main Service Activities and Integration, Operations and Distributed Analysis Support and needs to differentiate clearly from SA4.
Cal Loomis asked for revised proposals by last Friday but we did not make this completely. It is getting harder and harder to find manpower for this proposal. There is a con-call this Friday and a F2F Wednesday 21st. Use these two meetings to finalize both “HEP SSC” components of ROSCOE and task descriptions of SA4. This includes (brief!) partner descriptions and all outstanding input.
There is a ROSCOE F2F at LAL Wednesday 28th – “final proposal” that day / week followed by focus on “administrative matters” (EGI SSC F2F at CERN cancelled – not enough people).
Obtaining PICs and completed A2s etc has taken a long time and was still not finished end last week. It has been a hard summer - but we need to continue to maximize our chances of funding.
It is widely assumed that funding will be back-dated one month to 1st May 2010. We have a number of good people – some ending this year – plus others “in the pipeline”. Around ½ of the 8 + 3 “CERN slots”?
7.6 Partners Funded Activities
Likely that there will be unfunded NA activities in addition to the above (+ needed management work).
7.7 Next Steps
Need to complete detailed description of tasks and partner involvement / contributions this week (applies also for SA4)
Partners now expected to be pro-active on this point.
G.Merino asked whether the resources for the Tier-1 are still included.
J.Shiers replied that it was already agreed that the NGI should provide those resources.
8. Update on the EMI Proposal (Middleware) (Slides) – M.Schulz
M.Schulz presented the status of the EMI proposal showing the material prepared by A.Di Meglio, the interim EMI project director.
The latest news after EGEE09 are:
- University of Manchester (Gridsite) has decided not to join
- Added GRNET and TCD as new partners, bringing the total count at 22
- The second round of proposal drafting has started, next revision available on Fri 16/10
- Following EGI decision to extend duration to 4 years, EMI considering similar extension beyond 3 years
Probably 6 or 12 months extension only focused on maintenance and mostly with matching founding to avoid changing current budget and effort estimates (~90 FTEs in total co-funded)
Next major events are:
- EMI Core meeting (technical) on 19/10 in Juelich
EMI Collaboration Board
(administrative) on 20/10 in Juelich
The expected CERN role is:
- Project Director
- One WP leader (SA2 – Quality Assurance?)
- One coordinator of a major technical area (Data Management?)
- Development on Data Management (FTS, DPM/LFC and related components), Information Systems (BDII, Messaging), configuration (YAIM)
- Software Engineering support (Quality Assurance, ETICS)
Estimated CERN effort is 12-14 FTEs cofunded
Project objectives not yet completely shared by all partners:
- gLite and UNICORE partners more focused on maintenance of production components, standardization, operational aspects of the middleware and exploration of new computing models (clouds)
- ARC partners more focused on standardization, development and evolution of components independently of actual production status, operational aspects considered out of scope.
The status and future of Data Management not clear. Expertise may not be available anymore by the time EMI starts.
Need to discuss details of effort and in particular matching effort with potential names.
Need to understand the requirements of WLCG and the experiments and make sure EMI supports them. How? When?
Day 0 Distribution
At the beginning a long list of component will be included with many overlaps. See slide 9.
J.Gordon asked whether accounting has been finally been accepted.
M.Schulz replied that although some consider accounting as an operational tool, to have reference tools is important. Therefore accounting tools are included.
I.Bird asked how all these packages, basically almost all known middleware plus some future one, are going to be supported.
M.Schulz replied that there is a list of institutes supporting each of the tools above. The tools important to WLCG are well funded. The list above has not been really taking into account, in due proportions, which tools are more or less used.
J.Gordon added that WLCG and EGI should define and propose the priorities among these tools.
I.Bird proposed that as a large use community WLCG should express their requirements and priorities.
M.Schulz added that the EU wants the 3 middleware software to be merged and all have to be satisfied. Some middleware is looking for level of funding higher than the grid resources supported. He also added that forcing standardization may drive to distort effort from supporting the currently used solutions.
O.Smirnova noted that the situation is not as difficult as presented. And that the EMI project needs to have the 3 middleware projects.
I.Bird replied that the list of component seems too extensive in order to get good support.
O.Smirnova added that the manpower is distributed differently, some have 5 FTEs others have 0.1 FTEs for support.
O.Smirnova noted that the resources provided by NDGF/ARC have a relevant role in the WLCG and they would like to receive sufficient funding to continue to provide them.
I.Bird concluded that the EMI component needed by WLCG must be known because they are a requirement for WLCG benefitting from the EMI effort.
9.1 SAM Tests ASGC to FZK
DE-KIT is noting some FTS tests from ASGC which are always failing and reducing the availability of FZK.
M.Schulz replied that probably is the Gstat jobs running and probing the BDII. Should be investigating with the BDII Service.
9.2 Castor and Chimera Updates
J.Gordon will ask all Tier-1 about their updates to Castor and dCache/Chimera and plans for those upgrades.
Ph.Charpentier noted that also the client must be upgraded and the client distributed by gLite is not the correct version of the client.
10. Summary of New Actions
No new actions for the MB.