LCG Management Board
Tuesday 11 March 16:00-17:00
(Version 1 - 12.3.2008)
A.Aimar (notes), I.Bird (chair), Ph.Charpentier, L.Dell’Agnello, M.Ernst, I.Fisk, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, M.Litmaath, U.Marconi, H.Marten, P.Mato, G.Merino, A.Pace, B.Panzer, R.Pordes, Di Qing, H.Renshall, M.Schulz, O.Smirnova, R.Tafirout, J.Templon
Mailing List Archive:
Tuesday 18 March 2008 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
Missing in the previous MB minutes:
A clean-up and update to the SAM tests was proposed by Schulz and should be done in the next few weeks.
1.2 Tape Efficiency Metrics
Not all sites can produce the metrics proposed. Therefore they should provide alternative metrics suitable to measure and report about the performance of their MSS.
M.Ernst commented that from HPSS is not easy to see the metrics because they have one single HPSS cache and cannot see the metrics for multiple writes. It is not obtainable from HPSS but this should not be an issue for BNL.
I.Bird suggested that the sites using HPSS could skip the metrics that are not relevant and propose their alternatives.
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
- 26 Feb 2008 - The Sites and Experiments should confirm to A.Aimar that they have updated the list of their contacts (correct emails, grid operators’ phones, etc). Here is the current contact information: https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails
- 29 Feb 2008 - A.Aimar will verify with the GridView team the possibility to recalculate the values for BNL.
Not done. Asked to GridView but it still needs to be implemented.
- 29 Feb 2008 - A.Aimar will verify why the reliability values for the Tier-2 sites seems incorrect (being lower than availability).
On the way. Being verified.
- 18 Mar 2008 - Sites should propose new tape efficiency metrics that they can implement, in case they cannot provide the metrics proposed.
Will be verified next week.
3. CCRC08 Update (Slides) – H.Renshall
H.Renshall presented the weekly update on CCRC08.
CCR08 is now in the phase 1.5 of CCRC'08 (i.e. between the February phase 1 run and phase 2 in May). There are no formally coordinated activities or metrics in this phase as yet. Currently it involves individual Experiments doing functionality, throughput and stress testing of their computing model components and sites.
ALICE continue to exercise their data export reaching up to 300 MB/sec to their 5 major Tier1 sites, well above the required 60 MB/sec for p-p running. They plan to add RAL in the near future (this or next week).
ATLAS performed their M6 detector cosmics run including data storage at Tier 0. 40TB of data was stored on tape between Friday and Monday and calibration streams were sent to the planned 4 Tier 2 sites of Naples, Rome, Munich and Michigan. This week they are functionality testing Tier-1 to Tier-1 data movement to PIC. Will then distribute 20 datasets of 100 files each to other Tier-1 for multiple Tier-1 to PIC tests next week. Revolve this round all Tier-1 over next 2 months. Intensive MC production for CCRC’08 phase 2 continues.
G.Merino warned that next week PIC will have a long electrical scheduled shutdown. The activities for ATLAS will not be possible and they should be aware of this.
ATLAS has also proposed a draft plan for March and April. See slide 5.
CMS continue preparations for the May run and a prior CMS global run in March. Reprocessing is going well with some site issues (dCache slow at FNAL and IN2P3 had disk pools problem). In Tier-1 to Tier-1 commissioning only the RAL to ASGC pair is missing.
Their Tier 0 operations suffered from a one-week long instability in LSF which had hit a performance-related bug/feature in synchronising to the failover server. Automatic failover is currently disabled while this is investigated but CMS are suggesting a separate LSF instance for their Tier 0 operations.
M.Kasemann added that the agreement is that for next 4 weeks they will try not to set up a new gateway for local submission. They try to use the single common gateway. If after 4 weeks this solution shows to be inadequate it will then be changed for May’s CCRC.
LHCb are preparing the workflow for their stripping jobs to be part of the 4 weeks steady running at nominal rate in May. They are setting up to evaluate and act on CPU time remaining for their grid jobs and setting up SRMv2 endpoints to go into their SAM test suite.
Future meetings coming up are the following:
CCRC’08 Face-to-Face Tuesday 1st April:
Site focused session in the morning then Experiment and Service
focused session in the afternoon.
– 25th April WLCG Collaboration Workshop (Tier0/1/2) in CERN main
WLCG Service Reliability: focus on Tier2s and progress since November 2007 workshop (1 day?)
CCRC'08 & Full Dress Rehearsals - status and plans (2 days?)
Operations track (2 days, parallel)
(2 days, parallel)
– 13th June CCRC’08 Post Mortem:
J.Templon commented that the Site-focused session needs the presence of the Experiments. The session is focused on presenting how the sites work and how to use their resources at best. Therefore to be a useful session the Experiments’ presence is really necessary.
J.Templon noted that all the data that ALIDE is sending to SARA is all made of “0” data, which is either a mistake or data that is not real ALICE raw data but just testing the transfer
1. Update of the HL Milestones (HLM 11.03.2008)
The MB verified all due milestones in the High Level Milestones dashboard HLM 11.03.2008.
Here is the dashboard Updated Dashboard HLM 15.3.2008, after the discussion below:
- FZK: H.Marten confirmed that will be ready for end of March 2008.
- ASGC: Done.
- CNAF: 24x7 support is provided for all critical services but not for the non-critical ones. The timescale will be by and of April.
- PIC: Done.
- RAL: Not represented at the meeting.
- ASGC: Done.
- NDGF: Done
- SARA: In the process of changing to a single SARA/NIKHEF help desk. It will be ready by end of April.
- ASGC: No SLA defined yet. Will be done by March.
- IN2P3: No defined yet. After the proposal the steps should be quick because the procedures are actually already in place.
- CERN: The VOBox SLA is being discussed actively with the experiments so we are on the way to turning this green.
FZK: Only the agreement from CMS is
- NDGF: For ATLAS there are no VO Boxes to run at the sites. For ALICE they have 7 VO Boxes and the SLA is being prepared.
- PIC: For LHCb is done. ATLAS no VOBoxes. For CMS an SLA is being proposed, similar to the one of LHCb, should be done by end March.
- SARA: Document is ready but not implemented. Will be checked in the next two weeks.
- CERN: the uploading of accounting data is now probably OK, but we are seeing discrepancies of around 10% between the APEL and local accounting, most probably due to the single normalisation factor used by APEL.
IN2P3: The supplied CPU material was
not adequate and had to be sent back to the supplier. Hopefully will ready
- CERN: Also have delivery problems and a supplier had to be replaced. Will be ready for end of April.
- CNAF: Will have the CPUs by mid-May, Storage beginning of May. Due to administrative issues.
- NDGF: CPU will be there by April, And Storage by September.
- PIC: Will have CPU by and of April and Storage by June 2008.
- The tests are being done in CCC08. And some have been done recently.
Now CASTOR is 2.1.6 instead. But was tested only at CERN by ATLAS and CMS.
WLCG-07-39: This is not complete yet and need to be reviewed in the next few weeks.
- CMS: They ramped-up the resources but will do stress tests in May.
- LHCb: Means running analysis at CERN and will be done in May 2008
- OSG: There will be a VDT release with the SE tests. The Availability and Reliability calculations are progressing with the SAM team.
J.Templon added that it seems that the issues are minor and will be verified in April when D.Collados is back.
Verify whether there are issues with NDGF SAM tests. Some comments from J.Templon and D.Collados were not replied by M.Eller.
2. Multi-Users Pilot Jobs Working Group (Slides) - M.Litmaath
The Pilot Jobs Frameworks working group, launched by the GDB, was mandated by WLCG MB on Jan. 22, 2008.
Its mission is to:
Review security issues in the
pilot job framework of each experiment.
- Define a minimum set of security requirements
- Advise on improvements
- Use of a common library or tool set for proxy management, but seems unlikely.
- Report to GDB and MB in a time frame is a few months
The members of the working group are:
- ALICE: Predrag Buncic
- ATLAS: Torre Wenaus
- CMS: Igor Sfiligoi
- LHCb: Andrei Tsaregorodtsev
- EGEE: David Groep
- FNAL: Eileen Berman
- OSG: Mine Altunay
- WLCG: Maarten Litmaath (chair)
There were 3 phone conferences held, and a 4th call at the end of March (Friday March 28).
The discussion progresses mostly via the mailing list.
Each experiment is to provide a document about their system
- LHCb were the first and the next version will incorporate feedback from discussions so far. They had it already before the meeting and have set the tone about the quality and content of the document.
- CMS provided a first version last week
- ALICE and ATLAS needed more time and have not provided any document yet.
A security questionnaire is being discussed
- Currently at v0.4
- Agreement on the relevance/scope of a question is not always evident
- Each document should provide the Experiment’s answers in an annexe.
Some experiments do not agree on some questions.
- E.g. How user tokens are used by the proxies, from submission until the job is started on the WN. What happens if the job crashes and how the clean-up is done?
- The Experiments replied these requirements are not asked to the general gLite components (e.g. WMS has a lot of proxies).
M.Schulz noted that the TCG has launched the security verification of LFC and other components. WMS will be reviewed as soon as the next version is out.
M.Litmaath added that the fact that some gLite components have not been already verified is not a good reason not to verify the security of the VO’s frameworks.
I.Bird proposed that VOs that provide the document and the questionnaire and pass the security check are allowed to use their framework. While those not passing or not providing the information should be on hold until they do so.
M.Litmaath replied that it is possible to configure gLEexec to allow only some groups or users but this will require configuration at each single site. In practice is very difficult to achieve.
I.Bird replied that the PJF working group should report on each VO separately and then the GDB and MB could decide what to do.
J.Gordon added that in the future other applications (Bio-Med, etc) should go through the same verifications (but this is not a WLCG matter).
Ph.Charpentier noted that the sites should have already gLEexec installed so that when the solutions are approved the sites are ready.
J.Templon added that gLEexec can be configured to the sites only if they define all the details and the temporary space can be created in different way at the sites (e.g. job’s subdirectory, permissions to protect the proxy files, etc).
Ph.Charpentier replied that guidance is needed from the gLEexec experts.
M.Litmaath added that all these issues are covered by the discussions that are taking place in the working group.
I.Bird proposed that a general recommendations document on security should be provided even before the frameworks are all certified.
Ph.Charpentier proposed that one site could provide an example installation so that all VOs can test the environment and the configuration in depth before its deployed elsewhere.
3. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.