LCG Management Board
Tuesday 27 March 2007 - 16:00–17:00 - Phone Meeting
(Version 1 – 29.3.2007)
A.Aimar (notes), L.Betev, I.Bird, N.Brook, T.Cass, Ph.Charpentier, L.Dell’Agnello, Di Quing, C.Eck, I.Fisk, S.Foffano, D.Foster, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, J.Knobloch, H.Marten, G.Merino, H.Renshall, L.Robertson (chair), O.Smirnova, R.Tafirout, J.Templon
Mailing List Archive:
Tuesday 3 April 2007 - 16:00-18:00
– F2F Meeting in
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting
No comments received. Minutes approved.
1.2 Matters Arising
No matters arising.
2. Action List Review (list of actions)
Actions that are late are highlighted in RED.
Done. The calendars by experiment and site are already available on the SC4 Experiments Plans Wiki.
For example inside that page there are links to all plans:
VOS should always send all updates to H.Renshall (via the ECM meeting).
Done. Alice and LHCb sent their targets for 2007.
3 Apr 2007 - A.Aimar will update the targets for all four LHC experiments and distribute it to the MB.
Not done. Waiting for CMS (D.Newbold) and LHCb (N.Brook).
Done. Summary presented at this MB meeting.
Not done. L.Robertson will propose a new date for this milestone. (end May 2007)
Cancelled. L.Robertson had discussed the issue with J.Engelen. Beyond 2009 there are several options about the evolution of the LHC, but nothing has been decided yet and so computing estimates must remain speculative.
3. Site Reliability Reports for February 2007 – Summary (Reliability Data; Site Reports; Slides) – A.Aimar
All sites commented the Reliability Data document distributed at the end of February (pages 3 to 5 refer to February 2007).
In the attached Slides there are the February reliability daily values (slide 2 and 3) that are summarized and compared to the values in January in slide 4 and in the table below (sites ordered as in the Reliability Data document):
Reliability >= 88% (>= Target)
Reliability >= 79% (>= 90% of Target)
Reliability < 79% (< 90% Target)
The target of 88% for the best 8 sites was not reached
5 Sites > 88% (target)
5+3 Sites > 79% (90% of target)
In January we had 5+4 Sites > 79%.
J.Gordon asked why NDGF data is “n/a” while actually reliability data is available in GridView for NDGF.
L.Robertson replied that it was agreed to wait for a proposal from NDGF on the use of tests that they would develop and which would be equivalent to those of SAM. The results for NDGF will be included when this has been agreed and the new tests developed.
O.Smirnova confirmed that some SAM tests execute successfully (e.g. BDII) but, for instance, those for the CE and the WN need to be developed specifically by NDGF because of the structure of the NDGF Tier-1 site. And those tests are not ready yet.
The table below summarizes the Site Reports received, with “Problem, Solution” format, when the solution is available:
One can notice that the main issues were related to the SRM/MSS systems at the sites:
J.Templon added that the dCache patch fixed the “gridftp doors” problem but not completely. That is why it required further intervention at the site.
F.Hernandez pointed out that the “misconfigured RB” was not due to IN2P3, but IN2P3 was overloaded because of this problem.
There were not outstanding issues but can be noted that:
- Several problems were due to the upgrades of the dCache systems to SRM 2.2.
- The SRM and MSS start being overloaded on many sites, probably due to more realistic usage by the experiments.
- Individual site BDIIs are needed (decided at the Operations Meeting). Sites that have installed them have solved their timeout problems.
- Hardware upgrade was sufficient in many cases of service overloading (SRM, CE, etc).
- There were no “false positives” so the reliability of the SAM tests seems to have improved. But the site reliability did not improve in February compared to January. This may be due to teething troubles with the dCache upgrades at several sites but the target 88% seems to be difficult to reach for 8 sites. Other upgrades (SL4 migration, gLite 3.1, SRM 2.2, etc) still have not been implemented and they could introduce new problems that could reduce further reliability and stability of the sites:
The work for checking and reporting on site reliability to the Operations Meeting in the weekly site reports is ongoing. It will not be ready for March, but probably by the end of April 2007.
F.Hernandez noted that, in spite of past requests, SAM had not evolved to be more “site-friendly” with readable failure messages. This makes it very difficult to understand the causes of the failures.
I.Bird replied that this work is all on the list of things to do and will be taken into account, after other more urgent issues.
Received by M.Ernst:
As it is correctly stated in the table on "Operational Issues" SAM tests
are failing for BNL since 19 February because of problems with the gLite
middleware (in particular with the CE) since the farm was upgraded to
SL4 (and a new Condor version).
I think the real issue here is we need to correct the metrics.
Monitoring the LCG/gLite CE at BNL means we are probing the availability
of a component that is not
needed in the (
No production/analysis job ever gets dispatched through this CE.
Jamie and I discussed the matter last week and concluded that the
metrics will be corrected for BNL such that the CE will be taken off the
list of components to be tested.
4. VO Boxes - Sites Support Level (Slides) – G.Merino
G.Merino was asked to present the level of support of VOBoxes at PIC in order to trigger discussion and clarification of those issues between sites and experiments. A high-level milestone is set for April 2007: “Sites should propose and agree with the VO the level of support (upgrade, backup, restore, etc) of VOBoxes”.
4.1 VOBoxes at PIC
PIC supports VOBoxes for the three experiments: ATLAS, CMS and LHCb
For each of the experiments PIC has one person on-site acting as VO-liaison who:
- knows and operates the VOBox services (CMS)
- knows whom to contact in the experiment for VOBox issues (ATLAS, LHCb)
The VOBoxes at PIC run respectively:
- ATLAS: Data Transfer agents (Don Quijote)
- CMS: Phedex Agents
- LHCb: DIRAC Configuration Service and Transfer Agents
4.2 VOBoxes Operation
Backup: No backup requested to the site
- CMS: The only important files to keep are Phedex agents’ configuration files. These are kept in an external (CMS-owned) CVS repository.
- ATLAS: The MySQL tables are re-generated centrally in case of failure.
Monitoring: No VO-specific sensors provided for sites to show the VOBox status. All experiments have VO-specific sensors external to the site and they detect whether the VO-specific agents are working correctly.
- If any problem occurs, the VO contacts the local VO-liaison
- Site operators do not get alerted
Site should only monitor basic metrics (host is alive, CPU load …).If any of these is triggered, the local VO-liaison is notified
Recovery: No special recovery action is agreed.
- If a major problem happens, the site admin tries to reboot the VOBox node
machine is not responsive, it reinstalls a new default VOBox.
L.Robertson asked which is the “response time” agreed with the VOs.
G.Merino replied that there is not a formal agreement; the VOBoxes are “ping-ed” like any other host. PIC has no special spare for VOBoxes or for specific purposes. They have spare equipment that is like a “powerful WN”, not a full server; which would be used to replace a VOBox if needed.
As of today to prepare a VOBox from scratch it would take about a full day. Is this adequate for the experiments?
4.3 Specific Issues on VOBoxes at PIC
1. The CMS VOBox still doing quite a lot of "class 2" functions. Today it would not work if taken out of the site LAN.
For instance: checking whether a pre-staged file is already on disk, by using "local" commands (CASTOR 1 commands, for instance). Is this just a temporary solution until we get SRM-v2.2?
Nobody from CMS present to reply.
2. Many of the LHCb transfers hitting PIC's SRMs do not use FTS
The LHCb Transfer Agents in the VOBox seem to play the FTS role (transfer queue, retries, etc). This prevents the PIC system administrators from having any control on the LHCb data flow. When transfers are FTS-driven, one can administer the FTS channels (in particular, close them when there are problems). For LHCb, PIC can not do this since only LHCb people can interact with transfer agents in the VOBox.
For PIC was it impossible to recover a backlog work because all the retries were done directly from the WN and would overload and block all network transfers.
N.Brook replied that the VOBoxes retry the transfers when they fail. LHCb will continue to use the WN only for the first copy attempt, but failures will be retried by the VOBox using FTS channels in the future. He noted, however, that the transfer rate integrated on all Tier-1 sites is less than 10 MB/sec for LHCb.
3. There are some LHCb Tier-2s with no disk at all
The data flow from some LHCb Tier-2 sites into PIC will always come directly from the WN to the SRM. PIC lacks any FTS “operations” control. Is this a temporary situation, or will every Tier-2 site have some small SRM-disk to act as a buffer for those transfers?
N.Brook replied that also in this case the VOBoxes will use FTS; therefore the site will be able to control those transfers via normal FTS service administration.
L.Robertson asked for comments from other sites:
- H.Marten reported that similar informal procedures are defined at FZK with VO-liaisons and basic backup/recovery like at PIC.
- J.Gordon added that RAL has similar agreements but nothing is formally specified.
H.Marten added that a written agreement common to all sites would help to make sure all issues are clear between site and VOs and all sites use the same standards. The proposal was not discussed further for now.
L.Robertson concluded that is important that all sites know what to do if there is a VOBox failure. For now sites should verify that independently. This will be verified by the high-level milestone scheduled for April.
5. Mid-Term Resource Planning (Slides; document, 2Q2007 Req. Table) – H.Renshall
The document attached is a new version of L.Robertson’s document “Summary of the Process for Reporting Experiment Requirements, and Site Capacity and Usage Data for CERN and the Tier-1 Centres (version 7)”. Changes are highlighted in yellow in the document.
From 1 April we will start the new resource reporting process, using the values in the Mid-Term Resource Planning tables. The 2Q2007 table of WLCG Service Coordination planning uses values for ‘installed’ capacity taken from the end of January accounting plus any increments announced at the January workshop. Tier-1 sites should verify their values.
Other miscellaneous changes to 2Q2007 table were:
- addition of LHCb requirements for June and July
- addition of new columns of Allocated disk space (in italics) per site per experiment taken from the end of February accounting
- updated pledges for 2007, which need to be available to experiments from 1 July 2007 till 1 April 2008; the 2006 pledges should be available up till 30 June 2007
- added CERN Tier-0 and CAF resources for completeness
Issues for discussion:
- New proposal that Tape1Disk0 disk buffer size be quantified and be added to disk requirements.
attached draft 2Q2007 table shows discrepancies that we will follow-up
individually (after sites have verified their installed capacities).
- For ATLAS and CMS the allocated disk capacity is much larger than that formally required so this is to be understood (we know, for example, that ATLAS MC event sizes are bigger than predicted).
used the averages of the site offers of disk and CPU to give a single site
share for each experiment and used that to calculate all the site
requirements. Is this an oversimplification (tape offers are often much lower)?
This point needs to be investigated further.
- In the longer term we are assuming that the full 2007 requirements as in the TDR addenda need to be available to experiments from 1 October 2007(corrected during meeting to 1 July 2007) and that the full 2008 requirements, as per the ‘Summary of Regional Centre Capacities 01/02/2007’ need to be available from 1 April 2008 at which point the mid-term planning tables converge to the annual planning and probably no longer need a separate existence.
- The ramp-ups to 4Q2007 (double of 2Q2007) and then to 2Q2008 (double of 4Q2007) are very steep. A lot of hardware needs to be bought, installed and commissioned.
L.Robertson reminded the sites that from March 2007 the available and installed data for resource accounting will be taken from these tables as at the end of each month (i.e. March values from the 2Q2007 table, taken on the 8th of April 2007) and from the APEL repository for automatic resource accounting.
Following the email exchange preceding the MB meeting, L.Robertson asked the MB to agree (again) that the requirements from the experiments include the efficiency factors agreed in the TDR and all values are gross value. The MB agreed.
The efficiency factors include already all disk needed for buffers, etc. and the experiments should request gross values (including efficiency) which is what the sites will install and report on.
N.Brook added that LHCb will send to H.Renshall their updated requests, in order to include the efficiency factors.
Ph.Charpentier stated his scepticism on the fact that tape efficiency can be 100% without overhead as agreed in the TDR.
H.Marten replied that probably this was decided taking into account that data compression from the tape equipment can give 15-20% compensation. The MB agreed to discuss this in the future if the 100% factor for tape will prove to be incorrect.
7. Summary of New Actions
3 Apr 2007 - A.Aimar will update the targets for all four LHC experiments and distribute it to the MB.
The full Action List, current and past items, will be in this wiki page before next MB meeting.