LCG Management Board
Tuesday 11 December 2007 16:00-17:00 – Phone Meeting
(Version 1 - 13.12.2007)
A.Aimar (notes), D.Barberis, I.Bird, T.Cass, Ph.Charpentier, L.Dell’Agnello, S.Foffano, C.Grandi, F.Hernandez, R.Kalmady, M.Kasemann, M.Lamanna, E.Laure, H.Marten, H.Meinhard, G.Merino, P.Nyczyk, B.Panzer, L.Robertson (chair), A.Sciabá, J.Shiers, O.Smirnova, R.Tafirout, J.Templon
Mailing List Archive:
Tuesday 18 December 2007 16:00-17:00 – Phone Meeting
1. Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
1.3 Site Reliability Report - November 2007 (Site Reports; Slides)
A.Aimar distributed the Site Reports for November 2007.
The table below shows the sites reliability since January 2007.
In November 2007 9 sites were above 93% (target is 91%) and another 2 were above 82% (90% of target).
The averages across all sites are 95% for the 8 best sites and 92% across all sites:
· Avg. 8 best sites: May 94% Jun 87% Jul 93% Aug 94% Sept 93% Oct 93% Nov 95%
· Avg. all sites: May 89% Jun 80% Jul 89% Aug 88% Sept 89% Oct 86% Nov 92%
2. Action List Review (List of actions)
Actions that are late are highlighted in RED.
· 21 October 2007 - Sites should send to H.Renshall their resources acquisition plans for CPU, disks and tapes until April 2008
NDGF and NL-T1 should send to H.Renshall and S.Foffano an estimate about the delivery of their 2008 capacity.
Update: NL-T1 send a mail with November 2008 as estimate for making available the 2008 capacity.
· 30 November 2007 - The Tier-1 sites should send to A.Aimar the name of the person responsible for the operations of the OPN at their site.
Not done. Received from TW-ASGC, FR-CCIN2P3 (Jerome Bernier), IT-INFN (Stefano Zani), RAL (Robin Tasker)
· 14 Dec 2007 - L.Dell’Agnello, F.Hernandez and G.Merino prepare a questionnaire/check list for the Experiments in order to collect the Experiments requirements in a form suitable for the Sites.
In preparation. Being discussed with the storage and system administration experts at the respective sites.
3. SRM 2.2 Weekly Update (Slides) - J.Shiers
J.Shiers presented a summary of the SRM 2.2 deployment progress.
The tables below are examples of the Sites and Experiments respective views of the SRM installations and configurations. These tables will be used to summarize the status of all sites in a concise view.
The main points described are:
- More than 50% of the sites are now with SRM 2.2
- No new bugs with the client tools found
- Bug in the FTS (not marking the file as permanent). Solved but not released yet.
- Bug in SRMCopy; the temporary solution is to use Globus URLCopy.
- Experiments have specified their storage classes requirements.
- The issues about space for files recalled. For now it will not be solved. First one needs to understand the issue and collect experience. Later design a standard solution for all implementations.
Below is the slide presented.
L.Robertson asked whether the installation at CNAF is completed or not (because there was a question mark in the slide above).
L.Dell‘Agnello explained that the endpoints for CASTOR SRM 2.2 are working at CNAF. They still need to agree with the Experiments that they should use the new endpoints. The agreement must be made before the 17 December because no changes can be performed during Christmas.
L.Robertson asked whether now the Experiments have enough sites with SRM 2.2 in order to start using them for tests transfers.
J.Shiers replied that now some combinations useful to the Experiments are now possible. The tables above should help to visualize the situation by Site and Experiment.
4. Update on CCRC-08 Planning (Slides) - J.Shiers
The action list agreed at the F2F CCRC Meeting of the previous week is shown below.
5. Status of the VO-specific SAM tests (VO tests results - new and old)
5.1 CMS (Slides) - A.Sciabá
5.1.1 SAM in the CMS Tests
SAM is used in CMS to test the basic functionalities which are needed by the CMS workflows both for Monte Carlo production and for Analysis. In this context the SAM tests are used to test both EGEE and OSG sites.
The submission in all cases done through the LCG Resource Broker. Two sensors are used so far: the CE and the SRM sensor. The tests are run only at specific sites, essentially all CMS Tier-N plus a few others
CMS submits custom tests for the CE since the beginning of 2007. Tests are submitted every two hours. All tests are run with the lcgadmin role, but the MC test which is run with the production role.
Since June 2007, CMS uses custom tests for SRM v1. File transfer is done via srmcp and the “production” role is used.
They have a dependency on the PhEDEx database.
Tests for SRM v2 are in development and there are no tests for the SE and CMS does not see reasons to use both the SE and the SRM sensor in SAM.
5.1.2 WLCG Availability Calculation
It is determined by the choice of the critical tests.
- Job submission: the CE is unavailable if it cannot run a CMS job via RB [run by CMS]
- CA certs: the CE is unavailable if it has not the correct CA certificates [run by ops]
- VO tag management: the CE is unavailable if the publication of experiment tags does not work [run by ops]
- Put: the SRM is unavailable if it is not possible to copy a file on it via srmcp [since 10/12/07]
SE, FTS, RB, etc.
- No critical tests defined
There are some problems with the WLCG Availability calculation:
The availability calculation in GridView is wrong (bug #31233). If a
service type stops having critical tests, all its instances will have status
UNKNOWN but the combined service status is not updated any more.
This is serious: CMS stopped having critical tests for the SE on 13/11, and since then the SE status is frozen to what it was immediately before.
The impact for the Tier-1 global availability is "random"
- ASGC: SE always available, no impact
- CERN-PROD: SE always unavailable, serious impact (always red)
- FNAL: SE status UNKNOWN, no impact
- FZK: SE status UNKNOWN, serious impact (always grey)
- IN2P3: SE always on maintenance (not for real!), serious impact (always yellow)
- INFN-T1: same as FZK
- PIC: same as FZK
- RAL: same as IN2P3
In other words, the Tier-1 WLCG availability for CMS is wrong since ~ 1 month
R.Kalmady explained that this problem is fixed and GridView will be released in the next few days.
The graphs will be regenerated when possible. In some cases the hosts have been modified and there is no data for some of them.
There is also a problem in the new WLCG availability algorithm (bug #31233)
- If a service has no critical tests defined, the status is UNKNOWN, but
- If a VO says that no test is critical for a service type, it means that that service is always available for them (unless it is on maintenance of course)
- Therefore, if e.g. the SE has no critical tests, all SEs should be always OK
R.Kalmady asked what should be considered if a service has no critical tests defined? Should it be considered up?
A.Sciabá and Ph.Charpentier proposed to consider the service are working.
R.Kalmady asked if should be considered “green” or there should be a new state.
L.Robertson and Ph.Charpentier noted that there should be a status that shows that the service is “not tested”. But be propagated as “OK” in the calculations.
The MB agreed that the service should be in a state “irrelevant” or “not tested” and should be ignored (considered “up”) in the calculations.
R.Kalmady asked what happens if a not tested service is scheduled down, should it be marked as such?
The MB decided that when a service without critical tests is scheduled down should be marked “scheduled down” and that status should be propagated.
G.Merino asked why the SE service is still there as all the SRM service is provided by all the Tier-1 sites and the SE service is a legacy.
There are BDII inconsistencies at FNAL:
publishes its resources on different "GLUE" sites
- uscms-fnal-wc1-ce: contains one CE
- uscms-fnal-wc1-ce2: contains another CE
The affect: the FNAL WLCG availability ignores the status of the CE therefore there is a possible overestimation of the availability if the CE is down
5.1.3 CMS Availability
CMS has its own definition of availability. CMS has been using a custom definition of availability for internal use.
Calculated as the daily fraction of CMS SAM tests for the CE which were successful
- No test is really "critical", every failure just degrades a bit the estimation
- The SRM tests are not included in the calculation
It is calculated by a script run by hand
A new calculation more WLCG-like has been implemented in the ARDA dashboard
- The algorithm very similar to the WLCG one
- All CMS tests are taken as critical, for now
See slides 13 and 14 for the differences in the availability calculated with the new and the old definitions.
The choice of critical tests for WLCG is constrained by the fact that if a CE fails a critical test it is also removed from the BDII by FCR. Therefore the choice must be careful and conservative
For CMS, a test might be critical if its failure prevents some high level workflow from working. Therefore the choice can include other tests (e.g. jobs run, MC is OK, access to calibration DB fail).
5.1.4 SAM Tools
The FCR is the only tool to set the critical tests and to white list or blacklist specific sites instances.
There are two problems:
- OSG sites are not included anymore. But they were some time ago
- The only service types supported are the CE and SE. But not SRM.
The standard SAM web interface is inadequate and basically frozen since several months. It does not show EGEE sites and OSG sites together. It does not allow showing only "real" CMS sites. It has some bugs in the history view
CMS has turned its attention to the ARDA dashboard team in order to have a better graphical interface. Very easy to have new features in place and the work can be easily reused by other VOs.
5.1.5 Future Plans
- Add more tests, in particular "Analysis" tests including read access to local data and stage out to remote storage.
- Provide feedback on the CMS availability into SAM as another SAM test for easy viewing. Plug in the CMS SAM tests in the site monitoring. Tools in development in SAM group at CERN.
F.Hernandez noted that the Experiments are using several visualization tools and this diversity is not easy use by the sites.
I.Bird commented that the Sites should include the SAM results into their site monitoring systems. As is also suggested by the Monitoring Working Group.
J.Templon added that there should be tests corresponding to the MoU requirements so that a site can monitor whether it is within the MoU agreements.
5.2 LHCb (Slides) - Ph.Charpentier
5.2.1 LHCb and SAM Tests
LHCb Critical Tests:
Dedicated jobs (SAM jobs)
- Check site capabilities
SW repository access rights
Check correct sgm account mapping
Verify the platform, deployed middleware (lcg-utils version)
- Installs LHCb software
from a list of current releases of applications
installation on the SW repository (shared area)
- Runs test applications
simulation, digitisation, reconstruction on 10 events
Not yet in Production:
Dedicated tests (cron jobs)
- FTS transfers (full matrix)
- SRM tests (response time)
Information gathered by operations
- SE performance (staging time, SRM response)
- transfer error analysis
5.2.2 Tests Execution and Logging
Slide 3 shows how the tests are submitted through DIRAC. They target all CEs accepting LHCb jobs, report to the SAM DB and upload the log files to the DIRAC log system.
Slide 5 shows that GridView is reporting a difference with the new algorithm. There is a bug in the reliability calculation.
R.Kalmady acknowledged that there is a bug in the reliability display.
Slide 6 shows that LHCb reports only the status of the CE. But in order to correct the services that have no test LHCb has added some dummy tests (already discussed).
SAM-DB keeps knowledge of all past tests a clean-up is really needed.
5.2.3 Comments on GridView
- The list of Tier-1 does not correspond to those serving LHCb. How to restrict Tier1’s to those relevant?
- Why are NIKHEF and SARA two Tier1’s? Should be NIKHEF CE with SARA SE.
J.Templon added that actually NL-T1 is also having CEs at SARA also and the combination is more complicated that having CEs at NIKHEF and SEs at SARA.
P.Nyczyk added that the calculation in GridView is simpler than the one the VOs and the sites seem to expect. But to make it more customizable is a lot of work (which is being started).
SRM states should NOT be the “OR” of instances. It is not enough.
P.Nyczyk added that these principles are those he presented at the GDB. But they imply major changes in SAM and it will take time to study and implement them. Maybe the right place for these calculations is in the dashboard where is already implemented.
- Ergonomics of the queries: The menus are not really convenient and selecting time ranges cumbersome (limited to 31 days!).
J.Templon questioned the fact that VOBOXes could be used for site availability while they are completely under the responsibility of the VO. They had a problem with ALICE VOBOX at SARA. If the VOBOX reliability is low, unless it’s a hardware problem, it should not be on the sites but on the VO.
6. HEP Benchmarking (Slides) - H.Meinhard
H.Meinhard presented a summary of the activities of the HEPiX working group on Benchmarking.
In Autumn 2006: IHEPCCC chair contacts HEPiX conveners about help on two technical topics: File systems and CPU Benchmarking.
The CPU Benchmarking group started in April 2007 but actually mostly started in the HEPiX meeting in St Louis in November 2007.
The focus initially on was benchmarking the processing power for worker nodes. People who can spare a WN machine temporarily will announce this to the list.
A standard set of benchmarks should be run (SPEC benchmarks) and also they seek collaboration of the Experiments in order to check how well real HEP code scales with the industry-standard benchmarks.
The environment is fixed on the one in use by the Experiments (SL 4 x86_64, 32-bit applications with gcc 3.4.x) with the compilation options agreed by the LCG Architects’ Forum and should also be used to evaluate multi-threaded benchmarks vs. multiple independent runs. The proposal of H.Meinhard is to have an Interim report expected at HEPiX at CERN May 2008.
L.Robertson reminded that the adoption of new benchmarks is very urgent for the Experiments and for the Sites in order to specify the requirements and proceed with the tenders. In October after the presentation of M.Michelotto was agreed to prepare a range of machines on which run the SPEC 2006 benchmarks and the applications from the Experiments. The current time scale seems too late for the needs of the LCG.
H.Meinhard agreed that SPEC 2006 is the most interesting test to verify and should be done sooner. The Experiments will initially run their jobs themselves on the benchmarking machines. Later they could package their benchmark applications to be run without the Experiments experts.
L.Robertson reminded that by early March the new benchmarks should be adopted because they are needed for the Resources Scrutiny Group meeting in March.
Ph.Charpentier noted that the benchmarks are needed also because:
- the Experiments will have to re-calculate their requirements using the new unit and
- the Sites will have to use it for their pledges and tenders.
H.Meinhard agreed to ask the working group to quickly proceed to the preparation of some hosts and report to the LCG MB the progress. He will report to the MB in January about the setup and the initial benchmarking.
Ph.Charpentier proposed that, when the new unit is known, all worker nodes should be evaluated under this new unit.
Experiments should make nominate who is responsible for the benchmarking of their applications on the machines made available by the HEPiX Benchmarking Working Group. .
8. Summary of New Actions
The full Action List, current and past items, will be in this wiki page before next MB meeting.