LCG Management Board
Tuesday 8 July 2008, 16:00-18:00 - F2F Meeting
(Version 1 - 15.7.2008)
A.Aimar (notes), I.Bird (chair), K.Bos, D.Britton, T. Cass, Ph.Charpentier, L.Dell’Agnello, F. Donno, M.Ernst, I. Fisk, J.Gordon, A.Heiss, H.Marten, P.McBride, H.Meinhard, G.Merino, A.Pace, B.Panzer, Y.Schutz, J.Shiers, R.Tafirout, J.Templon
Mailing List Archive
Tuesday 22 July 2008 16:00-17:00 – Phone Meeting
Minutes and Matters arising (Minutes)
1.1 Minutes of Previous Meeting
The minutes of the previous MB meeting were approved.
1.2 SAM Tests for June 2008
- VO-Specific SAM Tests - June 2008 (Reliab_Summary_200806)
A.Aimar attached the Sites Reliability Report for OPS and VO-specific test for June 2008.
He had asked for explanations about the VO-specific results, received from ATLAS and CMS, not from ALICE and LHCb.
Tier-1 Report: The Tier-1 Report will have the data for the 6-7 June corrected for all sites. The data on the 22-23 June for CERN will be corrected: the downtime was not due to CERN but to the crash of a tool used by SAM.
Tier-2 Report: The OSG data is still not accurate for June but will be fixed for July.
Ph.Charpentier added that the LHCb tests are incorrect and will be fixed. .
I.Bird proposed that from July onward the results should be officially published.
Update (12 July 2008) - Here are the updated reports. Link
1.3 LHCb’s Request about Multi-User Pilot Jobs
The request presented at the previous MB meeting (Link, section 1.2) was accepted by the MB.
Action List Review (List of actions)
The issue was discussed at the LHCC mini review and J.Gordon received the updated information. J.Gordon will pass the information to H.Renshall.
On going. It will be installed on the pre-production test bed PPS at CERN and LHCb will test it. Other sites that want to install it should confirm it.
The only information still missing is: CMS list of 4 users DNs that can post alarms to the sites’ email address.
Done. Presentation during this meeting.
When CERN was down, no new SAM tests were run. This can be seen for CERN in GridView , where there is a large gap that day from 03:22:55 to 30/05/2008 14:49:53.
The last test CERN had (at 3:22) was GREEN, so during the whole period, it was considered GREEN by GridView – thus availability was 100%
After the fact, two downtimes were entered for CERN: 06:36-08:00 and another from 08:13-10:00. GridView only takes into consideration downtimes entered 2 hours BEFORE an intervention. Therefore, reliability was 100%.
Services Weekly Report (Slides)
J.Shiers presented a summary of status and progress of the LCG Services. This report covers the last two weeks.
Notes from each daily meeting can be found from there pages:
Below are the main issues reported at the meeting. Others issues, and more details, are in the slides attached.
3.1 Power and Cooling
IN2P3 – Weekend of 21/22 June suffered a serious problem with A/C and had to stop about 300 WNs. The report received was insufficient and given only at the Operations Meeting.
CNAF - Numerous power related problems but was difficult to receive a response and clear news about it. Nobody replied to the ATLAS emails for 40 hours.
L.Dell’Agnello explained that the request was taken into account and the support staff at CNAF was working actively. But until Sunday nobody replied to the email from ATLAS.
CERN – Scheduled power test. In order to check that problems in previous (unscheduled) cut were fixed. Power tests will continue probably on a quarterly basis.
J.Shiers stressed the fact that Sites should send their post-mortem reports quickly and without waiting for a request from the WLCG Operations.
CERN 20-22 June - Availability at zero from June 20th to the 22nd . The problem was traced to a limit (~100) in # tuples published from gstat to SAM. The # CEs at CERN, for the hardware ‘rotation’, was higher and this crashed gstat.
CMS Dashboard Load - An index was created on the query suspected for performance bottleneck. Seems better, but more follow-up needed to understand the real issue.
GridView Throughput - The throughput display collapsed on the 26 June. A restart of the Tomcat applications server cured it but this should be monitored in the future (i.e. monitor the monitoring).
FTM Installation - Check of which sites have installed FTM – follow-up at weekly joint ops meeting.
Network Problem - Network problem affecting all BDIIs hit SAM monitoring on Wednesday 2nd July. Will availability be automatically corrected? John Shade to follow-up.
CMS SAM Tests - Request from CMS to use WMS to submit CMS SAM tests – need to understand user mapping for this case.
Faulty SAM Alarms at RAL - Problem with SAM Thursday night which causes many alarms to be sent to RAL (also CERN). Called RAL staff out by SMS.
For performance measurements, except Lemon for CERN related activities and T0-T1 transfer display in GridView, nothing else was provided to show the combined picture of experiments metrics sharing the same resources. Work will be done to provide better tools but FTM should be installed at all sites.
The current FTM deployment is:
ASGC: Already deployed and operational.
Sites are still a bit disoriented. There is not sufficient information from the VOs. They do not have clear idea how to understand their own role/performance and whether they are serving the VOs well.
J.Templon, F.Hernandez and R.Santinelli will work together in order to define a solution (probably á la GridMap) by the end of the summer.
3.4 Data Management
The long-standing problem (LHCb) with gsidcap at IN2P3 (multiple connections) is fixed by dCache patch level 8. More details are in the slide notes. Encouraging tests from xrootd (LHCb) are ongoing at SARA
Gfal 1.10.14 should now be thread safe (to be confirmed).
The LFC at RAL was crashing. It is recommended to install 1.6.10-6 which fixes this problem. RAL should then send out information through the Joint Operations meeting(s). LFC (LHCb at CERN) had the wrong (20) number of threads; now reset to 60.
A new version of CASTOR and CASTOR-SRM will be released and requires an intrusive upgrade (i.e. a down-time to be scheduled).
New space tokens for ATLAS and LHCb were introduced. Further clarification of ATLAS space token permissions for DPM is needed. It could require DPM development and therefore take weeks.
ATLAS continues to have problems with reprocessing at the dCache sites – needs follow-up to locate the problem (dCache, gfal, ATLAS, etc).
Slides 9 to 11 were not described but show issues encountered by other services (with details on Data and DB Services).
Slides 12 to 15 show a summary of the reports by the 4 LHC Experiments.
In summary there were a lot of unstructured activities, but this will be the normal trend probably. Summer will be difficult to cover but is better than people take their holidays in July and August rather than later.
It is important that post-mortem reports are sent by the sites and if the downtime is longer that a given amount of time the report should be sent to the OB and in the QR report.
The MB agreed that the Post-mortem reports should be sent to the Overview Board (in the QR reports) from now on.
J.Gordon and R.Pordes asked why the patch to VOMS had also new unwanted changes (i.e. non-backward compatible and problems). This caused failures in WMS and the services had to be rolled back to the previous version of VOMS. This should have been explained at the Operation meeting and reported.
4. Proposal for Obtaining Installed Capacities (Slides) - F.Donno
F.Donno presented a proposal for obtaining, in an automatic manner, the installed CPU and SE capacity at all sites.
The goal is to provide the management with information about installed capacity (per VO) with information about actual resources usage (per VO), focusing on CPU and SE resources.
The intention is to obtain the information from the existing information system and distribute it via the APEL WLCG Accounting system.
4.1 Storage Resources
The technical specifications are available in the CCRC Twiki, in the SSWG section See Link
The documentation available covers only SE resources and reports on the conclusions reached during focused meetings with developers and information providers. With some specific solutions that had to be found to cover dCache internal specialized buffers and avoid double counting. The document has been agreed by storage developers, storage information providers, data management developers.
The current accounting collects data from the BDII; but there is incorrect data and is still a mix of Glue 1.2 and 1.3 schemas.
The current reports in cover the last hour and the monthly summary.
The agreement is to use the GlueSA class for both disks and tapes.
- Shared spaces are published as a Storage Area with multiple Access Control Rules, without distinguishing the details about VO usage.
- Express type of space is represented through the Capabilities attributes (read-only, wan, lan, etc.)
The following GlueSA attributes will provide this information:
- GlueSAReservedOnlineSize - Space physically allocated to a VO, or a set of VOs if shared – Installed capacity
- GlueSATotalOnlineSize (in GB = 10**9 bytes) - Total Online Space available at a given moment. It does not account for broken disk servers, draining pools, etc.
- GlueSAUsedOnlineSize (in GB = 10**9 bytes) - Size occupied by files that are not candidates for garbage collection – Resources usage by the VOs.
J.Templon noted that installed capacity is all resources ready but not necessarily assigned to the VOs; but should be assignable within minutes. There should be a mechanism to show the resources installed at a site and not yet assigned to VOs.
F.Donno replied that if a resource is not assigned is currently not accounted by this system. But the plan is to be able to report also installed but still unallocated resources.
I.Bird noted that an automatic solution is needed for the Tier-2 because there is no way to have manual data.
J.Templon added that in his opinion this information is management information and is not needed for the usage of the resources, therefore should not be mixed with the Information System.
I.Bird replied that this seemed a sensible solution but if there are better solutions should be proposed.
J.Gordon added that asking this information via other channels (e.g. email) to the sites one has just a 10% reply rate.
R.Pordes added that the “installed resources” is management information and OSG will define a solution for storing this information (probably not via GLUE).
4.2 Status and plans of the Implementations
CASTOR - The dynamic information provider for CASTOR is already available but corrections are needed to comply with what was agreed. Packaging and distribution effort through the CASTOR CVS, the first test installation will be at CERN and should be tested by end of July 2008.
DCache - The dynamic information provider for dCache is also available. One still needs to verify with dCache developers that pinned files usage info is available, and a few other details. Will take 2-3 weeks to implement the proposal and the changes will be reflected also in the new official dCache information provider.
DPM - Dynamic information provider for DPM is already implemented. It is installed at Edinburgh and the testing phase has started.
StoRM - The information already available for StoRM and probably minor additions (for VOInfoPath) are needed.
BestMan - No information available. The WG will need to contact them in order to start this work also with BestMan.
4.3 Computing Resources
In the current pledges table (in the MoU) all resources are expressed in terms of “KSI2000 per federation”. APEL provides already the information about resources usage; work is in progress to make information published about installed capacity more reliable (see presentation of S.Traylen at GDB). GridMap already provides the information per region. Ongoing work to provide a view per federation and an interface to retrieve data.
Proposal: Publish static pledges and the dynamic view in the current Megatable format (by federation) by providing APEL plug-ins: the installed capacity is retrieve from the Information System and stored it APEL. This will already make it easier detecting discrepancies wrt the pledges.
R.Pordes asked whether the end of August for the availability of this data is a requirement.
F.Donno replied that is just an estimate by when the APEL plug-in could be available.
I.Bird concluded that after the discussion at the GDB on the following day, a working group should clarify all open issues (installed vs. available resources, CPU vs. no of cores, etc) and agree on the usage of all the fields. All sites should publish the same information in the correct places and a clear document should describe all the details of this data formats, data flows and storage.
5. Proposal for CPU Benchmarking (Slides) - H.Meinhard
H.Meinhard presented the work of the HEPiX CPU Benchmarking working group and a proposal for a future CPU unit to use in the WLCG accounting, pledges, etc.
Since May all experiments have run a variety of benchmarks on benchmarking nodes and simple correlation studies were done. Some anomalies were found, in particular in Atlas simulation and work is being done to understand how random numbers are used, and the impact of HW architecture on random number sequences.
Full, statistically consistent least-squares fit is in the pipeline, but not yet done because of other priorities of the people involved
5.1 Benchmark Options and Environment
The working groups is still at work and there are currently two options:
- Wait for a detailed understanding of the differences in random number usage, and for a full statistical least-squares fit to all data. With summer period coming up, there will be no final result before autumn
- Trust the reasonable correlations shown at HEPiX, and pick a choice of one of the SPECcpu2006 benchmarks. Not very scientific, not bullet-proof solution but the indications are that it is good enough and can be done now.
Choosing now implies the choices of:
- Benchmark to use (SPECint2006, SPECfp2006 or SPECall_cpp2006)
- Benchmarking environment (OS, compiler, compilation options and running mode)
The current benchmark options are:
- SPECint2006 (INT, 12 applications). Well established, published values available, the HEP applications are mostly integer calculations and the correlations with experiment applications shown to be fine
- SPECfp2006 (FP, 17 applications).Well established, published values available and the correlations with experiment applications shown to be fine
- SPECall_cpp2006 (CPP, 7 applications, 3 from SPECint and 4 from SPECfp). Exactly as easy to run as is SPECint2006 or SPECfp2006 but without published values (not necessarily a drawback). It takes about 6 h (SPECint2006 or SPECfp2006 are about 24 h) and doses the best modeling of FP (10-15%) contribution to HEP applications.
The table below shows for each benchmark the geometric mean of the applications of that benchmark.
And comparing to SI2K.
More meaningful is the:
- Column 2: Ratio wrt to the SPECint2K. More recent machines (07 and 08) have a higher discrepancy in the scaling factor with the 2K6 benchmarks.
- Columns 3: Ratio between CPP and SPECint2K
- Column 4: Ratio of CPP vs. the SPECint2K6
The benchmarking environment as discussed and as recommended by benchmarking WG is the following:
- SL4 x86_64
- System compiler gcc 3.4.6
- Flags as defined by LCG architects forum) -O2 –fPIC –pthread –m32
- Multiple independent parallel runs, is more similar to the usage in HEP (SPECrate starts synchronized on all cores instead).
There is a script available that takes care of all steps and prints one single number – only need for SPECcpu2006 and machine installed with SL4 x86_64.
The proposal is to:
- Adopt SPECall_cpp2006 and environment presented.
- CERN will make the script available (build and runs the benchmark for as many cores as needed)
- Sites to purchase a SPECcpu2006 licence
- Call it HEP-SPEC or keep the SPECall-cpp2K6 name.
Transition of existing capacities:
- Run old benchmark the old way (whatever that was), and HEP-SPEC on samples of all HW configurations in use
Transition of existing requests/commitments:
- These were based on some reference machine type; both benchmarks to be run there
J.Templon noted that not using the INT benchmarks will requires additional explanations and calculations.
H.Meinhard replied that using the CPP benchmarks will force the sites and vendors to run the benchmarks and not take unrelated values from paper specifications. Sites ought to run the benchmarks anyway in order to provide realistic values. The CPP target is already available in the standard build of the benchmarks, therefore is as easy ad the INT target.
of Pledges and Exp Requirements to New Unit -
I.Bird summarized to the MB the issues that were mentioned during the previous presentation.
Adoption of the CPP suite instead of INT and FP
J.Templon proposed to use the standard INT and FP that can be found without any benchmarking and without any explanation required. Statistically the difference INT vs. CCP (table above) is only of about 10-15%. L.Dell’Agnello supported this proposal.
H.Meinhard added that not running the benchmarks the difference can reach 30% like happened for SPEC2K. This caused major problems in the past. Forcing the execution of the benchmark makes sure the sites account their resources correctly.
I.Fisk supported the execution of specific benchmarks because the HW changes internally (mother board, etc) and one will end up comparing resources from other sites (up to 30-40%) in a very incorrect manner.
I.Bird stated that the benchmarks should be run anyway and the decision of introducing scaling factors, as done for SPEC2K, caused major discrepancies among sites.
He asked feedback from Sites and Experiments on the CPP benchmark proposed.
- Sites, except mildly SARA and CNAF, supported the proposal by the HEPiX working group.
- ALICE,CMS and LHCb prefer to have the benchmark executed at the site
- ATLAS noted that is important to use the same system across all sites, whatever is chosen.
The MB agreed that the CPP benchmark suite should be used for the WLCG CPU resources.
J.Templon asked that the reference is RHEL4 and SL4 is just seen as a derivation of that. SARA will not install SL4 just for the benchmarks.
H.Meinhard replied that RHEL4 and SL4 behave exactly in the same way.
The MB agreed that the reference platform for CPU benchmarking is RHEL 4 and its descendants (i.e. SL4).
Information by each WN
Ph.Charpentier asked that this information ought to be available for each worker node not as an estimated average.
I.Bird agreed and added that this will be discussed at the GDB as part of the normalization process that will be part of the Accounting information at the GDB.
The MB agreed that the information of the resources should be a the WN level.
Recalculation Experiment Pledges
This will have to be agreed so that is done in same manner by all 4 Experiments so that the same normalization is applied.
7. Analysis of the MonaLisa/Alice Accounting (Slides, More Information) - H.Marten
H.Marten presented the comparison, by M.Alef, between accounting done by ALICE with MonALIsa and the one by with PBS at GridKa.
In slide 2 the difference for FZK is highlighted:
- For ALICE 365 KSI2K were delivered to ALICE
- For the WLCG Accounting 674 KSI2K delivered to ALICE
7.1 Delivered vs. Consumed Resources
There is an important remark on the ”delivered versus consumed“ CPU time:
The GridKa CPU resources have been increased by a factor of about 3 to the milestone 1st April 2008.
Usage by the Experiments in April was much less than 100%.
Always some worker nodes (job slots) have been idle.
Therefore Alice has consumed only 365 kSI2k wall time but is not what GridKa delivered.
Y.Schutz noted that in some case the Experiments cannot use the whole resources because of other components (e.g. middleware) not working.
J.Templon replied that there are several cases in which no ALICE jobs are submitted even if the site was ready to receive them.
Ph.Charpentier agreed that LHCb does not submit jobs if it does not need to. And is not a problem of the sites in this case, therefore accounting should consider what is installed.
7.2 Comparison at FZK
There is no local MonALISA logging (at site level) enabled by default. In order to check the log files and compare the MonALIsa data with our PBS logs, local MonALISA logging (on a per-job basis) has been enabled on the VOBox alice-fzk.gridka.de.
GridKa accounting is based on the log files of the local batch system (PBS Professional).
The comparison is based on all jobs which have finished on Saturday, 21st June 2008. All jobs which were executed on the sample worker node c01-108-117 have been checked in more detail.
For MonALIsa 6 jobs were selected and analysed.
And 4 PBS jobs where found to match them:
One can see that the pilot jobs hide the jobs executed.
In contrast to PBS, only the real workload is considered in the MonALISA accounting. The workloads are about 90% of the whole job.
Slide 11shows that MonALIsa calculates the time in milliseconds:
- Run (wall clock) time of 13513.0 ms, but does not match the time between the first and last time stamp.
- CPU time in instead of 10911.0 ms
The conclusion is that MonaLisa conceals some overhead within the Alice workload tasks or the calculations need to be clarified further.
In addition the kSI2k factors used in MonaLisa differ from the official numbers used at GridKa even if varies from different job executions.
In slide 15 some jobs do not use CPU but use wall time and do not have any KSI calculated. But is blocking the CPU without have the resource accounted, not even for wall time.
Slide 17 shows a case where both CPU and wall time are accounted but no KSI2K are accounted and left unconsidered by MonALIsa (~15%!).
As a result for that same day the summary results were (see more information attached). :
Sum of highest run_ksi2k entries from
MonALISA log files (column Q in Excel sheet ”MonALISA 20080621“):
Sum of walltime (computed from time stamps in
MonALISA files, multiplied with the right kSI2k
factor – column R in that Excel sheet):
Sum of PBS accounting records (column
K in Excel sheet ”PBS 20080621“):
In summary there are several reasons for differences between accounting data computed by Alice (MonALISA) and GridKa that make a total of ~50%.
- The GridKa accounting measures the whole time interval while the worker node (job slot) is occupied by the (Alice) job.
- -20% Alice jobs are pilots. In contrast to PBS, only the real workload is considered in the MonALISA accounting.
- -6% The walltime of the workload tasks computed in MonALISA is less than the real wall time (difference between first and last timestamp in logfile).
- -7% Wrong kSI2k factors are used.
- -4% Many idle workload tasks (0 seconds CPU usage) probably not accounted by MonALISA.
- -15% Some Alice jobs don't report their kSI2k usage, although they have consumed CPU time.
Y.Schutz agreed that the difference is probably due to the different scaling factors applied (1.5 factor makes the 50% difference).
H.Marten noted that the 1.5 factor is not applied to the jobs not accounted but only to those MonALIsa has accounted for. Therefore the issue is not really clarified
I.Bird concluded that:
- the calculations in MonALIsa should be fixed in order to account all the CPU/wall time resources used
- at the same time at the GDB the normalization factors should be improved.
8. Proposal for User Space Management (Slides) - B.Panzer
B.Panzer presented a proposal on how to support end-user analysis.
Below is the proposed high level dataflow. It is CERN-centric but can be generalized to other sites.
The list of requirements to be considered is coming from Experiments or experience:
- Storage capacity of 1-2 TB per user (assume 1.5 TB) for Ntuple, data samples, log files, etc
- Reliable storage, ‘server-mirroring’ 99.9% availability = 4 times per year unavailable for 4 h each
- No tape access, too many small files
- Some backup possibility (with archive?) Backup = 5% changes per day (75 GB/d) Quota system
- Easy accessibility from batch, interactive worker nodes and notebook
- POSIX access type and via distributed file systems
- World-wide access
- High file access read/write performance
- User identity and security
A draft scenario with costs, but no support included, could be the one below, with the user’s notebook at the centre of the analysis.
Actually there are very many question still to be answered:
Where are these ‘extra’ resources coming from?
Is there only one unique storage per user world-wide?
- What about users working on different sites?
- Do they have multiple end-user storage instances?
- How is data transferred between instances?
The difference between the ‘home-directory’ storage and end-user analysis space is small.
- Analysis tools/programs and the data must be accessed at the same time.
Who decides which user gets how much space where?
- Experiment specific policies
What is the data flow model?
- Notebook disk + site local file system + global file system
- Notebook disk + site local scratch + cloud storage
- Global file system only
OS support, virtual analysis infrastructure, network connectivity = data ‘gas station’
Is there a common interest to solve the problem at the scale of the WLCG?
K.Bos proposed to limit the scope at CERN. P.McBride agreed with the proposal.
Ph.Charpentier noted that at each Tier-1, the sites were required and agreed to pledge and install user space available for each Experiment.
J.Templon asked for clear use cases of what the Experiments expect from the Sites for user analysis. E.g. On individual notebooks? From batch queues? etc
I.Bird proposed the formation of a small working group (1 person per Experiment +1 FIO + 1 DM, lead by B.Panzer) that will come with a common proposal and the MB agreed.
9.1 US Tier-2 Accounting
J.Gordon noted that the Tier-2 accounting report for June does not include the US Tier-2 sites and is not understood why the data seems to be arriving only in the last few days.
9.2 Report of the LHCC Mini Review
The report of the LHCC Mini Review is highly positive. I.Bird will attach to the agenda the slide of the reviewers.
Summary of New Actions