WLCG Management Board

Date/Time

Tuesday 10 March 2009 – F2F Meeting - 16:00-18:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=49394

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 13.3.2009)

Participants

A.Aimar (notes), D.Barberis, I.Bird, K.Bos, D.Britton, T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, X.Espinal, I.Fisk, Qin Gang, J.Gordon (chair), A.Heiss, F.Hernandez, I.Fisk, S.Foffano, M.Kasemann, M.Lamanna, U.Marconi, P.Mato, A.Pace, B.Panzer, H.Renshall, Y.Schutz, J.Shiers, O.Smirnova, R.Tafirout, J.Templon

Invited

F.Donno

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 17 March 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters Arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were approved without comments.

1.2      LHC Schedule (more information)

As agreed I.Bird attached the email reporting the expected LHC colliding time for 2009 and 2010.

 

The Experiments representatives noted that the Experiments will need to discuss this information and for the moment the Tier-1 should wait for further feedback from the Experiments.

1.3      LHCC Mini Review: Slides of the Reviewers (Slides)

For information, all presentations of the Reviewers at the LHCC Mini Review are attached to this agenda.

The main comments from the reviewers had been discussed in the previous MB Meeting (see Minutes, Section 4).

1.4      Meeting on Experiments' Requirements (31 March?)

During the F2F meeting in February (see Minutes) it was agreed that at the first MB Meeting after CHEP there would be the presentation of the Experiments’ Requirements for 2009 and 2010. The date proposed is 31 March.

 

Representatives from the Experiments noted that it will be difficult to have the information ready so early. Some will have their decisions taken only early April or only have preliminary numbers.

J.Gordon noted that this is necessary in order to have the proposal for the RRB Meeting in Mid-April. This was agreed with the spokespersons in February.

 

From the AOB section where the topic was further discussed:

 

I.Bird, who could not be present at the beginning of the meeting, asked about the decisions on the presentation of the Experiments’ Requirements on the 31st of March. The requirements must be ready by the RRB in April. If there are new value before end of March the requirement will have to be modified accordingly.

 

The MB agreed that the discussion on the 31st will take place even if the values are not the final ones. And on the 7 April will be defined the final values with the presence of the spokespeople.

 

1.5      Tier1 SAM Availability and Reliability Reports (200902.zip)

A.Aimar will collect comments and feedback from the people in charge of the VO SAM Tests and distribute it before next MB meeting.

 

2.   Action List Review (List of actions)

 

 

  • SCAS Testing and Certification

 

There is a report at the GDB the following day (slides from the GDB).

 

  • VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

 

Below is the latest assessment.

ATLAS and LHCb: All SLAs approved.

CMS: Several SLAs still to approve.

ALICE: Still to approve the SLA with NDGF. Comments sent to NL-T1.

 

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

 

The dataflow and rates will be discussed at the WLCG Workshop before CHEP.

 

3.   LCG Operations Weekly Report (Daily Meeting Minutes; Slides) - J.Shiers

 

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Major Service Incidents

The run of “no major service incidents” has been broken with several incidents in the last two weeks.

 

Site

When

What

Report?

CNAF

21 Feb

Network outage

Promised…

ASGC

25 Feb

Fire

E-mails 25/2 & 2/3

nl-t1

3 Mar

Cooling

E-mailed

CERN

3 Mar

Human error

Provided by IT-FIO (Olof)
(FIO wiki of service incidents)

 

The fire in Taipei will take a long time to fully recover (up to 2 months) and the recovery is underway. LFC is back and FTS soon. The update will be presented at the GDB (http://indico.cern.ch/conferenceDisplay.py?confId=45473).

 

The CASTOR-related problems due to network invention at CERN – also needs further analysis: human mistakes are probably inevitable but this outage was COMPLETELY avoidable. Several human errors were the cause of this incident.

 

There is a growing too wide disparity in the quality of the reports – both level of detail and delay in producing them. Some reports are even still pending since a long time. It had been agreed that they should be produced by the following MB – even if some issues were still not fully understood.

 

Would adopting a template – such as that used by IT-FIO or GridPP – help? This will be discussed at pre-CHEP workshop.

 

The MB is not satisfied with the current level of reporting.

3.2      CASTOR Switch Intervention

Here is the sequence of event and problems:

 

-       Announcements: at the IT “CCSR” meeting of 25 Feb interventions on the private switches of the DBs for the CASTOR+ services was announced.
Oracle DB services: The firmware of the network switches in the LAN used for accessing the NAS filers as well as the LAN used to implement the Oracle Cluster interconnect, will be upgraded This intervention should be transparent for users since these LAN's use a redundant switch configuration.

-       Only the intervention on 2 Mar was put on the IT service status board and no EGEE broadcast (“at risk” would have been appropriate). But the intervention was done on 4 Mar anyway!

-       News regarding the problem and its eventual resolution was poorly handled – no update was made since 11:30 on 4 Mar – despite a “promise”: Only “We will update on the status and cause later today, sorry for inconvenience”.

-       The reports at 4 Mar CCSR were inconsistent and incomplete: the service as seen by the users was down from around 9:45 for 3-4 hours

-       At least some CASTOR daemons / components are not able to reconnect to the DB in case of problems – this is NOT CONSISTENT with WLCG service standards agreed with the SE developers.
Slide 5 and 6 show timeline as seen by the CASTOR team.

 

How this poorly managed intervention cost? Several days of IT staff and a much higher cost in the disruption of the Users work and time.

 

A much larger network intervention will take place in the near future (postponed to 1st April?).

 

There will be an “Important Disruptive Network Intervention on March 18th”

-       06:00 – 08:00 Geneva time: This will entail a ~15min interruption, which will affect access to AFS, NICE, MAIL and all Databases which are hosted in the GPN among other services.
Next, the switches in the General Purpose Network that have not been previously upgraded will be upgraded resulting to a ~10min interruption.
All services requiring access to services hosted in the Computer Centre will see interruptions.

-       08:00 – 12:00 Geneva time: The routers of the LCG network will be upgraded at 08:00 a.m., mainly affecting the Batch system and CASTOR services, including Grid related services.
The switches in the LCG network that have not been previously upgraded will be upgraded next.

3.3      ASGC Fire

Slide 8 shows the information that was distributed from the Site and that is really inadequate. Neither further report nor time line was reported to the WLCG Operations.

 

I.Fisk reported that CMS had received quick and detailed information from ASGC (and FZK for the issue discussed the previous weeks). But clearly this does not seem the case for the reports to WLCG Operations.

J.Shiers agreed but noted that Sites agreed to provide also quick and consistent summary information to the Operations because is useful to the Services and other Sites.

 

A.Heiss noted that the Site has several Experiments (plus many other VOs) to support and all have to be informed with the highest urgency. They cannot always participate to the WLCG meetings at 3PM.

J.Shiers replied that only important information and incidents must be reported not the daily progress discussed with the VOs. And participation should be done by different people so that is easier to have someone present.

3.4      GGUS Summary

Below is the summary of the GGUS tickets received in the last 2 weeks.

 

 

The alarm tests performed successfully against Tier0 & Tier1s. Still problems with mail2SMS gateway at CERN (FITNR) – some VOs sent “empty” alarm and not a “sample scenario” as agreed. Should we re-test soon or wait 3 months for next scheduled test?

 

Below are some examples of behaviours at the Sites. Usually the test alarms were answered adequately. In less than one hour.

 

 

LHCb (R.Santinelli) have also performed tests yesterday – interim results are available at  http://santinel.web.cern.ch/santinel/TestGGUS.pdf 

 

These results are still being analyzed – but it is immature to draw concrete conclusions from them but it would be interesting to understand why and how the ATLAS & CMS tests were globally successful whereas for LHCb at least some sites – and possibly also the infrastructure – gave some problems.

 

For this and other reasons J.Shiers suggested that we prepare carefully for another test to be executed and analyzed PRIOR to next month’s F2F/GDB.  The MB approved the proposal.

3.1      SAM VO Tests

Slides 12 to 14 show the results of the VO SAM tests in the last two weeks. This is a very useful visualization where patterns about VO and Sites are well identifiable (ASGC, CNAF and RAL seem to have problems).

 

L.Dell’Agnello and J.Gordon noted that the tests could have failed during some interventions at the Sites.

L.Dell’Agnello added that the OPN was working, while the general internet was down. This caused CNAF to fail some SAM tests.

M.Kasemann replied that this is correct because Experiments submit jobs via the internet, only data transfers use the OPN. Therefore SAM job submissions tests correctly failed.

 

4.   Feedback Update on "Busy Storage Services" (Slides) – F.Donno

 

Before the presentation F.Donno asked whether the SRM V2 should become the default for all WLCH utilities (in particular GFAL and lcg-utils).

 

New Action:

Sites should report whether GFAL and lcg-utils can start using by default SRM V2 and will not impact VOs outside WLCG.

4.1      “Busy” Storage Services Return Codes

F.Donno presented an update on the fact that storage services might become very slow or unresponsive when “busy”, I. e. running at the limit of some resources (high load).

 

The current Data Management clients do not handle correctly the situation where the storage service is “busy”. Some clients (FTS) may [abort and] retry immediately a failed request making the busy status of the storage server even more severe. There was no agreed way to communicate the busy status of a server to a client.

 

F.Donno organized 2 phone conferences with storage services and data management clients developers, data management experts and some managers (OSG and EGEE):

-       CASTOR, dCache, DPM, StoRM, BeStMan

-       GFAL/lcg-utils, FTS, dCache srm* clients, StoRM+BeStMan clients

-       Experiments DM clients

 

The goal was to agree on a way to communicate to clients the “busy” status of a server and to advise clients on how to react in the presence of a “busy” server.

 

The discussion is available here https://twiki.cern.ch/twiki/bin/view/LCG/CCRC08SSWGStorageBusyMeeting090226 and the conclusions are that

 

-       The SRM server MUST return SRM_INTERNAL_ERROR when experiencing transient problems (caused by high load on some components). SRM_FAILURE can always be returned to fail a request.

-       If the request will be processed the SRM server SHOULD return an “estimatedWaitTime” for each file in the request to tell the client when the next polling SHOULD happen in order to have a new update on the status of each file.

-       If the client application receives an SRM_INTERNAL_ERROR from the SRM server, it MAY repeat the request. If it does, it SHOULD use a randomized exponential retry time. The algorithm used by Ethernet when congestion is detected was proposed as a good example of randomized exponential retry algorithm.

 

J.Templon asked why the server does not return “service unavailable” or a specific clear “busy” error instead of an “internal error”.

F.Donno replied that the limitation is due by constraint of not changing the current WSDL definition and not to create a new error codes that will require client modifications.  In addition in the SRM specs the internal error is a valid code for asking for a retry.

 

Current official WLCG Data Management clients can already catch the SRM_INTERNAL_ERROR return code However, they do not act in the optimal way since they might either abort the request or retry immediately.

 

Current storage services already send the SRM_INTERNAL_ERROR code but only in very rare cases (failure of some internal component – i.e. Database). The return code is sent to the clients when it is already too late

 

The solution suggested is backward-compatible, although it is advisable but not required for the “busy storage services”-aware clients to be deployed before “busy storage services”-enabled servers are deployed. Clients might otherwise fail more often than needed.

 

More information can be found on the Storage Solution Working Group twiki page:

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCommonComputingReadinessChallenges#Storage_Solution_Working_Group_S

Please, check the section “Busy Storage Services”

4.2      Status of the WLCG DM Clients

The official WLCG Data Management clients that need code changes are:

-       FTS 2.2 will not have any of the agreed changes, and the priority is now on the checksum checks. The coding for the issue at discussion will begin in April, and at least three months are expected for having it in production.

-       For GFAL/lcg-utils the implementation of the suggested changes will not start before mid March.

-       We do not have yet a precise schedule for the dCache srm* clients.

-       StoRM clients perform only atomic operations at the moment (no retries). Therefore they do not need any changes. Retries will be implemented later on.

-       BeStMan clients will be implementing the required changes in the next two weeks.

 

There is no control on the VOs DM application but what can be offered is:

-       Reference implementations are being provided for all known use-cases to provide specific guidelines and examples.

-       Follow up with the experiments DM developers

-       Make GGUS supporters aware of this work

 

A first reference implementation for the pre-stage use case can be found here:

http://grid-deployment.web.cern.ch/grid-deployment/flavia/prestage_release_example.py

Reference implementations for all known use-cases can be ready in about 2 months.

4.3      Status of Storage Servers

The implementation of needed changes can start now for all storage services. Storage developers will make available a development test endpoint implementing the agreed behaviour.

 

A pre-release of GFAL basic building block functions need to be available in order to check the correct implementation of the storage server against them.

 

The reference implementations made available for the experiments will be used to test the pre-release of storage service (development test endpoint).

 

The S2 stress test suite will be used to make the servers “busy”, causing high load

4.4      Deployment Plans

It seems feasible to have both WLCG official Data Management clients and storage services tested and ready for deployment by September 2009. This date is very close to LHC start-up. How should we proceed?

 

I.Bird noted that if the new version is available and if there will be enough time it should be deployed.

 

Ph.Charpentier asked why the issue is discussed at the MB instead agreeing the dates with the DM developers.

F.Donno replied that this was done in the two meeting mentioned and is reported at the MB in order to ensure that is clear to all. And she discussed it with the DM developers.

I.Bird added that the issue was raised at the MB few weeks ago and the MB asked for a recommendation on how to proceed.

M.Kasemann noted that this should have been verified with the Experiments and the input of the DM developers should be taken into account.

 

New Action:

Experiments should confirm whether the schedule for the changes regarding “Busy” Storage Services is acceptable.

 

D.Barberis noted that the Experiments code should already deal with the current return code. If there is not so much development work why is it scheduled for September?

F.Donno replied because other higher priorities have been defined (e.g. data checksum verification, etc). If the MB changes priorities it can be done sooner.

I.Bird asked what the other issues are, in order to compare the current priorities.

 

5.   Plans for Virtualization and Multi-core in the Applications Area (Slides) – P.Mato

 

P.Mato presented the two R&D project under development in the Applications Area.

5.1      Introduction

Two work-packages on physics data analysis, simulation and computing retained from the R&D proposals in the White Paper under Theme 3

 

-       WP8 - Parallelization of software frameworks to exploit multi-core processors

-       WP9 - Portable analysis environment using virtualization technology; it will be reviewed for continuation after 2 years.

 

Both work-packages started in January 2008 for a duration of 4 years with about 14 FTE-year Fellows over 3-4 years  

5.2      WP8 - Parallelization of Software Frameworks to exploit Multi-core Processors

The aim of the R&D project is to investigate novel software solutions to efficiently exploit the new multi-core architecture of modern computers in our HEP environment

-       Activity divided in four “tracks”

-       Technology Tracking & Tools

-       System and core-lib optimization

-       Framework Parallelization

-       Algorithm Parallelization

 

Collaboration has been established with experiments, OpenLab, Geant4 and ROOT. With close interaction with experiments (bi-weekly reports in AF). Workshops are held each six months (first in April, second in October, next in spring 2009)

 

There was also a survey of HW and SW technologies. The target are multi-core (8-16/box) in the short term, many-core (96+/box) in near future. The goal is to optimize use of CPU/Memory architecture, exploit modern OS and compiler features (copy-on-write, MPI, OpenMT) and also prototype solutions ion the experiments and common projects (ROOT, Geant4) and in the R&D project itself.

 

The HEP code does not exploit the power of current processors: One instruction per cycle at best, little or no use of SIMD, poor code locality, abuse of the heap.

 

Running N jobs on N=8 cores still efficient but memory (and to less extent cpu cycles) wasted in non sharing “static” conditions and geometry data, I/O buffers, network and disk resources, caches (memory on CPU chip) wasted and trashed.

This situation is already bad today, will become only worse in future architectures (either multi-thread or multi-process)

 

The objective is to investigate solutions to parallelize current LHC physics software at application framework level and to identify reusable design patterns and implementation technologies to achieve parallelization and produce prototypes.

For instance Reconstruction Memory-Footprint shows large condition data that could be shared.

 

Modern OS share read-only pages among processes dynamically: a memory page is copied and made private to a process only when modified. The prototype in Atlas and LHCb (the latter using WP8 personnel) is giving encouraging results as memory sharing is concerned (50% shared) but there are concerns about I/O (need to merge output from multiple processes).

 

Memory (ATLAS)

One process: 700MB VMem and  420MB RSS

(before) evt 0: private: 004 MB | shared: 310 MB

(before) evt 1: private: 235 MB | shared: 265 MB

. . .

(before) evt50: private: 250 MB | shared: 263 MB

                                                                     

One can see that about 50% of the memory can be shared.

 

Another method is using KSM.

KSM is a Linux driver that allows dynamically sharing identical memory pages between one or more processes. It has been developed as a backend of KVM to help memory sharing between virtual machines on the same host.

 

KSM scans just memory that was registered with it.  Test performed “retrofitting” TCMalloc with KSM

Just one single line of code added! To your application.

 

CMS reconstruction of real data (Cosmics). No code change required. 400MB private data; 250MB shared data; 130MB shared code. Similar results with ATLAS applications.

 

The method is being tried with PROOF and Geant4:

-       PROOF Lite is a realization of PROOF in 2 tiers. The client starts and controls directly the workers. Communication goes via UNIX sockets and no need of daemons:

-       Multi-threaded Geant4: Event-level parallelism: separate events on different threads and to increase sharing of memory between threads

 

WP8 is also working on vent-level parallelism: separate events on different threads and working to increase sharing of memory between threads.

 

The future work is:

-       Release production-grade parallel application at event level. Exploit copy-on-write (COW) in multi-processing (MP), develop affordable solution for sharing of the output file and leverage G4 experience to explore multi-thread (MT) solutions

-       Continue optimization of memory hierarchy usage

-       Expand Minuit experience to other areas of “final” data analysis

-       Explore new Frontier of parallel computing. Scaling to many-core processors (96-core processors foreseen for next year) will require innovative solutions

5.3      WP9 - Portable Analysis Environment using Virtualization Technology

The goal is to provide a complete, portable and easy to configure user environment for developing and running LHC data analysis locally and on the Grid independent of physical software and hardware platform (Linux, Windows, MacOS):

-       Decouple application lifecycle from evolution of system infrastructure

-       Reduce effort to install, maintain and keep up to date the experiment software

-       Lower the cost of software development by reducing the number of compiler-platform combinations

 

The key building blocks are:

-       rPath Linux 1 (www.rpath.org)

       Slim Linux OS binary compatible with RH/SLC4

-       rAA - rPath Linux Appliance Agent

       Web user interface

       XMLRPC API

-       rBuilder

       A tool to build VM images for various virtualization platforms

-       CVMFS - CernVM file system

       Read only file system optimized for software distribution

   Aggressive caching

       Operational in offline mode

   For as long as you stay within the cache

 

One can produce many variation of the VM for the different VOs. The CernVM File System (CVMFS) is derived from Parrot (http://www.cctools.org) and its GROW-FS code base and adapted to run as a FUSE kernel module

 

Experiments publish new releases themselves and the installation done in a dedicated Virtual machine, which then synchronizes with Web Server. It all should be transparent to CernVM end-users. I.e. New versions appear in the ‘local’ file system.

 

The final goal is to provide a complete Data Analysis environment is available for each Experiment providing:

-       Code check-out, edition, compilation, local small test, debugging, …

-       Castor data files access, Grid submission, …

-       Event displays, interactive data analysis, …

 

Release Available now for download from http://cern.ch/cernvm/?page=Release1.01

Can be run on

-       Linux (KVM, Xen, VMware Player, VirtualBox)

-       Windows(VMware Player, VirtualBox) 

-       Mac (Fusion, Parallels, VirtualBox)

Release Notes are here http://cern.ch/cernvm/?page=ReleaseNotes  and the How-to is here http://cern.ch/cernvm/?page=HowTo

 

The appliance can be configured and used with ALICE, LHCb, ATLAS (and CMS) software frameworks

 

The next steps are:

-       Remove single point of failure, develop and test a Content Delivery Network

-       Migrate  CernVM to rPath Linux 2 (SLC5 compatible)

-       Migration of our pilot services on IT hosted resources

-       Investigate CernVM as job hosting environment

       Voluntary computing such as BOINC

       Explore the use of dedicated virtual facilities

 

There are changes needed in order to use the systems on the current Grids:

-       Running multi-thread or multi-process applications will be efficient

       Memory footprint, I/O optimization, etc.

       Scaling-down the total # of jobs to be managed

-       Batch systems and ‘Grids’ need to be adapted for multi-core jobs

       A user should be able to submit jobs using the 8 cores or more available in a box. The scheduling should be straight-forward without having to wait for resources to be available

 

A discussion followed on how it is possible to launch 8 jobs on the same CPU in order to use the 8 cores efficiently. And how do batch systems know when jobs are of the same kind? It was agreed that the issue needs to be investigated.

 

The CernVM platform is being used by Physicists to develop/test/debug data analysis. Ideally the same environment should be used to execute their ‘jobs’ in the Grid with features like:

-       Experiment software validation requires large datasets

-       Software installation ‘on-demand’

-       Decoupling application software from system software

 

How can the existing ‘Grid’ be adapted to CernVM?

-       Virtual Machine submission to the worker nodes?

-       Building a ‘virtual’ Grid on top of the ‘physical’ Grid?

 

J.Gordon asked how this solution is security-wise.

I.Bird replied that also the current services are verified but not all that the applications can do is under control.

 

J.Templon asked what are the limitation provided on the Amazon EC2 service.

P.Mato replied that rPath can generate images for Amazon EC2. But maybe there are limitations. But if Amazon accepts VMs also the Grid could accept jobs that do not access the local infrastructure but have a specific task to execute.

 

J.Templon noted that if the Sites already use VM for their worker nodes then the CernVM cannot run inside. VMs cannot inside other VMs.

P.Mato replied that one should standardize on the VM hypervisor so that the Experiment use images for that hypervisor.

 

T.Cass noted also that one should also verify how for instance AFS clients can be included in a virtual machine.

 

I.Bird proposed a discussion on how jobs from a single VO can be grouped on the same multi-core CPUs.

 

New Action:

The MB should organize a discussion about the work needed to have a prototype working at CERN of running with KSM and selecting nodes from the same VO on the same host.

 

6.   EGI Workshop Summary (Slides) – I.Bird

 

The material attached was not presented.

If there are questions they can be asked at next MB meeting (or via email).

 

7.   High Level Milestones (HLM_20090310.pdf) – A.Aimar

 

 

Postponed to next week.

 

 

8.   AOB

 

 

I.Bird, who could not be present at the beginning of the meeting, asked about the decisions on the presentation of the Experiments’ Requirements on the 31st of March. The requirements must be ready by the RRB in April. If there are new value before end of March the requirement will have to be modified accordingly.

 

The MB agreed that the discussion on the 31st will take place even if the values are not the final ones. And on the 7 April will be defined the final values with the presence of the spokespeople.

 

 

 

9.   Summary of New Actions

 

 

New Action:

Sites should report whether GFAL and lcg-utils can start using by default SRM V2 and will not impact VOs outside WLCG.

 

New Action:

Experiments should confirm whether the schedule for the changes regarding “Busy” Storage Services is acceptable.

 

New Action:

The MB should organize a discussion about the work needed to have a prototype working at CERN of running with KSM and selecting nodes from the same VO on the same host.