WLCG Management Board

Date/Time

Tuesday 17 February 2009 – Phone Meeting - 16:00-17:00

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=49391

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 21.2.2009)

Participants

A.Aimar (notes), D.Barberis, I.Bird (chair), T.Cass, Ph.Charpentier, L.Dell’Agnello, M.Ernst, Qin Gang, M.Kasemann, M.Lamanna, G.Merino, A.Pace, R.Pordes, Y.Schutz, J.Shiers, R.Tafirout,

Invited

---

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 24 February 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters Arising (Minutes)

 

1.1      Minutes of Previous Meeting

The minutes of the previous meeting were modified following the emails of D.Barberis and J.Gordon. Now they state that ATLAS had visited their Tier-1 Sites instead of holding a workshop.

1.2      Conclusions on GDB and MB schedule

The GDB schedule will not change but the MB could instead be held every two weeks.

Because of the limited attendance of this MB Meeting the subject will be discussed at next F2F Meeting.

 

I.Bird noted that if the MB is less frequent it should be always attended and take longer (2h probably). And in addition the GDB’s agendas should be agreed at the preceding MB (i.e. 2 weeks before).

1.3      T1 Reliability Reports (all VOs) - January 2009 (T1-SAM-Reports-200901.zip; VO Feedback)

The SAM Reliability Reports for the VOs, with comments, are available (T1-SAM-Reports-200901.zip; VO Feedback)

 

I.Bird noted that some OSG sites were reporting their tests results incorrectly. The OSG results will now be correctly reported.

1.4      Preparation of Overview Board meeting

Monday 23 February 2009 there is an Overview Board Meeting. It will focus on the new LHC schedule, planning of WLCG, presentation of EGI Blueprint and preparation of the necessary EGEE transition. The status report will be like the quarterly report.

1.5      2008 Resources at ASGC and NDGF

I.Bird asked for clarification about the missing CPU installed for 2008.

-       ASGC: Qin Gang replied that the installations, delayed because of problems, are in progress. The data in the accounting report is correct.

-       NDGF: Seems to have 200 TB missing, but was not represented at the MB meeting.

 

2.   Action List Review (List of actions)

 

 

  • SCAS Testing and Certification

17.2.2009 - M.Schulz reported that the process analysis of memory occupation shows that there is still some small memory leak and repeated errors every few minutes. The response time is very variable up to 10 seconds sometimes. NIKHEF should provide a new release soon. But the main developer will be absent for a few weeks.

 

Therefore the software is not ready for production installation. But it could be installed on some pilot installations. Some Sites, for test pilot installations, are needed.

 

  • VOBoxes SLAs:
    • Experiments should answer to the VOBoxes SLAs at CERN (all 4) and at IN2P3 (CMS).
    • NL-T1 and NDGF should complete their VOBoxes SLAs and send it to the Experiments for approval.

CERN: Done for ALICE, ATLAS, LHCb. CMS still need to agree on the SLA document.

NL-T1: J.Templon reported that the NL-T1 SLA has been sent to the Experiments for review and approval. ATLAS approved the SLA.

NDGF: O.Smirnova reported that NDGF has sent their SLA proposal to ALICE and is waiting for a reply.

IN2P3: Waiting for CMS.

 

  • 16 Dec 2008 - Sites requested clarification on the data flows and rates from the Experiments. The best is to have information in the form provided by the Data flows from the Experiments. Dataflow from LHCb

Not done yet. I.Bird proposed that someone is appointed in order to complete this action. A template should be used (G.Merino?).

J.Shiers reported that the CCRC 2007 document is the most recent version and clarifies what information the Sites need. The document will be updated for the WLCG Workshop.

 

  • 17 Feb 2008 - R.Pordes agreed to provide, within 2 weeks, the milestones for OSG reporting installed capacity into APEL.

Not done. Will be done on the 24 February.

 

3.   LCG Operations Weekly Report (Daily minutes; Slides) - J.Shiers

 

Summary of status and progress of the LCG Operations since last MB meeting. The daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

No “major” incidents last week. Just the usual background of problems that were addressed as they arose – see minutes from daily phone meetings (Daily minutes).

 

As mentioned at yesterday’s LHCC mini-review, it would be nice to include some additional “key performance indicators” – such as:

-       Summary of (un)scheduled interventions (including overruns) at main sites,

-       Summary of sites “suspended” by VOs. Do sites know that they have been suspended?

-       Production / analysis summaries (e.g. “VOviews”)

 

Proposal from CMS:

“I (Daniele Bonacorsi) have been filling - on behalf of CMS, and just for the WLCG Ops daily calls of ours, now since 2 weeks - the twiki: https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports It seems to me it works, both as a reference for discussion, and for your minutes. If so, and if you agree, I propose I keep this habit of mine for the future.”

 

This is very useful and it would be good if it could be adopted for the reports from the other Experiments. And this also facilitates reporting to other meetings.

 

The proposal is that the summary should be written before and just mentioned at the daily meeting.

3.2      GGUS Tickets

Only one test alarm from CMS. The alarm ticket was a test (Daniele Bonacorsi to CNAF):

“To be sure that a problem I had with GGUS alarm to CNAF is now solved, please anybody at CNAF receiving this 1) be aware it's a TEST and not a problem report, and 2) just CLOSE IT and mail me any details. Regards, DanieleB (CMS)”

 

VO concerned

USER

TEAM

ALARM

TOTAL

ALICE

3

0

0

3

ATLAS

20

2

0

22

CMS

8

0

1

9

LHCb

3

3

0

6

Totals

34

5

1

40

 

Ph.Charpentier stated that LHCb has submitted more than 3 tickets. Maybe the submitter did not specify the correct VO.

3.3      Experiment Specific Issues

 

Experiment

Issue

ALICE

On-going WMS issues still being debugged; seriously impacted experiment’s production: next steps

ATLAS

Some issues related to scheduling / communication of cleaning of PNFS database @FZK: now completed! (see announcement below)

CMS

Several issues reported but promptly followed up by experts / site contacts

LHCb

Some issues related to low numbers of running batch jobs – on-going reconfiguration and investigation. (Believed to be related to implementing the pilot role at CERN which gave problems with the LSF shares – now reported as fixed).

 

ALICE explicitly asked to report that the WMS issue is still unsolved.

  1. Setup of 2 new WMS at CERN with the latest 4.3 version which will be deployed for ALICE use only. These two new WMS will be put in production with the current ones so the experts can stop them, drain them any operation they consider in a totally transparent way for ALICE
  2. In addition we are putting in production at CNAF the egee-rb-09 WMS. It has also some fixes for ALICE as for example the drain flag. This procedure will directly put the WMS in drain mode as soon as the number of input requests becomes impossible to manage. 
  3. The CNAF procedure has been sent to the WMS experts at CERN to follow the same procedure, but it seems it is not still in production. 
  4. We hope to gain enough familiarity with these procedures to provide the developers with feedback and also the site admins.

 

Slide 7 shows an example of a KPI report (with fake data). 

Could report about all interventions.

-       Scheduled

-       Overran

-       Unscheduled

-       Hours Scheduled

-       Hours Unscheduled

3.4      Information to Off-line Sites

If an Experiment put a Site off-line (slide 8) shows when NL-T1 was put offline for 8h by ATLAS. And slide 9 shows ASGC unavailability for CMS half of the week. It is not sure the Site is notified in a standard way.

 

If the graphs shown in slides 8 and 9 could be prepared for all VOs they would give useful information about the KPI of the Sites.

 

Qin Gang noted that, in spite of the low results of ASGC in the graph on slide 9, no jobs were lost at ASGC for CMS. Only the responses to the tests were incorrectly configured at ASGC.

3.5      Workshop News

About 220 people had registered by the end of last week, including 20 for the workshop only. The numbers in Victoria and Mumbai were a little lower – 180 people on both occasions by time of event.

 

The agenda is now rather full – speakers should aim to leave at least 30% (or more) time for questions and discussions. The talks should be oriented towards operations / service delivery and not just status reports…

 

4.   User Analysis Working Group: Status and Progress (Slides) - M.Schulz

 

M.Schulz presented the summary of the feedback received about the expectation of the Experiment for their User Analysis at CERN.

4.1      Background Information

All Experiments provided their feedback by mid-December 2008, in different formats.

 

ALICE by document and links

-       Very independent from underlying infrastructure

-       Shares, allocation, access is controlled by Alien framework

-       Production like analysis with “Train” concept

-       End User analysis is mostly by non-grid means “PROOF” and via Alien task queues not clear which is the preferred way that will be used.

 

LHCb by mail and links

-       Not very detailed description at the moment

-       End user work requires very little storage ( few GBs i.e. on the desktop scale )

 

CMS by a document with concrete requirements

-       Missing functionality

-       Storage requirements

-       Split between production and analysis use of CPU

-       But not providing the amount of users and of concurrent usage

 

ATLAS provided a set of links

-       Two ways to access the resources.

-       Very independent on the underlying infrastructure.

4.2      General Comments

The analysis models of the experiments depend to a large degree on their own grid systems that are layered on top of the provided infrastructure.

 

-       ALICE is in this respect very independent and flexible and has simple requirements. Since most complex aspects are handled in the Alien layer

-       The models look on a high level similar, but the different implementations impact the infrastructure in different ways (calibration data access, data access control).

-       In addition several different systems are used by the larger experiments Pilot/WMS based submissions. This makes it difficult to give universally applicable advice to sites and have a generic set of requirements for services.

 

All experiments have exercised their frameworks. It is not clear how close this is to the activity level when we have beam.

 

For data access there is no clear metric to measure a T2s capability (Data size, number of files accesses, parallel active users is too simple)

 

Each system impacts the fabric a different way. The internal structure of files has shown to have a large impact on the I/O efficiency (which is still under investigation).

 

It might be instructive to take a look at some of CMS requests and observed issues.

4.3      CMS (as example)

CMS has a strong concept of locality:

-       Each physicist is assigned to one or more Tier-2 Sites.

-       User data is tied to these sites. Transfers with experiment’s tools only after registering the files in a global catalogue.

-       Jobs are sent by CRAB to the data.

-       In addition official working groups manage their own data.

 

However, this locality cannot currently be enforced by the underlying infrastructure. It is enforced indirectly by the experiments tools

 

The CMS T2 storage is as follow:

-       >20TB temporary space for production, controlled by prod team

-       30TB space for centrally managed official data sets

-       N* 30TB for each official physics group. Each group can have multiple sites. Each site can host multiple groups

 

-       Regional (local) user spaces are managed regionally.

-       SRM based stage-out space ( future )

-       Some T3 space at the site, probably for off-grid analysis

 

Complex ACLs and quotas are required: VOMS can express this by using groups for locality and roles below that level and complex mapping at the fabric level

 

The CMS T2 CPU allocations is as follow:

-       50/50 % share between production and all other activities. With a timeframe for equalization of shares under load of less than a week

-       Split and prioritization on a granularity of analysis groups and taking locality into account

 

Currently VOMS is only used for prod/analysis distinction and at some places to express locality.

The full implementation with the current infrastructure will be very cumbersome and put a significant load on the information system

 

Prioritization: For individual users based on recent usage requires fixed mappings between users and local ids to work, but is a very complex solution.

 

The CMS issues reported are do not regard only analysis:

-       Storage systems reliability

-       Improved SRM APIs and tools (bulk operations are missing)

-       Quotas, accounting and ACLs for space

-       SE and CE scalability. SRM operations. CE high load figures

-       Lack of support for multi user pilot jobs (parallel to push model). SCAS/gLexec  soon, GUMS/gLexec in use

-       Batch scheduler configuration appears to be not in line with agreements

-       lcgadmin mapping

-       Except for the more divers resource allocation requests the reported issues apply as much to analysis as they do to production tasks.

4.4      Summary

On an abstract level the different analysis models look similar. But the used tools are very different and by the way they are layered individually on the infrastructure create different requirements and constraints (job flow, data transport and access, calibration data access).

 

The analysis specific problems that are encountered are mostly specific to the way the infrastructure is used. Like with data access, catalogue operations; and rollout experience supports this observation.

 

It is very difficult to estimate the true scale of activity after LHC start in terms of users, “grid users” etc. There is a shift from Push to Pilot which will make a huge difference for the job management. But is not a complete transition.

 

There is a similar move for data access (xrootd) for resource configuration issues direct communication between the Tier-2 Sites and their experiment seems to be more efficient. The common problem domain is storage specific and solutions depend on the SE’s implementation.

 

I.Bird asked how much this proposal had been discussed with the Experiments.

M.Schulz replied that the information was on paper and web links received.

 

I.Bird asked whether there is also the need of a throttling system to control the load on a given SRM.

Ph.Charpentier confirmed that Sites should have these controls otherwise are overloaded.

 

Are bulk operations being implemented in the next?

A.Pace replied that these will not be available in the upcoming release. The problems with SRM are the new error reports from the servers that can crash the clients.

 

M.Schulz added that authorization and authentication will add a major load on the system in order to check the permission of every fine grain operation.

 

I.Bird replied that using pilot jobs will not be a problem. If also CMS moves to pilot jobs, there is no need of fine grained ACL because this is done by the VO.

M.Kasemann replied that for the moment CMS has not decided for analysis and for production CMS will stay with the WMS. For some time there will be the use of WMS and also pilot jobs methods.

 

I.Bird suggested that the issues are listed and checked with all the other VOs. In particular with the SRM API and tools.

M.Lamanna asked that before the list is requested to the SRM developers should be discussed and prioritized by the Analysis WG.

 

New Action:

4 March 2009 - M.Schulz to present the list of priorities for the Analysis working group.

 

5.   Feedback from the LHCC Mini Review - I.Bird

 

 

No feedback yet. The presentation at the closed session is on Thursday.

 

6.   AOB

 

 

No AOB.

 

7.   Summary of New Actions

 

 

New Action:

4 March 2009 - M.Schulz to present the list of priorities for the Analysis working group.