LCG Management Board

Date/Time:

Tuesday 3 April 2007 - 16:00-18:00 – F2F Meeting in Prague

Agenda:

http://indico.cern.ch/conferenceDisplay.py?confId=11632

Members:

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 8.4.2007)

Participants:

A.Aimar (notes), D.Barberis, I.Bird, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, T.Doyle, M.Ernst, I.Fisk, D.Foster, J.Gordon, C.Grandi, F.Hernandez, M.Kasemann, J.Knobloch, M.Lamanna, E.Laure, S.Lin, H.Marten, P.McBride, G.Merino, L.Robertson (chair), O.Smirnova, J.Templon

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive:

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting:

Tuesday 10 April 2007 – 16:00-17:00

1.      Minutes and Matters arising (minutes)  

 

1.1         Minutes of Previous Meeting

M.Ernst had a comment about the minutes of the previous meeting, regarding the information on "Site Reliability Reports” for BNL.

As it is stated in the table on "Operational Issues" the SAM tests are failing for BNL since 19 February because of problems with the gLite middleware (in particular with the CE) since the CE farm was upgraded to SL4 (and a new Condor version).

 

The LCG/gLite CE is not used in the US ATLAS operations model; instead they use an OSG CE. He noted that the LCG/gLite CE is still probed by SAM but it is not installed since they upgraded to SL4, therefore it fails.

 

The minutes have been modified. M.Ernst also asked the test on the LCG/gLite CE to be removed from the list of SAM tests executed for BNL (see M.Ernst complete comment (NICE login req.))

 

I.Bird replied that OSG had agreed to develop equivalent tests to the SAM tests in order to check the OSG CE. Therefore the SAM tests should not be disabled at BNL but replaced by equivalent OSG tests instead.

M.Ernst agreed that having OSG tests is the final desired solution. But he also added that to continue checking the LCG CE (un)availability is not representing the real availability of BNL. The issue is being discussed with J.Shiers and J.Casey in order to find a suitable temporary solution.

 

OSG was not represented (but excused) but would be good to have a report on the status of the SAM OSG tests in the near future.

 

Issue to follow:

OSG should report on the status of the equivalent tests for the SAM framework for the OSG sites and services.

1.2         Matters Arising

2007Q1 QR reports (QR_2007Q1.zip)

Distributed on the 2 April 2007, should be completed and sent back to A.Aimar by Thursday 12 April 2007.

 

 

2.      Action List Review (list of actions)  

Actions that are late are highlighted in RED.

 

  • 15 Mar 2007 - CMS and LHCb should send to C.Eck their requirements until 2011.

Not done. ALICE values were sent to Chris. Waiting for CMS (D.Newbold) and LHCb (N.Brook).

 

  • 3 Apr 2007 - A.Aimar will update the 2007 targets for all four LHC experiments and distribute it to the MB.

Not done.

 

 

3.      SL4 Migration Status - I.Bird

 

 

I.Bird provided a summary of the migration to SL4 of the gLite middleware.

A more detailed report will be presented on the following day by L.Field, at the GDB meeting.

 

The SL4 native build is still in progress:

-          WN is installed on the Pre-Production Service, in 32-bits mode. It should be available in a couple of weeks for deployment.

-          UI is also being built and should be available in a couple of weeks on the PPS; and a few weeks later for deployment at the sites. The old EDG submission clients did not work after the porting and it is additional work that was not originally foreseen by the UI developers.

 

Of all the gLite code (about 99%) is building. But installation, packaging and configuration are being completed and will need to be tested on the PPS before deployment on the sites.

 

All the components mentioned above are in 32-bits mode. For the 64 bits mode M.Schultz and L.Field have presented a proposal (to the GDB in March) on how to deploy the 6 4-bits version and how to specify the correct versions of Python and Perl (for sites supporting 32 and 64-bits platforms).

 

J.Templon warned about the usage of PATH variables to select the versions of Python and Perl, it could be dangerous for the experiments’ set-up. I.Bird agreed and asked the experiments to evaluate and reply to the current proposal.

 

Issue to follow:

Experiments should evaluate and answer to M.Schultz’s proposal on how to handle sites supporting both 32 and 64-bits farms.

 

 

4.      SRM 2.2 Update (SRM-Status-20070403) - F.Donno

 

4.1         Progress since Last Report

Not much progress was made on the SRM 2.2 tests. A major network problem with ESNet slowed all testing of the SRM 2.2 systems. A router in Chicago was losing packets and causing transfers between CERN and FNAL to hang. Similar problems were seen with transfers between LBNL and FNAL. An upgrade was performed by ESNet about 2 weeks ago. However the same problems were observed already four weeks ago and transfers appeared slowly deteriorating with time.

 

A fix was put in place by ESNet last Thursday. We still observe hanging transfers for the stress test endpoint at FNAL. The dCache team at FNAL has been fully involved to understand the source of the problem. Huge debugging effort also at CERN (M.Litmaath and F.Donno).

 

If notifications about such network problems were posted somewhere (like it happens for sites problems and downtimes) this would have avoided useless time consuming investigations on the SRM systems. A user-level monitoring tool is being put in place and it is really needed.

 

D.Foster asked who the network people involved were. F.Donno will send him that information.

4.2         Other Problems

Other temporary problems were encountered:

-          DPM
Ran out of VOLATILE space.

-          CASTOR:
Until yesterday only few instabilities to be fixed. From yesterday major instabilities (basic and use cases) due to Name Server and/or network interventions (back-end suspected).

-          DCache:
No improvements with use cases yet. This is mostly due to the network problem investigations and preparation for 1.8 beta release.

4.3         Stress Tests

The SRM stress tests started on the 22nd of March 2007 and are executed manually once or twice a day. Nine more nodes to dedicate to stress tests were deployed at CERN by the FIO group:

 

The stress tests are available:

-          Relocatable RPMs were created for the test suite. Binaries are linked statically to avoid installing local dependencies.

-          Such RPMs are being used by SAM. First attempts to integrate tests in SAM look successful.

-          Those RPMs are also used by CMS to develop PheDex usecases (with S2 libraries) for SAM

 

A mechanism is being put in place to run stress tests on all nodes, collect results, aggregate and publish them:

 

The current status is:

-          CASTOR:
Is not really possible at the moment to stress test CASTOR because of PutDone being too slow (it can kill the server). Next CASTOR version containing also the LSF plugin fix is really needed.

-          DCache:
The endpoint does not yet answer correctly to stress test situations. They are working to improve the situation.

-          DPM and BeStMan:
Behave well under stress with the resources available at the moment.

-          StoRM:
Some problems observed with StoRM. The developers have been working to improve the situation.

4.4         Basic and Use Cases Tests

Work has been performed in order to develop more tests to explicitly check new scenarios

-          BringOnline for dCache and Castor (simulating the problems experienced by LHCb with dCache SRM v1.1)

-          AbortRequest/Files used by FTS to recover from errors and implement retries

-          Incorrect output parameters returned in special conditions (size, file locality, etc.). CASTOR: on one occasion, srmStatusOfGetRequest operation returned an invalid file size (different from previously returned values).

 

The functionality tests are executed five times a day on SRM endpoints that are not the ones used for the stress tests.

 

Slide 6 to 9 show the status of the tests today and one month ago. The already-mentioned network problems make the current results not very meaningful. The Use Cases results are worse than before, but most of those bad results are also due to the network problems encountered.

4.5         News about Available Endpoints

dCache

-          Stress and functionality endpoints available at FNAL with dCache 1.7 + development version of SRM

-          As of today, new endpoint available at DESY with dCache v1.8beta.

CASTOR

-          Functionality and stress endpoints are available at CERN.

-          A stress test endpoint (c2itdc) with the new version of CASTOR (PutDone + LSF plugin fix) may be made available tomorrow

DPM

-          New version (1.6.4) of DPM just released. Two endpoints available at CERN. One used for functionality and stress testing.

-          Waiting for some UK site to install.

StoRM

-          Only one endpoint available at CNAF used for functionality and stress tests.

-          A more robust endpoint is being made available for stress testing at CNAF.

-          Waiting for a wider deployment

BeStMan

-          Only one endpoint available at LBNL user for both functionality and stress tests.

4.6         Test of the SRM Clients

GFAL/lcg-utils

-          Independent tests are being made by M.Ciriello (INFN) using the latest versions of the clients available on all endpoints.

-          Lcg-utils works for all implementations but not for BeStMan (it requires delegated proxy, not supported by lcg-utils).

-          Sporadic problems with GFAL for all endpoints except DPM.

-          Results are published on the SRMDev Twiki page: https://twiki.cern.ch/twiki/bin/view/SRMDev 

 

FTS

-          Tests executed almost daily. Results reported on srm-tester mailing list and published on Twiki: https://twiki.cern.ch/twiki/bin/view/SRMDev

-          Tests are working for all SRM implementations except for BeStMan.

-          More endpoints will be added soon.

4.7         Update on the GSSD Working Group

The next GSSD face-to-face meeting will be at CERN on April 12, 2007.

 

The main topics that will be discussed are:

-          SRM v1 to v2 migration plan

-          Experiments problems with Storage Services

-          Monitoring utilities

 

One Parallel GSSD Session will take place during the CMS off-line week:

-          Discussion of CMS input on Storage Classes and Tier-1s configuration.

-          CMS GSSD Wiki pages are being updated (https://twiki.cern.ch/twiki/bin/view/LCG/GSSD)

 

Planning of activities and next face-to-face meetings are on going:

-          Major milestones discussed during July and September meetings.

-          The US and Asia need to be involved. We need to plan very well in advance.

 

S.Lin asked for VRVS to be set-up for the GSSD event, to allow the connection of the people from Taipei.

F.Donno replied that the GSSD sessions will all be in VRVS. And the sessions involving Asian sites will be set-up in the mornings and those with the US in the afternoons, in order to better deal with time differences.

 

1.      Job Reliability Reports (Slides; document) - M.Lamanna

 

M.Lamanna presented the proposal of a document to be used as “Job Reliability Monthly Report”, which is showing the “job successes vs failures” of for each Site and each VO.

 

The first version was circulated in February 2007. The documents attached are for February and March 2007.

 

News since last presentation:

-          The summary table now covers the “Job attempts” only.

-          Statistics being updated (to have access to more data, otherwise just a sub sample).
ALICE and LHCb data already regenerated

-          Few jobs running at BNL (always submitted via WMS). Not to be considered part of the report so far (it is just a test activity).

 

Slide 3 shows the new summary table layout extracted from the reports of February and March 2007.

 

February 2007

 

March 2007

 

The summaries above shows job attempts successes vs failures for each Site and each VO. It is not the “user perceived” efficiency because failures are counted for each job attempt. A job can succeed for the user, but after having failed several attempts before succeeding.

 

Jobs submitted directly, without using the normal grid submission systems, are not counted because the information is not available. It is difficult to find the correlation between attempts and successes for jobs submitted directly (e.g. missing the global grid job id).

 

L.Robertson asked explanations about the “CMS Total in February 2007”: about 31000 successes vs 49900 failures. While the CRAB results look much better in the document: CRAB is always above 90% of success for all sites. Why this difference?

M.Lamanna explained that the CRAB values do not include the retries and the attempts while the summary table includes the attempts. Still the exact reasons where not very evident.

 

L.Robertson suggested that the tables above should also include the final “job results without the attempts” for each site and VO, having additional columns.

M.Lamanna agreed that the perceived efficiency should also be shown in the same format. At the end we will have the effective site efficiency (with all jobs attempts) and the perceived efficiency (with the users’ jobs final results).

 

J.Templon asked whether jobs attempts failed at other sites are not counted to the site where the job finally succeeded.

M.Lamanna replied that failures are counted to the site where they happen, not where the job is executed finally.

 

L.Robertson asked whether these job reliability status reports are ready enough to be presented to the LHCC Referees.

M.Kasemann noted that the users see a better behaviour than the one reported in these tables (because of the attempts and retries). They should be presented only making very clear that it is the sites performance shown not the result for individual users.

 

D.Barberis, asked why at each site for different VOs the reliability is very different. And also why for some VOs there are very few jobs reported and without knowing which type of jobs is taken into account.

M.Lamanna replied that any job submitted using the WMS is taken into account. For ATLAS, for instance, the jobs via Ganga are accounted.

 

L.Robertson noted that, only looking at the numbers, for CMS/CERN/March 2007 the successes seem all due to CRAB jobs while all others seem to fail. The numbers should be studied to understand what is covered and report them to the Referees only when we are sure of what they mean.

 

D.Barberis noted that probably all experiments have in their monitoring systems the success rates for their jobs.

 

O.Smirnova added that is also important to know what kind of jobs have failed (Short or long ones? After one minute or after days of execution?). Just counting the jobs can be extremely misleading. N.Brook added that failures could be due to job problems or sites misconfiguration. How to distinguish this?

 

Decision:

The MB agreed to again assess in the next weeks what should be shown to the Referees and how to present it. M.Lamanna will distribute a new version of his “Job Reliability Report” proposal.

 

2.      Job Priorities Update (Slides) - J.Templon

 

 

J.Templon presented an update of the implementation of Job Priorities at the Tier-0 and Tier-1 sites.

 

The table below provides a summary update

 

Values in March 2007

 

 

Values in April 2007

 

 

-          The rows in dark yellow are those that have not replied and not sent any update.

-          FZK, FNAL and NDGF did never send any information and have not specified their VOMS roles

-          L.Dell’Agnello stated that CNAF is publishing (since the morning) their information and should be available in the repository.

-          Until now only SARA is publishing the other are not compliant:
PIC, IN2P3 and ASGC are publishing but with an incorrect format (see [a], [b], [c] above),

 

G.Merino asked whether only ATLAS will test the VOMS roles or whether other VOs are planning to send requests soon.

M.Kasemann, N.Brook replied that CMS and LHCb do not plan to use VOMS roles in the next month.

 

L.Robertson asked whether all sites could progress on this topic in the next few weeks, in order to have VOMS roles implements at all sites by end of April 2007.

 

Issue to Follow:

The MB agreed to follow the status of VOMS roles at the sites in the next couple of weeks.

 

3.      CMS Top 5 Issues/Concerns (Slides) - M.Kasemann

 

M.Kasemann presented which are the top issues and concerns from CMS wrt to the LCG services.

 

Slide 2 to 5 show what CMS had presented at the Baseline Services document are their priorities. This is just to show that since then the priorities for CMS have not changed.

 

Slide 6 and 7 show the current top concerns.

 

1) Storage Element: CASTOR functionality and performance
Issues & Concerns are:

-          The single CASTOR request queue is a potential bottleneck

-          Under the stress of data taking the global priorities must work
and there is sufficient capacity in the system

-          Performance for use-cases at CERN and at Tier-1 sites

-          Lack of strategy wrt. to the use of xrootd

 

CASTOR is crucial for the CMS functioning. Both the functionalities and the performance.

The issues/concerns mentioned are the single request queue and the functioning of priorities.

CMS has no plan B regarding CASTOR.

 

L.Robertson noted that the last bullet referring to xrootd is a more CMS decision, than a CASTOR issue.

M.Kasemann agreed that the strategy towards xrootd is a CMS issue.

 

T.Cass added that CMS had reached sometimes the maximum number of connections to CASTOR; xrootd could be useful in this context because it treats several users in one single connection.

 

2) Storage Element: SRM interface with functionality and performance
CMS especially interested in:

-          Name space interaction will allow CMS to alleviate the
need for local VO services for data management
(if can be made to scale)

-          SRM explicit delete

-          good support for SRM copy in push mode

-          capability to control access rights consistently across
different SE's using VOMS

 

In particular the capability to control Access Rights is important for managing the resources.

 

I.Fisk added that until now the SE control is only by groups for CMS. A user currently can do anything to the SE if it is within a group, e.g. a mistake like a recursive delete in the wrong place could remove an important portion of the whole data. For this reason SRM 2.2 should have access control implemented with ACLs.

 

3) FTS servers

-          transfers and queues monitor

-          backup servers

-          heartbeat and similar ”service-is-up" monitoring

 

CMS will not fail without these FTS features; but it is important to monitor and know about transfer failures in order to improve the efficiency and make better use of the available resources.

 

4) Workload Management

-          Capable of fully utilizing available compute resources. Scaling to 200K jobs / day

-          Strategy to converge on few (2-3) submission tools to be used as best fit. Candidates: gLite RB, Condor-G submission

-          We need to improve the transparency of job submission and diagnosis for the users

 

WMS should reach the limit of 200K/day. CSA06 tested 50K and CAS07 will test 100K jobs/day.

CMS should decide their submission tools to use soon.

The transparency of job execution should allow investigating completely the result of a job, and have it adjustable by the experiment especially in case of failures one may want to see all execution details in some cases.

 

C.Grandi asked how many submission servers could be needed for 200K jobs/day.
M.Kasemann replied that about 10 could be sufficient. I.Bird confirmed that a similar number seems sufficient.

 

J.Templon asked whether JRA1 has improved the error messages making them clearer for the sites.

C.Grandi said that while components are modified the messages are improved substantially. Especially in the UI of gLite 3.1 and in the “syslog” logging for the site managers. Further improvements will be defined during the gLite restructuring, in the next few months.

 

5) Scalability, reliability and redundancy of information system.

-          mitigate the dependency from central CERN BDII

 

The BDII should scale up to the required load and be reliable. There should be several servers and not depend on a single BDII server.

 

I.Bird noted that this is already available. But it is the application that should use it correctly by sending the queries at the right BDII level, not always only to the CERN top-level server.

 

J.Templon added that the choice of the BDII depends on the application, on the WN configuration, on the site defaults and on the VO defaults.

 

L.Robertson concluded that the “experiments’ issues/concerns” should be collected in April so that they can be summarized at next GDB meeting in May.

 

4.      AOB 

 

 

No AOB.

 

 

5.      Summary of New Actions

 

 

 

The full Action List, current and past items, will be in this wiki page before next MB meeting.