LCG Management Board

Date/Time

Tuesday 4 August 2009 16:00-17:00 – Phone Meeting 

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=62552

Members

http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html

 

(Version 1 – 7.8.2009)

Participants

A.Aimar (notes), O.Barring, I.Bird(chair), D.Britton, Ph.Charpentier, L.Dell’Agnello, D.Duellmann, M.Ernst, X.Espinal, I.Fisk, Qin Gang, J.Gordon, A.Heiss, M.Kasemann, O.Keeble, R.Pordes, H.Renshall, Y.Schutz, J.Shiers, R.Tafirout

Invited

P.Fuhrmann, A.Sciabá 

Action List

https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList

Mailing List Archive

https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/

Next Meeting

Tuesday 18 August 2009 16:00-17:00 – Phone Meeting

1.   Minutes and Matters arising (Minutes)

 

1.1      Minutes of Previous Meeting

No comments received about the minutes. The minutes of the previous MB meeting were approved.

1.2      Explanation of the APEL/R-GMA issues

J.Gordon during the meeting distributed the PM report on the APEL issues of the previous week.

 

Matter arising from the minutes of 21/7. Explanation of the APEL/R-GMA issues raised.

 

We do not believe there were problems per se with either tomcat or R-GMA, we do not believe this happened during STEP09 and we have no evidence of lost data

 

What happened was that some weeks prior to the problem on 20/6 we replaced the hardware of the flexible archiver which receives accounting data from sites.  This appeared to be working correctly but the combination of java, tomcat, and mySQL was not configured optimally. We are not sure whether this service gradually ran slower over the weeks or whether some incident tipped it over the limit but on 20th June it stopped receiving data and failed soon after each restart. Diagnosis and solution were prolonged by the APEL expert, tomcat expert and principal sysadmin for the service all being on holiday. They all had cover but their backups were slightly more tentative in their progress than the principals would have been. The initial problem was fixed later that week (24th) and data was being received successfully. The backlog took about a week to catch up.

 

Unfortunately the subsequent processes which summarise the data, prepare it for the portal views and run SAM tests to verify publishing could not cope with the size of the backlog. The crons had not completed when the next wave started and new data was trickling through very slowly. Despite broadcasts this situation was made worse by sites repeatedly trying to republish data which was already safely in the repository because they could not see the results in the portal.

 

This was eventually fixed by taking the service offline and letting it process the backlog and then dealing with the new backlog when it was restarted. This took almost another two weeks. We now believe that everything is working well and the backlog has been cleared and is visible in the portal.  In order to stop this happening again we propose to freeze the data older than 13 months. AT the moment the summaries for each site/VO/month are rebuild every day. We will now only rebuild for the last 13 months of data. Any site wishing to publish data older than this will need to make a specific request. The old data will still be visible in the portal but sites cannot automatically republish old data (e.g. in order to change site name).

 

All of the problems described above were with the core processes of the service and will not change significantly when we change the transport layer from R-GMA to ActiveMQ. The replacement APEL client is under test and will be released to guinea pig testing RSN. Given the bottlenecks in the certification process we do not expect widespread deployment of the client which does not use R-GMA until the end of this year.  We also need to work with James Casey to plan a deployment model as the R-GMA MON box at each site is used by APEL independently of R-GMA (it is where the site accounting data is held)  and this will need to be upgraded to SL5/gLite3.2 once R-GMA is removed from the site without significant breaks in service. This plan is related to that for rolling out an ActiveMQ infrastructure to sites.   

 

Summary: problem would have been difficult to foresee but could have been handled more quickly if not for holidays. All working well again and limiting republishing to last 13 months should prevent recurrence.

 

Action: better documentation and diagnostics of the internal processes of APEL.

1.3      Changing T1 and T2 Sites names

I.Bird noted that Sites intend to change name should consider all the side effect of such decisions. This item is mentioned because BNL would like to change their name in the WLCG databases.

 

The Sites name is the key in all databases. Changing the name  it as major consequences:

-       All reports generated have to be modified

-       The new names implies that is a new Site, the old Site information stops

-       All historical information on the Sites is disconnected from the new data

-       No statistical and historical trend are possible vs. the old name’s data

-       Even some FTS channels could be using the names.

 

M.Ernst commented that BNL wants to consolidate the new name in the OSG standards. The current names were set up in the WLCG long time ago and there is a lot of legacy that should not be lost. OSG possibly will continue to publish the information with the old name in the WLCG databases. They will do a mapping in order to keep it working. The issue seems understood but will wait end of August after ATLAS has done the current processing. No changes will happen in August.

 

J.Gordon added that for the accounting is possible to maps many names to the same Site. But this is not done automatically and is only true for the APEL accounting.

1.4      SL5 Deployment

I.Bird reported that some Sites have asked for clarification about the SL5 deployment.

 

J.Gordon added that many Sites are now running SL5 but the deployment is not as smooth as expected. He asked that the MB approve and requests the move to SL5 following the request of the Experiments that want to have initially a separate CE in order to test SL5 nodes at a given Site. The Experiments will then ask for the change in proportion of SL4/SL5 nodes at each Site.

 

O.Keeble added that the Sites want a clear request from the WLCG in order to proceed. The meta-package for the SL5 deployment is ready and should be used at the Sites. There is a wiki page with all the information available. The advice from each Experiment is under construction.

 

At the last GDB all except ATLAS were ready for the move. Since then also ATLAS has mentioned they are ready for moving to SL5. An official statement from ATLAS would be good.

 

A.Heiss added that SL5 will be the default for ALICE and CMS at FZK, as agreed with the Experiments. They can still choose a specific CE but the default will be SL5.

M.Ernst added that also BNL had moved all resources to SL5. With the SELinux component disabled.

 

I.Fisk asked that also the Tier-2 Sites should be mentioned in the request.

 

Ph.Charpentier noted that there is problem in GFAL not ported in the LCG Applications Area and is printing out a long log of errors for every error and should be fixed. D.Duellmann replied that he will follow up the issue right away.

 

A web page is available to provide pointers to the relevant information to support the migration, including links to the necessary packages.  (https://twiki.cern.ch/twiki/bin/view/LCG/SL4toSL5wnMigration). It is understood that the Experiments are all ready and able to use SL5 resources.

 

Decision:

The MB agreed in the Management Board to ask for the SL5 migration at all Sites, including the Tier-2 Sites.

 

2.   Action List Review (List of actions)

 

 

The MB Action List was not reviewed this week.

 

3.   LCG Operations Weekly Report (Slides) – H.Renshall
 

 

Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting.

All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings

3.1      Summary

The report covers the two weeks 19 July to 1 August.

There was a mixture of problems, mostly site related, and no alarm tickets.

 

The incidents leading to service incident reports were:

-       ASGC Power Cut

-       NL-T1 dCache degradation (awaited)

3.2      GGUS Tickets

 

VO

User

Team

Alarm

Total

ALICE

2

0

0

2

ATLAS

12

45

0

57

CMS

9

0

0

9

LHCb

1

21

0

22

Totals

24

66

0

90

 

3.3      Services Availability

Slide 4 shows the VO SAM results.

 

One can notice that:

-       ATLAS yellow/red at SARA on 23/24th July was a period when all ATLAS CE were unavailable.

-       ATLAS orange/yellow at SARA on 28/29 July was because of a dCache upgrade exposing new logging features which lead to service degradation.

 

-       CMS failures at ASGC 19-23 July were a CE (previously upgraded) misconfigured and the farm was also overloaded, which caused CE-cms-mc test to fail for few days with timeout errors.

-       CMS at CNAF had hardware problems with two disk servers serving the software area via GPFS, on July 23-24. The SAM test failures at CNAF on July 28-29 are related to some first tests of the TSM+GPFS system. These had been coordinated with CMS DataOps, did not have any major effects on production activities, but caused the failure of the tests.

-       CMS at GridKa, the SRM service for the dCache production instance ran out of memory (also affecting PhEDEx transfers) on July 29. There were problems with the generation of the gridmap file, which induced SRM authentication errors, on July 31st. They had network problems on some dCache read pools, also, again on July 31st.

-       CMS FNAL was on scheduled downtime on July 23rd – usually a constant dark green.

3.4      ASGC Power Failure SIR

Site: T1_TW_ASGC

Incident Date: July 17, 2009

Service: power system

Impacted: most services

 

Incident Summary: A power surge happened suddenly at around 6:00 UTC, the surge last for only 3 seconds. But since there is no UPS available in the computing centre, all services went off. Most servers were restarted and got recovered in 2 hours; some crond’s were delayed for a few more hours.

 

Incident duration: ~4 hours (approx)

 

Report date: July 17, 2009 (was not in last MB report)

Reported by: Gang Qin

 

Future plan: - set up a second power route from another region in near future. - Some small UPS might be applied on core servers, but surely big UPS will not be used.

 

I.Bird asked explanations about this second power route and whether is a solution for an unexpected power cut or a power glitch.

 

Qin Gang replied that currently the route is from the campus and a second line from another source is added. The switch should be within few milliseconds. ASGC does not want to use UPS after the experience suffered in the recent past.

 

I.Bird added that the switch should be tested and seems very risky for power glitches. 

Qin Gang confirmed that the failover will be tested. And if the switch is within 5 power pulses it will not impact the equipment running.

3.5      NL-T1 dCache Degraded (28/29 July)

 

Detailed SIR with impact/recovery action/plans still due.

 

H.Renshall presented the information available.

SARA started upgrade to dCache 1.9.3-3 on the morning of 28 July but could not then start their SRM servers.

Next day reported dCache server partition filled by the ‘billing’ log – 5GB instead of normal few MB.

Log was filled with records of orphaned files – files on disk but not in dCache name space – and normal cron log compression job could not keep up. Now fixed with no need to rollback dCache.

It would affect other sites going to this version and having orphaned files.

3.6      Miscellaneous Reports

Below is a summary of the reports filled in the last two weeks and miscellaneous information reported by H.Renshall:

-       Tape drives contention between ATLAS and CMS seen at ASGC. Studies indicate they should grow from 7 to 12 drives – planning for early October. Migration of CMS data to tape was stuck for a few days.

Qui Gang added that the fair share between Tier-1 and Tier-2 was incorrect. And is now corrected.

ASGC can only run 200 jobs a day currently and will be increased in August.

 

-       SARA have added 12 more tape drives.

-       BNL added 2 PB of disk space. ATLAS farm was upgraded to SL5. Site name change as part of move to OSG was reverted when FTS transfers failed. Understood and to be re-planned (was anyway during a BNL downtime). FTS name change scripts now ready for future such changes.

-       CERN AFS acron failed for users beginning a to l (served by afsdb2) for several hours one morning – stopped ATLAS service status displays and site functional tests.

-       RAL banned a CMS userid which killed half their worker nodes by triggering the out of memory kill procedure. Now restored - being analysed by CMS as workflows are usually pre-tested.

 

-       ATLAS restarted STEP’09 like reprocessing at ASGC and experienced Tier1/Tier2 contention with Monte Carlo jobs. Switched to pilot jobs mode in single grid queue with PANDA server doing the scheduling and this was successful. Currently jobs failing after worker nodes ‘lost’ part of the ATLAS software (having previously been validated).

-       ATLAS DBATL data base at IN2P3 was down for 12 hours while repairing a corrupt block. Stopped streams replication of ATLAS Metadata Interface database to CERN.

-       ATLAS batch at CNAF down for a morning after shared software area server failed followed by a switch port failure to the backup server. Affected all VOs.

-       LHCb data disk service class at CERN filled up during Monte Carlo production. Had only 443 TB allocated of their 2009 991 TB pledge. More (100TB) has now been added.

-       CMS tape migration stuck at RAL since Sunday 2 August. Reported yesterday (Monday) as due to a mechanical problem with a tape robot.

J.Gordon added that they will improve the monitoring of the migration process.

 

 

 

4.   Storage Issues (Slides) – A.Sciabá

 

 

A.Sciabá summarized the current status and issues of the different SRM systems.

4.1      News from the SE Software and SRM Implementations

CASTOR

CASTOR 2.1.8-10 was released, looking for an appropriate time to upgrade at CERN

It fixes the serious xrootd bug (failed transfers looked OK to xrootd) and many other bugs addressed.

 

SRM 2.8 is being stress-tested, close to release

-       Improved logging

-       srmLs returns file checksum

-       Asynchronous SRM requests are processed in time order

-       srmStatusOfBringOnline works correctly

 

dCache

Will be covered in the following presentation.

 

StoRM

Collecting wish list for next release. The current ones are::

-       SL5 support

-       Checksum calculated in background on srmPutDone

-       Better dynamic info provider

-       Group-based and user-based quotas

-       Better logging

 

DPM

The latest version in production is 1.7.2 and the next release in the works.

Fix a bug which makes the SRM 2.2 daemon to crash and new xrootd code to increase the stability of the xrootd access for ALICE.

 

FTS

FTS 2.2 is still under certification. But they would like to add the patches for the “SRM busy” logic and the checksum support.

The checksum support needs to be validated by the experiments

 

GFAL/lcg-util

GFAL 1.11.8-1 and lcg-util 1.7.6-1 just certified for SLC4.

It fixed a serious bug for CMS: could not copy a file using a GridFTP TURL as destination.

Better BDII failover implemented.

 

L.Dell’Agnello and Ph.Charpentier noted that SL5 certification is also needed because is needed on the WN.

4.2      SRM Implementations

There is the effort to document how the storage providers implement the SRM protocol.

 

The motivation is that SRM is often ambiguous. The SRM specification not enough to define actual behaviour. A description of the implementations is unavailable / incomplete / scattered around. \

 

The goal is to stop re-interpreting again and again the specifications. It will be useful for clients and data management frameworks and to make discussions on SRM more focused on facts.

 

It will be documented as a “living” document, best as Wiki; from existing documentation, input from developers on a “best effort” basis.

4.3      Addendum to the SRM MoU

Check with the experiments what is really needed among what is still missing from the MoU addendum (GDB in April):

-       VOMS support in CASTOR

-       Space selection for read operations in dCache

-       srmLs returning all spaces where a file exist not available in CASTOR

-       (“Client-driven” tape optimization not available anywhere) therefore irrelevant

 

The ranking from the Experiments should be:

-       0 = useless

-       1 = if there, it could allow for more functionality in data management, or better performance, or easier operations, but it is not critical

-       2 = critical: it should be implemented as soon as possible and its lack causes a significant degradation of functionality / performance / operations

 

Action:

The Experiments have to send to A.Sciabá their ranking for the features in the SRM MoU addendum.

 

I.Bird noted that the goal of this request is to really define the minimal necessary set of features among those already agreed.

 

M.Kasemann noted that the input of the Technical Forum should be also taken into account. 

 

Ph.Charpentier reminded about other CASTOR security issues known (not reported in these minutes) that would allow a malicious user to delete a lot of data. The insecure access method must be closed.

 

 

5.   DCache: Migration to Chimera, and dCache “Golden” release (Paper; Slides) – P.Fuhrmann

                                                                                                                                                

 

P.Fuhrmann presented the status of dCache and the migration to Chimera. There are also papers (link above) explaining more in technical details the motivations and the procedures to follow.

5.1      Pnfs to Chimera migration

The dCache file system engine is one of the most important dCache components. All other services rely on that functionality.  The performance of the file system engine determines the performance of the entire system.

 

Pnfs is the initial implementation of the file system engine and has been introduced a decade ago. The technology Pnfs is build on doesn't allow the system to be improved any more.

 

Pnfs is no solution for the Petabyte storage domain. The main flaws :

-       All Pnfs accesses have to go through the NFS 2/3 interface (even dCache)

-       A single lock protects an entire Pnfs database, which means :

-       The Pnfs performance doesn't benefit from multiple cores.

-       Data is stored in BLOB which doesn't allow SQL queries.

 

DCache.org designed and implemented a successor: Chimera. Chimera is based on standard database technology.

The advantages are that :

-       There is no Chimera server, which allows Chimera to scale with the number of CPU/cores of the name space host.

-       Standard SQL allows easy monitoring and maintenance.

-       Further dCache features and speed improvements assume that the name space is based on Chimera.

-       e.g.: ACL inheritance.

 

Chimera is in production-use at some sites since Nov 2008. The NDGF Tier I migrated to chimera end of March '09.

In the meantime most German Tier II sites migrated to Chimera as well as sites in the US, Scandinavia and Italy. Other Tier II's will follow soon. GridKa/KIT will build the new CMS instance on Chimera

 

Migration is composed of 3 steps:

-       Dumping Pnfs into the SQL format.

-       Injecting the SQL into the new database.

-       Verifying Pnfs against Chimera (md5 checksums)

 

For NDGF (8 million files) this took (11 + 3.5 + 11 hours). And different Pnfs databases can be converted concurrently

 

The migration process can be tested without interference with the production system.

The Pnfs dump step does a very strict Pnfs consistency check. If Pnfs inconsistencies are found, they can be fixed in advance. This allows running an uninterrupted 'real' conversion and it allows a good estimate on the time it takes.

 

There is no way back from Chimera to Pnfs unless one accepts to immolate all newly created files.

dCache will treat chimera issues (if there are any) with highest priority, assuming fixing the potential problem will take less time than converting back to Pnfs.

 

The real risk assessment has to be done by the site manager. DCache.org tries to provide as much information as possible.

 

There is no risk associated with the migration process itself, as one can fall back to Pnfs as long as no file has been written into Chimera. The migration process can be tested many times in advance to find the optimal setup, without interfering with the production system.

 

Chimera is in production for about 8 months. Pnfs is in production for about a decade. However Chimera has been used by one Tier-1 and many Tier -2 Sites during the STEP 09 exercise and no problems were reported.

 

Pnfs doesn't support ACL inheritance which is required by the SRM auto directory creation.

Reasonable monitoring, maintenance and accounting can only be done with Chimera. e.g. Amount of data by user or group

Amount of data on tape or pending to go to tape.

 

In case of Pnfs inconsistencies not many people would be able to fix the system, while Chimera is based on standard SQL tables and somebody with reasonable database experience would be able to investigate. The structure of the tables is self-describing and no server is needed to extract the file system information.

There is no way back from Chimera, w/o writing own back-conversion scripts, which can be done as the source code and the design are available.

 

If performance issues are found during data talking we cannot speed up Pnfs. One would have schedule an upgrade to Chimera. Running faster hardware is only moderately beneficial as Pnfs cannot scale with the number of cores/CPUs.

 

The STEP 09 exercise can be seen as a hint to judge Pnfs behaviour if one believes that this has been a reasonable test. If in the past the PnfsManager has shown loaded queues over a long period of time a migration to Chimera would be recommended.

 

The NFS 4.1 interface is only available with Chimera.

 

Ph.Charpentier asked whether the names spaces will be unchanged, even if they contain Pnfs in the name.

P.Fuhrmann replied that for the final users there is no difference.

 

I.Bird asked to the Sites whether they have enough information and confidence to migrate to Chimera now. They should comment in the next few weeks. At the Operations meeting.

 

5.2      DCache Release Policy (Golden Releases)

DCache.org is moving towards time bases releases. Time bases release a common practice in large software projects.

Time of release is known but not exactly the features.

 

The advantages are several: 

-       Easier to synchronise with distributors of the software

-       No waiting for a release because the implementation of a feature is delayed.

-       Sites can properly schedule updates.

 

Golden Releases: Golden releases are supported much longer than regular ones. The last Golden Release has been the CCRC branch which was a great success.

 

The next Golden Release will be support throughout the entire first LCG run period.

The Golden Release will be dCache 1.9.5. No new features will be added to that release, only bugs will be fixed.

There will be of course new regular releases being published with shorter lifetime and new features.

Sites are free to choose the Golden or the Feature Releases. But the advice is to use Golden Releases.

 

And 1.9.2 will no longer be supported with releasing 1.9.5.

 

So 1.9.5 Release will be supported one year. 1.9.6 Release and others will be released but 1.9.5 will have bug fixing for one year. A new release will be within 3 6 weeks and 3 months.

 

The features will be negotiated with the users but time will be fixed.

 

D.Duellmann noted that also for CASTOR one could look for a similar approach.

 

I.Bird noted that 1.9.5 Release is planned for late September or early October. Are Sites expected to move to 1.9.5 are the last minutes? Seems unrealistic. If people cannot migrate what will be supported?

P.Fuhrmann agreed but replied that cannot be done earlier. Before data taking Sites should migrate to 1.9.5. Release 1.9.3 will be supported until 1.9.6 released.

I.Bird noted that upgrades can take some time and is risky to be at the beginning of data taking. There will be old versions of dCache to support also next year.

 

P.Fuhrmann noted that there are Tioer-1 Sites running older versions, even 1.9.0. But they should upgrade and there is a direct path to avoid intermediate versions. DCache has regular Tier-1 meetings to discuss these issues.

 

J.Gordon asked who is following the calendar in order not to have many Sites down at the same time.

J.Shiers agreed that should report to the Operations meeting in advance. And discuss the schedules in advance in order to avoid clashes.

 

P.Fuhrmann agreed to provide a proposal for the upgrades.

 

R.Tafirout proposed that the Tier-2 should upgrade first to Chimera so that there is a larger number of installations before moving to the Tier-1 Sites.

P.Fuhrmann agreed that Tier-2 can be encouraged and he will talk to M.Jouvin about it.

 

 

6.    AOB

 

 

6.1      SL5 and Other Changes before Data Taking

J.Gordon noted that this should be followed by the Experiments. Tier-2 in the UK said they want to move to SL5 end of the year.

Sites were asked to freeze their installations by August but the current delay of 6 Weeks could allow Sites to move their freezing time by one month.

 

I.Bird and J.Shiers noted that the Experiments are going in production anyway (cosmics, etc) so the end of August date should remain. Upgrades are still possible during production times; but Sites must consider themselves in production.

 

M.Kasemann agreed that the mode of operations should be that all changes must be discussed before taking place. CMS needs the migration to SL5 by the 1st of September and most Sites are proceeding.

 

J.Gordon added that Sites want a month of stability before data taking and therefore also the Experiments should not ask for changes after that deadline.

6.2      CERN at Telecom 2009 (Slides) – I.Bird

CERN will participate to the Geneva Area stand at the Telecom 2009.

The slides show a proposal of possible demos to provide. One of those is a 3-days data challenge but seems difficult with the on-going activities of the Experiments. But CERN is involved and accelerator and Experiments (online and offline) are going to be involved.

 

At what extent the Experiments want to participate to the PR material and demos? Will it be static monitoring or something else?

 

Experiments should send their input within 2 weeks. I.Bird will send a reminder by email.

 

J.Shiers suggested that in order to minimize work and risks is to replay the STEP09 data. Some EGEE 09 demos can also be used. Live demos are not really going to be possible.

 

 

7.    Summary of New Actions

 

 

 

 

Action:

The Experiments have to send to A.Sciabá their ranking for the features in the SRM MoU addendum.