LCG Management Board |
|||||||||||||||||||||||||||||||||
Date/Time |
Tuesday
4 August 2009 16:00-17:00 – Phone Meeting
|
||||||||||||||||||||||||||||||||
Agenda
|
|||||||||||||||||||||||||||||||||
Members |
|||||||||||||||||||||||||||||||||
|
(Version 1 – 7.8.2009) |
||||||||||||||||||||||||||||||||
Participants |
A.Aimar
(notes), O.Barring, I.Bird(chair), D.Britton, Ph.Charpentier, L.Dell’Agnello,
D.Duellmann, M.Ernst, X.Espinal, I.Fisk, Qin Gang, J.Gordon, A.Heiss,
M.Kasemann, O.Keeble, R.Pordes, H.Renshall, Y.Schutz, J.Shiers, R.Tafirout |
||||||||||||||||||||||||||||||||
Invited |
P.Fuhrmann, A.Sciabá |
||||||||||||||||||||||||||||||||
Action
List |
|||||||||||||||||||||||||||||||||
Mailing
List Archive |
https://mmm.cern.ch/public/archive-list/w/worldwide-lcg-management-board/ |
||||||||||||||||||||||||||||||||
Next Meeting |
Tuesday
18 August 2009 16:00-17:00 – Phone Meeting |
||||||||||||||||||||||||||||||||
1. Minutes and Matters arising (Minutes)
|
|||||||||||||||||||||||||||||||||
1.1 Minutes of Previous Meeting
No
comments received about the minutes. The minutes of the previous MB meeting
were approved. 1.2 Explanation of the APEL/R-GMA issuesJ.Gordon
during the meeting distributed the PM report on the APEL issues of the
previous week.
1.3 Changing T1 and T2 Sites namesI.Bird noted that Sites intend to change name should consider all the side effect of such decisions. This item is mentioned because BNL would like to change their name in the WLCG databases. The Sites name is the key in all databases. Changing the name it as major consequences: - All reports generated have to be modified - The new names implies that is a new Site, the old Site information stops - All historical information on the Sites is disconnected from the new data - No statistical and historical trend are possible vs. the old name’s data - Even some FTS channels could be using the names. M.Ernst
commented that BNL wants to consolidate the new name in the OSG standards.
The current names were set up in the WLCG long time ago and there is a lot of
legacy that should not be lost. OSG possibly will continue to publish the
information with the old name in the WLCG databases. They will do a mapping
in order to keep it working. The issue seems understood but will wait end of August
after ATLAS has done the current processing. No changes will happen in
August. J.Gordon
added that for the accounting is possible to maps many names to the same
Site. But this is not done automatically and is only true for the APEL
accounting. 1.4 SL5 DeploymentI.Bird
reported that some Sites have asked for clarification about the SL5
deployment. J.Gordon
added that many Sites are now running SL5 but the deployment is not as smooth
as expected. He asked that the MB approve and requests the move to SL5
following the request of the Experiments that want to have initially a
separate CE in order to test SL5 nodes at a given Site. The Experiments will
then ask for the change in proportion of SL4/SL5 nodes at each Site. O.Keeble
added that the Sites want a clear request from the WLCG in order to proceed.
The meta-package for the SL5 deployment is ready and should be used at the
Sites. There is a wiki page with all the information available. The advice
from each Experiment is under construction. At
the last GDB all except ATLAS were ready for the move. Since then also ATLAS
has mentioned they are ready for moving to SL5. An official statement from
ATLAS would be good. A.Heiss added that SL5 will be
the default for ALICE and CMS at FZK, as agreed with the Experiments. They
can still choose a specific CE but the default will be SL5. M.Ernst added that also BNL had
moved all resources to SL5. With the SELinux component disabled. I.Fisk asked that also the Tier-2
Sites should be mentioned in the request. Ph.Charpentier noted that there
is problem in GFAL not ported in the LCG Applications Area and is printing
out a long log of errors for every error and should be fixed. D.Duellmann
replied that he will follow up the issue right away. A
web page is available to provide pointers to the relevant information to
support the migration, including links to the necessary packages. (https://twiki.cern.ch/twiki/bin/view/LCG/SL4toSL5wnMigration).
It is understood that the Experiments are all ready and able to use SL5
resources. Decision: The MB agreed in the Management Board to ask for the
SL5 migration at all Sites, including the Tier-2 Sites. |
|||||||||||||||||||||||||||||||||
2. Action List Review (List of actions)
|
|||||||||||||||||||||||||||||||||
The MB Action List was not reviewed this week. |
|||||||||||||||||||||||||||||||||
3.
LCG
Operations Weekly Report (Slides)
– H.Renshall
|
|
||||||||||||||||||||||||||||||||
Summary of status and progress of the LCG Operations. This report covers the WLCG activities since the last MB meeting. All daily meetings summaries are always available here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings 3.1 SummaryThe report covers the two weeks 19 July to 1 August. There was a mixture of problems, mostly site related, and no alarm tickets. The incidents leading to service incident reports were: - ASGC Power Cut - NL-T1 dCache degradation (awaited) 3.2 GGUS Tickets
3.3 Services AvailabilitySlide 4 shows the VO SAM results. One can notice that: - ATLAS yellow/red at SARA on 23/24th July was a period when all ATLAS CE were unavailable. - ATLAS orange/yellow at SARA on 28/29 July was because of a dCache upgrade exposing new logging features which lead to service degradation. - CMS failures at ASGC 19-23 July were a CE (previously upgraded) misconfigured and the farm was also overloaded, which caused CE-cms-mc test to fail for few days with timeout errors. - CMS at CNAF had hardware problems with two disk servers serving the software area via GPFS, on July 23-24. The SAM test failures at CNAF on July 28-29 are related to some first tests of the TSM+GPFS system. These had been coordinated with CMS DataOps, did not have any major effects on production activities, but caused the failure of the tests. - CMS at GridKa, the SRM service for the dCache production instance ran out of memory (also affecting PhEDEx transfers) on July 29. There were problems with the generation of the gridmap file, which induced SRM authentication errors, on July 31st. They had network problems on some dCache read pools, also, again on July 31st. - CMS FNAL was on scheduled downtime on July 23rd – usually a constant dark green. 3.4 ASGC Power Failure SIR
I.Bird
asked explanations about this second power route and whether is a solution
for an unexpected power cut or a power glitch. Qin
Gang replied that currently the route is from the campus and a second line
from another source is added. The switch should be within few milliseconds.
ASGC does not want to use UPS after the experience suffered in the recent
past. I.Bird
added that the switch should be tested and seems very risky for power
glitches. Qin
Gang confirmed that the failover will be tested. And if the switch is within
5 power pulses it will not impact the equipment running. 3.5 NL-T1 dCache Degraded (28/29 July)Detailed
SIR with impact/recovery action/plans still due. H.Renshall presented the information available. SARA started upgrade to dCache 1.9.3-3 on the morning of 28 July but could not then start their SRM servers. Next day reported dCache server partition filled by the ‘billing’ log – 5GB instead of normal few MB. Log was filled with records of orphaned files – files on disk but not in dCache name space – and normal cron log compression job could not keep up. Now fixed with no need to rollback dCache. It would affect other sites going to this version and having orphaned files. 3.6 Miscellaneous ReportsBelow is a summary of the reports filled in the last two weeks and miscellaneous information reported by H.Renshall: - Tape drives contention between ATLAS and CMS seen at ASGC. Studies indicate they should grow from 7 to 12 drives – planning for early October. Migration of CMS data to tape was stuck for a few days. Qui Gang added that the fair share between Tier-1
and Tier-2 was incorrect. And is now corrected. ASGC can only run 200 jobs a day currently and will
be increased in August. - SARA have added 12 more tape drives. - BNL added 2 PB of disk space. ATLAS farm was upgraded to SL5. Site name change as part of move to OSG was reverted when FTS transfers failed. Understood and to be re-planned (was anyway during a BNL downtime). FTS name change scripts now ready for future such changes. - CERN AFS acron failed for users beginning a to l (served by afsdb2) for several hours one morning – stopped ATLAS service status displays and site functional tests. - RAL banned a CMS userid which killed half their worker nodes by triggering the out of memory kill procedure. Now restored - being analysed by CMS as workflows are usually pre-tested. - ATLAS restarted STEP’09 like reprocessing at ASGC and experienced Tier1/Tier2 contention with Monte Carlo jobs. Switched to pilot jobs mode in single grid queue with PANDA server doing the scheduling and this was successful. Currently jobs failing after worker nodes ‘lost’ part of the ATLAS software (having previously been validated). - ATLAS DBATL data base at IN2P3 was down for 12 hours while repairing a corrupt block. Stopped streams replication of ATLAS Metadata Interface database to CERN. - ATLAS batch at CNAF down for a morning after shared software area server failed followed by a switch port failure to the backup server. Affected all VOs. - LHCb data disk service class at CERN filled up during Monte Carlo production. Had only 443 TB allocated of their 2009 991 TB pledge. More (100TB) has now been added. - CMS tape migration stuck at RAL since Sunday 2 August. Reported yesterday (Monday) as due to a mechanical problem with a tape robot. J.Gordon added that they will improve the monitoring
of the migration process. |
|
||||||||||||||||||||||||||||||||
4. Storage Issues (Slides)
– A.Sciabá
|
|
||||||||||||||||||||||||||||||||
A.Sciabá
summarized the current status and issues of the different SRM systems. 4.1 News from the SE Software and SRM ImplementationsCASTOR CASTOR
2.1.8-10 was released, looking for an appropriate time to upgrade at CERN It
fixes the serious xrootd bug (failed transfers looked OK to xrootd) and many
other bugs addressed. SRM 2.8
is being stress-tested, close to release - Improved logging - srmLs returns file checksum - Asynchronous SRM requests are processed in time order - srmStatusOfBringOnline works correctly dCache Will be
covered in the following presentation. StoRM Collecting wish list for next release. The current ones are:: - SL5 support - Checksum calculated in background on srmPutDone - Better dynamic info provider - Group-based and user-based quotas - Better logging DPM The
latest version in production is 1.7.2 and the next release in the works. Fix a
bug which makes the SRM 2.2 daemon to crash and new xrootd code to increase
the stability of the xrootd access for ALICE. FTS FTS 2.2
is still under certification. But they would like to add the patches for the
“SRM busy” logic and the checksum support. The
checksum support needs to be validated by the experiments GFAL/lcg-util GFAL
1.11.8-1 and lcg-util 1.7.6-1 just certified for SLC4. It
fixed a serious bug for CMS: could not copy a file using a GridFTP TURL as
destination. Better
BDII failover implemented. L.Dell’Agnello and Ph.Charpentier noted
that SL5 certification is also needed because is needed on the WN. 4.2 SRM ImplementationsThere
is the effort to document how the storage providers implement the SRM
protocol. The motivation
is that SRM is often ambiguous. The SRM specification not enough to define
actual behaviour. A description of the implementations is unavailable /
incomplete / scattered around. \ The
goal is to stop re-interpreting again and again the specifications. It will
be useful for clients and data management frameworks and to make discussions
on SRM more focused on facts. It will
be documented as a “living” document, best as Wiki; from existing
documentation, input from developers on a “best effort” basis. 4.3 Addendum to the SRM MoUCheck with the experiments what is really needed among what is still missing from the MoU addendum (GDB in April): - VOMS support in CASTOR - Space selection for read operations in dCache - srmLs returning all spaces where a file exist not available in CASTOR - (“Client-driven” tape optimization not available anywhere) therefore irrelevant The ranking from the Experiments should be: - 0 = useless - 1 = if there, it could allow for more functionality in data management, or better performance, or easier operations, but it is not critical - 2 = critical: it should be implemented as soon as possible and its lack causes a significant degradation of functionality / performance / operations Action: The Experiments have to send to
A.Sciabá their ranking for the features in the SRM MoU addendum. I.Bird noted that the goal of this request
is to really define the minimal necessary set of features among those already
agreed. M.Kasemann noted that the input of the
Technical Forum should be also taken into account. Ph.Charpentier reminded about other CASTOR
security issues known (not reported in these minutes) that would allow a
malicious user to delete a lot of data. The insecure access method must be
closed. |
|
||||||||||||||||||||||||||||||||
5. DCache: Migration to Chimera, and dCache
“Golden” release (Paper;
Slides)
– P.Fuhrmann
|
|
||||||||||||||||||||||||||||||||
P.Fuhrmann
presented the status of dCache and the migration to Chimera. There are also
papers (link above) explaining more in technical details the motivations and
the procedures to follow. 5.1 Pnfs to Chimera migrationThe
dCache file system engine is one of the most important dCache components. All
other services rely on that functionality.
The performance of the file system engine determines the performance
of the entire system. Pnfs is
the initial implementation of the file system engine and has been introduced
a decade ago. The technology Pnfs is build on doesn't allow the system to be
improved any more. Pnfs is
no solution for the Petabyte storage domain. The main flaws : - All Pnfs accesses have to go through the NFS 2/3 interface (even dCache) - A single lock protects an entire Pnfs database, which means : - The Pnfs performance doesn't benefit from multiple cores. - Data is stored in BLOB which doesn't allow SQL queries. DCache.org
designed and implemented a successor: Chimera. Chimera is based on standard
database technology. The
advantages are that : - There is no Chimera server, which allows Chimera to scale with the number of CPU/cores of the name space host. - Standard SQL allows easy monitoring and maintenance. - Further dCache features and speed improvements assume that the name space is based on Chimera. - e.g.: ACL inheritance. Chimera
is in production-use at some sites since Nov 2008. The NDGF Tier I migrated
to chimera end of March '09. In the
meantime most German Tier II sites migrated to Chimera as well as sites in
the US, Scandinavia and Italy. Other Tier II's will follow soon. GridKa/KIT
will build the new CMS instance on Chimera Migration
is composed of 3 steps: - Dumping Pnfs into the SQL format. - Injecting the SQL into the new database. - Verifying Pnfs against Chimera (md5 checksums) For
NDGF (8 million files) this took (11 + 3.5 + 11 hours). And different Pnfs
databases can be converted concurrently The
migration process can be tested without interference with the production
system. The
Pnfs dump step does a very strict Pnfs consistency check. If Pnfs
inconsistencies are found, they can be fixed in advance. This allows running
an uninterrupted 'real' conversion and it allows a good estimate on the time
it takes. There
is no way back from Chimera to Pnfs unless one accepts to immolate all newly
created files. dCache
will treat chimera issues (if there are any) with highest priority, assuming
fixing the potential problem will take less time than converting back to
Pnfs. The
real risk assessment has to be done by the site manager. DCache.org tries to
provide as much information as possible. There
is no risk associated with the migration process itself, as one can fall back
to Pnfs as long as no file has been written into Chimera. The migration
process can be tested many times in advance to find the optimal setup,
without interfering with the production system. Chimera
is in production for about 8 months. Pnfs is in production for about a
decade. However Chimera has been used by one Tier-1 and many Tier -2 Sites
during the STEP 09 exercise and no problems were reported. Pnfs
doesn't support ACL inheritance which is required by the SRM auto directory creation. Reasonable
monitoring, maintenance and accounting can only be done with Chimera. e.g.
Amount of data by user or group Amount
of data on tape or pending to go to tape. In case
of Pnfs inconsistencies not many people would be able to fix the system,
while Chimera is based on standard SQL tables and somebody with reasonable
database experience would be able to investigate. The structure of the tables
is self-describing and no server is needed to extract the file system
information. There
is no way back from Chimera, w/o writing own back-conversion scripts, which
can be done as the source code and the design are available. If
performance issues are found during data talking we cannot speed up Pnfs. One
would have schedule an upgrade to Chimera. Running faster hardware is only
moderately beneficial as Pnfs cannot scale with the number of cores/CPUs. The
STEP 09 exercise can be seen as a hint to judge Pnfs behaviour if one
believes that this has been a reasonable test. If in the past the PnfsManager
has shown loaded queues over a long period of time a migration to Chimera
would be recommended. The NFS
4.1 interface is only available with Chimera. Ph.Charpentier asked whether the names
spaces will be unchanged, even if they contain Pnfs in the name. P.Fuhrmann replied that for the final users
there is no difference. I.Bird
asked to the Sites whether they have enough information and confidence to
migrate to Chimera now. They should comment in the next few weeks. At the
Operations meeting. 5.2 DCache Release Policy (Golden Releases)DCache.org is moving towards time bases releases. Time bases release a common practice in large software projects. Time of release is known but not exactly the features. The advantages are several: - Easier to synchronise with distributors of the software - No waiting for a release because the implementation of a feature is delayed. - Sites can properly schedule updates. Golden Releases: Golden releases are supported much longer than regular ones. The last Golden Release has been the CCRC branch which was a great success. The next Golden Release will be support throughout the entire first LCG run period. The Golden Release will be dCache 1.9.5. No new features will be added to that release, only bugs will be fixed. There will be of course new regular releases being published with shorter lifetime and new features. Sites are free to choose the Golden or the Feature Releases. But the advice is to use Golden Releases. And 1.9.2 will no longer be supported with releasing 1.9.5. So 1.9.5 Release will be supported one year. 1.9.6 Release and others will be released but 1.9.5 will have bug fixing for one year. A new release will be within 3 6 weeks and 3 months. The features will be negotiated with the users but time will be fixed. D.Duellmann
noted that also for CASTOR one could look for a similar approach. I.Bird
noted that 1.9.5 Release is planned for late September or early October. Are
Sites expected to move to 1.9.5 are the last minutes? Seems unrealistic. If
people cannot migrate what will be supported? P.Fuhrmann
agreed but replied that cannot be done earlier. Before data taking Sites
should migrate to 1.9.5. Release 1.9.3 will be supported until 1.9.6
released. I.Bird
noted that upgrades can take some time and is risky to be at the beginning of
data taking. There will be old versions of dCache to support also next year. P.Fuhrmann
noted that there are Tioer-1 Sites running older versions, even 1.9.0. But
they should upgrade and there is a direct path to avoid intermediate
versions. DCache has regular Tier-1 meetings to discuss these issues. J.Gordon
asked who is following the calendar in order not to have many Sites down at
the same time. J.Shiers
agreed that should report to the Operations meeting in advance. And discuss
the schedules in advance in order to avoid clashes. P.Fuhrmann
agreed to provide a proposal for the upgrades. R.Tafirout
proposed that the Tier-2 should upgrade first to Chimera so that there is a
larger number of installations before moving to the Tier-1 Sites. P.Fuhrmann
agreed that Tier-2 can be encouraged and he will talk to M.Jouvin about it. |
|
||||||||||||||||||||||||||||||||
6. AOB
|
|
||||||||||||||||||||||||||||||||
6.1 SL5 and Other Changes before Data TakingJ.Gordon noted that this should be followed by the Experiments. Tier-2 in the UK said they want to move to SL5 end of the year. Sites were asked to freeze their installations by August but the current delay of 6 Weeks could allow Sites to move their freezing time by one month. I.Bird
and J.Shiers noted that the Experiments are going in production anyway
(cosmics, etc) so the end of August date should remain. Upgrades are still
possible during production times; but Sites must consider themselves in
production. M.Kasemann
agreed that the mode of operations should be that all changes must be
discussed before taking place. CMS needs the migration to SL5 by the 1st
of September and most Sites are proceeding. J.Gordon
added that Sites want a month of stability before data taking and therefore
also the Experiments should not ask for changes after that deadline. 6.2 CERN at Telecom 2009 (Slides) – I.BirdCERN will participate to the Geneva Area stand at the Telecom 2009. The slides show a proposal of possible demos to provide. One of those is a 3-days data challenge but seems difficult with the on-going activities of the Experiments. But CERN is involved and accelerator and Experiments (online and offline) are going to be involved. At what extent the Experiments want to participate to the PR material and demos? Will it be static monitoring or something else? Experiments should send their input within 2 weeks. I.Bird will send a reminder by email. J.Shiers
suggested that in order to minimize work and risks is to replay the STEP09
data. Some EGEE 09 demos can also be used. Live demos are not really going to
be possible. |
|
||||||||||||||||||||||||||||||||
7. Summary of New Actions |
|
||||||||||||||||||||||||||||||||
Action: The Experiments have to send to
A.Sciabá their ranking for the features in the SRM MoU addendum. |
|
||||||||||||||||||||||||||||||||