WLCG Operations Coordination Minutes - 7th February 2013

Agenda

Attendance

  • Local: Maria Girone (chair), Andrea Sciabà (secretary), Ian Fisk, Simone Campana, Borut Kersevan, Ikuo Ueda, Felix Lee, Maria Dimou, Maarten Litmaath, Michail Salichos, Jan Iven, Manuel Guijarro, Jerome Belleman, Alessandro Di Girolamo, Daniele Spiga, Nicolò Magini
  • Remote: Cristina Aiftimiei, Dave Dykstra, Ian Collier, Andrea Valassi, Di Qing, Jeremy Coles, Joel Closier, Gareth Smith, Alessandra Forti, Oliver Gutsche, Burt Holzman, Rob Quick, Andreas Petzold, Ron Trompert, Andrew Sansum, wchang, Shawn Mc Kee

News (Maria Girone)

Maria announces that Pepe and Alessandra accepted to co-chair the meeting. They started a discussion to find a strategy to involve more deeply the Tier-2 sites, possibly having a small number of them actively participate in discussions and activities.

From March 4th the daily WLCG operations meeting will become bi-weekly, viz. on Mondays and Thursdays.

Middleware news and baseline versions (Nicolò Magini)

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Highlights:

  • DPM baseline version has been moved to the latest (1.8.6, EMI-2 update 8), released last week and including several security fixes. It is mandatory for DPM sites, including those which still have the gLite version, as the patch is only for the EMI versions. Not yet in UMD, but it is foreseen for Feb 18.
  • the UI baseline version is now the EMI-2 UI, with the known caveat that the submission via gLite WMS is not reliable; the fix will come with the Feb EMI update. It has no other known issues.
    • Added Feb 20: there also are 2 issues related to CREAM jobs, see the baseline page for details.

Tier-1 Grid services (Nicolò Magini)

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR 2.1.13-6.1; SRM-2.11 for all instances.

EOS:
ALICE (EOS 0.2.20 / xrootd 3.2.5)
ATLAS (EOS 0.2.25 / xrootd 3.2.7 / BeStMan2-2.2.2)
CMS (EOS 0.2.22 / xrootd 3.2.6 / BeStMan2-2.2.2)
LHCb (EOS 0.2.21 / xrootd 3.2.5 / BeStMan2-2.2.2)
   
ASGC CASTOR 2.1.11-9
SRM 2.11-0
DPM 1.8.5-1
Jan 29th - running on 2.5Gb backup link during intervention to 10Gb link, minor impact on storage Feb 9th-17th: New Year Holiday; on-site staff covering service ops during daytime (0200 - 1100 UTC)
Feb 25th whole day: DPM unavailable during UPS intervention + CASTOR scheduled upgrade
BNL dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)    
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.4-1.osg
Oracle Lustre 1.8.6
EOS 0.2.22-4/xrootd 3.2.4-1.osg with Bestman 2.2.2.0.10
   
IN2P3 dCache 1.9.12-16 (Chimera) on SL6 core servers and 1.9.12-24 on pool nodes
Postgres 9.1
xrootd 3.0.4
   
KIT dCache
  • atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
  • cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
  • lhcbsrm-kit.gridka.de: 1.9.12-24 (Chimera)
xrootd (version 20100510-1509_dbg)
   
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF)    
PIC dCache 1.9.12-20 (Chimera) - doors at 1.9.12-23    
RAL CASTOR 2.1.12-10
2.1.12-10 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 1.9.12-19(Chimera) gPlazma2 with voms+kpwd move to dcache 2.2 in the first half year

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1    
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite Oracle ATLAS None
CERN EMI2, 1.8.6 All nodes in production are now running in SLC6 with the EMI2 software Oracle ATLAS, LHCb, OPS  
CERN, test lfc-server-oracle-1.8.3.2-1 SLC6, EMI2 Oracle ATLAS Xroot federations  

Other site news

Ian asked (and obtained) to postpone the planned upgrade of EOSCMS till after the end of the LHC run (which was recently extended by a few days).

Data management provider news

The latest EMI update (update 8, 28th Jan) contains new versions of DPM/LFC, lcg_util and FTS.

http://www.eu-emi.eu/emi-2-matterhorn/updates/-/asset_publisher/9AgN/content/update-8-28-01-2013-v-2-6-0-1

  • DPM/LFC 1.8.6 is a bugfix release
  • gfal/lcg_util 1.14.0 is a bugfix release, implementing better default timeouts for SRM operations
  • FTS 2.2.9 is a consolidation of all existing patches - up to date sites do not need to upgrade.

Experiment operations review and plans

ALICE (Maarten Litmaath)

The ALICE analysis jobs efficiency at KIT had been low for a long time (many months) due to performance issues in accessing the Xrootd installation, which subsequently led to frequent failover to remote sites, which lately led to high load on the KIT firewall. The site was therefore mostly used for production workflows instead, with good efficiency. On Jan 30 the Xrootd experts at KIT and the ALICE grid experts had a very effective Vidyo meeting in which short- and medium-term strategies for improvements were developed:

  • The file accesses and failover were analyzed for a number of recent jobs.
  • Routing problems were discovered and fixed for a number of WN racks, significantly improving the access to the Xrootd servers and letting Xrootd traffic bypass the firewall also for those WN, whenever jobs need to access remote SEs.
  • Installation of new gateway servers to boost throughput.
  • Xrootd SW upgrade.
  • Improving the redundancy: allowing each GPFS partition to be accessed through multiple Xrootd servers.
  • Improving the monitoring.

Already after the first improvements the efficiency of the analysis became as good as at other sites and allowed KIT to be used at full capacity, up to ~6800 concurrent jobs so far. Many thanks to the team at KIT!

ATLAS (Ikuo Ueda)

status:

  • we are running very important production/analysis jobs for winter conferences
  • data taking of heavy ion collisions on-going (until next week)

Issues

Maarten says that the GFAL/lcg_util developer and Felix are actively working with high priority on the lcg-cp problem.

Simone gives an update on the SL6 validation by ATLAS. Production and analysis are fully validated on a subset of Athena releases. What is still missing is the group production which must be done by the physics groups; RAL will be used for that and the timescale for completing the validation is about end of March (Moriond permitting). This is not a show stopper: if a Tier-2 site has good reasons to migrate now, it can do so, but a Tier-1 site must wait for the completion of the validation.

Alessandra Forti says that in UK sites have just finished a mammoth upgrade to EMI-2 and are certainly inclined to wait before upgrading to SL6. For Simone this is perfectly fine.

CMS (Ian Fisk)

2012 data reprocessing in full swing, several issues
  • activating /cms/Role=t1production and CERN took more than 5 days, GGUS:91055
  • seems to have been communication problems between GGUS and SNOW (SNOW ticket seems not to have been updated from GGUS with several comments and the SNOW ticket receiver took no further action)

Maria D. points out that even if the status "In progress" was not propagated from GGUS to SNOW (which is something to be fixed), still the comments added by CMS to GGUS saying that the problem was not fixed were correctly propagated to SNOW. CMS did well to escalate the ticket.

  • IN2P3:
    • new old problems staging out files to MSS. IN2P3 configured dcap as read protocol and we seem to have triggered an old bug reported 1.5 years ago
    • IN2P3 is moving now to use xrootd
    • since the start of the rereco campaign (January 22nd), no rereco jobs have been run successfully
    • on a positive note, the srm problem with long proxies has a fix available that is being tested now, GGUS:90390

  • GGUS migration of all savannah functionality
    • no time to check with CMS about discussed changes, don't know when I will have time SAV:131565

  • latest problem: long running jobs getting killed
    • Problem appears to be in the pilots themselves. Being checked by experts.

After the run we will look at how we treat the CERN site. Likely we will activate more PhEDEx links and treat it as yet another Tier-1.

An issue came up at the MB which was the schedule for move to SLC6 in the context of the proposed move of the lxplus alias. In the past we would move batch resources first and interactive nodes last. CMS ensures an older binary will run on the newer OS. We virtually guarantee the newer binary will fail on an older OS. Therefore the user environments generally move last. In other words, until most sites have SL6, users still need to compile their code on SL5.

Manuel explains that the proposal to move the lxplus alias on April 30 was made to the WLCG MB and the architects forum and there is a twiki with the details.

Ian adds that with the SL4-to-SL5 migration, there was a greater urge for sites to upgrade due to hardware support issues but this is not the case today. He proposes to agree in this working group on a schedule for site migration to SL6. Alessandra proposes September, although sites should be encouraged to start testing SL6 much sooner, for example using HammerCloud jobs. Maarten says that in two months we may in fact be able to tell the sites that it is safe to upgrade to SL6.

Borut sees no problem with the lxplus alias migration because of the different way ATLAS users compile their code but raises the issue of VOBOXes: migrating all ATLAS services to SL6 will take quite some time. He asks if it is worth waiting for Agile and Puppet: Manuel answers that yes, it is probably better to migrate at the same time to SLC6 and to the Agile/Puppet system, which should be completely available at the end of this month. Most VOCs are already trying Puppet anyway.

Simone asks if after April it will still be possible to get both physical and virtual machines on SL5, if the need arises: the answer is yes.

LHCb (Joel Closier)

  • 2011 data reprocessing is 50% done, estimating another 3 weeks to finish the operation
  • we continue to transfer disk only data from CASTOR to EOS.
  • request to LFC people to change at the DB level the OLD srm endpoint of gridka by the new one.

Concerning the migration of the lxplus alias, this is not a big issue for LHCb: it would even be better to remove it altogether, so users know exactly what kind of nodes they are using.

GGUS tickets (Maria Dimou)

No ticket submitted this time from experiments or sites.

We decided today to remove this permanent item from the agenda because GGUS issues are included in the experiment reports. In particular when the daily meeting will be taking place only twice a week, the GGUS ticket escalation button should be used when needed. At the highest escalation level also MariaD will be notified.

Task Force reports

CVMFS (Ian Collier)

  • 97 sites targeted by the task force
    • 47 sites have deployed CVMFS (+5 since last meeting)
  • In view of deployment target day 30 April for sites supporting ATLAS and LHCb another 23 sites need to deploy
    • The GGUS ticket for those sites will be updated with a reminder mid February for those sites
  • Initially 11 sites have not replied with information on deployment plans, this has reduced now to 6 sites
    • egee.irb.hr, egee.srce.hr, INFN-TORINO, INSU01-PARIS, NCP-LCG2, ru-Moscow-SINP-LCG2
    • more info ticket numbers etc are available on the cvmfs deployment page

gLExec (Maarten Litmaath)

  • CMS
    • gLExec is used in production on the following EGI sites:
      • T1_UK_RAL
      • T2_BE_IIHE
      • T2_CH_CSCS
      • T2_DE_RWTH
        • does not show up in the SAM tests for OPS
      • T2_ES_CIEMAT
      • T2_PT_LIP_Lisbon
      • T2_UK_London_Brunel
      • T2_UK_SGrid_RALPP
    • Getting more EGI sites working: no timeline yet, possibly as of early March.
    • OSG sites are handled by USCMS/OSG
  • LHCb
    • Last check of gLExec support in DIRAC was in June/July 2012, there were significant code changes since.
    • The current status will be checked in the DIRAC certification installation.

In a couple of weeks, with the help of CMS and LHCb we should be able to propose some milestones.

SHA-2 migration (Maarten Litmaath)

  • Still waiting for the new CERN CA, should become available soon...

Generic links:

Middleware deployment (Maarten Litmaath)

  • Good progress on the WLCG VOBOX:
    • All rpms tested OK for SL5 and SL6 in ALICE VOBOX test deployments.
    • EGI intend to have a WLCG repository available early March.
  • Worker Node testing for WLCG

FTS 3 integration and deployment (Nicolò Magini)

  • Discussed FTS3 requirements with ATLAS DDM developers
    • Some promptly implemented: verifying user-provided file size, appending user-specified metadata to jobs/file transfers.
    • New feature requests for scheduling need evaluation
  • Demo of new functionality:
    • xrootd third party transfers (requires latest xroot clients and servers, so not compatible with current production storage elements)
    • can now request explicit staging of files through srmBringOnline before transfer (main user LHCb)
  • Testing
    • Stress-testing by developers ongoing
    • Testing with real transfers by VOs to start at low level and ramp up after winter conferences

Maria G. asks when it could be possible to have a deployment schedule, as asked by Philippe Charpentier at last pre-GDB. Alessandro says that before it is necessary to ramp up the testing scale in March, which is fine for the experiments. After that we can prepare a schedule. Maria strongly suggests to have it in a written proposal.

Simone asks if to test xroot 3rd party transfers a beta version of xroot was used, or an official version that can be rolled out. Michail answers that it is a beta version, used just for FTS3 testing. Alessandro adds that this version will become the 3.3 while EOS has 3.2.x. He will discuss with the EOS developers to see if it can be installed on the EOS development servers. At that point, a second site will be needed to run scale tests of 3rd party transfers, which according to Simone is important to understand what is the best protocol for WAN transfers.

Squid monitoring (Dave Dykstra)

See slides in the agenda

Dave proposes a SAM test for squid based on the MRTG monitoring information and a SAM test for squid based on hits to Frontier/CVMFS reverse-proxy Squids. Simone says that the latter is of no interest for ATLAS (too difficult to interpret) but the former can be useful for all experiments. It is not clear what manpower could be used to develop it, also considering that from May onward the role of IT-ES in SAM probe development is not yet clear.

PerfSONAR (Simone Campana)

See slides in the agenda

Contacts are needed for the Asia, Spain and Germany regions.

Oliver notices that in the PS Dashboard the LHC OPN looks horrible (lots of reds). Who should look after these problems? Simone and Alessandro think that the network working group should be involved, but before we must be totally sure that the tests and the PS Dashboard are correct and the error thresholds are properly tuned. Michel says that sites have no idea what to do when these errors appear. Documentation and procedures are needed and Shawn points out that OSG produced a lot of documentation to help site administrators troubleshoot network problems.

Expertise from the sites is very useful: sites who would like to volunteer some help can contact the perfSONAR task force. Andreas offers help in looking for a contact in the Germany region.

Tracking tools

News from other WLCG working groups

AOB

Action list

  • ALICE, ATLAS, CMS and LHCb should give feedback on the lxplus migration to SL6 (twiki). The plan is to move a significant part of lxbatch to SLC6 before the alias switch takes place. The amount of SLC6 WN resources provided by other sites have not been taken into account for planning this alias switch (M. Guijarro).
  • Maarten will look into SHA-2 testing by the experiments when the new CERN CA has become available.
  • MariaD will update the WLCGCriticalServices twiki. DONE: the twiki is superseded by this.
  • Tracking tools TF members who own savannah projects to list them and report to the TF <wlcg-ops-coord-tf-tracktools@cern.ch> (which includes the savannah and jira developers) what they wish to do with them (freeze/migrate-to-jira/other(what)).

-- AndreaSciaba - 04-Feb-2013

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2013-02-20 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback