WLCG Operations Coordination Minutes - 24th January 2013

Agenda

Attendance

  • Local: Maria Girone (chair), Andrea Sciabà (secretary), Andrea Valassi, Simone Campana, Oliver Keeble, Ikuo Ueda, Maarten Litmaath, Nicolò Magini, Maria Dimou, Felix Lee, Ian Fisk, Maite Barroso, Domenico Giordano, Alessandro Di Girolamo, Jan Iven
  • Remote: Daniele Bonacorsi, Alessandra Forti, Claudio Grandi, Dave Dykstra, Oliver Gutsche, Andreas Petzold, Shawn Mc Kee, Joel Closier, Pepe Flix, Di Qing, Alessandro Cavalli, Gareth Smith, Ron Trompert, Jeremy Coles, Christoph Wissing, Ian Collier, Christopher Walker, Peter Gronbech, Massimo Sgaravatto

News (M. Girone)

Maria G. announces that Pepe F. and Alessandra F. kindly accepted to help in coordinating the working group. In general we should aim at a stronger direct involvement of sites in the WG activities. She also illustrates some new topics that will be coordinated by our WG, namely data placement, data access, operations of the common analysis framework and cloud infrastructure testing. She announces a meeting on the common analysis framework for next Thursday in this time slot.

Middleware news and baseline versions (N. Magini)

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

Nicolò announced a few changes:

  • added CREAM from EMI-2 to the list of baseline services
  • an update to squid for Frontier has been released and it will be mandatory within a few months due to a change in the list of monitoring hosts allowed to connect from CERN
  • a new version of the EMI-2 WN tarball is available (thanks to GridPP). It is not yet officially validated but it is used at some sites for quite some time; sites are welcome to try it and give feedback
  • in the EMI-2 UI there is still a bug that causes ~2% of the jobs submitted to a gLite WMS to abort, but no other known issues; sites can update to it at their own discretion. Massimo explains that the fix was initially scheduled for December but now it is foreseen for end of February, which means that there is a gap between the end of support of the gLite 3.2 UI and that release date. Maarten thinks that security issues due to the UI are rather unlikely, so it should not be a problem: still, sites and experiments can decide to upgrade if the WMS issue does not affect them. Concerning the relocatable UI, it is still months away. About the WLCG VOBOX, it is based on the EMI UI and all pieces are there, it is just a matter of setting up the official WLCG repository; the VOBOX should be officially released in a couple of weeks. Ueda brings up an outstanding problem with the GFAL and lcg_utils currently in the EMI-2 WN, which have very short default timeouts for file transfers, which seriously affects ATLAS jobs. Maarten says that there is a fix, which according to Oliver should be released by EMI next Monday. Until now only the sites visibly affected were contacted via tickets and given workarounds and pointers to the fixed rpms; an emergency update was not pursued because the problem was not raised in other WLCG operations meetings. After some discussion, it is agreed that next Monday after 1500 we will send a broadcast to tell sites to apply the update. Simone adds that it really should be rolled out within the next week.

Task Force reports

CVMFS (S. Roiser & M. Girone)

  • 96 sites targeted by this task force
  • 42 have deployed CVMFS as of today (+19 since last meeting)
  • GGUS tickets opened to all sites that have either not responded at all or have promised deployment by "end of last year"
  • Reminder: deployment target date for sites supporting ATLAS & LHCb is 30 April 2013
    • GGUS tickets to sites supporting the two VOs and missing deployment will be opened in Feb (if not yet done)
Maria adds that the MB and the GDB are informed about sites that did not respond to the tickets; Luca Dell'Agnello will investigate why with the Italian sites (apparently some claim that they did not receive any ticket).

gLExec (M. Litmaath)

Maarten plans to arrange soon a meeting with CMS and LHCb to decide how to move forward.

SHA-2 migration (M. Litmaath)

Generic links:

The new CERN CA will be ready very soon, then we will get in touch with the experiments and arrange test plans.

Middleware deployment (S. Campana)

There was a survey one week ago about the SL6 readiness: statements from CMS and LHCb were positive as they are ready to run on it. ATLAS' situation is more complicated: there is progress on compiling and running Athena on SL6 but only a few releases are supported. Pilots mostly work on SL6, apart from some issues in contacting LFC: it is not yet clear if they are just network glitches or something more serious (for example, a mismatch of Globus libraries). Rod Walker needs to run some more tests. Another problem is there is not yet a meta-rpm for the SL5 compatibility libraries needed by ATLAS and CMS. Oliver answers that in fact it is ready but not yet released. Andrea V. says that it will be released for CERN (hence, SLC6) tomorrow and it is being validated by LHCb. Actually there are some significant differences between SLC6 and SL6, we don't yet know if it works on SL6.

FTS 3 integration and deployment (N. Magini)

  • FTS3 demo and taskforce meeting Jan 16th
    • Features demonstrated:
      • mysql backend - 3 T1s already using it with no issues in setup or operations
      • changing/reading server config - CLI to update config (passing json as input), all commands logged in server's DB
      • blacklisting users and storage elements with CLI
      • transfer-submit/status CLI output now also available in json format
      • new FTS3 Monitor webpage, to replace Lyon FTS2 Monitor: uses Django, has various stats, supports filtering, ...
    • Discussion on deployment model: will test FTS3 server scalability and come back with a proposal (generally in favour of single "global" FTS3 server deployment)
  • Dedicated meeting on Jan 25th to discuss FTS3 requirements with ATLAS DDM developers
  • Next regular meeting on Jan 30th
Maria summarizes the next steps: 1) complete tests, 2) assess scalability, 3) write proposal for deployment model.

Squid monitoring

Andrea V. thinks that the topic about the Squid WN information does not fit in the task force, which is about monitoring, even if the point arose because of the information needed by the MRTG monitoring. Dave is proposing a new service and it is not clear who would use it, if Frontier and CVMFS would use the current systems to get the information. Dave proposes that a new TF is formed to work on this topic. Maria agrees with separating the two topics and proposes to close the squid monitoring task force when its goals have been reached. For the WN information, one could form a new TF, or use the CVMFS TF, or the Information System TF. Dave agrees to propose a mandate for a new TF, if he finds it appropriate.

Tracking tools

News from other WLCG working groups

Experiment operations review and plans

ALICE (M. Litmaath)

  • continuing the p-A data taking, MB sample completed (130 M events), now in 'rare trigger' mode. All OK.
  • data registration in CASTOR fine, replication to T1 follows quickly.
  • quasi-online reconstruction working fine at all T0/T1 sites; 'fast' reconstruction for Muon and calorimeter analysis completed few hours after data taking; 'calibrated' reconstruction follows standard procedure, no surprises
  • plans to finish with full pass of everything shortly after end of data taking period (few weeks), then re-calibration and new full pass (~1 month after)
  • plans to run p-A MC anchored to RAW data taking conditions in parallel (~1 1/2 months of work for the sites)
  • longer term plans to run Pb-Pb MC for 2011 data in preparation for new processing (~3 months).

ATLAS (I. Ueda)

Status
  • we are running very important production jobs for winter conferences
  • data taking of heavy ion collisions (incl. proton-nucleus) on-going
    • expected data volume would be: 750 TB at T1 disk, 500 TB at T1 tape, 250 TB at T2 disk, or a little more (eg. 1PB at T1 disk) depending on the achieved luminosity.
Issues
  • we observe some bottlenecks
    • Network, FTS channel configuration, as well as our own workflow, rather than CPU
    • We should bring technical discussion to the corresponding WG/TF (?) probably combined effort of the two is needed
    • We hope WLCGOPS Coordination will follow the issues
  • CERN CA CRL issue needs to be followed up and to be avoided in the future
    • We (WLCGOPS) should follow the followups presented at MB
  • lcg-cp issues with EMI releases (not only one)
    • specified timeout did not take effect (GGUS:86202)
    • needing additional --srm-timeout (GGUS:89163, GGUS:89892, GGUS:89998)
    • the patch for the second needs to be announced (the workaround should have been announced earlier).
    • how can we prevent such problems in the future?
  • Any news about KIT tape system / throughput?
Maria notices that the Tier-1 representation has been very scarce recently. It is important to do everything possible to have them connecting to the meeting.

Points raised at the last meetings to be followed-up

CMS (I. Fisk)

  • CMS has enjoyed reasonably stable operations on T1 level. Good support over the holiday and work was completed.

  • There have been some smaller problems with individual sites on the T2 level, is being followed up individually. Nothing systematic

  • We have transitioned the Tier-0 infrastructure for the HI run. Based on the framework used for the rest of organized processing. We will begin decommissioning the old system in stages beginning today.

  • We have finished up high priority MC workflows for winter conferences. There will be some stragglers, but mostly complete.

  • The Large scale re-reconstruction pass for all 2012 data has started
    • includes all parked data
    • expected to last till at least End of April
    • This is the largest data processing attempted by CMS in terms of number of events.
    • We will use CERN after HI run is over

  • Continuing work to work on HLT cloud. Working on increasing the scale and stability

  • Issues:
    • The missed CRL update over the weekend was disruptive. Two items we wanted to bring up in this meeting
      1. The error code the user receives is misleading
      2. The disruption of this failure is potentially worse than the security issue addressed. Perhaps it should not be fatal?

  • A 2 hour EOS upgrade was requested: We would prefer after HI run is concluded given that EOS is in the data taken path

  • The CMS document on the motivation for Disk/Tape Separation is attached to the agenda.
About the CRL issue, Maria hopes that there will be a dedicated discussion with the relevant people. Maarten mentions some possible improvements, but only by the end of LS1.

LHCb

  • We start the reprocessing activity which should last 6 weeks
  • we are using the ONLINE farm for simulation in parallel to the pA data taking
  • problem during the week-end of CRL expiration: is there a way to avoid this kind of problem
  • Gridka create a dCache srm end point dedicated to LHCb since one week. problem at the beginning to update all the OLD srm endpoint reference to the new one.
  • could we envisage that after pA run the daily WLCG operations frequency become twice a week ?
    • see AOB

GGUS tickets

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR 2.1.13-6.1; SRM-2.11 for all instances.

EOS:
ALICE (EOS 0.2.20 / xrootd 3.2.5)
ATLAS (EOS 0.2.25 / xrootd 3.2.7 / BeStMan2-2.2.2)
CMS (EOS 0.2.22 / xrootd 3.2.6 / BeStMan2-2.2.2)
LHCb (EOS 0.2.21 / xrootd 3.2.5 / BeStMan2-2.2.2)
Jan 21 update of EOSATLAS Need to schedule update for all the others EOS instances
ASGC CASTOR 2.1.11-9
SRM 2.11-0
DPM 1.8.5-1
Dec 23-25: DPM upgraded to 1.8.5-1
Dec 28 storage down due to power cut, SIR in preparation
Next week limited bandwidth to storage for 10 Gb link maintenance, date TBA.
In February CASTOR upgrade to 2.1.12-10 (tests OK) or 2.1.13-6 (tests in progress). Exact date TBD.
BNL dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.8.1 (Atlas, CMS, LHCb)    
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.4-1.osg
Oracle Lustre 1.8.6
EOS 0.2.22-4/xrootd 3.2.4-1.osg with Bestman 2.2.2.0.10
   
IN2P3 dCache 1.9.12-16 (Chimera) on SL6 core servers and 1.9.12-24 on pool nodes
Postgres 9.1
xrootd 3.0.4
None None
KIT dCache
  • atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera)
  • cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera)
  • lhcbsrm-kit.gridka.de: 1.9.12-24 (Chimera)
xrootd (version 20100510-1509_dbg)
Migrated LHCb from gridka-dcache.fzk.de to lhcbsrm-kit.gridka.de and from PNFS to Chimera. Additionally we updated the dCache version from 1.9.12-17 to -24.  
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF)    
PIC dCache 1.9.12-20 (Chimera) - doors at 1.9.12-23 upgraded dCache doors to 1.9.12-23 version None
RAL CASTOR 2.1.12-10
2.1.12-10 (tape servers)
SRM 2.11-1
XROOTD federated access now enabled for ATLAS Testing 2.1.13-6
TRIUMF dCache 1.9.12-19 with Chimera namespace    

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1 updated to transfer-fts-3.7.12-1  
ASGC 2.2.8 - transfer-fts-3.7.12-1 None None
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1   None
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1   None
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.0-1 for T1 and 1.8.3.1-1 for US T2s SL5, gLite Oracle ATLAS None
CERN 1.8.2-0 SLC5, gLite (upgrade ongoing to EMI2 (SLC6) Oracle ATLAS, LHCb Upgrade to EMI2: LFC DB schema updates successfully done by the DBA on 23 Feb for ATLAS and 24 Feb on LHCB and SHARE; no problem with the new schema. The deployment of front-end nodes was stopped, caused by GSS errors (see INC:226493). The errors are solved by pointing to an Oracle 10 kit on AFS. This is not an ideal solution because it will make the LFC service depend on AFS as there is no Oracle 10 Instant client for SLC6. Oracle 10 is deprecated for security reasons. Please note that we have asked the LFC developer for a build of the EMI2 LFC with Oracle 11 a while ago but we have not got it yet.
CERN, test lfc-server-oracle-1.8.3.2-1 SLC6, EMI2 Oracle ATLAS Xroot federations  

Oliver announces for tomorrow an unofficial LFC build against Oracle 11, which will allow to upgrade the CERN LFCs. EMI-3 will be built against Oracle 11 but only from the end of February. Maria worries that we might be in a situation where Oracle 10 is no longer usable while EMI-3 versions of the LFC or other Grid services (e.g. VOMS and FTS) are not yet sufficiently stable for deployment; she therefore proposes that EMI-2 versions of affected components may need to get built against Oracle 11 as well. Maarten recommends bringing this up with EMI and Oliver suggests tickets be created to speed up the process. Maarten will follow up.

Other site news

Data management provider news

AOB

In order to make it easier for the Tier-1 contacts to join, Maria proposes to have the baseline versions news, the Tier-1 storage services review and the experiment reports at the beginning of the meeting and to make sure that the Tier-1's are properly reminded about the meeting.

It is agreed (as nobody ever objected) that with the end of the run the daily meeting will become bi-weekly, on Mondays and Thursdays, which means starting the new schedule as of February 25 or March 4 (to be decided).

Maria reminds again that on January 31 there will be a meeting on the common analysis framework. The next operations coordination meeting will be on February 7.

Action list

  • Maarten will look into SHA-2 testing by the experiments when the new CERN CA has become available.
  • MariaD will follow up on the ongoing Vidyo problems. DONE, see INC:210753 . If not accessible to everyone, solution was: "Problem should be solved by gateway software update." Nevertheless, vidyo problems regularly re-occur and they are regularly reported in other such tickets.
  • MariaD will follow up with PES about the VOMS-GGUS synchronisation problem. This action is Done. Answer is: VO members changing their DN/CA pair, so far, need to update also their own Groups/Roles. Steve (the VOMS manager) offered to automate this update in future cases where a CA DN changes, thus affecting many VO members.
  • Jeremy will follow up the review of the fall-back procedure for GOCDB (as discussed in the ATLAS report). DONE: see the final report
  • MariaD will update the WLCGCriticalServices twiki.
  • AndreaV will follow up on the HEPOSlibs meta rpm package. DONE: the meta-rpm and some documentation on twiki are available.
  • Tracking tools TF members who own savannah projects to list them and report to the TF <wlcg-ops-coord-tf-tracktools@cern.ch> (which includes the savannah and jira developers) what they wish to do with them (freeze/migrate-to-jira/other(what)).
Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r25 - 2013-02-13 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback