WLCG Operations Coordination Minutes - October 2nd, 2014

Agenda

Attendance

Operations News

  • HEP_OSlibs-7.0.0-0.el7.cern.x86_64.rpm for CentOS7 has been released. The metarpm is based on the previous SL6 one with packages not present anymore in CERN CentOS7 removed. More information:
  • CHEP2015 deadline for abstracts in on the 15th of October 2014.
  • The Shellshock vulnerability implies a set of security bugs affecting the bash shell. This was disclosed on 24th September and the WLCG Security experts has been evaluating since then the impact of these vulnerability in the WLCG infrastructure. Some perfSONAR nodes have been found to be compromised. The security teams, as well as WLCG Operations, highly recommend that all sites terminate their perfSONAR instances as a precautionary measure, until the attacks are contained, unless you have patched the Bash packages on the perfSONAR instance(s) by Friday 26 Sep and can actively confirm by checking network logs that NO IRC traffic was emitted from your hosts.Investigations of other services are ongoing. WEB servers are particularly targeted.

Middleware News

  • Baselines:
    • News from EGI URT meeting on monday
      • New version of the UI and WN are going to be hopefully included in the next version of UMD foreseen for end of October.
      • dCache 2.2.x decommissioned deadline is 31-10-2014. Baseline for the 2.6.x series is 2.6.34 which fixes issues with Brazilian CAs certs.
      • Globus 6 is in epel-testing and PTs are invited to test compatibility. We are aware of FTS and DPM being tested, at the moment not blocking issues have been discovered and therefore soon Globus 6 will go to stable. ( the exact date will be discussed during the next URT)

  • MW Issues:
    • xroot package deployed with ROOT 6 breaks access to dCache storage, affecting LHCb. The problem is both client and service side, A fix for dCache has been developed but not yet released, at the moment there will be a workaround fix in ROOT.
    • installation of several grid products is broken. CREAM, WMS, L&B, UI, WN cannot be installed at the moment cause the classads package ( dependency for all of them ) was declared orphan in EPEL, and retired from the EPEL repository. The package is going to be included at the moment into UMD and EMI third-party repo, waiting for a maintainer. CESNET should take care of it but they are not happy with this extra effort.

  • T0 and T1 services
    • IN2P3
      • dCache upgrade to 2.6.34
    • NL-T1
      • dCache upgrade to 2.6.34
    • KIT
      • Update of dCache for CMS and LHCb to 2.6.34
      • Update xrootd configuration for FAX and AAA to respect EU privacy policy Thursday 08:00 - 08:30 UTC.
      • Update for LHCb dCache to next version that fixes issues with ROOT6 not scheduled yet (new dCache release required first).
    • JINR-T1
      • one dCache instance upgrade to 2.6.34
      • one dCache instance running 2.2.27 to be upgraded to 2.6 or 2.10 early november
    • BNL
      • FTS upgrade to 3.2.27

Oracle Deployment

  • IT-DB new hardware installations in: CERN computer centre and Wigner.
  • Timeline: testing in October, production move - by the end of 2014. Schedule will be updated accordingly.
  • Following table includes only those DB services that concers WLCG

Database Comment Destination Upgrade plans, dates
ATONR Data Guard for Atlas Online Wigner  
ATLR Data Guard for Atlas Offline Wigner
ADCR Data Guard for ADC Wigner  
CMSR Data Guard for CMS offline Wigner  
LHCBR Data Guard for LHCB Offline Wigner  
LCGR Data Guard for WLCG Wigner  
CMSONR Data Guard for CMS Online Wigner  
LHCBONR Data Guard for LHCB Online Wigner  
CASTORNS Data Guard for Castor Nameserver Wigner  
ATONR Active Data Guard for Atlas Online CERN CC  
ALIONR Active Data Guard for Alice Online CERN CC  
ADCR Active Data Guard for ADC CERN CC  
LHCBONR Primary Database for LHCB Online CERN CC  

Tier 0 News

  • AFS UI: Waiting feedback from the experiments (see action list)
  • WMS service was decommissioned on October 1st
  • plus5/batch5: user feedback fully analyzed, no major showstoppers. Some alternatives discussed with different user groups. Lxplus5 will be stopped in October, exact date being discussed.
  • Next job efficiency meeting on October 10th, https://twiki.cern.ch/twiki/bin/view/PESgroup/MeetingHeld10thOct2014

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • CERN: investigation of job failure rates and inefficiencies
    • the batch team adjusted the parameters of the ALICE queue - thanks!
    • the effects should become visible over the coming days
    • the pilot ("job agent") logic has been checked and a potential improvement is being looked into
  • CERN: HLT farm running as an ALICE site since Sep 24
    • being ramped up over the coming months

ATLAS

  • DC14
    • digi+reco Run2 in 8-core mode will finish in about a week from now
    • some more simulation samples launched
    • further AOD2AOD on reprocessed data will be launched in one week (0.5PB)
  • Multicore recommendations for 8-core reconstruction
    • Allocate 16GB physical memory per job
    • if limiting memory per process: 3GB RSS and/or 5GB VMEM
    • For cgroup-enabled sites: total RSS of the job should be 16GB
  • Deployment of ATLAS MCORE queues
    • more than 70k cores were used this week for multicore jobs exclusively
    • after fixing initial memory issues, the digi+reco processing is progressing very fast, up to 50M events/daily - thanks to the sites for fast action
    • to all the sites, please continue the deployment of multicore queues
    • serial production tasks in the future will be limited
  • SAM3 tests
    • all ARC-CE sites fixed the configuration and the ATLAS_CRITICAL tests are effective since 1.10.
  • Rucio and Prodsys-2 commissioning ongoing, still no fixed date for deployment

CMS

  • Processing overview:
    • Not much work in the system
    • Preparations for new campaign (PHYS14) ongoing
  • Scale testing of HTContor and GlideinWMS by OSG colleagues
    • Launch many pilots on one acquired job slot to reach high scales
    • Caused some trouble at sites
      • Firewalls not able to handle that many connections
      • Maximum number of NAT connections exhausted
      • Report problems to CMS (via ticket or HyperNew) - we negotiate with the testers
  • Problems with Dashboard reporting
    • Dashboard team presently working on a job monitoring collector with UDP
  • AFS-UI at CERN
    • Analysis of AFS access logs
      • ~45 individual analysis users - usage likely decrease after closing of lxplus5
      • ~5 users from central production
    • Extending the UI availability beyond Oct 31st is still preferred
      • Perhaps even needed - migration path of some services still to be understood
  • Reminders for sites
    • Participate in space monitoring (compare last meeting)
    • Update xrootd fallback configuration
      • Opened tickets to various sites - quite some took action - thanks!
    • Add "Phedex Node Name" to site configuration

LHCb

  • Access to dCache storage sites is broken when accessed by ROOT6/xrootd, i.e. a negotiation for a vector read fails and subsequently ROOT crashes. A fix is proposed by the dCache team. On the ROOT side an intermediate stop gap solution is possible to be deployed until the dCache fix is out and deployed.
  • AFS UI references have been checked and eventually cleaned from all LHCb distributed computing clients and tools. The retirement of the UI should be possible for LHCb.
  • It was discovered that WLCG reports for CERN not only contain statistics for worker nodes but all resources used by the experiment (including VOBOXes, build nodes, etc.). This makes it hard to compare e.g. job efficiencies to other sites and LHCb proposed to only publish worker node figures
  • A new stripping campaign is currently being prepared by LHCb. This campaign will produce a "legacy dataset" for 2010-12 data. The plan is to also to partly reconstruction and include tagging information which will result in more work to be executed. The net result for the sites is that the staging will most likely not be the bottleneck for this operation.
  • New VOMS servers are currently being tested in certification by LHCbDIRAC with full workflows.
  • IPv6 tests done for LHCbDIRAC, network configuration of LHCbDIRAC and authentication of services/agents across different machines is working

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

Machine/Job Features

  • NTR

Middleware Readiness WG

The 6th meeting of the WG took place yesterday, as planned. Please follow the presentations of the MW Officer Andrea M. and the MW Package Reporter developer L.Cons from the agenda HERE to see the products in the pipeline for Readiness verification and the the one of the developer L.Cons scenarios for the Collector/Reporter under evaluation. Actions were completed. Next meeting on Wed Nov 19th at 4pm CET. Please do note the date/time!

Multicore Deployment

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • node firewalls are being opened selectively per experiment
      • ALICE OK
      • LHCb looking good
      • CMS in progress
        • FNAL FTS-3 config needed fixing
      • ATLAS in progress
        • BNL FTS-3 config needed fixing
    • EGI will soon conclude their campaign to get all EGI sites to recognize the new servers for the Ops VO
      • the port for Ops will be opened at that time
    • our special routing rules have been extended until Tue Nov 18 (sic)
      • those rules allow remote sites to get "Connection refused" instead of timeouts
      • by that time we still have 1 week to fix unwanted behavior
      • we should have things in good shape long beforehand...

WMS Decommissioning TF

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

  • No progress to report again this meeting

Network and Transfer Metrics WG

  • Details on the shell shock vulnerabilites and its impact on perfSONAR available at https://twiki.cern.ch/twiki/bin/view/LCG/ShellShockperfSONAR
  • We recommend ALL sites that didn't patch bash before Friday Sep 26 to terminate their instances and wait until perfSONAR 3.4 is released
  • perfSONAR 3.4 to be released on Mon Oct 6, WLCG and EGI broadcasts will be sent with the installation instructions
  • perfSONAR operations meeting this Friday (Oct 3 at 3PM), agenda at https://indico.cern.ch/event/342995/

Action list

  1. ONGOING on the WLCG middleware officer: to take the steps needed to enable the CVMFS UI distribution as a viable replacement for the AFS UI.
    • The CVFMS grid.cern.ch contains the emi-ui-3.7.3 SL6 (path /cvmfs/grid.cern.ch/emi-ui-3.7.3-1_sv6v1) and provides as well CA certs, crls and voms lsc files. Given the new UI release we can also plan to upload the UI v3.10.0.
    • TODO: clarify the responsibilities (including ticketing etc.) for the maintenance of the CVMFS UI, in particular running fetch-crl
    • UPDATE: Steve said that the grid.cern.cn CVMFS server maintenance is under PES responsibility, so also the fetch-crl update process. In case of issues the Configuration Management SE should be addressed.
  2. ONGOING on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
  3. ONGOING on Andrea S.: to understand with EGI if it is possible to bypass the validation of VO card changes in the case of the LHC VOs. Agreed by Peter Solagna.
  4. ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Report about showstoppers. Status: the SAM team made a proposal on the steps to taken to enable SAM. ATLAS is following up to make sure that the new CEs are correctly visible in AGIS, while for the CMS VO feed they will be taken directly from OIM. The plan is at first to test HTCondor-CEs in preproduction and later switch to production. It is not foreseen to monitor at the same time GT5 and HTCondor endpoints on the same host.
    • No showstopper for SAM. Need to discover topology; publishing queues in BDII not necessary for SAM probes since Condor can choose the queue based on the proxy.
  5. ONGOING on Alessandro DG: find out from OSG about plans for publication of HTCondor CE in information system, and report findings to WLCG Ops. To be followed up with Michael Ernst and Brian Bockelman.

AOB

-- NicoloMagini - 19 Sep 2014

Edit | Attach | Watch | Print version | History: r30 | r26 < r25 < r24 < r23 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r24 - 2014-10-02 - AndreaManzi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback