TWiki> LCG Web>WebPreferences>WLCGOpsMinutes151119 (revision 9)EditAttachPDF

WLCG Operations Coordination Minutes, November 19th 2015

Highlights

Agenda

Attendance

  • local:
  • remote:
  • apologies:

Operations News

Middleware News

  • Issues:
    • Critical Vulnerability broadcasted by SVG on Friday 06 affecting NSS. (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183). All software where the SSL handshaking is based on Mozilla Network security services which includes RedHat 6 and 7 and its derivatives is affected ( for instance libcurl uses NSS). EGI CISRT put as deadline the 2015-11-13 for patching the hosts. Sites failing to act and/or failing to respond to requests from the EGI CSIRT team risk site suspension.
    • this is a problem affecting not only grid services, Security team @CERN has also sent this week an email to ask all service admins to patch their hosts

  • T0 and T1 services
    • KIT
      • dCache upgraded to v 2.13.9
    • CERN
      • Every EOS deployments upgraded to EOS 0.3.135-aquamarine
    • JINR
      • dCache upgraded to v 2.10.44

Tier 0 News

  • The LSF 9 upgrade of the WNs is in QA testing. The ATLAS Tier-0 LSF instance is upgraded to v9, and the clients are also in QA, Atlas will decide when to upgrade them..
  • The HTCondor capacity represents some 5% of the total batch capacity at CERN; we plan to rather quickly move more resources from LSF to HTCondor to reach some 2025%. The two ARC CEs are declared obsolete;
  • The Kilo-1 configuration that resulted from performance optimization work jointly done with the cloud team is now running on some 100 lxbatch hosts, so far with very satisfactory results and no indication of any unwanted effect. It will be extended to all hosts when the Openstack kilo release is deployed, estimated at the end of November.
  • IPv6 enabled in MyProxy and VOMS for testing purposes, in dual-stack mode (IPv4 and IPv6).

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • generally normal to high activity
  • preparations for heavy ion reco jobs:
    • important changes in the code and workflow have been implemented to reduce the memory usage
    • they were tested with 2011 heavy ion reference data
    • if all goes well, for this year's heavy ion data the reco jobs will only need ~2.5 GB RAM
    • to be on the safe side, special arrangements were made with the sites that will receive heavy ion raw data
    • CNAF, KISTI, KIT and SARA have set up dedicated high memory queues
    • at CERN the jobs can request 2 cores and hence have twice the memory
    • all setups have been tested with normal jobs
    • we thank the sites for the good support!

ATLAS

  • Activity as usual
    • new record in parallel running slots: 250k . Thanks to the impact of opportunistic resources like Sim@P1 and NERSC_Edison (together they contributed with more than 50k slots)
  • Frontier and Squid: during the past few days we observed that some of the jobs we are running now (mc15b campaign) are requesting an excessive amount of conditions data. This is creating troubles so some squids and Frontier servers. The problem is understood and fixed, no new tasks like this will be launched. For the existing ones, since they are almost over, we will let them finish
  • HeavyIon data taking: we are ready for it. Since the processing time of HI is huge, we are ready to use the Tier1s/Tier2s to reconstruct also.
  • Deletion agents: deletion agents were switched off between Sunday night and Wednesday, to allow time to recover data which was scheduled for deletion but was actually needed by some people. Now the deletion agents have been restarted, but they are struggling to keep on with the high amount of deletions.
  • PRODDISK has been decommissioned on all the Tier2s (and Tier3s which wanted).

CMS

  • Preparations for Heavy Ion running continuing
    • No issues so for from the Computing side
  • Very high load in the system
    • Last week sustained ~120k parallel jobs
    • Multi-billion events MC RECO campaign ahead
    • Situation expected to stay like this for weeks

LHCb

  • Operations
    • Very high activities on distributed computing resources with user and simulation workflows
    • Some low levels of Data processing activities ongoing
    • LHCb will participate and take data in led-ion runs until mid December
  • Issues
    • Several days of failures at SARA when srm was overloaded by a local user.(GGUS:117413, GGUS:117483)
    • Issues with tape movers at RRCKI (GGUS:117444, GGUS:117267)
    • Security vulnerability reported with LHCb setup script in CVMFS which is sourced before every workflow. Under investigation.
  • Development / Outlook
    • Working on interface to HTCondor-CE

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

HTTP Deployment TF

The 5th TF meeting took place on 11th Nov - https://indico.cern.ch/event/459419

Minutes are attached to the agenda.

The TF now has a working Nagios probe, endpoint lists from the experiments, regular monitoring of the infrastructure (see links on agenda) and a GGUS support unit. The TF is thus ready to do a "dry run" of its principal activity, helping sites to get their HTTP storage in shape. In the next couple of weeks we will run with a small group of volunteer sites to test/optimise the process which will then be used to ticket and support all remaining sites.

Information System Evolution


  • The first draft of the Future Use Cases document is now available for comments. Deadline to provide input is on 24.11. The document will be presented at the December GDB.
  • There was a TF meeting on 12.11 ( Minutes). All the experiments presented their plans to move to GLUE 2.0 and proposals to simplify the interactions with the IS. Several action items were defined after the meeting:
    • Define a roadmap to stop publishing GLUE 1.3 in coordination with EGI and OSG.
    • Information validation:
      • Document existing validation mechanisms (this is now documented in the TF wiki)
      • Actively validate information that is important for WLCG. Feedback from experiments is needed (especially ATLAS). In particular, validation of the Waiting Jobs GLUE attribute for ALICE has been implemented ( SSB).
      • It was agreed that after the feedback collected so far, it doesn't make sense to define a GLUE 2.0 profile for WLCG.
      • There are ongoing discussions with MW officer to integrate glue-validator within the different services running a resource BDII and improve information quality before it gets published. This will be proposed at the URT meeting on 14th December.
    • Study the proposal of publishing a subset of the current GLUE schema that is useful for WLCG in JSON/HTTPS. Andrew McNab presented his work on publishing Vac/Vcycle resources using this approach.
  • Next meeting is on 26.11 ( Agenda)

IPv6 Validation and Deployment TF


Middleware Readiness WG


Multicore Deployment

Network and Transfer Metrics WG


  • perfSONAR collector, datastore, publisher and dashboard in production (stable operations)
  • Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
  • perfSONAR 3.5: 205 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR
  • Pilot projects: ATLAS Panda, perfSONAR stream now in ATLAS Network Analytics (https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ATLASAnalytics), several KIBANA dashboards available - Site link stats. Jorge and Ilija working on cost matrix using the round-trip time and packet loss in Mathis's formula to infer bandwidth (predictions based on this model will follow).
  • Pilot projects: LHCb DIRAC bridge is now functional, processing perfSONAR stream and inserting packet loss metrics in DIRAC, includes mapping to LHCb sites. Henryk, Federico and Stefan are working on this.

RFC proxies

  • NTR

Squid Monitoring and HTTP Proxy Discovery TFs

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well)
2015-10-01 Follow up on reporting of number of processors with PBS John Gordon ONGOING  
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-11-05 ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details ATLAS - - None -
2015-09-03 T2s are requested to change analysis share from 50% to 25% since ATLAS runs centralised derivation production for analysis ATLAS - - a.s.a.p. CLOSED
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet CLOSED

AOB

-- AndreaSciaba - 2015-11-17

Edit | Attach | Watch | Print version | History: r13 | r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2015-11-19 - DavidCameron
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback