WLCG Operations Coordination Minutes, July 30th 2015




  • local: Maria Alandes (Minutes), Maria Dimou
  • remote:

Operations News

  • Following yesterday's GGUS Release, the Did you know? article this month reminds of existing GGUS features requested at the WLCG Site Survey last autumn, as decided in this meeting a month ago.
  • The new WLCG Operations Portal is now online http://wlcg-ops.web.cern.ch/. The portal aims at sharing information related to WLCG Operations with Sys admins, Experiments and TF/WG people. For example, Sys admins will be able to find a summary of action items (like the ones presented in the Ops meeting) and useful documentation for the services they have to maintain. Please, check the portal and do not hesitate to send us your feedback! We plan to add some dynamic content with articles covering topics of interest in our community. Stay tuned and help us making the portal useful to everyone and up to date!

Middleware News

  • Baselines:
    • The end of support for dCache 2.6.x was May 2015. The deadline for decommissioning is 21/09/2015 and starting from 31/08/2015 sites still running dCache 2.6.x will be ticketed. ( more details at https://wiki.egi.eu/wiki/Software_Calendars#dCache_v._2.6.x). ~20 instances still running 2.6.x ( no T1s).

  • Issues
    • Critical vulnerability affecting RedHat 5, 6, 7 broadcasted by EGI CSIRT (https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/libuser-2015-07-24) which allows local root exploit. This vulnerability is present in the case of access via a local user account, so only UIs where access is given via local passwd file could be affected. A fix for the affected package ( libuser ) is available for RedHat 6 and 7, but not for Red Hat 5, so sites should either upgrade to a new OS release or apply the workaround described at https://access.redhat.com/articles/1537873. According to the CSIRT procedures, sites not upgrading by 31/07 risk site suspension.

  • T0 and T1 services
    • JINR
      • dCache upgraded to 2.10.36
    • NL-T1
      • DPM upgraded to v 1.8.9 at NIKHEF in order to fix a data transfer issue.
    • PIC
      • dCache upgraded to 2.10.37

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports


  • high activity
    • new record 83k briefly reached on Jul 29
  • CERN:
    • raw data copies from Point 2 to CASTOR were timing out (GGUS:115145)
      • raw data reconstruction jobs were keeping many disk server slots busy
      • many more slots appeared to be occupied by stale transfers
      • to be followed up further with the CASTOR devs
    • job submissions became really slow multiple times (GGUS:115153 and GGUS:115238)
      • some issues were cured on the Argus side
      • the real cause of such problems has not yet been identified
  • NDGF reported inefficient data transfers and noise in their logs
    • due to failed attempts with 2 methods before the 3rd succeeds
    • Xrootd client only checks if the source supports 3rd party copies
      • also the destination should be checked
      • a bug has been opened for the Xrootd devs
    • meanwhile a workaround has been applied on the ALICE side


  • Activity as usual, no major issues.
  • working on the ATLAS Tier-0 dedicated cluster to understand "slow" nodes (not nodes with 10% less performances but more going half or one third of the others). Now setup HammerCloud stress test with single core analysis jobs, not able to fully saturate the cluster (now 12k slots out of 14.5k), investigating with experts.
  • we are reviewing the ATLAS Central Service monitoring: procedures (and twiki) are being setup
  • minor: GGUS unscheduled downtime. It was published in GOCDB, but we didn't know. We suggest an email to atlas-adc-crc at cern.ch is sent out next time.
  • minor: kibana meter.cern.ch was down. issue announced in IT Status Board. Discussing with Pedro (itmon team) that ATLAS would like to send a GGUS team ticket for such issues, he agreed.
  • in preparation the draft of KB article for the issue of users contacting CERN for ATLAS issues


  • Main production activities
    • PromptRECO (Tier-0): No major infrastructure problems
    • DIGI-RECO for Run2 and Upgrade: Using all T1s and round 15 T2s
    • Continues GEN-SIM production
  • Assignment of custodial location of Primary Datasets to T1 sites
    • One tape copy always at CERN, 2nd tape copy on other T1
    • First 50ns data went all to CERN and FNAL
    • Forthcoming data distribution to all T1 sites being iterated
  • Operational Issues
    • Over subscribed PIC disk space
      • Production system queried bad source for available disk space
      • Sorted out with good support from PIC team
      • Improvements of tools under way
    • Dataset need by SAM test accidentally removed at several sites
    • Bad HammerCloud results at many sites under investigation
      • Appears to be a monitoring issue - not a site problem
  • CMS User tickets in SNOW
    • CMS has a Funtional Element (FE) "CMS Support"
    • This FE should be used to route CMS related user issues to
    • Supporters will help directly or forward the user to the appropriate channel


  • Operations
    • Currently finishing a restripping of the Run1 legacy data and of the 50 ns Run2 ramp
    • Discussion with CERN/LSF team about the queue capabilities, problems found both in LSF and Dirac (GGUS:115027)
    • Preparations for the 25ns ramp up ongoing.
  • Developments
    • Hammercloud testing for LHCb is currently being re-vitalized. The probe will check the possiblity to run user analysis jobs with protocol access at sites.
    • perfSonar data extraction from WLCG sources is almost finished, currently working on the publishing of the data into LHCbDIRAC

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • NTR

Machine/Job Features

  • A nagios probe checking the availability and sanity of machine / job features has been developed. It's currently running in preprod for the LHCb SAM instance. Results can be seen at http://cern.ch/go/Gzn8 . LHCb sites providing MJF are
    • CERN
    • GRIDKA
    • LPNHE
    • Imperial College

Middleware Readiness WG

Multicore Deployment

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

Network and Transfer Metrics WG

  • Successfully tested publishing of the perfSONAR results to the message bus directly from the OSG collector. Discussing possible SLA to run this as a production service in collaboration with OSG.
  • OSG datastore on track to go production at the end of July, this will be a service provided to the WLCG, it will store all the perfSONAR data and provide an API
  • Started testing proximity service, which helps to map sonars to storages and thus enables integration of the network and transfer metrics.
  • Review of the experiments use cases was presented/discussed at the last meeting, see slides for details (https://indico.cern.ch/event/393101/)
  • FTS performance study update - see slides for details (https://indico.cern.ch/event/393101/), observations from the report so far:
    • Peak transfer rates between Europe and North America are less asymmetric than they were last month (to be followed up)
    • Almost all incoming to BNL uses TCP=1 (Alejandro confirmed this is how BNL is configured right now, the other FTS instances use auto-tuning)
    • CMS T1s have better transfer rates compared to ATLAS and LHCb (to be followed up)
    • CMS uses TCP=1 more often than ATLAS and LHCb for large files
    • TCP stream=1 transfer do timeout about 2-3% of the time, however timeouts are concentrated at a few sites.
    • Throughput dependence on TCP streams possibly understood (see http://egg.bu.edu/lhc/fts/docs/2015-05-26-status/results_so_far.pdf)
  • perfSONAR operations status
    • Agreed to establish WLCG-wide meshes for top 100 sites (based on the contributed storage and location). This will enable full mesh testing of latencies, traceroutes and throughput (ongoing).
    • ESNet interested in the perfSONAR configuration interface developed for WLCG, development design document for an open-source project based it is currently discussed.

HTTP Deployment TF

Information System Evolution

  • The first TF meeting took place last week ( agenda, minutes)
    • It was agreed to implement in REBUS a set of easy fixes. For more details, please check REBUS known issues
    • A set of action items were defined, for more details, please check Task tracking and timeline. A summary below:
      • Requirements to remove information (Physical CPU) or change how information is collected (HS06) in REBUS will be followed up
      • Agree on a better definition of Installed Capacities, or even decide to change this name and better use "Available capacities" or something similar
      • Discuss at the MB the possibility of adding T3s and also publish pledges per sites in REBUS
  • A draft document describing use cases from experiments and project activities relying on the information system has been circulated among TF members for their contribution. This will be presented in the future MB (date to be confirmed) although we are aiming to have the document ready by end August

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-07-02 Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case all n/a ALICE, ATLAS, CMS have made progress after discussing with the T0 manager. They will present at the next meeting. July 30 ~40%

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-06-18 Some sites have still not enabled multicore accounting All Multicore Deployment Instructions here a.s.a.p. Almost DONE. HERE is the list of the remaining still pending sites.
2015-06-04 ALL ATLAS sites implementing a cap to their multicore resources (whether their configuration is dynamic just for a portion of nodes or it is a static partition) should review the cap to give 80% of the ATLAS production resources to multicore. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2. More info here ATLAS Multicore   None CLOSED
2015-06-04 LHCb T1s requested to make sure that all the RAW data will be stored on the same tape set in each tape system when it is feasible LHCb - More details in GGUS:114018    
2015-06-18 CMS requests an adjustment of the Tier-1 fair share target for the following VOMS roles: /cms/Role=production 90% (was 95%), /cms/Role=pilot 10% (was 5%). Note that for CMS SAM tests the role cms/Role=lcgadmin is used, it basically needs very little fair share but should be scheduled asap to have the test not timing out. Overall at least 50% of the pledged T1 CPU resources should be reachable via multi-core pilots (this is as before - just mentioned for completeness) CMS     None yet CLOSED (confirmed regarding config) Verification is longer term
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open


GGUS: How do users (e.g. VO shifters) receive GGUS downtime notifications?


-- MariaALANDESPRADILLO - 2015-07-27

Edit | Attach | Watch | Print version | History: r22 | r17 < r16 < r15 < r14 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r15 - 2015-07-30 - AleDiGGi
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback