WLCG Operations Coordination Minutes, July 30th 2015

Highlights


Agenda

Attendance

  • local: Maria Alandes (Minutes), Maria Dimou
  • remote:

Operations News

  • Following yesterday's GGUS Release, the Did you know? article this month reminds of existing GGUS features requested at the WLCG Site Survey last autumn, as decided in this meeting a month ago.

Middleware News

  • Baselines:
    • The end of support for dCache 2.6.x was May 2015. The deadline for decommissioning is 21/09/2015 and starting from 31/08/2015 sites still running dCache 2.6.x will be ticketed. ( more details at https://wiki.egi.eu/wiki/Software_Calendars#dCache_v._2.6.x). ~20 instances still running 2.6.x ( no T1s).

  • Issues
    • Critical vulnerability affecting RedHat 5, 6, 7 broadcasted by EGI CSIRT (https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/libuser-2015-07-24) which allows local root exploit. This vulnerability is present in the case of access via a local user account, so only UIs where access is given via local passwd file could be affected. A fix for the affected package ( libuser ) is available for RedHat 6 and 7, but not for Red Hat 5, so sites should either upgrade to a new OS release or apply the workaround described at https://access.redhat.com/articles/1537873. According to the CSIRT procedures, sites not upgrading by 31/07 risk site suspension.

  • T0 and T1 services
    • JINR
      • dCache upgraded to 2.10.36
    • NL-T1
      • DPM upgraded to v 1.8.9 at NIKHEF in order to fix a data transfer issue.
    • PIC
      • dCache upgraded to 2.10.37

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity
    • new record 83k briefly reached on Jul 29
  • CERN:
    • raw data copies from Point 2 to CASTOR were timing out (GGUS:115145)
      • raw data reconstruction jobs were keeping many disk server slots busy
      • many more slots appeared to be occupied by stale transfers
      • to be followed up further with the CASTOR devs
    • job submissions became really slow multiple times (GGUS:115153 and GGUS:115238)
      • some issues were cured on the Argus side
      • the real cause of such problems has not yet been identified
  • NDGF reported inefficient data transfers and noise in their logs
    • due to failed attempts with 2 methods before the 3rd succeeds
    • Xrootd client only checks if the source supports 3rd party copies
      • also the destination should be checked
      • a bug has been opened for the Xrootd devs
    • meanwhile a workaround has been applied on the ALICE side

ATLAS

CMS

LHCb

  • Operations
    • Currently finishing a restripping of the Run1 legacy data and of the 50 ns Run2 ramp
    • Discussion with CERN/LSF team about the queue capabilities, problems found both in LSF and Dirac (GGUS:115027)
    • Preparations for the 25ns ramp up ongoing.
  • Developments
    • Hammercloud testing for LHCb is currently being re-vitalized. The probe will check the possiblity to run user analysis jobs with protocol access at sites.
    • perfSonar data extraction from WLCG sources is almost finished, currently working on the publishing of the data into LHCbDIRAC

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • NTR

Machine/Job Features

  • A nagios probe checking the availability and sanity of machine / job features has been developed. It's currently running in preprod for the LHCb SAM instance. Results can be seen at http://cern.ch/go/Gzn8 . LHCb sites providing MJF are
    • CERN
    • GRIDKA
    • LPNHE
    • Imperial College

Middleware Readiness WG


Multicore Deployment

  • Accounting: Latest update on accounting sites that haven't enabled yet multicore accounting was sent to the GDB mailing list on the 6/7/2015 with the following list of sites
    • Austria - HEPHY-UIBK, Hephy-Vienna
    • Germany - RWTH-AACHEN, DESY-HH, MPPMU
    • India - IN-DAE-VECC-02
    • Mexico - ICN-UNAM
    • Russia - RRC-KI, ru-Moscow-SINP-LCG2, RU-SPbSU, Ru-Troitsk-INR-LCG2
    • Spain - IFIC-LCG2, UB-LCG2
    • UK - UKI-LT2-IC-HEP, UKI-SCOTGRID-DURHAM, UKI-SOUTHGRID-OX-HEP

could the sites listed check and let us know (UK sites we know about that is why I stroke them off)?

IPv6 Validation and Deployment TF


Squid Monitoring and HTTP Proxy Discovery TFs

Network and Transfer Metrics WG


HTTP Deployment TF

Information System Evolution

Action list

Creation date DescriptionSorted ascending Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-07-02 Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case all n/a ALICE, ATLAS, CMS have made progress after discussing with the T0 manager. They will present at the next meeting. July 30 ~40%

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-06-18 Some sites have still not enabled multicore accounting All Multicore Deployment Instructions here a.s.a.p. Almost DONE. HERE is the list of the remaining still pending sites.
2015-06-04 ALL ATLAS sites implementing a cap to their multicore resources (whether their configuration is dynamic just for a portion of nodes or it is a static partition) should review the cap to give 80% of the ATLAS production resources to multicore. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2. More info here ATLAS Multicore   None CLOSED
2015-06-04 LHCb T1s requested to make sure that all the RAW data will be stored on the same tape set in each tape system when it is feasible LHCb - More details in GGUS:114018    
2015-06-18 CMS requests an adjustment of the Tier-1 fair share target for the following VOMS roles: /cms/Role=production 90% (was 95%), /cms/Role=pilot 10% (was 5%). Note that for CMS SAM tests the role cms/Role=lcgadmin is used, it basically needs very little fair share but should be scheduled asap to have the test not timing out. Overall at least 50% of the pledged T1 CPU resources should be reachable via multi-core pilots (this is as before - just mentioned for completeness) CMS     None yet CLOSED (confirmed regarding config) Verification is longer term
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open

AOB

GGUS: How do users (e.g. VO shifters) receive GGUS downtime notifications?

https://its.cern.ch/jira/browse/GGUS-1454

-- MariaALANDESPRADILLO - 2015-07-27

Edit | Attach | Watch | Print version | History: r22 | r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2015-07-30 - MaiteBarroso
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback