WLCG Operations Coordination Minutes, July 16th 2015

Highlights

  • The dates of the next WLCG Workshop in Lisbon, Portugal, are 1-3 February 2016 (Monday am to Wednesday lunchtime), with a DPHEP event attached on 3-4 February.
  • A new Information System Evolution Task Force is defined. The TF's twiki explains the Mandate, Goals, first issues tackled and how to take part.
  • The issue of lost files due to a race condition between Rucio and FTS is now fixed. ATLAS will make a list of affected files and inform the shifters.
  • Good progress is made by the experiment-T0 discussions to provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected. The results will be presented at the next meeting (deadline of the action).
  • Few sites are left, still not enabling multicore accounting. The list is here. Congrats to those who are done.
  • In the MW Readiness verification effort, Volunteer sites start expressing interest to test CentOS7/SL7 versions of Middleware. Thanks!


Agenda

Attendance

  • local: Maria Alandes (chair), Maria Dimou (minutes), Andrea Sciabà, Jerôme Belleman, Oliver Keeble, Alessandro di Girolamo, David Cameron.
  • remote: Hung-Te Lee, Di Qing, Gareth Smith, Renaud Vernet, Rob Quick, Alessandra Doria, Frédérique Chollet, Alessandra Forti, Andrew McNab, Thomas Hartmann, Ulf TIgerstedt, Catherine Biscarat, Vincenzo Spinoso, Julia Andreeva, Jeremy Coles.
  • apologies: Maite Barroso, Pepe Flix, Maarten Litmaath, Andrea Manzi.

Operations News

  • The dates of the next WLCG Workshop in Lisbon, Portugal, are 1-3 February 2016 (Monday am to Wednesday lunchtime), with a DPHEP event attached on 3-4 February. Please note them in your calendars. Registration will open in October.
  • The dates for the next Operation Coordination meeting are:
    • 30th July
    • 20th August
    • 3rd September

New TF: The future of the Information System

  • Maria A. presented slides completing the action created on 2015-06-18. The future of the Information System needs to be examined because, if OSG no more publishes in the BDII, the WLCG InfoSys will be missing information. The new TF has a twiki including a Mandate, e-group and meeting information. They will be meeting on Thursdays at 4pm CE(S)T, at a commonly agreed frequency. Rob Q. was added to the e-group by Maria D, moderator is Maria A. Ale di Gi suggested the TF should set clear goals so that its lifetime becomes obvious. REBUS issues will be handled at first, the rest will be defined at the first meeting of the TF.

Middleware News

  • Baselines:
    • NTR

  • Issues
    • NTR

  • T0 and T1 services
    • NTR

Tier 0 News

Tier 1 Feedback

  • NDGF-T1 : Getting tired of Atlas sending tickets about missing files due to the FTS3<->Rucio bug.
  • PIC: GGUS:114648 was also wrongly assigned to PIC, as the issue was with Rucio, as well.
David and Ale reported that the Rucio issue was solved on July 1st. Updated instructions were given to the ATLAS shifters, not to disturb the sites when apparently lost files are in the list of "suspicious files". Further information in the ATLAS report below. Ale said that real cases of file loss do exist indeed. Case by case investigation by the site is, therefore, part of WLCG operations.Further information in the ATLAS report below.

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity
  • CERN: LSF cap removed on Tue July 14, thanks!
    • it used to be 15k and was very often reached during the last many weeks
    • the concurrent jobs are often 20k+ now
    • a top of 27k+ was reached for many hours!
  • CNAF: tape SE has been upgraded to Xrootd 4.1.3, thanks!
    • testing in progress
    • disk SE will follow

ATLAS

  • Data taking back but with a reduced rate compared to June, processing is going smoothly
  • Grid consistently full with 150-200k slots used
  • Lost files: In May/June many files were lost due to a race condition between Rucio and FTS. The issue is now fixed and ATLAS is investigating which files are affected.
    • A consistency check for all files suspected to be lost is underway
    • In addition we requested all T1 sites to provide storage dumps for large-scale consistency checks
    • We will inform shifters to check problems with non-existent files against a known list of suspicious files and not send tickets to sites if they are in the list.
  • Recent Issues
    • Daily LSF reconfiguration problem affected badly T0 processing two days in a row GGUS:114929
    • Accidental overloading of TRIUMF tape buffer GGUS:114796. Protection will be put in place on ATLAS side
    • Problems reading data from NIKHEF GGUS:114431. Maybe a DPM version issue?

CMS

  • Production overview
    • Upgrade DIGI-RECO with PU=200 (rather resource demanding)
    • Run2 DIGI-RECO
    • Monte Carlo production
  • Tier-0
    • Now creates MINIAOD (compressed analysis format)
  • Operational issues
    • Run into CVMFS bug Thursday last week
      • Link in SITECONF directory after optimization for nested catalogues
      • Basically all CMS jobs started to fail "everywhere"
      • Alarm ticket GGUS:114933
    • Needed some re-tuning after SRM re-shuffling for CASTOR at CERN
      • DN mapping
      • Firewall configuration
    • Global redirector
      • Runs out of threads occasionally
      • Investigations by experts/developers ongoing
      • Upgrade to xrootd 4.2 planned
  • Experiment specific tickets in SNOW
    • Discussion in CMS almost concluded
    • Needs some input from people presently on holidays
    • CMS will finally report for next meeting (within the dead line)

LHCb

  • No report.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • NTR

Machine/Job Features

Middleware Readiness WG


  • The ATLAS and CMS Volunteer sites are active verifying new MW versions, especially dCache v.2.13.3 and StoRM.
  • The MW Officer Andrea M. is now polling sites' interest in testing CentOS7/SL7 versions of MW.
  • The new pakiti-client v.3.0.1 is now available. It contains a new tag to distinguish publishing of packages run for MW Readiness. See explanation in jira ticket MWREADY-67. All documentation linked from our twiki. Direct link HERE.
  • Next vidyo meeting on Wed September 16th at 4pm CEST.

Maria A. asked about UMD availability for CENTOS7/SL7 of some common MW packages like yaim-core. Vincenzo said there will be an EGI meeting next week followed by a broadcast including packages recently released in UMD.

Multicore Deployment

  • Accounting: Latest update on accounting sites that haven't enabled yet multicore accounting was sent to the GDB mailing list on the 6/7/2015 with the following list of sites
    • Austria - HEPHY-UIBK, Hephy-Vienna
    • Germany - RWTH-AACHEN, DESY-HH, MPPMU
    • India - IN-DAE-VECC-02
    • Mexico - ICN-UNAM
    • Russia - RRC-KI, ru-Moscow-SINP-LCG2, RU-SPbSU, Ru-Troitsk-INR-LCG2
    • Spain - IFIC-LCG2, UB-LCG2
    • UK - UKI-LT2-IC-HEP, UKI-SCOTGRID-DURHAM, UKI-SOUTHGRID-OX-HEP

could the sites listed check and let us know (UK sites we know about that is why I stroke them off)?

IPv6 Validation and Deployment TF


Squid Monitoring and HTTP Proxy Discovery TFs

  • No news

Network and Transfer Metrics WG


HTTP Deployment TF

The 3rd meeting of the TF took place on Wed 15th July and focused on monitoring.

https://indico.cern.ch/event/401396/

For functional validation of storage, the TF decided to use a shared SAM probe which would be integrated into the experiment SAM instances. Such a probe already exists (created by the IT-SDC group at CERN) and will be extended to cover the required functionality.

For access monitoring, two solutions were presented - a UDP stream compatible with the xrootd f-stream and the publication of json messages. Storage providers are free to choose the solution they prefer.

Full minutes will be available shortly.

Action list

Creation date Description Responsible Status Comments
2015-06-18 Organise further discussions on the InfoSys future Maria Alandes DONE Organise a new TF to discuss the future of the Infosys.
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-07-02 Provide a description of each experiment computing support structure, so tickets wrongly assigned to the T0 (via SNOW or GGUS) can be properly redirected; evaluate the creation of SNOW Functional Elements for the experiments, if this is not already the case all n/a ALICE, ATLAS, CMS have made progress after discussing with the T0 manager. They will present at the next meeting. July 30 ~40%

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-06-18 Some sites have still not enabled multicore accounting All Multicore Deployment Instructions here a.s.a.p. Almost DONE. HERE is the list of the remaining still pending sites.
2015-06-04 ALL ATLAS sites implementing a cap to their multicore resources (whether their configuration is dynamic just for a portion of nodes or it is a static partition) should review the cap to give 80% of the ATLAS production resources to multicore. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2. More info here ATLAS Multicore   None  
2015-06-04 LHCb T1s requested to make sure that all the RAW data will be stored on the same tape set in each tape system when it is feasible LHCb - More details in GGUS:114018    
2015-06-18 CMS requests an adjustment of the Tier-1 fair share target for the following VOMS roles: /cms/Role=production 90% (was 95%), /cms/Role=pilot 10% (was 5%). Note that for CMS SAM tests the role cms/Role=lcgadmin is used, it basically needs very little fair share but should be scheduled asap to have the test not timing out. Overall at least 50% of the pledged T1 CPU resources should be reachable via multi-core pilots (this is as before - just mentioned for completeness) CMS     None yet CLOSED (confirmed regarding config) Verification is longer term
2015-06-18 CMS asks all T1 and T2 sites to provide Space Monitoring information to complete the picture on space usage at the sites. Please have a look at the general description and also some instructions for site admins. CMS -   None yet ~10 T2 sites missing, Ticket open

AOB

-- MariaDimou - 2015-07-13

Edit | Attach | Watch | Print version | History: r40 < r39 < r38 < r37 < r36 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r40 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback