DRAFT

WLCG Operations Coordination Minutes, November 8th, 2018

Highlights

Agenda

https://indico.cern.ch/event/769760/

Attendance

  • local:
  • remote:
  • apologies:

Operations News

Special topics

Hammer Cloud for commissioning of the compute resources

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal activity levels on average
  • No major issues on the grid
  • Task queue was affected by a HW I/O problem late Oct - early Nov
    • Causing erratic activity levels and job failures
    • Resolved by replacing the machine

ATLAS

  • Smooth Grid production over the last weeks with ~330k concurrently running grid job slots.
    • During last week’s LHC TS and MD additional ~90k jobs slots from the HLT farm with Sim@P1. Such a big single resource is interesting for ATLAS all together, since it’s like a stress test for parts of the central ADC infrastructure.
    • Additional HPC contributions with peaks of ~100k concurrently running job slots and ~10k jobs from Boinc.
  • Commissioning of the Harvester submission system via PanDA is on-going: currently finishing the migration of the UK cloud and starting with the remaining clouds in FR, NL, CA, DE and the US.
  • The Heavy Ion data throughput from CERN Point1 to EOS to Tape will be higher than initially planned and used in the September throughput test. I.e. 4.5GB/s during the run. Still 50% duty cycle expected (from LHC, not up to ATLAS to decide ). This should be just "correct" with the tested throughput to tape (2-2.5GB/s): backup plans in case of troubles have been defined.
  • Asked sites to slowly migrate to CentOS7 in the coming months
  • Crash of EOSATLAS on Wednesday morning: link

CMS

  • Setup for HI run ongoing
  • Good CPU utilization recently
    • ~160k cores for production and ~50k cores for analysis
  • Emergency stop of production activities ~two weeks ago because of disk shortage
    • We reached 85% of "unmovable data"
    • Situation improved (back to 68% of "unmovable data") thanks to some cleaning performed and to early availability of 2019 pledges from some sites
  • Still some EOS instabilities related to fuse mount INC:1784940 (recent EOS crashes OTG:0046403, OTG:0046125)
    • Getting a quite good support, with daily interactions and fixes
  • Upgrading several services to CentOS7
  • Planned switchover to CentOS7 for CMSSW 2019 releases
    • Singularity makes the change transparent for most of the MC and Analysis jobs submitted through grid
    • Still need solutions with sl6/sl7 availability for users running "local job". We proposed IT to look into a solution which uses singularity at the level of batch system, we would like their feedback on that

LHCb

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • The latest WLCG Accounting Task Force discussed HTCondor accounting. We were lacking standard solution for configuration HTCondorCE + HTCondor batch. Stephen Jones evaluated PIC implementation and is evolving it further so that it would be straight forward to use it by all sites which have such configuration.

Archival Storage WG

Update of providing tape info

PLEASE CHECK AND UPDATE THIS TABLE
Site Info enabled Plans CommentsSorted ascending
NLT1 YES   Almost done, waiting for opening of the firewall, order of couple of days
KISTI YES   KISTI has been contacted. Will work on in the second half of September
CERN YES    
BNL YES    
FNAL YES    
JINR YES    
KIT YES    
NRC-KI YES    
TRIUMF YES    
NDGF NO   NDGF has a distributed storage which complicates the task. Discuss with NDGF possibility to do aggregation on the storage space accounting server side. Should be accomplished by the end of the year
CNAF YES   Space accounting info is integrated in the portal. Other metrics are on the way
IN2P3 YES   Space accounting info is integrated in the portal. Other metrics are on the way
PIC YES   Space accounting info is integrated in the portal. Other metrics are on the way
RAL YES   Space accounting info is integrated in the portal. Other metrics are on the way
One can see all sites integrated in storage space accounting for tapes here

Information System Evolution TF

  • Work is specification of the computing resource description (Computing resource Reporting- CRR) in order to provide an alternative of the CE description via BDII.

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


  • perfSONAR infrastructure status - CC7/4.1 campaign ongoing
    • Sites were reminded to upgrade to CC7 and review their configuration (preferably by end of October)
    • Still only around 50% of nodes are on CC7 as of today - we'll soon start contacting sites directly
    • Some sites waiting for/deploying new hardware; e.g. SARA deployed 100Gbps perfSONAR (first in Europe), BNL deployed 2x40 Gbps perfSONAR
  • WG update was presented at HEPiX and LHCOPN/LHCONE workshop
  • WLCG/OSG network services working fine
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Squid Monitoring and HTTP Proxy Discovery TFs

Traceability WG

Container WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2018-10-08

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r23 - 2018-11-08 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback