WLCG Operations Coordination Minutes, September 13th, 2018

Highlights

  • A report on EOS incidents, improvements and plans
    • Despite recent issues, the outlook is positive for the rest of 2018 and beyond

  • CMS CRIC is deployed in production

  • Sites should upgrade perfSONAR to v4.1 on CentOS 7

Agenda

Attendance

  • local: Alberto (CERN storage), Borja (monitoring + WLCG), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Massimo (CERN storage), Tommaso (CMS), Vincent (security)
  • remote: Alessandra D (Napoli), Alessandra F (Manchester + ATLAS + WLCG), Brij (TIFR), Catherine (LPSC + IN2P3), Cristi (CERN storage), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Giuseppe (CMS), Johannes (ATLAS), Martin (Prague), Matt (EGI), Puneet (TIFR), Stephan (CMS), Vladimir (LHCb)
  • apologies: Pepe (PIC)

Operations News

  • Experiment contacts are kindly asked to provide info for the list of services dealing with user information and therefore will need to provide privacy note . See more details in the GDPR discussion during GDB

  • The next meeting will be on Thu Oct 11
    • Please let us know if that date would pose a significant problem.

Special topics

EOS report

See the presentation

Discussion

  • Johannes:
    • EOS-ATLAS has been fairly stable over the summer
    • 1-2 short outages in the last 3 weeks, not that serious
    • The improvements have worked, the situation now looks OK for ATLAS
    • 1 rogue user caused an outage during the spring, dealt with on the ATLAS side

  • Massimo:
    • The presentation refers to a report that compares before and after
    • It agrees with the conclusions from ATLAS
    • 1 reboot was caused by a power cable incident
    • Should be better with the HA setups we will have in the future
    • Will also mitigate SW issues
    • Reboots lasting 1000+ seconds up to a few hours have been the main issue

  • Tommaso:
    • CMS has had the opposite experience - major incidents last week
    • See the CMS report
    • 1-2 h downtime can be handled OK, silent corruptions cannot!
      • The source files will typically have been deleted in the meantime
    • Fortunately that problem was noticed after just ~20 minutes
    • Is it understood?

  • Massimo:
    • It was caused by a background activity, viz. the MGM memory compaction
    • The files disappeared only from the name space and usually can be recovered
    • This kind of trouble will go away with the new MGM

  • Tommaso:
    • We also ran into directories where only root could write
    • And there were FUSE mount problems on some of our service machines
      • E.g. processing log files
    • Do we have to switch to the new plugin?

  • Massimo:
    • The old plugin is still maintained
    • The new version is in QA for LHCb and AMS - maybe CMS wants to join?

  • Tommaso:
    • Let's try it out on specific machines
    • Can the FUSE plugin be used in production?

  • Massimo:
    • The last 2 years we have invested a huge effort in the FUSE plugin
    • We now have v2 which indeed is supported for production

  • Alberto: where will it be tested?

  • Tommaso: on a few CMS VOBOXes

  • Tommaso: w.r.t. the HI data-taking tests this week,
    there does not seem to be a lot of tape writing by ALICE?

  • Maarten: the 10 PB ALICE disk buffer in EOS + CASTOR should hold all the new data,
    which will be copied to tape as fast as possible in the background

  • Massimo:
    • We have mainly focused on a global test
    • We have seen the CMS rates get lower after some time

  • Tommaso: we still have ~2 days to debug any remaining issues

  • Giuseppe: NTA

  • Julia: do LHCb or ALICE have concerns to mention to the EOS team?

  • Vladimir: no major issues for LHCb

  • Maarten: neither for ALICE

DPM SRR deployment TF

See the presentations

Discussion

  • Julia:
    • The timeline has not been decided yet, possibly by the next meeting
    • Regarding DPM upgrade to version 1.10.3 and higher we want to be in line with the TPC activity in DOMA.
      We currently aim for the version upgrade to be finished by the end of spring. Re-configuration might come later.
    • Start with a small set of pioneer sites to polish the procedures and instructions and then go for wider deployment

  • Catherine:
    • There are 9 DPM sites in France, 1 using DOME since 1 year
    • We have discussed the deployment and the expected timeline is OK
    • The current priority is on the dual-stack deployment for IPv6 support

Middleware News

  • Useful Links
  • Baselines/News
  • Issues:
    • UMD-4 update on July 11 broke SL6 CREAM CEs
      • Reported here for the record - already included in the Service Report for August
      • Tomcat could not start with the newer versions of canl-java, bouncy-castle and voms-api-java
      • This was not caught in the Staged Rollout, because CREAM itself was not updated
        • Will be handled better in the future
      • Several high-priority tickets were opened, e.g. GGUS:136074
      • CREAM developers quickly provided fixes in their own repository
      • UMD-4 was again updated on July 24
        • With additional instructions in the release notes
        • Still leaving some loose ends to be tied up after the holidays
      • Tickets were updated with workaround recipes in the meantime

Discussion

Tier 0 News

  • CERN would like to ask the experiments what notice they would need to have the majority of batch resources here changed to CC7, assuming any intervention would take a couple of weeks to roll-out.

An action for the experiments has been created

Tier 1 Feedback

  • IN2P3-CC: Because RAID problem, a disk server will be lost on XROOTD storage. 110 To of data lost for ALICE.

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal activity levels on average over the summer
  • IN2P3-CC: 110 TB lost due to RAID problem
    • Mostly recovered from replicas

ATLAS

  • Smooth Grid production over the last weeks with ~300k concurrently running grid job slots. Additional HPC contributions with peaks of ~100k concurrently running job slots.
  • In the last 6 weeks ran a large digitisation and reconstruction campaign of MC16e using about 150k job slots. This will be followed in the next weeks by a larger derivation production campaign.
  • Commissioning of the Harvester submission system via PanDA is on-going on the Grid: Iberian cloud mostly done, IT and UK cloud now on-going
  • Tape carousel R&D staging campaign at Tier1s on-going: BNL, FZK, PIC, INFN-T1, Triumf, SARA, IN2P3-CC done so far. About 200TBs of AODs are staged from tape and possible improvements of the workflows are evaluated.
  • Heavy ion TDAQ to EOS/Castor through-put test started.

Discussion

  • Maarten, Julia: it would be good to present your tape carousel results
    in the DOMA project and the Archival Storage WG

  • Johannes: yes, after they have been presented within ATLAS

CMS

  • MD3 in progress
  • preparing for HI rate test in the following week
  • various EOS issues at CERN during the last month
  • local file access issues at RAL and PIC under investigation, GGUS:136028, GGUS:136677
  • compute systems busy at about 220k cores, usual mix of 80% production and 20% analysis
  • processing backlog, lower/medium priority Monte Carlos not progressing much

LHCb

  • Usual activity with Data reconstruction and stripping, MC simulation and User analysis

Ongoing Task Forces and Working Groups

Accounting TF

  • The problem in the CERN accounting has been investigated and hopefully understood.

Archival Storage WG

Update of providing tape info

PLEASE CHECK AND UPDATE THIS TABLE

Site Info enabled Plans Comments
CERN YES    
BNL YES    
CNAF YES   Space accounting info is integrated in the portal. Other metrics are on the way
FNAL YES    
IN2P3 YES   Space accounting info is integrated in the portal. Other metrics are on the way
JINR YES    
KISTI NO   KISTI has been contacted. Will work on in the second half of September
KIT YES    
NDGF NO   NDGF has a distributed storage which complicates the task. Discuss with NDGF possibility to do aggregation on the storage space accounting server side. No news recently
NLT1 NO   Almost done, waiting for opening of the firewall, order of couple of days
NRC-KI YES    
PIC YES   Space accounting info is integrated in the portal. Other metrics are on the way
RAL YES   Space accounting info is integrated in the portal. Other metrics are on the way
TRIUMF YES    

One can see all sites integrated in storage space accounting for tapes here

Information System Evolution TF

  • CMS CRIC is deployed in production. Functionality currently enabled is to replace SiteDB both for topology and user info.
  • Working on CRIC for WLCG central operations which will provide topology for all 4 experiments. Progressing well.
  • Next IS Evolution Task Force will take place next Thursday. The main topic is to provide computing resources description in the json file similarly to what is proposed in the Storage Resource Reporting document for storage services

Discussion

  • Alessandra F: UK sites intend to drop the BDII
  • Maarten: have you considered the consequences w.r.t. EGI, e.g. the ARGO tests?
  • Alessandra F:
    • Other VOs are supported through the GridPP DIRAC service
    • We first remove the LHC experiments from the BDII and then we will see
  • Matt:
    • Will discuss this matter with my colleagues
    • Alessandro Paolini would know more details
  • Julia: he has participated before; it would be good if he could join next Thu

IPv6 Validation and Deployment TF

Detailed status here.

See the status report presented in the Sep GDB

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


  • perfSONAR infrastructure status
    • perfSONAR 4.1 was released few weeks ago - main new feature is an improved central/remote configuration
    • WLCG broadcast was sent this week to remind sites to upgrade to CC7 and review their configuration (preferably by end of October)
    • Around 50% of sonars are on CC7 as of today
  • WG update will be presented at the upcoming HEPiX
  • WLCG/OSG network services
    • Central configuration service (meshconfig/psconfig) was updated to the version released in 4.1 (officially supported by perfSONAR team)
    • psconfig.opensciencegrid.org is currently unreachable via IPv6 from non-LHCONE sites due to issue with routing, this is being followed up by the network team at MSU
  • NSF funded projects: SAND and IRIS-HEP are starting, both will contribute in different ways to the OSG Network Area - more details will be provided in the HEPiX talk
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Squid Monitoring and HTTP Proxy Discovery TFs

  • Thanks to Michal Svatos, there are now CVMFS and ATLAS frontier failover monitors linked from http://wlcg-squid-monitor.cern.ch
    • Based on GOCDB/OIM squid registration
    • There's also one for CMS but it hasn't yet been cut over into production, waiting on more squid registrations
  • Some duplicate MRTG SNMP queries have been removed by sharing data between MRTG plots
    • Same technique will be used for generating CMS-only MRTG page based on registrations, which will eliminate more SNMP query duplication and make it easier to implement

Traceability WG

Container WG

See the status report presented in the Sep GDB

Action list

Creation date Description Responsible Status Comments
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations In progress GGUS:133915
07 Jun 2018 Followup of OSG service URL changes WLCG Operations DONE We suggest that for all middleware using various OSG-related URLs the experiments look at this page and inform operations in case you need more help
07 Jun 2018 GDPR policy implementation across WLCG and experiment services WLCG Operations + experiments Ongoing Details here

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments
13 Sep 2018 moving most of CERN batch to CC7 all - 11 Oct   how much advance warning needed?

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2018-09-17 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback