Week of 180305
WLCG Operations Call details
- For remote participation we use the Vidyo system. Instructions can be found here
.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Portal
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- local: Kate(DB, chair), Julia (WLCG), Maarten (ALICE, WLCG), Remy (storage), Vincent (sec), Alberto (mon), Borja (mon), Gavin (comp), Marian (network)
- remote: Alexander (NL-T1), Dmytro (NDGF), Kyle (OSG), Marcelo (CNAF), Sang Un (KISTI), Stephan (CMS), Di Qing (TRIUMF), Darren (RAL), David (IN2P3), Pepe (PIC), Victor (JINR), Xavier (KIT), Zoltan (LHCb)
Experiments round table:
- CMS reports ( raw view) -
- No major grid computing issue(s)
- Running smoothly at a little over 200k cores
- 60% production, 40% analysis
- Tier-0 migration, OpenStack to HTCondor-CEs) in progress
- Singularity deployment progressing well
- large ticket for factory operations to correct entries (glexec/OS) for CEs where sites did not contact them after Singularity setup
- tickets for all Tier-1,2 sites without Singularity and no known plan/schedule
- CMS global cosmic run starting today
- may not be able to attend the meeting but will read minutes, sorry!
- ALICE -
- Lowish activity level on average
- New productions are being prepared
- LHCb reports ( raw view) -
- Activity
- HLT farm fully running
- 2017 data re-stripping ongoing
- Stripping 29 reprocessing is ongoing
- Site Issues
- Tier2D
- Users with UK certificate are having problem to access data at CBPF, Glasgow, CSCS, NCBJ (3 DPM, 1 dCache) GGUS:133667
, GGUS:133617
Maarten offered to check the certificate issue.
Sites / Services round table:
- ASGC: nc
- BNL: nc
- CNAF:
- LHCb data move from old SE to new SE in progress, should finish by middle~end of this week. Also renewal of machine certificates.
- Shared file system works in progress, the farm should be re-opened today or tomorrow.
- Singularity is installed in CINECA farm.
Service |
VO |
Status |
Expected restart date |
Readiness |
GGUS ticket |
CNAF comment |
VO comment |
Electric power line |
- |
Maint. |
20.02, 21.02 |
One line in Production with UPS |
|
The second redundant line is still down |
|
Tape buffer |
ALICE |
OK |
|
Production |
133582 |
all ALICE services at CNAF are running in the final configuration. |
Looks OK |
Tape buffer |
ATLAS |
OK |
|
Production |
131742 |
Ok |
|
Tape buffer |
CMS |
OK |
|
Production |
133515 |
Ok |
|
Tape buffer |
LHCb |
Being Rebuilt |
Production in March |
Not Ready |
133673 |
|
|
Disk |
ALICE |
Parity OK |
|
Production |
133582 |
all ALICE services at CNAF are running in the final configuration. |
Looks OK |
Disk |
ATLAS |
Parity OK |
|
Production |
131742 |
Ok |
|
Disk |
CMS |
Degraded Parity |
|
Production |
133515 |
raid5 in a few LUNs, raid6 in the others |
Disks to be replaced |
Disk |
LHCb |
Parity Restored |
Production in March |
Not Ready |
133673 |
raid6 in all LUNs |
Started the copy of data from old system to new one, ETA middle~end of this week. Tapes should be ready to stage when when copy is done |
Computing farm |
- |
|
Ready |
|
CMS: configure singularity on only CINECA farm (DONE); ATLAS: FS mounting on farm nodes and restart LSF queues. |
|
Now all CEs in unscheduled downtime - WIP, should be back today or tomorrow |
- EGI: nc
- FNAL: nc
- IN2P3: Tomorrow morning, ALICE and LHCb production will be fully moved to CentOS7 WNs. Next tuesday March 13th, IN2P3-CC will in scheduled maintenance. CE and SE will be in downtime and back by the end of the day.
- JINR: NTR
- KISTI: NTR
- KIT: NTR
- NDGF:
- Switch maintenance tomorrow morning (07-08 CET), might cut a chunk of ATLAS data and the squid server unavailable for a short time.
- NL-T1: NTR
- NRC-KI: nc
- OSG: NTR
- PIC: NTR
- RAL: NTR
- TRIUMF: Maintenance on site core router on March 4, there were two network connectivity outages and each lasted about 12 minutes.
- CERN computing services: NTR
- CERN storage services: EOS-CMS was slow during the week end night due to a find command that was performed, currently being investigated
- CERN databases: CMSONR DB hosts were restarted today to refresh DNS settings
- GGUS: NTR
- Monitoring:
- Final SAM report for Jan 2018 sent to WLCG office.
- MW Officer: nc
- Networks: AGLT2 to LHCONE sites (NDGF/CERN) under investigation, narrowed down to US/ESNet segment
- Security: NTR
AOB: