Week of 120123

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local (Andrea, Elisa, Massimo, Eva, Eddie, Yuri, Philippe); remote (Elizabeth/OSG, Michael/BNL, Tobias/KIT, Rolf/IN2P3, Onno/NLT1, Thomas/NDGF, Burt/FNAL, Paolo/CNAF, Tiju/RAL; Stefano/CMS).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
      • [Yuri: just experienced a GGUS issue, cannot close a ticket, this fails with a strange web error. Sent a private email to GGUS developers. Tobias: will report it to GGUS colleagues in the offices next door. Yuri: will send Tobias more details about the issue.]
    • T1s
      • RAL-LCG2: Sunday morning: DNS issues resulted in transfer failure, GGUS:78459 TEAM ticket created, site announced unscheduled downtime and fixed the issue.
      • TRIUMF-LCG2: Sunday ~19:30, a number of file transfer failures with "source file doesn't exist" errors. GGUS:78468.
      • PIC: Monday morning ~4:30, many transfer failures: "lacality is unavailable" and stage-in job failures. GGUS:78470 solved at ~8:00. Connectivity problems on half of the DDN blades serving ATLAS data yesterday afternoon. Due to a power issue after restarting all services this device was not correctly started. Fixed.

  • CMS reports -
    • NTR - crunching data happily
    • CRC on duty: Stefano Belforte

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0
      • ntr
      • [Philippe: tomorrow LFC will be affected by the migration to 11g. Eva: this will actually be a general downtime for LHCB for the upgrade to 11g of the offline Oracle server, from 10h to 14h. Will affect all services including LFC, bookkeeping and others.]
    • T1
      • PIC: unscheduled downtime yesterday (Sunday). After all services were restarted (around 8pm), still some problem observed with LFC streams. Site availability has turned red again today at about 12h. [Eva: Streams problem is due to a known bug in the mixed configuration we deploy, with 11g downstream and 10g at Tier1 sites. Following up the PIC recovery with the site experts.]
      • IN2P3 : migration of the data from the old space tokens to the new ones is ongoing (GGUS ticket ): T0D1 and T1D0 space tokens migrated, T1D1 ongoing.
      • SARA : problem of CVMFS on some node solved on Friday 20th (GGUS:78391)
    • T2
      • ntr

Sites / Services round table:

  • Elizabeth/OSG: ntr
  • Michael/BNL: ntr
  • Tobias/KIT: nta
  • Rolf/IN2P3: ntr
  • Onno/NLT1: ntr
  • Thomas/NDGF: downtime this afternoon for upgrading the dcache head node, but it did not boot after the upgrade, so it was replaced by a spare
  • Burt/FNAL: ntr
  • Paolo/CNAF: ntr
  • Tiju/RAL:
    • DNS problem for 7h on Sunday as reported by ATLAS
    • succesfully upgraded CMS SRM to 2.11
  • Gonzalo/PIC (sent after the meeting):
    • Yesterday at around 15:45 CET we had a power problem in the building which affected the cooling system of the main machine room and the whole power of the WN module (container-like extra room where 80% of the WNs are sitting). Due to this, most of the WNs lost power abruptly plus various services were gracefully switched off to avoid main room overheating. The recovery lasted for few hours and about 21:00 CET most of the Tier1 services were back. A big part (about 60%) of the storage service did not recover properly until this morning at around 9:00 CET. It was a DDN system for which one controller did not properly start in yesterday's power on procedure. A Service Incident Report is in preparation and will be published in the WLCG ops wiki in the following days.

  • Massimo/Storage:
    • Moved CMS this morning to new tape gateway, will do ALICE tomorrow
    • Upgraded the stager DB for ALICE this morning
  • Eddie/Dashboard: ntr
  • Eva/Database:
    • LHCb upgrade to 11g tomorrow as already mentioned
    • Applying security patches on all development and integration databases this week, today and in the coming days. The intervenions are rolling, but a short full downtime is scheduled at the end to change some database parameters.

AOB:

  • (MariaDZ) GGUS Release this Wednesday. Scheduled downtime was announced in GOCDB.

Tuesday:

Attendance: local (Andrea, Elisa, Eva, Eddie, Yuri, John, Ignacio, MariaDZ); remote (Michael/BNL, Tobias/KIT, Gonzalo/PIC, Rolf/IN2P3, Jeremy/GridPP, Thomas/NDGF, Tiju/RAL, Ronald/NLT1, Burt/FNAL, Rob/OSG, Paolo/CNAF, Giovanni/CNAF; Jose/CMS).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • GGUS ticket "close/solve" issue reported to GGUS-support and was resolved yesterday. Details in ATLAS Elog:33239,33243,33267. The problematic GGUS:78461 solved. [Tobias: this was a synchronization problem in the ticketing system. To reduce the number of separate email threads, Guenther Grein suggests that next time ATLAS should please open a GGUS ticket as it was done, but also copy the ATLAS shifters in CC. Yuri: OK will report this to ATLAS.]
      • [Yuri: saw that ATLAS space tokens are blacklisted, is this related to the CASTORPUBLIC intervention today? Ignacio: yes this is related, because the SAM tests are using CASTORPUBLIC, they should be changed. John: will follow this up offline.]
    • T1s
      • ntr

  • CMS reports -
    • SLS showed unavailability of CMS CASTOR-SRM for ~1hour (10:30-11:30 CET). [Ignacio: same as for ATLAS, this is because the SAM tests are using CASTORPUBLIC which was affected by an intervention today.]
    • CRC on duty: Jose Hernandez

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0
      • Oracle upgrade ongoing ( from 10h to 14h), main services affected are the LFC and the Bookkeeping. Users have been notified in advance. [Elisa: saw that DB intervention is over but Streams are not restarted yet. Eva: yes this is normal, Streams will be restarted later in the afternoon because it takes a bit more time to set them up.]
    • T1
      • PIC: yesterday opened a GGUS ticket for aborted pilots, promptly fixed by the site.

Sites / Services round table:

  • Michael/BNL: ntr
  • Tobias/KIT: today Gridka CondDB for ATLAS moving to 11g
  • Gonzalo/PIC: all ok now after the power cut, preparing a SIR. Problem was caused by a glitch in the input power to the building, using a configuration that is being fixed.
  • Rolf/IN2P3: ntr
  • Jeremy/GridPP: ntr
  • Thomas/NDGF: ntr
  • Tiju/RAL: ntr
  • Ronald/NLT1: ntr
  • Burt/FNAL: ntr
  • Rob/OSG: maintenance window since 15 minutes ago, should be transparent but there may be small outages
  • Paolo/CNAF and Giovanni/CNAF: ntr

  • Eddie/Dashboard: ntr
  • Ignacio/Grid:
    • CERN Pilot FTS The fts-pilot-service.cern.ch was changed yesterday from gLite 2.2.8 to the EMI 2.2.8 release candidate. It was mostly transparent other than some dropped transfers around 17:00. Early indications are promising.
    • CVMFS the CVMFS repository /cvmfs/atlas-condb.cern.ch will alter tomorrow morning at 09:00 UTC to be served from a new stratum0 data source. This is expected to be transparent. No configuration change at the stratum ones or elsewhere is required (see ITSSB).
  • John/Storage
    • CASTORPUBLIC: DB update to 11g this morning went OK
    • CASTORALICE: tapegateway pushed back to tomorrow same time 9.30am to 10.30am
    • EOSCMS update on Wed: unclear, may have to be cancelled/delayed (due to new issues)
  • Eva/Database: nta

AOB:

  • (MariaDZ) GGUS Release tomorrow. There will be test ALARMs as per agreed timezone. Please visit the list of new features here.
  • (MariaDZ) Note for IN2P3: Following the relevant presentation at the T1SCM last Thursday, the GGUS tickets on network problems to Japan and the USA got updated by the experiments who see no improvement in one direction. Please read the meeting notes and continue the dialog to reach a solution satisfactory for all. [Andrea: who should follow up? Maria: this is mainly on IN2P3 now. Rolf: yes, we are still working on this.]

Wednesday

Attendance: local (Andrea, Luca, Alessandro, Elisa, Eddie, Yuri, John, Ignacio, MariaDZ); remote (Burt/FNAL, Michael/BNL, John/RAL, Tobias/KIT, Gonzalo/PIC, Alexander/NLT1, Thomas/NDGF, Paolo/CNAF, Jhen-Wei/ASGC, Rolf/IN2P3, Rob/OSG; Jose/CMS).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s
      • TRIUMF-LCG2 transfer failures (SCRATCHDISK) due to missing user files. GGUS:78468, GGUS:78245. Understood: list of the 180 files declared as bad to DDM created, and so they should soon be deleted from LFC.
      • IN2P3-CC ~25k failures to transfer files (both to/from) with "failed to contact on remote SRM". GGUS:78547 filed at ~4:40. Also some related production job failures (stage-in/out). Crash of the SRM server, restarted successfully at ~6:30.

  • CMS reports -
    • T1 sites:
      • GGUS:78555 ASGC tape migration issue acted upon by site. Data flowing now.
      • GGUS:78548 IN2P3 SRM problems. SRM server restarted after problems with dCache

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0
      • After Oracle upgrade some issue with conditions replications to Tiers1, fixed around 21h.
    • T1
      • PIC: since Monday a high percentage of jobs fail when trying to upload the output data. Probably due to failed authentication with LFC. Under investigation in collaboration with the site.
      • IN2P3: some failed transfers to IN2P3 this morning from 4am to 8am. Currently working fine. [Elisa: this is the same issue affecting also CMS and ATLAS, any news? Rolf: still investigating, will report at the meeting tomorrow.]

Sites / Services round table:

  • Burt/FNAL: ntr
  • Michael/BNL: ntr
  • John/RAL: two scheduled downtimes tomorrow, one for SRM DNS in the morning and one for CASTOR in the afternoon
  • Tobias/KIT:
    • ATLAS conditions DB move to 11g went ok, replication and Frontier have also been restarted. [Alessandro: was FTS also moved to 11g? Tobias: no, FTS was not moved to 11g yet.]
      • [Question from Dario Barberis after the meeting: Frontier at KIT was off for 24 hours for the DB migration, why so long? Tobias: the downtime (including Streams and Frontiers) took longer than expected because after the upgrade to 11G not all nodes came up. Additionally we had to contact a firewall expert out of the normal working hours because the firewall settings had to be adopted. Also our database expert waited with setting the squids online until the replication of the DB was finished this morning.]
    • GGUS release went ok, tests alarms were sent and received successfully.
  • Gonzalo/PIC: ntr
  • Alexander/NLT1: ntr
    • [Alessandro: noted some inconsistencies in ATLAS, FTS has downtimes in GOCDB both today and in one week from now, is anything wrong? Alexander: intervention was initially planned for today, then moved to next week, but we were unable to remove today's downtime from GOCDB. We have notified GOCDB of the problem. Alessandro: please copy me in CC in that ticket. MariaDZ: please also copy me in the ticket, there have been similar issues with GOCDB in the past, but it is not clear nowadays how these issues should be followed up (savannah, emails, ...).]
  • Thomas/NDGF: the network link to one site was down for 1h yesterday, then again for less than 1h this morning, some data was not reachable. The network providers have investigated and they will change a faulty card later today.]
  • Paolo/CNAF: ntr
  • Jhen-Wei/ASGC: ntr
  • Rolf/IN2P3: nta
  • Rob/OSG: ntr

  • John/Storage:
    • CASTORLHCB: Database upgrade Thursday morning may be cancelled and rescheduled. [Elisa: please inform us as soon as you know whether the intervention tomorrow will be or will not be cancelled.]
    • EOSCMS: update on Wed: cancelled, will be rescheduled once issues resolved
    • EOSATLAS: two crashes this morning may have resulted in some files being unavailable. The user causing this crash has been temporarily blocked. [Alessandro: actually there was a bug in EOSATLAS affecting the operations of this user, who was behaving correctly. So blocking the user was the only workaround for this bug, but the user had no fault in causing this problem.]
  • Luca/Database: ntr
  • Eddie/Dashboard: ntr
  • Ignacio/Grid: ntr

AOB:

  • (MariaDZ) GGUS Release today was successful. 13 test ALARM tickets were issued to the Tier0, european Tier1s and Taiwan_LCG2. All are in status 'solved'. Drills showed that operators, scripts or supporters (according to each site's procedures) reacted appropriately and timely to the test. Thanks to all involved. American sites' tests coming up, as per the timezone agreements. NB! A method to handle concurent ticket editing is introduced with this release. Details in: https://ggus.eu/pages/news_detail.php?ID=452. [MariaDZ: in this new GGUS release there is a new ticket category for test alarms, which can be useful for reporting. Andrea: until now test alarms were included in the summary table and plots (not the detailed drills) for the MB operations report, will this still be the case? Maria: good point, the new release makes it possible to exclude these tickets. Maria will follow up with the SCODs to ask what should be the policy from now on.]

Thursday

Attendance: local (Andrea, Philippe, Eddie, Yuri, Elisa, MariaDZ); remote (Michael/BNL, Jeremy/GridPP, Ronald/NLT1, Thomas/NDGF, Gareth/RAL, Gonzalo/PIC, Elizabeth/OSG, Marc/IN2P3, Xavier/KIT, Jhen-Wei/ASGC, Paolo/CNAF; Jose/CMS).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s
      • IN2P3-CC GGUS:78547 solved. SRM problem fixed. [Marc: we have a few hints what had gone wrong, there was a peak load at the same time as the database backups were being done. Now fixed, anyway.]
      • NDGF-T1 production job failures (issue with input files downloading). GGUS:78615 in progress since ~8pm yesterday. The network link to one of subsites went down. [Thomas: yesterday a faulty network card was changed to fix the issues observed the previous day, but this did not help. More investigations showed that the fibre was probably dirty, but this could not be fixed yesterday. Then this morning at 8.15am, without any intervention, the link came up again. This is not understood and will be followed up, the network provider will monitor the system till tomorrow.]

  • CMS reports -
    • Site Status Board unavailable: GGUS:78605 TEAM ticket. Not critical (I decided not to trigger a GGUS ALARM ticket) since we could continue monitoring site status using the Dev SSB instance. Dashboard team investigating how to restore information which was reported during the gap. Dashboard support did not get alarms. Investigating why that happened to make sure it won't happen in future. [eddie: the virtual machine hosting the CMS SSB, maintained by IT-PES, went down. We only found out this morning so could not fix it before. Now investigating with PES what went wrong in the alarm flow.]

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0
      • Castor databases upgrade this morning. Finished? [Elisa: was meant to finish at 1pm, will follow up with IT-DB if all went ok.]
    • T1
      • PIC: the problem for uploading output data (mentioned yesterday) has been tracked down to failed connections to PIC LFC catalogue. In some worker nodes, some faulty drivers were causing a timeout in the connection. They have been replaced.
      • IN2P3: this morning high percentage of failed MC jobs, that were identified as stalled. Under investigation (no ticket opened so far). [Marc: what login was used? Elisa: all jobs were running as lhcbpilot, can give more detials offline if needed.]

Sites / Services round table:

  • Michael/BNL: ntr
  • Jeremy/GridPP: ntr
  • Ronald/NLT1: ntr
  • Thomas/NDGF: nta
  • Gareth/RAL:
    • updating all SRMs, did the Castor "gen" instance this morning and aanounced the intervention on the ATLAS one for next Monday in GOCDB
    • had planned a change in the Castor information provider to the BDII, this has been postponed to next week
  • Gonzalo/PIC: LHCb problem is still a consequence of the power cut on Sunday: some nodes were reinstalled with a wrong network driver version, now fixed
  • Elizabeth/OSG: ntr
  • Marc/IN2P3: dcache/SRM is atble now after yesterday's problems; will not do a detailed investigation because the dcache installation will undergo a major (sw and hw) upgrade in two weeks
  • Xavier/KIT: ntr
  • Jhen-Wei/ASGC: one hardware failure affected CMS transfers , now recovery
  • Paolo/CNAF: ntr

  • Eddie/Dashboard: nta
  • Philippe/Grid: ntr

AOB:

  • (MariaDZ) The suggestion to omit 'test' GGUS tickets from reporting is being listed in Savannah:125862 . Please note this months Did you know?.... It explains who will receive which email notification depending on his/her role in the GGUS ticket (submitter, supporter, subscriber, in Cc, in 'Assign to one person', in 'Involve others', in site contact list).

Friday

Attendance: local (Andrea, Massimo, Maarten, Ignacio, Eva, Eddie, Yuri, Elisa); remote (Michael/BNL, Gonzalo/PIC, Xavier/KIT, Mette/NDGF, Jhen-Wei/ASGC, John/RAL, Lisa/FNAL, Onno/NLT1, Rob/OSG, Rolf/IN2P3; Jose/CMS).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
    • T1s
      • FZK-LCG2: file transfer failures: "connection reset by peer". GGUS:76658 solved. Seems a temporary issue, transfers succeed again the same day.
      • RAL: a new FTS version was successfully tested on Jan.26 for several sites in UK. Elog.33302,33311,33314. [Maarten: is it FTS 2.2.8 and in that case is it glite or EMI version of 2.2.8? John: confirm it is 2.2.8 from EMI. Massimo: are these different software with the same name, or different packaging of the same software? Maarten: it is the same software, but with very different packaging.]

  • CMS reports -
    • NTR. No site effects of network glitch at CERN this morning.

  • ALICE reports -
    • The CVMFS repository for ALICE has been emptied and the latest version of the CVMFS init scripts used at sites no longer mounts that repository. This should prevent further occurrences of old AliEn versions interfering with ALICE jobs, as was observed at RAL earlier this week. Yesterday a broadcast was already sent asking sites to correct their CVMFS configuration where needed. [Ignacio: still using bit torrent then? Maarten: yes.]
    • [Maarten: ALICE was also probably affected by the network problems at CERN - at least, some jobs failed during that time.]

  • LHCb reports -
    • Experiment activities
      • MC11: Monte Carlo productions
    • T0
      • Yesterday Castor databases upgrade. All went fine
      • This morning about 30% jobs completed and waiting to upload job logs. Might be related to the general network problem of this morning, reported in IT Service Status Board. Later during the day the situation went back to normality
    • T1
      • PIC: issue solved after replacement of faulty drivers. No significant job failure rate during last day

Sites / Services round table:

  • Michael/BNL: ntr
  • Gonzalo/PIC: ntr
  • Xavier/KIT: announce downtime for LHCb LFC and conditions DB on Feb 1
  • Mette/NDGF: ntr
  • Jhen-Wei/ASGC: ntr
  • John/RAL: ntr
  • Lisa/FNAL: ntr
  • Onno/NLT1: overview of 4 upcoming downtimes, all in GOCDB except the fourth one
    • Feb 1st: SARA Oracle; FTS & LFC-LHCb
    • Feb 1st night: warning Nikhef network maintenance, should be transparent
    • Feb 3rd: warning SARA network maintenance, should be transparent
    • Feb 7th: all Grid systems at SARA will be down, not in GOCDB yet
  • Rob/OSG:
    • GGUS alarm tests went ok
    • started automatic exchange of tickets with FNAL service-now (that replaced Remedy there)
  • Rolf/IN2P3: ntr

  • Massimo/Storage: two upcoming 'at risk' for Castor
    • ATLAS Mon 30, 9am-10am Castor config change, should be transparent
    • LHCb Tue 31, again transparent, complete the tape gateway move (by the way, saw factor 2 performance improvements)
  • Ignacio/Grid: ntr
  • Eddie/Dashboard: ntr
  • Eva/Database: ntr

AOB: none

-- JamieShiers - 12-Jan-2012

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2012-01-27 - OnnoZweersExCern
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback