Week of 100308

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Availability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Attendance: local(Maria, Nicolo, Jamie, Andrea, Jacek, Graeme, Alessandro, Patricia, Roberto, Jean-Philippe, Majik, Timur, Jan, MariaDZ, Gavin);remote(Jon/FNAL, Gonzalo/PIC, Rolf/IN2P3, Michaela Lechner (NDGF), Gareth/RAL, Michael/BNL, Angela/KIT, Ron/NL-T1, Gang/ASGC, Rob/OSG).

Experiments round table:

  • CMS reports -
    • T1 Highlights:
      • MC processing and ReReco completing at T1s: FNAL, RAL, PIC, KIT
        1. cmsHotDisk service class deployed at RAL to solve access issues to hot input files, adjusting optimal number of replicas of hot files.
      • IN2P3: SRM SAM test failures Sunday 7th, disappeared after ~4 hours.
      • CNAF: CE SAM test failures - LSF master dying, batch system admins investigating.
    • T2 highlights
      • MC ongoing in all T2 regions.
        1. Some jobs seem lost in communication problem between CREAM CEs and WMS (status Done on CREAM but Running on WMS). Update: known WMS/ICE bug, patch already in certification Savannah #61405.
        2. Several workflows failing on all T2s due to operator mistake (requested SLC4 CMSSW architecture for jobs running on SLC5 CMSSW version), resubmitting corrected workflows.
        3. SAM test failures at T2_BR_UERJ, T2_UK_London_IC, T2_RU_JINR

  • ALICE reports - GENERAL INFORMATION: Production dominated by the user analysis jobs (also small fraction of jobs runing at the T0 for Pass1 reconstruction)
    • T0 site
    • T1 sites - Minimum activity at these sites for the moment
    • T2 sites
      • Configuration of the new Hiroshima services at the ALICE LDAP finished this morning. Site admin informed about the operations. site back in production
      • The last two French sites (IPNL and Grenoble) which were still not providing a CREAM-CE system for ALICE are beginning with this service setup. When this operation will be finished, one of the current dedicated WMS for ALICE in France will be removed

  • LHCb reports - MC productions running at sustained regime w/o major problems (9-10K jobs concurrently)
    • T0 sites issues:
      • CASTOR upgrade to 2.1.9-4 + SRM upgrade to 2.8-6 went w/o problems to report
      • Yesterday some problems with the merging jobs at CERN because input data was not available (RT open). Today it became available again.
    • T1 sites issues: Noticed CNAF CE has been failing the job submission critical SAM test on Sunday morning (all CEs affected, problem with LRMS local submission) and also IN2p3 had the critical unit test for SRM failing on Sunday.
    • T2 sites issues: Shared area issue at IN2P3-LPC jobs failing at Pisa

Sites / Services round table:

  • PIC - had a GGUS ticket opened by ATLAS - team ticket - reported that some jobs fail as they do not find a pool xml file catalog in s/w area. s/w area seems ok - but file is not there. s/w installation problem? GGUS #56328. Tomorrow am scheduled downtime for FTS 2.2.3 upgrade - will coincide with stress test.
  • FNAL - ntr
  • NDGF - ntr
  • IN2P3 - had some problems with BDII over w/e. Known performance problem of SL5 BDII. Might explain SRM difficulties. Graeme - don't think so as this was a transfer stage failure with error from gridftp itself.
  • ASGC - last week we had many errors "could not load client credentials". Deleted credentials from disk. Applied cron to do this before upgrading to FTS 2.2.3. No date yet...
  • RAL - ntr
  • BNL - ntr for T1. GGUS ticket 55101 for NDGF not accepting proxy from BNL for LFC. Sent by mistake? Manually dispatched. Problem lies with NDGF. They have not uploaded most recent LCG VOMS certs so proxy not accepted. NDGF - trying to find someone who wants to test - installed newest files. Don't have RPM based installation. Please can someone test this? From our view should work now. Graeme - will try to find someone with DoE cert to test. Michael - will also look at it.
  • KIT - ntr
  • NL-T1: on Saturday evening a power outage in East Amsterdam which also involved NIKHEF part of NL-T1. All WNs turned off as a result - all should now be back ok. Tomorrow morning at 06:30 a router will be replaced which means Internet connectivity briefly interrupted.
  • OSG: AHM for OSG at FNAL this week so most people at FNAL for this. Send mail to me (Rob) for any issues.

  • CERN: CERN site linux upgrade proceeding today. All will be upgraded by end of day.

  • CERN DB: problem with replication of LFC for LHCb to RAL. Looks like replica was manually updated and hence replication failed. Data should be R/O but not enforced.

  • CERN - SRM will roll forward at 09:00. As from 09:30 ATLAS will drive data (large first) and 30' later small files. Planned rollback at 16:00. T1s should be prepared to get data from CERN during this time. Rollback earlier in case of problems. Known problem with this SRM release which would affect CMS (srm cp)


  • USATLAS - would like sites to publish downtimes under OSG resource group names. Details - see Alessandro.


Attendance: local(Elisa, Jamie, Maria, Graeme, Gavin, Cedric, Majik, Nicolo, Harry, Jean-Philippe, Jacek, Andrea, Alessandro, Lola, Stephane, Andrew, Miguel, Julia, Timur, MariaDZ);remote(Jon/FNAL, Michael/BNL, Gang/ASGC, Ronald/NL-T1, Gareth/RAL, Rob/OSG, Pepe/PIC, Angela/KIT, Michaela/NDGF, Gonzalo/PIC, Rolf/IN2P3).

Experiments round table:

  • ATLAS reports -
    • CERN
      1. Problematic files reported at yesterday's meeting actually transferred ok after a few hours, so no ticket was submitted.
      2. There is a problem with one RAW file which is unavailable 32 hours, but not migrated to tape (https://gus.fzk.de/ws/ticket_info.php?ticket=56293). [ Jan - 20 files missing, similar symptons as other disk server. Files should have migrated to tape but box died before this could happen. Graeme - call back from T0 to SFOs was only to release files when migration bit set - not the case for these (FILE CLASS TAPE).
      3. Disk server MIA: https://gus.fzk.de/ws/ticket_info.php?ticket=56184. News? [ Jan - no news, still being investigated. ]
    • T1
      1. LFC upgrade at TRIUMF, no problems observed.
      2. Network upgrade at PIC, no problems observred.
      3. CNAF reduced transfer efficiency for a few hours last night: https://gus.fzk.de/ws/ticket_info.php?ticket=56284. Recovered by ~0000UT.
      4. Enabled checksums for INFN-T1 in FTS, but problems observed: https://gus.fzk.de/ws/ticket_info.php?ticket=56296.
      5. Production and analysis offline at SARA for two days of interventions.
    • T2
      1. INFN-MILANO back in production.
      2. LSPC back in production.
      3. RRC-KI still being validated.
      4. Storage problems at CA-SCINET-T2, https://gus.fzk.de/ws/ticket_info.php?ticket=56286.
    • SRM tests - look ok. Have to decide if we continue tests overnight, tomorrow, Thursday etc and cross check with LHC schedule.

  • CMS reports -
    • T1 Highlights:
      • MC processing and ReReco completing at T1s
      • PIC: some MC datasets from T2s subscribed for custodial storage before tape families were set up, ticket open to track correct tape migration.
    • T2 highlights
      • MC ongoing in all T2 regions.
      • SAM test failures at T2_FR_IPHC, T2_IN_TIFR, T2_IT_Bari
    • [Data Ops]
      • Tier-0: mainly testing (castor test). Today morning expected higher rate tests [O(1kHz) in stream A (tracker out)]. Event sizes smaller than usual. Online to ensure Express rate is reasonable.
      • Tier-1: move back to backfill as MC and other processing is finishing. Validation of 3_5 for the large scale reprocessing in on-going as well.
      • Tier-2: large scale MC production. We will be using 35X for the first time for full MC production.
    • [Facilities Ops]
      • CMS preparing for first 900 GeV collisions.
      • Follow-up on Tape Families setup in ASGC to bring the site to Operations before the run. Some 2009 Cosmic Data to be transferred in custodial/non-custodial mode to validate new Tape Families.
      • Follow-up on T1_DE_KIT transfer issues with a few T2 sites and the creation of new FTS channel for low throughput links.
      • The CMS SAM submission client has been migrated to a SLC5 VOBox.
      • The Job Robot will follow probably today. [ Andrea - done ] For this reason, the submission of jobs from the old SLC4 VOBox was stopped yesterday the 8th/March to let already submitted jobs to drain.
      • Today there is a CMS Computing Shift (CSP) Tutorial dedicated to the Cukurova Institute (Turkey)... Agenda here. More tutorials to follow.
      • CMS week: 15. - 19. March 2010 at CERN Indico agenda here. According to the current schedule for Monday, 14. March 2010, the CMS Computing Operations Meeting take place as normal.

  • ALICE reports - GENERAL INFORMATION: Production dominated by two RAW reconstruction cycles and some unscheduled user analysis jobs
    • T0 site
      • Reconfiguration of the WLCG ALICE VOBOXES performed this morning. This reconfiguration has been done to ensure a stable and redundant setup during the data taking:
        • Single VOBOX for WMS backend submissions
        • Two VOBOXES for CREAM backend submissions
        • One VOBOX (voalice14) taken out of production. This machine will be used as development machine only
    • T1 sites - Several T1 sites participating in the Pass2 reconstruction running today with no issues to report (small fraction of jobs)
    • T2 sites - no new issues to report

  • LHCb reports - Small productions running at very low rate occasional failures mostly due to application related issues. Production last week finished quite smoothly... just merging of smallish files was causing some trouble in DIRAC logic.
    • T0 sites issues:
      • Some merging jobs failed at CERN on Saturday and Sunday because of input data temporarily not available. On Monday the problem had disappeared.
    • T1 sites issues:
      • RAL:Apply process on streams replication to LFC@RAL failing yesterday. This seems to be due to a spurious entry in the RAL DB (read-only) that has been introduced manually. This is something that should not happen at all. [ Now under control and fixed. ]
    • T2 sites issues:
      • Continuing the investigation of data-upload related problem from several UK sites to CERN. Strict collaboration with Sheffield and Glasgow people.

Sites / Services round table:

  • FNAL - as Nicolo said FNAL received request to replicate hot files - 6 files - replicated to >100 pools on different nodes. Appears to be working.
  • BNL - ntr
  • NL-T1 - ntr
  • ASGC - ntr
  • RAL - have disk server out of action, part of ATLAS MC DISK. Hope back later. STREAMS replication problem. Some investigations - looks like additional record arrived due to regional Nagios monitoring (?!) being followed up.
  • KIT - ntr
  • IN2P3 - ntr
  • NDGF - problem yesterday has been solved.
  • PIC - FTS upgrade had to be cancelled and will be done next Wed/Thu. Today PIC is closed due to snow storm yesterday!
  • OSG - ntr

  • CERN: Monthly Linux upgrade pulled in the latest gLite-WN software which caused a service incident on the LCG-CEs affecting jobs which were submitted to SLC5 worker nodes. See SIR.

  • CERN: We will update the LSF version on the batch service to a new patch version (7.0.5 -> 7.0.6). We will also reboot the master batch nodes to pick up a new kernel. There will be two short periods (up to 5 minutes) when no new jobs can be submitted. Running jobs will not be affected. Andrea - does this do anything regarding high rate of jobs aborting due to Maradona error? Gav - working with Platform another patch will be applied soon.

  • CERN - MIA Server is back and being drained. Vendor will have to work on it.

  • CERN Dashboards - update of ATLAS DDM monitoring today. Downtime very short - a few minutes.

AOB: (MariaDZ) Just a reminder of the ALARM timezones for tomorrow as people asked yesterday: IIn the afternoon of the GGUS release date, initiated by the GGUS developers to the Tier1s, as part of the service verification procedure in 3 slices:

  1. Asia/Pacific right after the release,
  2. European sites early afternoon (~12:00 UTC),
  3. US sites and Canada late afternoon (~18:00 UTC).
Alarm test results/comments in https://savannah.cern.ch/support/?112668 please.

  • Nico - noticed CNAF has completed FTS 2.2.3 upgrade.


Attendance: local(Edaordo, Patrica, Majik, Elisa, MariaDZ, Andrea, Jamie, Prszemek, Harry, Timur, Jean-Philippe, Nicolo, Miguel, Graeme, Simone, Ricardo, Nilo, Jan, Alessandro. Cedric, Giuseppe);remote(Jon/FNAL, Michael/BNL, Gonzalo/PIC, Rob/OSG, Angela/KIT, Tiju/RAL, Rolf/IN2P3, Gang/ASGC, Onno/NL-T1).

Experiments round table:

  • CMS reports -
    • T0 Highlights: Replay running.
    • T1 Highlights:
      • MC processing and ReReco completing at T1s
        1. RAL - jobs failing for permission issue on new cmsHotDisk service class, now fixed.
      • Starting preproduction for next processing round.
        1. FNAL - jobs failed for incorrect architecture specification in workflow
        2. IN2P3 - some jobs failed because they were submitted to IN2P3-CC-T2 CE - resubmitted to T1 CE, running.
      • CNAF: since upgrade to FTS 2.2.3, about ~50% of CMS T2 site contacts not authorized anymore to submit transfer jobs to CNAF FTS [ Should now be corrected - incorrect mapping in gridmapfile. ]
        1. glite-transfer-submit exits with 'You are not authorised to submit jobs to this service'
    • T2 highlights: MC ongoing in all T2 regions.
    • Other: SAM test history not updating in Dashboard for a few hours - probably SAM programmatic interface not returning results used to import data in Dashboard.

  • ALICE reports - GENERAL INFORMATION: Massive MC production is foreseen at the end of this week. For the moment the current production is dominated by the Pass1 and 2 reconstruction tasks and some user analysis jobs (small proportion in comparison with the reconstruction jobs) [ About 8000 user analysis jobs, not using analysis train, seen just before meeting. ]
    • T0 site: The T0 is currently running most of the jobs due to the Pass1 reconstruction activities. Stable behavior of the services after the redistribution of VOBOXes performed yesterday morning. Both CERN CREAM-CE are now in production (2 voboxes) and also the LCG_CE (one VOBOX)
    • T1 sites: Responsible at this moment of most of the Pass2 reconstruction jobs, nothing special to report
    • T2 sites: Clermont (France): The new vobox provided by the site deployed to submit to the local CREAM-CE has not been validated. Problems with the local software area will prevent the good behaviour of the PackMan service and the installation of AliEn at the site. Following the issue with the site admins

  • LHCb reports - No activity yesterday, system pretty idle.
    • T0 sites issues: Tomorrow morning a 90 minutes downtime [ from 08:30 ] on all all central DIRAC machines for an intervention on the network service f513-c-ip169-shpyl-10.
    • T1 sites issues: CNAF: after the upgrade yesterday of Storm, the SAM unit test for SRM keeps failing affecting the whole site availability. Not clear the reason yet.
    • T2 sites issues: Continuing the investigation of data-upload related problem from several UK sites to CERN. Strict collaboration with Sheffield and Glasgow people.

Sites / Services round table:

  • FNAL - ntr
  • BNL - outage under investigation. Not caused by SRM. Combination of SRM and pnfs server. Gained more details of nature of problem - hopefully find a fix soon.
  • PIC - FTS migration to 2.2.3 cancelled due to snow has been fixed for Tuesday 18th March. Seeing quite a lot of spurious errors from SAM test for LHCb - LFC catalog test which checks streams. Intermittent errors. Triggering alarms to manager on duty - they risk to ignore alarms if they are too frequent. Elisa- most probably due to intervention at CERN to fix problem at RAL. Maria - problem with most T1s plus other problem with RAL. Elisa - just one hour after intervention...
  • IN2P3 - ntr
  • KIT - ntr
  • RAL - ntr
  • NL-T1 - SARA FTS/FTA downtime till in progress. SARA SRM downtime tomorrow is cancelled. Graeme - rescheduled? Onno - will be rescheduled the week after next. During LHC machine stop two weeks from now - will use that machine stop for the maintenance. Graeme - have to be quite agile with schedule!
  • ASGC - ntr
  • OSG - ntr

  • CERN DB - yesterday evening one app inserted tons of rows in ATLAS online DB - big impact on streams replication. 6hours of delay online->offline. Can also impact propagation to Tier1s.Investigation under way.

  • CERN batch intervention this morning "transparent".

  • CERN 90% of files recovered from 2 diskservers mentioned. If you have those files (the remaining 10%) please let us know.


  • Reminder of tomorrow's Tier1 Service Coordination meeting.

  • LHC - All beams except LHC beams will be stopped on Friday (12 March) at 18:00. All LHC type beams will be stopped on Monday (15 March) at 05:00 to respect 3h radiation cool-down imposed by RP. Access will be given as from 08:00. Re-start of the machines is planned for Tuesday (16 March) as from 18:00.Slot for 450 GeV collisions to be redefined.


Attendance: local();remote().

Experiments round table:

Sites / Services round table:



Attendance: local();remote().

Experiments round table:

Sites / Services round table:


-- JamieShiers - 04-Mar-2010

Edit | Attach | Watch | Print version | History: r11 | r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2010-03-10 - JamieShiers
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback