Week of 101213

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Alessandro, Cedric, David, Dirk, Gavin, Harry, Jacek, Jamie, Maarten, Maria D, Maria G, Roberto, Simone, Ulrich);remote(Gareth, Gonzalo, Jon, Luca, Michael, Peter, Rob, Rolf, Ron, Stephen, Suijian, Tore, Xavier).

Experiments round table:

  • ATLAS reports -
    • NDGF tape issues, ticket GGUS:65013, 1103 files unavailable
    • Simone: this week there will be commissioning tests for full T2-T2 transfer mesh
    • Simone: no big plans for Xmas break, just normal background activities

  • CMS reports -
    • Experiment activity
      • Shutdown activities
      • Stephen: reprocessing is ending, skimming ongoing; during the Xmas break there could be a big MC production, but no request has been received yet
    • CERN and Tier0
      • Still working on recovering CVS repository.
      • Tier-0 Idle
    • Tier1 issues and plans
      • Rereco in process.
    • Tier-2 Issues
      • Nothing to report.

  • ALICE reports -
    • T0 site
      • Production and analysis downtime Dec 15-17 for the deployment of new AliEn v2.19 at CERN, T1 and some T2.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Xmas plans: run huge MC productions for physics studies. This is pending the removal of old MC data at all centers (decide to archive them in a T1D0 service class at CERN).
      • Reprocessing: at 98.4%. Remaining jobs are only due to an internal problem DIRAC not recreating jobs for unprocessed files.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • NTR
      • T1 site issues:
        • NTR

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • will replace disk back-end for storage name server Wed morning, 2h downtime, will drain queues beforehand
  • CNAF
    • downtime Tue for ATLAS SRM upgrade
  • FNAL - ntr
  • IN2P3
  • KIT - ntr
  • NDGF
    • downtime for SRM head nodes Tue morning
    • ATLAS ticket in progress
  • NLT1
    • 4h maintenance Tue to move Oracle RAC back to original hardware
  • OSG
    • 3rd machine foreseen to be added to OSG BDII on Tue, running BDII v5 (others still on v4), still need approval of USCMS
  • PIC
    • downtime Tue to move PNFS servers to new hardware
  • RAL
    • Gareth: Gridview, and I think the experiments, have shown the RAL Tier1 as in maintenance (outage) from Saturday at 08:00 until today at around 11:30. In fact the site was up. We had declared a 'Warning' on the site for this time, but not an outage. I think (but am not certain) that this may be linked to the GOC DB reverting the 'Warning' state to 'At Risk'.
    • Gareth: VO CE tests failed, but other tests did not?!
    • Gareth: will get in touch with GOCDB team
    • Peter: on Fri a ticket related to Oracle was opened for the GOCDB

  • CASTOR - ntr
  • dashboards - ntr
  • databases
    • CMS online DB will be stopped this evening because of power cut, back on Thu
    • DB upgrade on Wed affecting TOTEM
  • GGUS - ntr
  • grid services
    • SLC4 batch has been closed
    • lxplus4 will be closed on Tue
    • remaining 2 LCG-CEs submitting to SLC4 will be drained
    • 2 of the 3 CREAM CEs had problems over the weekend: patches had been delayed until after the HI run; the nodes probably will be reinstalled to get rid of accumulated history

AOB:

Tuesday:

Attendance: local(Alessandro, Cedric, David, Harry, Lola, Maarten, Maria D, Massimo, Miguel, Roberto, Simone);remote(Cristina, Gonzalo, Jeremy, Jon, Michael, Rob, Rolf, Suijian, Tiju, Tore).

Experiments round table:

  • ATLAS reports -
    • Alarm ticket Lyon. But now site in downtime.
    • BNL voms cert problem at some sites (Milano, Napoli, IFIC). Fixed now
    • UKI-NORTHGRID-LANCS-HEP has some corrupted files, leading to many failing transfers in UK cloud (ticket GGUS:65007). Files are being identified and will be corrected.

  • CMS reports -
    • Experiment activity
      • Shutdown activities
    • CERN and Tier0
      • Still working on recovering CVS repository.
      • Tier-0 Idle
    • Tier1 issues and plans
      • Rereco in process.
    • Tier-2 Issues
      • Nothing to report. Details of next MC production round (over Xmas) not there yet.
    • AOB CRC of the week is Stefano. Stefano can not attend the call today. Apologies.

  • ALICE reports -
    • T0 site
      • Yesterday following Ulrich's suggestion we took out of production ce202 till it was in good shape. Since yesterday afternoon, it is back in production and performing well
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Fixed the internal problem in DIRAC and submitted the remaining jobs for reprocessing. No longer so clear whether there will be MC production during Xmas break.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • NTR
      • T1 site issues:
        • RAL: Found 38 files corrupted (checksum mismatch) due to a bug in Castor related to incompletely transferred files.

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • reminder of tomorrow's (4h30m) downtime
  • CNAF
    • Purdue should react on GGUS:64771 - proposal is to reduce MTU size on a test machine to see if that would solve the issue
    • Jon: they are unlikely to change their MTU, but will contact them
    • Gonzalo: GGUS:64773 (affecting PIC) is also for Purdue; will add more info from e-mail thread
    • Rob: OSG is not involved with either ticket - should it be?
    • Alessandro: network incident procedure was presented at the GDB - 1 of the 2 sites should be the owner, but a T1 might have precedence when the FTS is involved
    • Maria D: support for that procedure will be implemented in GGUS, but is on hold for approval by the MB (Savannah:115213)
    • Maria D: in general all info should be put into the ticket; the submitter or a supporter can select Purdue as the site and OSG would be automatically involved (done for GGUS:64773)
  • FNAL - ntr
  • GridPP
    • Lancaster file corruption under investigation, seems hardware-related, most files seem to be OK
  • IN2P3
    • still in downtime, most services are back, except for FTS, LFC and VOMS
    • ATLAS alarm ticket was for another reason, details unknown
  • NDGF - ntr
  • NLT1 (added after the meeting)
    • Ron: the migration of the Oracle DBs to the original hardware which was planned for today has been cancelled due a hardware failure on a switch. There was no outage today. We will postpone this action until the new year.
  • OSG
    • 3rd BDII machine will not be added today (USCMS could not test it), hopefully Tue next week
  • PIC
    • scheduled intervention to move PNFS servers to new HW is tomorrow (Wed), not today
  • RAL - ntr

  • CASTOR - ntr
  • dashboards - ntr
  • GGUS
    • see AOB

AOB: (MariaDZ) As KIT (the GGUS hosting institute) closes from 2010/12/24-2011/01/02 please note that TEAMers' and ALARMers' extract from VOM(R)S is now fully automated. Nevertheless, if VOs plan to enter extra members in these groups for the Year End, it would be good to do it before CERN closure, in case we need to follow-up any unexpected issues. ALARM tickets' processing will depend on T0 and T1s responsiveness (the notification process is automatic).

On the PIC-Purdue GGUS:64773 and the CNAF-Purdue GGUS:64771 network issues MariaDZ and RobQ suggested to notify the site (is it RCAC, Rossmann or Steele?) so that OSG(Prod) receives the assignment and all the information. Please copy related info that circulated in threads in this ticket. On the general issue of Network ticket handling via GGUS the implementation will emerge from https://savannah.cern.ch/support/?115213.

Wednesday

Attendance: local(Cedric, David, Gavin, Harry, Jamie, Luca, Maarten, Maria D, Massimo, Miguel);remote(Cristina, Gonzalo, Joel, Jon, Onno, Rob, Rolf, Suijian, Tiju, Xavier).

Experiments round table:

  • ATLAS reports -
    • T1s:
      • Lyon extended downtime yesterday, but back now since yesterday ~22:30
      • NDGF tape problem seems to be fixed (GGUS:65013)
      • FZK/KIT had FTS problems during the night. Fixed this morning.
        • Xavier: HW for FTS is being replaced, DNS round-robin setup needed a fix
    • T2s:
      • We started yesterday our Sonar test to test the full T2 to T2 matrix for transfers. Some errors already reported for some particular T2-T2 transfers. Will have a better view tomorrow.
      • Corrupted files at UKI-NORTHGRID-LANCS-HEP (GGUS:65007) identified and declared to DQ2 for recovery.
    • Cedric: problem seen with ATLAS dashboard for site services around lunch time, now fixed?
      • David: workaround has been put in place, fix for proper DB caching under development
    • Maria D: ticket GGUS:65013 OK now?
      • Cedric: yes
    • Maria D: ticket GGUS:61440 (BNL-CNAF network issue)?
      • Cedric: update tomorrow
      • Cristina: performance OK as far as CNAF is concerned

  • ALICE reports -
    • T0 site
      • Deployment of new AliEn v2.19 is ongoing smoothly. It will take three days. Services are stopped, so no activity on the GRID
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Reprocessing almost completely finished and its output merged.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 1
      • T2: 0
    • Issues at the sites and services
      • T0
        • NTR
      • T1 site issues:
        • IN2P3: users being affected accessing files in a disk pool that was disabled yesterday (GGUS:65290)
        • SARA: request to restore the original share (exceptionally increased for coping with the merging backlog). It should be clear that for the long term NIKHEF/SARA should adjust internally (and transparently) the share in a more reasonable way (50% vs 50% for example)
          • Onno: fair share issue will be discussed internally
        • RAL corrupted files: LHCb will look at the list provided and re-transfer them.
    • Joel: MC production over Xmas under study

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL
    • some users are requesting files from CNAF that have bad checksums, can CNAF look into Savannah:118151?
    • Cristina: OK
  • IN2P3
    • yesterday's downtime was extended because of an unforeseen problem in the migration of the Oracle DBs
    • LHCb problem due to a disk server that is offline with a bad network card, it will take some time to recover the files
  • KIT
    • network maintenance today led to dCache pools getting disabled, they had to be restarted; some transfers may have been affected
  • NLT1 - ntr
  • OSG - ntr
  • PIC
    • today's downtime should end at 15:00 UTC as planned
  • RAL - ntr

  • CASTOR - ntr
  • dashboards
    • downtime for ATLAS DDM dashboard for DB maintenance this morning
  • databases
    • ATLAS offline DB at risk: 2 nodes crashed, high load on other nodes
  • GGUS - ntr
  • grid services - ntr

AOB:

  • Friday morning the VOMRS server at CERN is foreseen to be disabled for the DTEAM VO, as part of the transition of DTEAM to EGI: https://wiki.egi.eu/wiki/Dteam_vo
  • Jeff (NIKHEF): Now that our CVMFS service is being used by all LHCb jobs, we added a second squid server. We have tested that CernVM FS fails over to this 2nd squid if we sever the network connection to the first one.

Thursday

Attendance: local(Alessandro, Andrea, Cedric, David, Gavin, Harry, Jacek, Jamie, Maarten, Maria D, Maria G, Massimo, Simone);remote(Cristina, Foued, Gareth, Jeremy, Joel, Jon, Michael, Pepe, Rob, Rolf, Ronald, Suijian).

Experiments round table:

  • ATLAS reports -
    • Central services :
      • ATLR dead yesterday 15:00 - 16:15 causing panda server/DQ2 unavailability.
        • Jacek: we needed to replace 1 fibre channel switch, normally transparent; resulted in high I/O loads ending in reboots by the clusterware; load moved to 3 other nodes, looked OK, but then the DB got stuck; Service Request opened with Oracle; all nodes rebooted --> DB OK
      • ADCR testing ongoing with DQ2 and Panda. Current setup seems able to handle the load.
      • Problem of timeouts from callbacks to Dashboard still there. For some VOboxes, big number of callbacks in this situation : Difficult to monitor correctly the transfers for the clouds they serve.
        • David: BUG:76472 in work
        • Simone: Fernando will produce a patch, should be deployed on Friday
    • T1s :
      • No problems with the downtimes of yesterday.
    • T2s :
      • Sonar test going-on, but because of the timeout problems mentioned above, difficult to have a clear picture.
    • Cedric: major offline DB reorganization (ATLR --> ADCR) will require a downtime on Mon Jan 17 (https://twiki.cern.ch/twiki/bin/viewauth/Atlas/OfflineDBSplit)
    • Cedric: CNAF-BNL network ticket GGUS:61440 should be kept on the list

  • CMS reports -
    • Experiment activity
      • Shutdown activities
    • CERN and Tier0
      • Tier-0 down
    • Tier1 issues and plans
      • No issues
      • Rereco winding down. Another pass to be decided by monday.
    • Tier-2 Issues
      • No issues
      • Fall MC production coming to end. Getting ready for Winter10 MC.
        • Pepe: MC production, skimming, rereco expected in some of the T1

  • ALICE reports -
    • T0 site
      • Deployment of new AliEn v2.19 ongoing, no activity on the GRID.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports -
    • Experiment activities:
      • Reprocessing almost completely finished and its output merged. MC production ready to start after the final tests.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • NTR
      • T1 site issues:
        • nothing reported
    • Joel: how to install SW at IN2P3 in the Xmas break - through SAM job or via GGUS ticket?
      • Rolf: we will discuss it with our local LHCb support and inform you offline
    • David: Conditions DB SAM tests failed at NIKHEF, PIC and SARA, but the errors should rather be due to issues with those tests themselves

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3
    • GGUS:65290 (disabled LHCb pool node) solved, all files accessible again
    • Savannah:118357 (CMS reprocessing jobs aborting): why are updates in that ticket not bridged to the corresponding GGUS:65348 ticket?
      • Maria D: will look into this
  • KIT - ntr
  • NLT1
    • ALICE VOBOX at SARA needed to be restarted after running out of memory
  • OSG - ntr
  • PIC
    • yesterday's downtime was not properly reflected in the SAM test results
      • Maarten: please open a GGUS ticket for the SAM team
  • RAL - ntr

  • CASTOR - ntr
  • dashboards
    • see ATLAS and LHCb reports
  • databases
    • see ATLAS report
    • Jamie: was the problem with reboots due to logrotate understood?
      • Jacek: will check
  • GGUS - ntr
  • grid services - ntr

AOB:

Friday

Attendance: local(Alessandro, Cedric, Dirk, Edoardo, Harry, Ignacio, Jacek, Jamie, Lola, Maarten, Simone);remote(Alessandro I, Andrea, Elizabeth, Gonzalo, Joel, Jon, Kyle, Michael, Onno, Rolf, Stefano, Suijian, Tiju, Ulf, Xavier).

Experiments round table:

  • ATLAS reports -
    • Central services :
      • Fix for problem of callbacks to Dashboard is available. Will be applied manually on VOboxes.
    • T1s :
      • Consolidation of ESD datasets to Tape from the last reprocessing campaign started. Represents ~1 PB
      • RAL srm being hit by too many transfers/jobs (GGUS:65500).
        • Tiju: we halved the FTS slots and reduced the number of job slots; there still is a backlog at the moment
        • Simone: possible relation to CASTOR upgrade?
        • Tiju: our CASTOR experts will look into the problem
        • Ignacio: RAL's SRM was not upgraded
        • Maarten: the upgraded back-end could have an issue
        • Simone: this is the first big activity since the upgrade
    • T2s :
      • ntr

  • CMS reports -
    • Experiment activity
      • Shutdown activities
    • CERN and Tier0
      • Tier-0 down
    • Tier1 issues and plans
      • ASGC - GGUS:65224 data access problems, site acknowledged high load but offered no solution
      • CC-IN2P3 - GGUS:65348 reconstruction jobs abort, likely due to timeout, still being investigated if during data access or staging out
      • For both tickets, site replied in Savannah, not GGUS, We will address in CMS how to better deal with GGUS vs. Savannah to enforce proper reporting
      • Rereco winding down. Another pass to be decided by Monday.
    • Tier-2 Issues
      • No issues
      • Fall MC production coming to end. Getting ready for Winter10 MC.

  • ALICE reports -
    • T0 site
      • Deployment of new AliEn v2.19 is ongoing. Already 18 sites have been migrated. At CERN the migration has been smooth and CREAM submission is performing well
    • T1 sites
      • Most of the sites are not running due to the migration
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Reprocessing almost completely finished and its output merged. MC production ready to start after the final tests. CERN IT coordination is needed for LHCb intervention
        • Ignacio + Jacek: upgrades of physics DB and CASTOR name server can be done at the same time, foreseen for Jan 6
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • NTR
      • T1 site issues:
        • NTR

Sites / Services round table:

  • ASGC
    • CMS ticket GGUS:65224 in progress, please check if the problem persists
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • a problem was detected with SW installations for ATLAS, Alessandro de Salvo has been contacted
    • CMS ticket GGUS:65348 in progress
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • LHCOPN connection from SARA to CERN is foreseen to be moved to another fiber on Monday, should be transparent
      • Edoardo: confirmed
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • CASTOR - ntr
  • databases
  • networks
    • on Jan 11 the replacement of Force10 routers will be started, should be transparent on that day
    • on Jan 13 new routers will be put in production, may not be transparent

AOB:

  • today was Harry Renshall's last official day before retirement - many thanks and best wishes !
  • next meeting Wed Jan 5

Season's Greetings !

-- JamieShiers - 10-Dec-2010

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2010-12-17 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback