Week of 120507

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Andrea V, Jan, Jhen-Wei, Lukasz, Maarten, Maria D, Przemek, Simone);remote(Gonzalo, Joel, Kyle, Lisa, Michael, Onno, Paolo, Stephen, Ulf, Xavier M).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s/CalibrationT2s
      • Taiwan-LCG2: AGENT error during TRANSFER_SERVICE. It's test spacetoken which suffered large amount of transfers. Error ceased after transfer done. No harm to production service, ticket closed. GGUS:81877
      • Taiwan-LCG2: Core switch down (04:00 - 08:00 UTC, Monday).

  • CMS reports -
    • LHC machine / CMS detector
      • Three fills over weekend
    • CERN / central services and T0
      • Database problem on Saturday causing jobs to fail everywhere. INC:126416.
        • Przemek: our NAS has problems when there are a great many files in a single directory; a workaround has been implemented on Sat, while a permanent fix is expected soon; as it will not be transparent, the intervention will be agreed with CMS
    • Tier-1/2:
      • T1_TW_ASGC has CASTOR problems causing jobs to fail. Savannah:128277 / GGUS:81887.
        • Jhen-Wei: as of 1h ago CASTOR should be OK; CMS should verify that through a few production jobs
      • T1_TW_ASGC core network went down. Reported to be back and vendor working on fix.
        • Jhen-Wei: OK now
      • T2_US_UCSD lost cooling water Thursday evening. We have two CRAB Servers there for glideinWMS which have had problems over the weekend still.
    • Other:
      • NTR

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data
    • New GGUS (or RT) tickets
    • T0
    • T1
      • RAL : 1 disk server down during the week-end..
        • Joel: will be looked into tomorrow (today is a bank holiday)
      • Joel: was there a modification in the queue length at NIKHEF? it was requested 2 weeks ago and supposedly done, but seems to have been reset now
        • Onno: will check
    • Others
      • FTS : Problem of FTS transfer between RAL and CNAF (under investigation By LHCb)

Sites / Services round table:

  • ASGC - nta
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • Mon May 14 13:30-14:00 UTC downtime for SARA SRM, dCache will be restarted
  • OSG
    • we received an e-mail stating the REBUS topology is not available? maybe related to issue affecting the forwarding of accounting records from Gratia to APEL?
      • Maarten: APEL services were affected by last week's network problem at RAL; an EGI broadcast announced delays in the processing of accounting records, but recovery would be automatic; send e-mail to the operations list and/or open a GGUS ticket if you still see issues on the OSG side; REBUS currently looks OK:
  • PIC - ntr

  • dashboards - ntr
  • databases
    • Oracle security patch will be applied to WLCG integration DBs on Wed and to the production DBs on Mon if no problems were seen; existing connections may suffer from the interventions; the physics DBs should also be done soon
  • GGUS/SNOW
    • see AOB
  • storage
    • EOSATLAS had a problem 11:30-13:30 CEST, some modifications may have been lost; under investigation
    • rolling DB updates for CASTOR on Wed and Thu, transparent for clients

AOB:

  • (MariaDZ) File ggus-tickets.xls with total numbers of GGUS tickets per experiment per week is up-to-date and attached to WLCGOperationsMeetings page.There was one real ALARM last week GGUS:81786 on Mayday by ATLAS against SARA-MATRIX. Detailed drills next week for the 2012/05/15 WLCG MB (they will cover 8 weeks - 13 real ALARMs so far since the last MB).
  • (MariaDZ) In case people haven't noticed the latest https://ggus.eu/pages/didyouknow.php#2012-04-25 about new status "Closed" introduced in production at the lastest GGUS Release of 2012/04/25.
    • the new state has already been applied to some tickets

Tuesday

Attendance: local(Massimo, Claudio, Simone, Gavin, Luca, Maria D, Maarten);remote(Gonzalo, Joel, Lisa, Kyle, Ulf, Xavier, Ronald, Jeremy, Tiju, Jhen-Wei, Giovanni).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • ntr
    • T1s/CalibrationT2s
    • Questions to WLCG/sites
      • After the EOS downtime yesterday and the intervention today, I understand there is no data loss. Is this correct/confirmed?
      • At the T1SCM last week, T1 sites have been asked to patch FTS for the gridftp2 issue. Could we have a statement from sites about who applied the patch?

  • CMS reports -
    • LHC machine / CMS detector
      • Data taking with beam during the night, cosmics afterwards
    • CERN / central services and T0
      • Waiting for the DB upgrade to fix the cause of the problem seen on Saturday (SNOW: INC:126416)
    • Tier-1/2:
      • T1_TW_ASGC Savannah:128277 / GGUS:81887 solved in pronciple but still open (high job inefficiency because ofhigh load on the storage servers).
      • T2_US_UCSD: CRAB Servers are back
    • Other:
      • NTR

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data
    • New GGUS (or RT) tickets
    • T0
    • T1
      • NIKHEF : (GGUS:81930) Pilot aborted and queue lenght was reset and put back this morning
      • IN2P3 : (GGUS:81927) Pilot failed at one creamce05
      • GRIDKA : What are the plan for migration of the SRM instance ? Meanwhile can we have more space because we are low in space
    • Others
      • FTS : (GGUS:81996) Problem of FTS transfer between RAL and CNAF (CNAF is not seeing the RAL FTS instance)

Sites / Services round table:

  • ASGC: Investigating the CMS problem
  • BNL:
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3:
  • KIT: new SRM for LHCb: the procedure is being prepared. Expect to have some firm dates/planning in ~ 2 weeks. For the disk space addition (LHCb): 30 TB ready to go in; LHCb to confirm details
  • NDGF: ntr
  • NLT1: Next Tue (May 15): all-day downtime due to multiple maitenance
  • PIC: 1 CE (out of 4) was drained. Noticed that Panda is using only one CE (hardcoded in Panda). ATLAS took note of it
  • RAL: FTS patch applied. The disk server (LHCb) having problems is back in R/O: all the files should be available
  • OSG: ntr

  • CASTOR/EOS: Confirm that the intervention on EOS ATLAS should recover all the data
  • Central Services: ntr
  • Data bases: CMS data guard problem: one more intervention needed (server file system): DB will inform the experiment
  • Dashboard:

AOB:

Wednesday

Attendance: local(Massimo, Simone, Luca, Maria D);remote(Claudio, Lisa, Kyle, Ulf, Pavel, Ron, Tiju, Shu-Ting, Paolo).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • After yesterday's intervention on EOS to restore missing metadata after monday's problem to the name server we still see files missing. GGUS:81907 has been updated.
    • T1s/CalibrationT2s
    • More investigation about the analysis queues in PIC, reported by Gonzaolo yesterday:
      • The pilot factories were submitting to many CEs, not only ce07.pic.es (in downtime)
      • The number of analysis jobs in PIC was low, but not zero (100 jobs out of 2000 jobs running in total for ATLAS). This is consistent with the 5% share of analysis we asked T1s to configure
      • as conclusion, we believe there was never a problem in running jobs at PIC
    • Problem in Taiwan DPM was notified by TW people before ATLAS could realize. No GGUS. Problem fixed in approx 1h
    • Problem in PrepareToPut in SARA SRM. GGUS: 82032 sent during the night bu Asia-Pacific shifters. Problem under investigation at the time of writing this report.
    • RAL reported a on-site connectivity problem. Problem spotted before ATLAS could realize it. No GGUS.

  • CMS reports -
    • LHC machine / CMS detector
      • Only cosmics data taking
    • CERN / central services and T0
      • Patching of CMS integration databases (INT2R and INT9R) today apparently gave no problems
      • Patch on CMSR database expected on Monday. Will the ADG reconfiguration happen at the same time?
    • Tier-1/2:
      • T1_TW_ASGC Savannah:128277 / GGUS:81887 site managers are still investigating
      • T2_US_UCSD some 'unmerged' data loss due to the cooling incident last week
    • Other:

Sites / Services round table:

  • ASGC: DPM crash. dev team contacted (patch will be provided)
  • BNL:
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: 30 TB added. Pilot roblem solved (waiting for confirmation). Good progress on the WMS.
  • NDGF: 16:00 UTC netwrok outage (Finland-Swedent link). Some ALICE data might be unavailable for ~2h
  • NLT1: FTS patched (ATLAS req)
  • PIC:
  • RAL: 4:00 UTC last night a network switch was down (~1 h). Some disk servers and batch nodes were unavailable
  • OSG: ntr

  • CASTOR/EOS: EOSATLAS file recovery ongoing
  • Central Services:
  • Data bases: Security patches applied on all instances. This is also in preparation of various upgrades next week.
  • Dashboard:

AOB:

Thursday

Attendance: local(Massimo, Claudio, Simone, Yarka, MariaDZ, Maarten);remote(Gonzalo, Michael, Joel, Ulf, Ron, Kyle, John, Lisa, Giovanni, Jhen-Wei, Rolf).

Experiments round table:

  • ATLAS reports
    • T0/Central Services
      • NTR
    • T1s/CalibrationT2s
      • At 18:30 yesterday the problem at SARA mentioned in GGUS:82032 appeared again. Ticket has been re-opened and is being looked into.
  • CMS reports -
    • LHC machine / CMS detector
      • Still no collisions
    • CERN / central services and T0
      • ntr
    • Tier-1/2:
    • Other:
      • ntr

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: dCache pool update today in DK: some ALICE files unavailable for few hours. Overnight Norway batch capacity went down: now back to normal
  • NLT1: Investigating the ATLAS problem (it might be some misconfig in the storage DB sector)
  • PIC: ntr
  • RAL: CASTOR upgraded
  • OSG: ntr

  • CASTOR/EOS: DB upgrades (transparent) in ~ 10 days
  • Central Services: CERN FTS today at 07:00 UTC the t0export and t2 fts service had gridftp2 patch and msg patch applied.
  • Dashboard: ntr

AOB: (MariaDZ)

  1. GGUS-SNOW interface will be ready with the May Release of 2012/05/30 to support the creation of REQUESTS (not only incidents). Details in Savannah:120007.
  2. The BIOMED VO asked for the implementation of an automatic TEAM ticket creation out of the Operations dashboard https://operations-portal.egi.eu/dashboard. We'd like to understand if the WLCG VOs use this dashboard for GGUS ticket creation and if this functionality would be useful for the WLCG community. Details in Savannah:127494.
  3. Important info for developers of ticketing systems interfaced to GGUS: a new SOAP web service is available for testing in the GGUS test system https://train-ars.ggus.eu/arsys/WSDL/public/train-ars/GGUS . It will eventually replace the current one, so please DO test it. Details in Savannah:127763.

Friday

Attendance: local(Massimo, Luasz, Claudio, Simone, Stephan, Jan, Jarka, Maarten, Eva, Ignacio); remote(Michael, John, Alexandre, Kyle, Xavier, Giovanni, Rolf, Jhen-Wei, Catalin, Roger).

Experiments round table:

  • ATLAS reports -
    • T0/Central Services
      • NTR
    • T1s/CalibrationT2s
      • At 18:30 yesterday the problem at SARA mentioned in GGUS:82032 appeared again. Ticket has been re-opened and is being looked into.

  • CMS reports -
    • LHC machine / CMS detector
      • CMS magnet was down during LHC collisions last night
    • CERN / central services and T0
      • ntr
    • Tier-1/2:
      • T1_TW_ASGC Savannah:128277 / GGUS:81887 many "unmerged" files were garbage collected during the intervention and production has been stopped again. Will restart soon
    • Other:
      • ntr

  • LHCb reports -
    • DataReprocessing of 2012 data at T1s with new alignment
    • MC simulation at Tiers2
    • Prompt reprocessing of data

    • T0
    • T1

Sites / Services round table:

  • ASGC: ntr
  • BNL: Incident last night (SRM DB). Initially switched the system to a backup and now back in production
  • CNAF: ntr
  • FNAL: FTS being updated
  • IN2P3:
    • Very high number (50k) pilot jobs extremely short lived. It results in a very low efficiency
    • The site asks LHCb for example of corrupted files
  • KIT: downtime 21-MAY [EDIT by Xavier: wrong date corrected] from 6:00 UTC to 11:00 UTC (tape back end intervention - only tape reading/writing will be blocked)
  • NDGF: ntr
  • NLT1: ATLAS problem under investigation (hopefully fixed next week)
  • PIC: ntr
  • RAL: 2 downtime uploaded for next week
  • OSG: ntr

  • CASTOR/EOS: Retry policy for migration changed
  • Central Services: as requested by ATLAS (Cedric), an additional node has been added to prod-lfc-atlas.
  • Data bases: Next week rolling DB upgrades
  • Dashboard: ntr

AOB:

-- JamieShiers - 11-Apr-2012

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2012-05-14 - XavierMol
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback