Week of 120924

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Alex, Eva, Fernando, Ivan, Lothar, Maarten, Maria D, Massimo);remote(Christian, Jhen-Wei, Kyle, Lisa, Onno, Rolf, Saverio, Tiju, Vladimir, Xavier).

Experiments round table:

  • ATLAS reports -
    • CERN/T0/Central services
      • Global job failures due to expired proxy on Panda server. Side effect on HammerCloud site exclusion.
    • T1
      • INFN-T1 automatically excluded in DDM during 1h scheduled downtime.

  • CMS reports -
    • LHC / CMS
      • CMS starting up, ready for data
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
      • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework..
        • Massimo: we contacted the developers
        • Maarten: in the Storage Interfaces Working Group a possible hack on the LHCb side was mentioned ("if EOS, fix the TURL...")
    • T1 :
      • CNAF: SE downtime

Sites / Services round table:

  • ASGC
    • during the weekend a disk partition error started affecting the ATLAS tape buffer; migrations are pending while the problem is being worked on
  • CNAF - ntr
    • Vladimir: what about the unscheduled SE downtime?
    • Maarten: please update the GOCDB downtime calendar as needed
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • there have been problems with the tape system since last month, some files cannot be retrieved; the matter is still being worked on; so far ATLAS and LHCb have been affected, but not (yet) CMS
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • RAL
    • tomorrow CASTOR upgrade for ATLAS

  • dashboards - ntr
  • databases
    • rolling intervention on CMS archive DB tomorrow still to be confirmed by CMS
  • GGUS/SNOW
    • next standard GGUS release on Wed Sep 26, with tests for alarms and attachments as usual
  • grid services - ntr
  • storage
    • network interventions (10-30 min each) on Thu Sep 27 08:30-12:30 CEST will affect CASTOR and EOS machines for ALICE and ATLAS

AOB:

Tuesday

Attendance: local(Fernando, Ivan, Maarten, Maria D, Massimo, Ulrich);remote(Gonzalo, Jhen-Wei, Kyle, Lisa, Michael, Rolf, Ronald, Salvatore, Tiju, Vladimir, Xavier).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • RAL in scheduled downtime 10:00-16:00 for CASTOR upgrade. UK cloud brokeroff.

  • CMS reports -
    • LHC / CMS
      • CMS starting up, ready for data
    • CERN / central services and T0
      • Testing a new Tier-0 Framework
    • Tier-1/2:
      • Tier-1 Consistency check requests issued as tickets

  • LHCb reports -
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
      • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework..
    • T1 :
      • CNAF: SE downtime
        • Vladimir: CNAF is blocked for LHCb work until some time next week
        • Salvatore: we will need ~5 more days to repair the situation
          • further details were provided after the meeting, see below

Sites / Services round table:

  • ASGC
    • still working on the ATLAS tape staging problem reported yesterday
  • BNL - ntr
  • CNAF
    • details on the LHCb SE incident are reported below
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • no update on the tape recall problem reported yesterday
  • NLT1 - ntr
  • OSG
    • today there is maintenance on the central services
    • tomorrow the ticket exchange system will be upgraded to handle the SOAP interface change coming with the new GGUS release
  • PIC - ntr
  • RAL
    • the CASTOR upgrade for ATLAS finished OK at 12:00 local time

  • dashboards - ntr
  • GGUS/SNOW
    • reminder: GGUS release tomorrow, with alarm ticket tests etc.
  • grid services
    • scheduled network intervention Thu morning will affect various services, e.g. LFC
      • Fernando: please provide further details on which services
      • Ulrich: OK
    • CREAM ce203 is back in production after migration to using Argus; ce204 is next, now being drained
  • storage - ntr

AOB:

CNAF LHCb SE incident

PROBLEM DESCRIPTION:
On Friday 21st Sept 2012 15:00 CET there has been a routine intervention 
on-site on the hardware hosting the LHCb disk pools, from the storage 
vendor support. The intervention was on a RAID disk storage array (a 
redundant power supply substitution), and created an unpredictable major 
failure on the disk array chains with the consequence of forcing a 
number of disk arrays to go off-line. The reason of the failure is under 
investigation from the vendor support and the Tier1 staff, and a 
detailed report has been requested to the vendor. As an immediate action 
the LHCb GPFS file-system has been unmounted from all the worker nodes, 
all the Tier1 LHCb services were stopped and GPFS was shut down. The 
disk storage system causing the trouble contains about 650TB of the LHCb 
disk.

ACTIONS AND RELATED SOLUTION
The vendor support succeeded in restoring the disk array on-line during 
the 21st evening but a large number of RAID disk arrays started to be 
rebuilt. The rebuild operations are taking a lot of time to complete. 
Full completion is required in order to validate the volumes from the 
point of view of the GPFS layer and before starting a subsequent 
file-system check to properly ensure the data consistency of the whole 
file-system. Only after the completion of all the previous steps the 
storage area could be made available to the users and the Tier1 LHCb 
services could be restored.

ESTIMATION OF UNSCHEDULED DOWN So far (25th Sep. 2012 12:45 CET) two 
further days for the completion of the ongoing rebuilds is a plausible 
estimation. Then, additional 2-3 days are required for a complete and 
detailed file-system check. A full file-system check, though 
time-consuming, is strongly encouraged after such a disruptive event. If 
no major corruption of the file-system will be detected the disk storage 
area should be on-line by the 1st October 2012 or before, if possible. A 
detailed SIR will be submitted later on.

Wednesday

Attendance: local(Fernando, Ian, Ivan, Maarten, Maria D, Massimo, Ulrich);remote(Gonzalo, Jhen-Wei, John, Kyle, Lisa, Michael, Pavel, Rolf, Ron, Salvatore, Thomas, Vladimir).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • LFC and DDM registration slowness between 12PM and 12 AM. Many sessions open on ADCR databases. Affecting Panda workload management system and raising high load alarms on machines. Problem not fully understood yet, but situation OK now.
    • T1
      • NTR

  • CMS reports -
    • LHC / CMS
      • CMS running
    • CERN / central services and T0
      • Testing a new Tier-0 Framework. Some question about its load on Castor. Shutting down the test components until we understand better
        • Massimo: we see a high load indeed, which causes our probes to time out, but their timeouts are rather short; next week would be better, when new HW has been added
        • Ian: the current load ought to be equivalent to a highish load from the old system, i.e. not a problem; we will look into the difference in behavior; the new system will allow CMS to eliminate special components that are only being used in the Tier-0 Framework; the plan is to switch in time for the proton-ion run and possibly already earlier
    • Tier-1/2:
      • Tier-1 Consistency check requests issued as tickets

  • ALICE reports -
    • Since ~20:30 yesterday evening new jobs are failing immediately at CERN and many other sites; being investigated
      • a change on the AliEn side was rolled back and things look better since ~15:45 CEST

  • LHCb reports -
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
      • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework..
    • T1:
      • CNAF: SE downtime
      • SARA: wrong TURLs returned from SARA's SRM (GGUS:86435)

Sites / Services round table:

  • ASGC
    • yesterday the tape buffer disk problem was fixed and the backlog of pending migrations for ATLAS has been cleared
  • BNL - ntr
  • CNAF
    • no news about the ongoing LHCb SE recovery
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG
    • the ticket exchange system has stopped working due to today's SOAP interface change that came with the new GGUS release; working on it with the GGUS team
  • PIC - ntr
  • RAL - ntr

AOB:

Thursday

Attendance: local(Fernando, Ian, Ivan, Maarten, Maria D, Massimo, Ulrich, Vladimir);remote(Jhen-Wei, John, Kyle, Lisa, Michael, Rolf, Ronald, Thomas, WooJin).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • NTR

  • CMS reports -
    • LHC / CMS
      • CMS running
    • CERN / central services and T0
      • Degradation of Castor yesterday afternoon after we had shut down the TIer-0 testing framework. We need the bad user list again.
    • Tier-1/2:
      • Tier-1 Consistency check requests issued as tickets. PIC seems done. FNAL done and ticket closed. IN2P3 and ASGC in progress. No news on the others.
        • WooJin: will have a look
        • John: GGUS or Savannah ticket?
        • Ian: it should have been bridged to GGUS, will check

  • LHCb reports -
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
      • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework..
    • T1 :
      • CNAF: SE downtime: Solved
      • SARA: wrong TURLs returned from SARA's SRM (GGUS:86435): Solved

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - nta
  • NDGF
    • this morning some dCache head nodes were found to have been using an outdated CA repository since ~1 year; fixed
  • NLT1 - ntr
  • OSG
    • the GGUS-OSG ticket exchange issues reported yesterday were resolved shortly after yesterday's meeting
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • GGUS/SNOW
    • ALARM test results show us that it is wise to repeat them specifically for CERN. Explanation in Savannah:131998#comment12. All tests to the Tier1s worked well. The OSG-GGUS problem is fixed, it was not due to GGUS. Details in Savannah:127763#comment30.
      • Maria: the CERN tests are foreseen to be repeated on Monday
  • grid services - ntr
  • storage
    • assuming there will be less activity starting from the scrubbing run at the end of next week and during the subsequent machine development, the following interventions are foreseen:
      • Mon Oct 8 14:00-17:00 CASTOR-CMS upgrade + DB HW upgrade
      • Tue Oct 9 10:00-12:00 CASTOR-ALICE upgrade
      • Tue Oct 9 14:00-16:00 CASTOR-LHCb transparent update
      • Thu Oct 11 14:00-17:00 CASTOR-ATLAS upgrade + DB HW upgrade

AOB: (MariaDZ) The hadoop courses at CERN are now in the Training catalogue (one may need a CERN login to open the pages). The price of each course has been calculated for a minimum number of participants 8 for Developers and Admin and 20 for Masterclass, of course if there are more students the final price will be cheaper. Participants coming from the WLCG Tier1s, if any, should take care of their travel and accomodation. If you are not in the CERN phonebook and/or you don't have a team account for the course payment, you'll need to contact technical.training@cernNOSPAMPLEASE.ch for the CERN access and payment details.

  1. Hadoop Masterclass. This one-day course is designed for Managers who need a high level view/understanding of Hadoop concepts and benefits and/or for technical staff who need to provide support/advise on Hadoop. Date : 06-Nov-12. Maximum number of participants:35.
  2. Hadoop for Administrators. This 3-day (50% lab) course is designed for installers and mainteners of a Hadoop cluster. Upon completion of the course, students will receive a voucher for the "Hortonworks Certified Apache Hadoop Adminsitrator" Certification Exam (online). Dates: 07-Nov-12 to 09-Nov-12. Maximum number of participants: 12
  3. Hadoop for Developers. This 5-day (60% labs-hands on exercises) is designed for developers who want to better understand how to create Apache Hadoop solutions. Upon completion of the course, students will receive a voucher for the "Hortonworks Certified Apache Hadoop Developer" Certification Exam (online). Dates: 12-Nov-12 to 16-Nov-12. Maximum number of participants: 12

Friday

Attendance: local(Fernando, Ivan, Maarten, Massimo, Ulrich);remote(Gonzalo, Jeremy, Jhen-Wei, John, Kyle, Lisa, Michael, Onno, Rolf, Rolf, Saerda, Salvatore, WooJin).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • NTR

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • T0:
      • CERN : cleaning TMPDIR on lxbatch (GGUS:86039 )
      • CERN : problem of protocol supported by EOS (GGUS:86226) only gsiftp is supported which prevent access to the DATA from our framework..
    • T1:
      • SARA: SE misconfiguration (GGUS:86509); most probably disk is full

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF
    • the LHCb SE is OK since yesterday after a GPFS rebuild and file system check, no data was lost; a SIR will be created; the LHCb share has been increased for the weekend to help LHCb recover from the backlog
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • LHCb ticket: does not seem to be a misconfiguration, the space is full; there are limits on 2 levels: the space token and the pool group, which has 9 TB more, but is the one reported as full; there may then be a lot of dark data stored outside the space token; we will look further into that
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • grid services - ntr
  • storage - ntr

AOB:

-- JamieShiers - 02-Jul-2012

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2012-09-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback