Week of 120611

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local (Andrea, Eric, Maarten, Vladimir, Massimo, David); remote (Rolf/IN2P3, Burt/FNAL, Kyle/OSG, Onno/NLT1, Tiju/RAL, Dimitri/KIT, Zeeshan/NDGF; Ian/CMS, Simone/ATLAS).

Experiments round table:

  • ATLAS reports -
    • T0
      • On Friday at approx 18:30 EOS started failing any operation. Alarm ticket was submitted to CERN, but EOS people were already looking at the problem (should have checked the IT Status Board). Problem fixed right after.
    • T1
      • One disk server in RAL was unavailable starting from friday. The data become available again on saturday. Thanks a lot to the RAL people, those data were urgently needed and would have taken some time to re-generate them.
    • [Simone: problem with GGUS is still being investigated.]
    • [Zeeshan/NDGF: downtime at PDC today for ATLAS.]

  • CMS reports -
    • LHC machine / CMS detector
      • Productive weekend for data taking
    • CERN / central services and T0
      • We ran short of EOS space at CERN and lead to some number of failures of stage out on repacking, express and reco. This seems to be related to disk servers attempting to accept a file and then filling up and failing. We have freed up space and will adjust the threshold we use to decide the system is full. [Massimo: as we speak, the EOS space for CMS is being increased.]
    • Tier-1/2:
      • We had a problem with the Site Availability tests over the weekend. We were unable to determine why some sites were failing the SUM CE tests, because there were no reports. Reported and resolved by the dashboard team.
      • A variety of responses about glexec tickets last week. We may need some help with the deployment

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • T1:
      • IN2P3 : Short scheduled Downtime for GE upgrade
    • [Vladimir: updated wildcard information, but this information is now waiting for validation by USCT. Maarten: please open a GGUS ticket.]

Sites / Services round table:

  • Rolf/IN2P3: scheduled downtime at risk for batch, the update went well, all ok now
  • Burt/FNAL: ntr
  • Kyle/OSG: ntr
  • Onno/NLT1: ntr
  • Tiju/RAL: ntr
  • Dimitri/KIT: ntr
  • Zeeshan/NDGF: nta

  • David/Dashboard:
    • comment on CMS problem, seems due to a VM crash, will investigate with SAM experts
    • can KIT get in touch with FTS developer to resolve the issue. [Dimitri/KIT: is there a ticket about this? David: will find out, otherwise will open one.]
  • Massimo/Storage: nta

AOB: none

Tuesday

Attendance: local (Andrea, Eric, David, Massimo, Eva, Alessandro, MariaD, Manuel); remote (Xavier/KIT, Kyle/OSG, Rolf/IN2P3, Burt/FNAL, Tiju/RAL, Jhen-Wei/ASGC, Ronald/NLT1, Giovanni/CNAF; Alessandro/ATLAS, Vladimir/LHCb).

Experiments round table:

  • CMS reports -
    • LHC machine / CMS detector
      • NTR
    • CERN / central services and T0
      • EOS space issues seem resolved and running smoothly
      • Numerous apparent false alarms this morning due to SLS issues. No issues since 10:00
      • [Massimo: noticed spikes of traffic in T1 transfers, is this normal? Eric: will investigate, thanks.]
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • T0:
      • CERN:
    • T1:
      • PIC : lhcbweb.pic.es does not respond (GGUS:83177); Fixed

Sites / Services round table:

  • Xavier/KIT: ntr
  • Kyle/OSG: in maintenance period, some services may drop
  • Rolf/IN2P3: ntr
  • Burt/FNAL: minor degradation on tape services, has been resolved and should not have been visible
  • Tiju/RAL: reminder, tomorrow will upgrade databases behind LFC/FTS and also will upgrade Castor the whole day
    • [Alessandro: how long for FTS? Tiju: scheduled 8am to 5pm but will be shorter than that, will take care of draining FTS channel one hour before. Alessandro: so whole UK cloud will be unavailable for transfers tomorrow; there is ICHEP conference and ATLAS is struggling to get MC production done. Tiju: this was discussed with ATLAS already.]
  • Jhen-Wei/ASGC: ntr
  • Ronald/NLT1: ntr
  • Giovanni/CNAF: ntr

  • David/Dashboard:
    • followup on FTS messaging from KIT, we are receiving messages, we just did not notice the endpoint had changed, sorry about that
    • about CMS visualization there were two issues, one dashboard VM crash (will now get alarms to dashboard support list) and SAM ATP component issues (fixed in next release)
  • Massimo/Storage: tomorrow network group 7-830 will interve on switches affecting 10 our machines, conncated ATLAS T0 and they are ok
  • Eva/Databases: ntr
  • Manuel/Grid: ntr

AOB:

  • MariaDZ: ATLAS ALARM ticket GGUS:82854 against T0 is still in status "Assigned" since 2012/06/04. [Massimo: the issue was fixed but we forgot to update the ticket, there was not a problem with the interface. MariaDZ: it is a pity because this incorrectly gives a bad statistics in the MB reports.]

Wednesday

Attendance: local (Andrea, Luca, Eric, David, Manuel, MariaDZ); remote (Kyle/OSG, Jhen-Wei/ASGC, Ron/ NLT1, Zeeshan/NDGF, Rolf/IN2P3, Gareth/RAL, Lorenzo/CNAF; Vladimir/LHCb, Alessandro/ATLAS).

Experiments round table:

  • ATLAS reports -
    • RAL Oracle11 upgrade: whole UK cloud set to brokeroff

  • CMS reports -
    • LHC machine / CMS detector
      • Access 1700-2200, physics ongoing
    • CERN / central services and T0
      • CASTOR disk server dropped out, restored, GGUS:83214.
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • T0:
      • CERN:
    • T1:
      • RAL : Scheduled Downtime
      • CNAF: Pilots failed; Fixed without GGUS ticket

Sites / Services round table:

  • Kyle/OSG: ntr
  • Jhen-Wei/ASGC: ntr
  • Ron/ NLT1:
    • small network problem this morning for 1h, no GGUS ticket was sent so the impact on users cannot have been severe
    • dcache upgrade next Monday
  • Zeeshan/NDGF: ntr
  • Rolf/IN2P3: closed as unresolved the LHCb ticket about corrupted files, because data corruption does not seem specific to IN2P3 but rather seems a well known problem, for which the workaround is to use the checksum to check for data corruption and retry the transfer if necessary. Please look at GGUS:82247 for more details. [Vladimir: there are different reasons why there were corrupted files at IN2P3, the main one being site-specific problems in the transfer from WN to SE, but we agree that this ticket can be closed.]
  • Gareth/RAL:
    • Castor upgrade to 2.1.11-9 was done and the services are ok now
    • problems were encountered during the 11g upgrade of the Oracle database behind FTS, a rollback had to be done, this is being followed up and ATLAS will be kept informed. [Alessandro: thanks for following up. Understood that the 11g upgrade at other sites was done using two instances, one on 10g and one on 11g, and only switching the alias at the end, was this done at RAL? Gareth: no we cannot do this as we only have one small RAC and this is the one being upgraded. We do have a small FTS instance that we could turn on in the meantime if the problem takes much longer, could this be useful? Alessandro: yes thanks, this could be useful. Will be followed up offline.]
  • Lorenzo/CNAF: ntr

  • Luca/Databases: ntr
  • David/Dashboard: ntr
  • Manuel/Grid: ntr

AOB:

  • MariaDZ/GGUS:
    • two upcoming releases at unusual dates, Mon 25/6 with new reporting tools and Mon 9/7 with other development items in the pipeline
    • the BIOMED VO liked the team ticketing functionality and uses the TEAM ticket submit form for quite some time already. They wish to change the default ticket priority value from 'top priority' to a lesser value. If the experiments are happy with this request, please send feedback to Maria. Else the BIOMED request will be refused and we shall continue as now.

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS reports -
    • CERN-PROD LSF bsub slow (response time always >1sec): ALARM GGUS:83252
    • RAL Oracle11 upgrade did not make it, site rolled back. To be re-scheduled

  • CMS reports -
    • LHC machine / CMS detector
      • NTR
    • CERN / central services and T0
      • LSF issues over the night. Failing and coming back up many times. Working OK as of this morning, but would like to understand if it's a load issue or related to recent upgrades.
      • VO virtual machine (vocms161) for CMSWEB lost a partition. Still not resolved. Suspect Hypervisor. Causing intermittent errors in crabserver dbs phedex reqmgr sitedb t0datasvc t0mon. Removed from our cluster.
    • Tier-1/2:
      • NTR

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • T0:
      • CERN: RAW file not migrated (GGUS: 83239)
    • T1:
      • ntr

Sites / Services round table:

AOB: (MariaDZ) Concerning the course in preparation at CERN on hadoop, announced last week, interested people from T1 centres can also attend but they will have to pay as any other course participant. Arranging things via budget codes will be complex, so, we should know quite early how many people we are talking about. The course will take place mid-September at the earliest. Meanwhile, please send Maria candidate names.

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 22-May-2012

Edit | Attach | Watch | Print version | History: r18 | r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2012-06-14 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback