Week of 120528

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday - No meeting - Whit Monday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

Tuesday

Attendance: local(Alex, Elisa, Ian, Ikuo, Luc, Maarten, Mike, Xavier E);remote(Jeff, Jeremy, Jhen-Wei, Lisa, Michael, Rob, Rolf, Tiju, Ulf, Xavier M).

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS - physics data taking. Important and urgent MC production on-going (not new).
    • WLCG services
      • GOCDB (GGUS:82543) site/downtime info not accessible. Solved.
    • T0
      • CERN EOS (GGUS:82557 verified) open/create error: No space left on device
        • solution: raised the file quta from 8 to 9m in datadisk
    • T1
      • SARA SE instability : GGUS:82032 -> SARA is removed from T0 export untill SE issue solved . SARA blacklisted for T0 export
      • TAIWAN-LCG2_MCTAPE back in T0 SAV:128845
      • TRIUMF MCTAPE (GGUS:82556) Free=3.774 Total=25.0. Solved.
      • TRIUMF reported 1 TAPE lost
      • INFN-T1 DATADISK (GGUS:82517) invalid path -- likely too many directories. empty directories being removed
      • BNL-OSG2 (GGUS:82539 assigned) failed to contact on remote SRM -- apparently solved, but no response in the ticket (after the first response in GGUS:82539 "The hardware of SRM database had disk problem. We are working on it."
        • Michael: on Fri there was a problem with the SSD hosting the SRM DB, which was resolved within 1 hour; will check what happened with updating the ticket
        • Ikuo: the important matter was that the problem was solved very quickly; there also may have been confusion with GGUS:82538 which was closed before further updates could be added

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • Storage problem over the weekend, tracked to a user overloading the system. Since resolved.
        • Ian: the affected site was FNAL; it looked like a PNFS overload
      • Simulation remains the highest priority for the week
    • Other:
      • Ian Fisk is CRC

  • LHCb reports -
    • Users analysis, DataReprocessing of 2012 data and prompt reconstruction at T1s ongoing
    • MC production at Tiers2
    • New GGUS (or RT) tickets
    • T0:
      • ntr
    • T1:
      • SARA: unscheduled downtime this morning
      • IN2P3: ongoing investigation about corrupted files (GGUS:82247)
        • Elisa: the last affected file dated from March 20
        • Rolf: LHCb does not test for checksum errors immediately after their transfers, while ATLAS and CMS do, thereby avoiding problems later
        • Elisa: LHCb will implement such checks for FTS transfers; it may take a few weeks before the new code is in production; still, such problems are not normal!
        • Rolf: not clear where the problem is exactly, it may not be the site's fault
        • Elisa: what about stageout from the WN?
        • Ikuo: ATLAS also test the checksums for those transfers
        • Ian: for CMS it depends on the site; we have not often seen such problems and will rather detect them in PhEDEx when an affected data set is transferred
    • Central services
      • request to LFC support (GGUS:81924)
        • reassigned to CERN LFC support after the meeting

Sites / Services round table:

  • ASGC - ntr
  • BNL - nta
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - nta
  • KIT - ntr
  • NDGF
    • ATLAS reported they could not get certain files from tape, due to a problem with the tape system in Bergen; some data might be lost
    • the main OPN link to CERN is down for ongoing work; currently connected through backup route via SARA
  • NLT1
    • as the SARA SRM problem (GGUS:82490) started after an upgrade, a downgrade was performed to see if that helps
    • today's unscheduled downtime (11:30-13:30 UTC) was to deal with network issues
  • OSG - ntr
  • RAL
    • Yesterday, Castor-LHCb was switched to using transfer manager. Tomorrow we will switch Castor-Gen (used by Alice).
    • Request to other Tier1s - If other Tier1s have enabled hyper-threading on their batch farm, please can they get in contact with Alastair Dewhurst (alastair.dewhurst@cernNOSPAMPLEASE.ch)

  • dashboards
    • only CNAF and KIT still to fill out the Active MQ patch Doodle poll
      • Xavier M: KIT should have the patch running
      • Mike: the KIT FTS still has a problem, viz. a 57% deficit of transfers reported via Active MQ
  • grid services - ntr
  • storage - ntr

AOB: (MariaDZ) GGUS Release tomorrow! Usual round of ALARM and user test tickets as per Savannah:128369 .

Wednesday

Attendance: local(Alessandro, Alex, Elisa, Jan, Luc, Maarten, Marcin, Maria D, Massimo, Mike);remote(Giovanni, Gonzalo, Ian, Jhen-Wei, Lisa, Michael, Pavel, Rob, Rolf, Ron, Tiju).

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS - physics data taking (No stable beam since 10:30 yesterday). Important and urgent MC production on-going (not new).
    • WLCG services
      • NTR
    • T0
      • NTR
    • T1
      • SARA SE instability : GGUS:82490. Correlation with data deletion. SARA is removed from T0 export until SE issue solved .
      • INFN-T1 Unscheduled downtime (SRM). A priori ATLAS not affected. By precaution INFN-T1 removed from T0 export.
      • Slow transfers IN2P3-CC ->TRIUMF: 2 bad machines at Lyon identified (GGUS:82363). A dedicated FTS monitoring might be needed.
        • Luc: difficult to track time spent for a given transfer
        • Alessandro: the info probably is available via the message bus, but so far it is not clear to us; we need to be able to select transfers taking longer than N hours

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking
    • CERN / central services and T0
      • Lost a disk server in the T0streamer pool. Lots of errors in repack and Express. A team ticket was created and the problem is being addressed.
    • Tier-1/2:
      • HammerCloud failure at T1_FR_IN2P3. Exit code indicates failure to open an input file.
      • T1_IT_CNAF is expected back-up soon

  • ALICE reports -
    • NIKHEF jobs using Torrent since yesterday evening.
    • SARA job submission was blocked overnight by stuck resource BDII on CREAM CE (GGUS:82618)
    • CNAF: no jobs running since the night, 299 waiting.

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • New GGUS (or RT) tickets
    • T1:
      • IN2P3 ticket for jobs killed due to memory limit has been closed (GGUS:82544)
        • Elisa: is the 5 GB memory limit for the long queue at IN2P3 per process or per job?
        • Rolf: per process group, i.e. per job

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • brief VOMS outage this morning, a hanging daemon needed to be restarted
  • CNAF
    • work ongoing to recover from earthquake fallout; not clear when affected services can be restored for CMS and ALICE
    • at-risk downtime 09:00-12:00 UTC on May 31 for network backbone maintenance
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • FTS manager tried to apply the Active MQ patch, but it was no longer in the repository; FTS developers have been contacted
  • NLT1
    • still investigating the slowness of the SRM (GGUS:82490); LHCb are hammering the service with srmLs requests at ~10 Hz, which may not fully explain the problem, but certainly needs fixing
      • Elisa: we will look into tuning our clients better to lower the rate
      • Alessandro: correlation with deletion activity?
      • Ron: unclear, but deletions do not come with their own srmLs overload; note that the Chimera service is shared between the experiments
  • OSG - ntr
  • PIC - ntr
  • RAL
    • this morning's CASTOR upgrade went OK

  • dashboards - nta
  • databases
    • yesterday the ATLAS PVSS service was moved to the 3rd node of the ATLAS online DB, after which nodes 1 and 2 were rebooted to clear alarms from the ATLAS DCS about event buffering; the underlying cause is not understood, but the issue was also cured by a reboot the previous time, a few months ago
    • today nodes 1 and 2 of the WLCG RAC were rebooted after agreement with the Dashboard team; other services were not affected
  • GGUS/SNOW
    • a new GGUS release happened today; a summary of the test alarms and test tickets with attachments will be given tomorrow
  • grid services
    • 3 extra WMS nodes were added for CMS, after recent user activities caused a stress on the sandbox file systems of the other machines
  • storage
    • no news yet for CMS t0streamer disk server; the hope is that all files can be recovered
    • EOSCMS had a deadlock early afternoon, cured by a restart after some debugging

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS - physics data taking (many short fills yesterday). Important and urgent MC production on-going (not new).
    • WLCG services
      • NTR
    • T0
      • NTR
    • T1
      • SARA SE instability : GGUS:82490. Situation more stable now (lower LHCb activity?), but still Destination errors. Still removed from T0 export.
      • INFN-T1 Unscheduled downtime (9:00-12:00 ) for backbone network intervention. Still removed from T0 export.

  • LHCb reports -
    • Users analysis and prompt reconstruction and stripping at T1s ongoing
    • MC production at Tiers2
    • New GGUS (or RT) tickets
    • T1:
      • SARA: decreased the rate of job submission for merging jobs, in order to reduce a bit the pressure on the SRM

Sites / Services round table:

AOB: (MariaDZ) GGUS Rel. yesterday went well. From the test tickets there are 5 waiting for closing by WLCG supporters (CERN, OSG, NGI_IT). Ticket numbers in Savannah:128369#comment4 . Our Did you know?... this month explains the meaning of GGUS ticket number colours in email notifications and search results. The gus.fzk.de domain decommissioning was abandoned to avoid any user inconvenience. Details in https://ggus.eu/pages/news_detail.php?ID=461 Other highlighs of the release include a standard Cc address in LHCb TEAM tickets, new Support Units for Experiment Dashboards and REBUS. Details in https://ggus.eu/pages/news_detail.php?ID=458

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 11-Apr-2012

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2012-05-31 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback