Week of 121105

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Massimo, Alexey, Diugh, Andrey, Jerome, MariaD, Maarten, Felix, Michael, Eva);remote(Oliver, Tiju, Lisa, Michael, Pavel,...).

Experiments round table:

  • ATLAS reports -
    • Tier0/1s
      • TRIUMF slow T0 Export GGUS:88111 : Backup OPN link (1GB/s) is used instead of Primary one (5GB/s)
      • RAL slow T0 Export GGUS:88112 : Also Backup link is used but equal to Primary one.

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Cream CE problems noticed in HammerCloud tests for CERN: GGUS:87987
        • Error messages like:
          • Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted
          • Transfer to CREAM failed due to exception: CREAM Start failed due to error [the job has a status not compatible with the JOB_START command!]
        • updated over the weekend with more failures
      • Castor:
        • 1.4 TB file has been copied off Castor, ticket GGUS:87940 closed
        • Stuck files in T0EXPORT pool, GGUS:88147, seems to work again but no comment in the ticket, we have resubmitted the jobs
      • EOS:
        • Problems copying files from Castor to EOS, known problem, xrootd process crashes and transfers go into illegal state, older ticket used INC:179785
        • Upgrade on Wednesday, Nov. 7, 10 AM - 12 PM CERN time, announced to collaboration
    • Tier-1:
      • T1_TW_ASGC: continuing problems writing to tape at ASGC because of media not available. Although tape usage is well below pledge, ASGC ran out of tape and is struggling to provide enough tape space. We ask for a resolution as quickly as possible. Related ticket GGUS:88148
    • Tier-2:
    • NTR

  • ALICE reports -
    • CERN: castoralice/alicedisk migration still on track (end of Nov although EOS headnode instabilities)
    • KIT: for at least 2 months SAM CE tests sent via the WMS often failed because of "no compatible resources"; this was due to 2 of the 3 site BDII nodes being unreachable, fixed Sun evening (GGUS:88110); also other VOs would have been affected

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 4 attached T2s
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • IN2P3:
        • data access failures on the weekend, retries successful
      • Gridka:
        • data access failures, agreed to increase the number of gridftp movers from 5 to 10 in each LHCb pool
Sites / Services round table:
  • ASGC:Vendor call forced them to have an intervention tomorrow (little notice). Apologises since this was not intended.
  • BNL: ntr
  • FNAL:ntr
  • IN2P3: IN2P3 asks LHCb to open a ticket: they are investigating some dCache issues and more info would be useful
  • KIT: Reinstalling CEs. Problems with ATLAS dCache pools (several offline): experts at work
  • NDGF: dCache related problems under investigation (GGUS:87999)
  • NLT1: ntr
  • PIC:ntr
  • RAL:12:00-18:30 Due to DB overload the SRM service for ATLAS was not working. Now fixed
  • OSG:ntr

  • CASTOR/EOS: nta
  • Central Services:Wed morning: deployment of the EMI2 WN
  • Data bases:ntr
  • Dashboard: ntr
AOB: Phone interface not working (I cannot see who is connected from the web interface)

Tuesday

Attendance: local(Massimo, Alexey, Dough, Andrey, Jerome, Maarten, Felix, Michael);remote(Ulf, Stefano,Oliver, Tiju, Burt, Michael, Rolf, Ronald, Kyle).

Experiments round table:

  • ATLAS reports -
    • Report to WLCGOperationsMeetings
      • T0//T1
        • SARA T1 transfer from tape errors problem ongoing - files do make it to disk but after several attempts - GGUS:88174
        • SARA T0 export problems due broken DNS (solved) - GGUS:88230
        • CERN T0 export errors CERN-PROD DATADISK and DATAPREP GGUS:88241
    • ATLAS Bulk reprocessing continuing

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Cream CE problems noticed in HammerCloud tests for CERN: GGUS:87987
        • verifying that the problem is not showing anymore
      • EOS:
        • Problems copying files from Castor to EOS, known problem, xrootd process crashes and transfers go into illegal state, older ticket used INC:179785, problems is fixed, keeping the ticket open for further checks and verification that it actually works
        • Upgrade on Wednesday, Nov. 7, 10 AM - 12 PM CERN time, announced to collaboration
    • Tier-1:
      • T1_TW_ASGC: continuing problems writing to tape at ASGC because of media not available. Although tape usage is well below pledge, ASGC ran out of tape and is struggling to provide enough tape space. We ask for a resolution as quickly as possible. Related ticket GGUS:88148, still ongoing, trying to move files now off site to store them custodially at other T1 sites
    • Tier-2:
      • NTR

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 4 attached T2s
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
      • CERN:
        • Castor intervention 1pm-4pm today, 6th Nov, CERN/SRM storages will be banned for read-write access for any activity.
    • T1:
      • Gridka:
        • 2 CEs moved from WMS access to direct access, VO-lhcb-pilot tag is set to allow pilots with VOMS Role=prod. This solved the problem of aborted pilots at Gridka from yesterday
Sites / Services round table:
  • ASGC: 80 TB of tapes arrived today, other 80 TB will arrive tomorrow. CMS asks for the plan to reach the pledge (~1 PB)
  • BNL:ntr
  • CNAF:ntr
  • FNAL:ntr
  • IN2P3:ntr
  • NDGF:FTS upgraded. dCache will be upgraded this Friday
  • NLT1: GGUS:88174 part related to a known dCache issue but we are investigating because it doess not explain the problem completely
  • RAL:ntr
  • OSG: ntr

  • CASTOR/EOS: the ticket of ATLAS is solved (EOS headnode machine spontaneous reboot)
  • Central Services:ntr
  • Dashboard: ntr
AOB:

Wednesday

Attendance: local(Ale, Doug, Massimo, Andrey, Felix, Jerome, Luca, Maarten, Michael);remote(Audioconf tool not working).

Experiments round table:

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • CreamCEs unknown held reasons for glideIn WMS pilots: GGUS:88238, old ticket, was recently bridged to GGUS
      • EOS:
        • Problems copying files from Castor to EOS, older ticket used INC:179785, need to revisit after the upgrade today
        • Upgrade on Wednesday, Nov. 7, 10 AM - 12 PM CERN time, announced to collaboration
    • Tier-1:
      • T1_TW_ASGC: tape writing resumed after 100 TB in media had been added yesterday, see GGUS:88148, Wednesday morning 17 TB were waiting to go to tape
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: many job failures after the site was switched to using Torrent yesterday; being investigated.

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 4 attached T2s
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0: NTR
    • T1:
      • RAL:
        • General power cut, banned for usage by LHCb
Sites / Services round table:
  • ASGC: Problems with CMS (stuck RFIO processes). Solved, ticket can be closed
  • BNL: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: After the dCache upgrade several disks (ATLAS data) cannot brought back yet. Working
  • PIC: ntr
  • RAL: Power cut at 11:30 UTC. Diesel generators failed. As we speak (14:00 UTC) power is coming back but we cannot forecast when we will be back (UPS could not protect critical services like the DB servers)
  • OSG: ntr

  • CASTOR/EOS: Short reboot of EOS ATLAS
  • Central Services: All batch is now with EMI2. We are investigating the LSF ticket (TEAM-->ALARM). In an offline discussion Massimo suggested not to upgrade tickets from TEAM to ALARM when the service manager have just taken in hand the issue.
  • Data bases: ntr
  • Dashboard: ntr
AOB:

Thursday

Attendance: local(Ale, Doug, Massimo, Andrey, Felix, Jerome, Maarten, Michael, MariaD);remote(Audioconf tool not working).

Experiments round table:

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • CreamCEs unknown held reasons for glideIn WMS pilots: GGUS:88238, old ticket, was recently bridged to GGUS, being followed up by both sides
      • Castor:
      • EOS:
        • Problems copying files from Castor to EOS, older ticket used INC:179785, can be closed, still some follow up on the cause would be nice
        • Problem accessing files written to EOS, checksums were missing, GGUS:88297
    • Tier-1:
      • T1_TW_ASGC: tape writing resumed after 100 TB in media had been added yesterday, see GGUS:88148, Thursday morning still 1 TB to migrate, new files are going quickly to tape, older files take longer, closing the ticket today
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: job profile looks a lot better now, but not yet clear if yesterday's problem has been fully resolved.

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 4 attached T2s
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
      • CERN:
        • LHCb EOS storage assignment was increased 250TB -> 450TB, but the LHCb quota was not increased resulting in the unavailability of the new capacity. Solved now by increasing the LHCb quota.
    • T1:
      • RAL:
        • General power cut, still banned for usage by LHCb
Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: New CREAM CE installed. Experiments should check them out
  • NDGF: Still problems (disk related on dcache). Hope to lose no data. Short intervetnion tomorrow for upgrading the headnodes
  • NLT1: ntr
  • RAL: downtime just ending (14:00 UTC). Key services (FTS, CASTOR) are ready, batch capacity will be enabled immediately after
  • OSG: ntr

  • CASTOR/EOS: Intervention on CMS/T0EXPORT-T1TRANSFER needed to fix recent degradations (avoid to impact further tape access times). This needs a reconfig and a temporary reduction of resources (25-30%)
  • Central Services: LSF problem should be better now. No issues with the EMI2 WNs upgrade
  • Dashboard: ntr

  • GGUS: Changes at the next release 2012/11/28 The 'lcg-ce' GGUS Support Unit will be removed Savannah:133467. Adding multiple ticket attachments with the same filename will not be allowed any more Savannah:133020.
AOB:

Friday

Attendance: local(Alexandre, Doug, Massimo, Andrey, Jerome, Felix, Maarten);remote(AudioTool not working).

Experiments round table:

  • ATLAS reports -
    • Report to WLCGOperationsMeetings
      • T0//T1
        • Recovered from RAL outage - Thanks to site for their efforts
        • Continuing problems with CERN LSF slow dispatch of jobs. We are in contact with the experts, who are working on the problem.
    • ATLAS Bulk reprocessing still proceeding

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • CreamCEs:
        • unknown held reasons for glideIn WMS pilots: GGUS:88238, waiting for reply from factory teams in US
        • Jobs being aborted and failing on CERN CreamCEs, GGUS:88304
      • Castor:
        • T0EXPORT under high load, removed 8 older lower-performance disk servers from pool, will add 2 newer ones to return to full capacity, ok for CMS currently
      • EOS:
        • Problem accessing files written to EOS, checksums were missing, GGUS:88297, fixed more files
        • Stageout to EOS stuck occasionally - High occurrence rate, GGUS:88355, solved
        • Missing files from EOSCMS, GGUS:88358
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: the SW installation on the WN times out for most jobs - the suspicion is that Torrent traffic between the WN is blocked somehow, i.e. they cannot help each other and all their SW needs to come from the seeder at CERN, which has a limited bandwidth (on purpose).

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 4 attached T2s
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • RAL:
        • Disk server failure in the LHCb_DST space, recovered after the server restart.
Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: CREAMCE to be used (monitor) are cream-[12678]-kit.gridka.de
  • NDGF: Swe dCache site: still recovering dead disks. Some more intervention on Monday (firmware)
  • NLT1: ntr
  • RAL: ntr (but RAL is back after the power cut)
  • OSG: Around 13:00 UTC one BDII went down. The second is taking all traffic. Working on restoring the failed one.
  • CASTOR/EOS: nta
  • Central Services: ntr
  • Dashboard: ntr
AOB:

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2012-11-09 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback