Week of 121126

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(AndreaS, Alexei, Giuseppe, Maarten, Xavier, Stefan, WeiJen, Guido, Elena, Mike, Eva, Ulrich, MariaD);remote(Boris/NDGF, Paolo/CNAF, Michael/BNL, Onno/NL-T1, Tiju/RAL, Rolf/IN2P3, Rob/OSG, Dmitri/KIT).

Experiments round table:

  • ATLAS reports -
    • ATLAS General
      • Massive cleaning of T1s DATADISK was done over weekend
      • ATLAS production is recovering from yesterday's incidents (see below)
    • Central Services:
      • AFS ALARM ticket GGUS:88856 : The afs154 had a hard disk problem, which was fixed by rebooting. Tier0 activity was affected. [Xavi reports that in the logs the last action was that the operator tried to plug in the console; we are trying to know what happened]
      • WEB : AFS problem triggered the whole atlas.web.cern.ch web site to be unavailable for hours, it was fixed this morning (INC:198000). Unavailability of TiersOfAtlasCache web copy affected ATLAS production. Possible ways to avoid it in the future are under discussion.

  • CMS reports -
    • LHC / CMS
      • Machine Development to last until ~noon Thursday
    • CERN / central services and T0
      • CASTOR problem on Sat morning GGUS:88846, problem retransferring the files to T0 [Xavi explains that this was caused by a head node restart while a file was being written, which ended up being in a funny state. Only one file was affected]
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • LHCb reports -
    • General: some sites have set a wall clock time limit == CPU time limit. This makes any matching using CPU work requirement useless as the wall clock limit will always fire first (unless the job has an efficiency > 1). Sites don't seem to understand the issue... Not a major problem in most cases as our job efficiency is good, but it could be degraded by events outside our control (machine too heavily loaded or too high overcommitment of the machine (slots/cores).
    • Reprocessing: running last jobs at RAL, Gridka and Cnaf "groups". New conditions expected next week.
    • Prompt reconstruction: 50TB collected in last 72 hours
    • MC productions at T2s and T1s if resources available
    • T0: NTR
    • T1:
      • FTS transfer failures to Gridka disk from different sites

Sites / Services round table:

AOB:

Tuesday

Attendance: local(AndreaS, Stefan, Nicolò, Maarten, Mike, Elena, Xavier, Alex, MariaD);remote(Wei-Jen/ASGC, Michael/BNL, Paolo/CNAF, Lisa/FNAL, Rolf/IN2P3, Xavier/KIT, Boris/NDGF, Jeff/NL-T1, Gonzalo/PIC, Tiju/RAL, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • FZK: problem with T0 export to DATATAPE (GGUS:88877). FZK DATATAPE is removed from T0 export.
      • IN2P3: reprocessing was failing because of backlog on gridftp door of dcache system (GGUS:88899).

Rolf explains that the backlog at IN2P3 was due to the restart of a CE, which in turn restarted a large number of ATLAS jobs and dCache could not cope with the load. He is confident that the backlog will be quickly absorbed. Elena asks if one can put a cap on the jobs. Rolf says that it is not possible with SGE (nor to cap the ramp-up, which is more relevant in this situation). There are discussions with the developers to have this feature in the future.

  • CMS reports -
    • LHC / CMS
      • Machine Development to last until ~noon Thursday
    • CERN / central services and T0
      • VOMS: ANSPGrid CA not recognized by CMS VOMS server GGUS:88889
      • CMS wants to retire the CMSPRODLOG service class. If it is confirmed that all files can be deleted, IT-DSS can do a bulk cleanup
    • Tier-1:
      • ASGC: slow tape migrations, GGUS:88873
      • ASGC: HammerCloud test job failures on decommissioned computing elements, will investigate how to exclude them from testing GGUS:88872
      • RAL: HammerCloud test job failures with qsub timeouts, looks OK now, SAV:134046
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, network investigation in progress GGUS:88752
    • Tier-2:
      • NTR

  • ALICE reports -
    • CERN: EOS free space is down to 180 TB and the occupancy is 94% - what is the plan for adding more space from CASTOR there soon? [Xavier: need to check with Jan but it should be already planned. Checked: Some more capacity will be added by tomorrow, no problem for the 5% free (it's 180TB)]

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage: ntr
  • Dashboard (strictly, SAM team): update of production SAM API (grid-monitoring instance) will happen TOMORROW (between 10:00 and 11:00 CET). Service disruption ~30minutes while web server is off. Nagios machines probably updated on Thursday; SAM team will be in contact with experiment reps on Weds afternoon to discuss further.

AOB:

Wednesday

Attendance: local(AndreaS, Wei-Jen, Ulrich, David, Luca, Nicolò, Elena, Xavier, MariaD);remote(Pavel/KIT, Gareth/RAL, Boris/NDGF, Joel/LHCb, Rob/OSG, Rolf/IN2P3, Michael/BNL, Ron/NL-T1, Lisa/FNAL).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • FZK: ~40K files from reprocessing have been declared as lost.

  • CMS reports -
    • LHC / CMS
      • Machine Development to last until ~noon Thursday
    • CERN / central services and T0
      • CMS Oracle Online Active Data Guard (CMSONR ADG) stopped refreshing at 23:30, affecting conditions and other services at P5 and T0. FIXED this morning by IT-DB. [Luca explais that it was due to a human error during an intervention to make a configuration change around 22:00, which lasted more than foreseen. The issue was fixed at 1:30.]
      • VOMS: ANSPGrid CA not recognized by CMS VOMS server GGUS:88889 IN PROGRESS.
    • Tier-1:
      • KIT: FTS server outage for DB issues
      • ASGC: transfer submission to FTS was failing for CRL issue on PhEDEx VOBOX at ASGC GGUS:88961 FIXED
      • ASGC: slow tape migrations, GGUS:88873 FIXED
      • ASGC: HammerCloud test job failures on decommissioned computing elements: list of ASGC CEs needs to be updated in CMS SiteDB, not a site problem GGUS:88872 IN PROGRESS
      • RAL: HammerCloud test job failures with qsub timeouts, looks OK right now but NO REPLY from site yet, SAV:134046
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, network investigation IN PROGRESS GGUS:88752
    • Tier-2:
      • NTR

  • ALICE reports -
    • NTR
    • [Xavier reports that space has been added on EOS and now ALICE has 0.5 PB free]

  • LHCb reports -
    • Reprocessing: running last jobs at RAL, Gridka and Cnaf "groups" (see http://lhcbproject.web.cern.ch/lhcbproject/Reprocessing/sites.html )
    • Prompt reconstruction: CERN + 5 Tier2 sites
    • MC productions at T2s and T1s if resources available
    • T0: NTR
    • T1:
      • FTS transfer failures to Gridka disk from different sites (GGUS:88906). To be mentionned that we did not received the Notifictaion of the dowtime by CIC ????

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: there was a serious database problem after a routine intervention but now the situation is much better. The experts say that FTS is now recovered. A SIR will be prepared. About the notification, yesterday around 20 there was an unscheduled downtime declared from 00 of 27/11 to 09 of 28/11.
  • NDGF: ntr
  • NL-T1: today we migrated the FTS database to new hardware. No issues seen.
  • RAL: the GGUS ticket corresponding to the Savannah ticket mentioned above by CMS is in fact solved. Nicolò thinks it could be a Savannah-GGUS bridge problem.
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage: the ARCHIVE service class on CASTORCMS is ready. It would good to know from CMS what is the expected usage pattern.
  • Dashboards: the SAM central services and the SUM production pages were succesfully upgraded this morning

AOB:

  • The GGUS release test was OK on European and Asian sites. Tomorrow a complete analysis of the results will be reported.

Thursday

Attendance: local(AndreaS, Wei-Jen, Elena, Nicolò, DavidT, Stefan, Maarten, Ulrich, Eva);remote(Marian/KIT, Michael/BNL, Ronald/NL-T1, Gareth/RAL, Lisa/FNAL, Boris/NDGF, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • FZK: Ongoing issue with DCache (GGUS:88877). FZK DATATAPE is still removed from T0 export. [Marian: in the ticket today there are some comments from Xavier]
      • IN2P3: problem with SRM server (GGUS:88984).
      • RAL: problem with Castor overnight (GGUS:88989). Fixed in early morning. Thanks.

  • CMS reports -
    • LHC / CMS
      • Machine Development & access to last until afternoon, then normal physics data taking from the evening. High pile-up run tomorrow.
    • CERN / central services and T0
      • VOMS: ANSPGrid CA not recognized by CMS VOMS server GGUS:88889 IN PROGRESS. [Ulrich: unfortunately the VOMS expert is away]
      • Tier-0 batch system: LSF monitoring unavailable GGUS:89009 FIXED.
    • Tier-1:
      • RAL: CASTOR issues affecting SAM SRM availability and transfers, now improved, GGUS:89003 and GGUS:89004 IN PROGRESS
      • KIT: HammerCloud test job failures on cream-1-kit.gridka.de with error "Transfer to CREAM failed due to exception: Failed to create a delegation id" GGUS:89022 SUBMITTED
      • FNAL: HammerCloud test job failures on cmsosgce2.fnal.gov after host cert renewal, FIXED yesterday SAV:134134
      • ASGC: HammerCloud test job failures: now jobs are properly submitted to new computing elements, but with wrong role, so they are failing with authorization issues - still not a site issue GGUS:88872 IN PROGRESS
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, network investigation IN PROGRESS GGUS:88752
    • Tier-2:
      • NTR
    • AOB

  • LHCb reports -
    • Reprocessing: running last jobs at RAL, Gridka and Cnaf "groups", restarting with new files at 10 Dec
    • Prompt reconstruction: CERN + 5 Tier2 sites
    • MC productions at T2s and T1s if resources available
    • T0: NTR
    • T1:
      • FTS transfer failures to Gridka disk from different sites (GGUS:88906) [Marian: now FTS should be fine]

Sites / Services round table:

  • ASGC: A CASTOR disk server failed this morning due to a defective storage controller; now fixed
  • BNL: ntr
  • FNAL: ntr
  • KIT: ntr
  • NDGF: this morning a network provider in Denmark had planned a downtime to upgrade a central switch. It had been advertised an just a reduction in redundancy but in fact it was an outage. It lasted 15' but it took more time for all services to go back to normal.
  • NL-T1: ntr
  • RAL: The problem last night with CASTOR reported by ATLAS and CMS was fixed. Now we are seeing unusually high failure rates in CMS transfers: CASTOR is OK, we are trying to understand the cause.
  • OSG:
    • we detected problems in connecting with bdii206.cern.ch, which now seems to be fine. Ulrich will check if that node can really be used; in the mean time it might be better to monitor the alias instead of the individual nodes. Maarten proposes to provide a list of the BDII nodes currently in production.
    • The test GGUS tickets sent to OSG were received and dealt with without any problem
  • CERN batch and grid services: ntr
  • Dashboards: ntr
  • Databases: ntr

AOB:

  • the usual round of tests for the latest GGUS release was fully successful
  • all the experiment SAM Nagios servers at CERN were updated to the latest version

Friday

Attendance: local(AndreaS, Elena, Wei-Jen, Jarka, Stefan, Maarten, Xavier, Nicolò, Ulrich);remote(Michael/BNL, Gonzalo/PIC, Luis/NDGF, Xavier/KIT, John/RAL, Jeremy/GridPP, Ronald/NL-T1, Rob/OSG, Lisa/FNAL, Rolf/IN2P3, Paolo/CNAF).

Experiments round table:

  • ATLAS reports -
    • CERN/T0
      • NTR
    • T1
      • IN2P3: ongoing problem with SRM server (GGUS:88984). Excluded from T0 export. We would appreciate if the site can declare DT when it has a severe problem. [Rolf: the SRM problem, which is triggered by long proxies, should be solved by a patch provider by the dCache developers, but we need more time to be sure of that. It is difficult to find out which users cause it, but most likely some CMS users do, as the problem is visible also at a CMS-only Tier-2. About the downtime, it was a mistake, we did not switch it from "at risk" to "outage". We reminded people to be more careful. The ticket should have been already updated]
      • INFN-T1: SRM problem (GGUS:89033). Fixed. Thanks
      • RAL: Frontier server was down since 18:00 yesterday (GGUS:89063).
      • FZK: Frontier server was down from midnight, fixed in the morning. Thanks. RAL is backup for FZK. ATLAS jobs in DE cloud were affected.

  • CMS reports -
    • LHC / CMS
      • Proton-proton physics data taking. High pile-up run this afternoon.
    • CERN / central services and T0
      • VOMS: ANSPGrid CA not recognized by CMS VOMS server GGUS:88889 IN PROGRESS.
      • P5 to Tier-0 transfer system: transfer of data for a test stream got stuck, because the machine used at P5 for the transfers had not yet been added to the exception list of machines allowed to access CASTOR without a kerberos token. FIXED GGUS:89027
    • Tier-1:
      • ASGC: HammerCloud test jobs aborted with CREAM JobSubmissionDisabledException, looks like a transient issue SAV:134184 NOTE: since the sites started to upgrade their CREAM CEs to EMI-2, we have seen this behaviour at several sites, e.g. KIT GGUS:88853 - CREAM for EMI-2 will auto-disable job submission when it is overloaded but the default configuration for this feature is a bit conservative and might require some tuning.
      • IN2P3: requested to increase number of replicas for files in "hot" input dataset GGUS:89054
      • IN2P3: SRM unavailable overnight, affected SAM tests and transfers, FIXED (no ticket)
      • RAL: CASTOR issues affecting transfers, now improved, GGUS:89004 IN PROGRESS
      • RAL: CASTOR DB issues affecting local file access, seen in production jobs, SAM test jobs and HammerCloud jobs, FIXED GGUS:89034
      • KIT: HammerCloud test job failures on cream-1-kit.gridka.de with error "Transfer to CREAM failed due to exception: Failed to create a delegation id" GGUS:89022 FIXED
      • CNAF-->FNAL: ongoing issues with CNAF-->FNAL transfers, network investigation IN PROGRESS GGUS:88752
    • Tier-2:
      • NTR

  • ALICE reports -
    • CNAF: yesterday afternoon there was a lot of contention for disk I/O on the WN, in particular slowing down jobs using CVMFS; as on the previous occasion when this problem was observed, the ALICE task queue was almost empty, thereby leading to many agents preparing SW for tasks that had started on other sites in the meantime, without new tasks taking their place; not clear if that fully explains the problem, though; for now ALICE switched back from Torrent to using the shared SW area instead

  • LHCb reports -
    • Reprocessing: running last jobs at RAL and Gridka "groups", restarting with new calibration/files at 10 Dec
    • Prompt reconstruction: CERN + 5 Tier2 sites
    • MC productions at T2s and T1s if resources available
    • T0: NTR
    • T1:
      • CNAF: CVMFS problems turned out to be a problem b/c of a backlink into afs which was not working correctly. LHCb is working on a fix
      • FTS transfer failures to Gridka disk from different sites (GGUS:88906)

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: since last week there were several read-only ATLAS pools with I/O errors. Now they are back online and we are cleaning up the inventory to catch up with the several files that ATLAS deleted last week. Then we will check if there are any corrupted files.
  • NDGF: ntr
  • NL-T1: ntr
  • RAL: 1) We fixed the ATLAS Frontier (the ticket is updated). We will also update the CMS ticket. 2) an LHCb disk server is out of action, it will be back tomorrow.
  • OSG: ntr
  • GridPP: ntr
  • CERN batch and grid services: ntr
  • CERN storage: ntr
  • Dashboards: ntr

AOB:

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2012-11-30 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback