Week of 141201

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Tsung-Hsun (ASGC), Stefan (SCOD), Herve (DSS), Raja (LHCb), Luca (Databases), Maarten (ALICE), Alessandro (ATLAS),
  • remote: Ulf (NDGF), Dimitri (KIT), Michael (BNL), Onno (NL-T1), Tiju (RAL), Dea-Han (KISTI), Rolf (IN2P3), Pepe (PIC), Elizabeth (OSG), Lisa (FNAL),
  • apologies: Antonio (CNAF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Data loss:
        
      Approximately 600k files were physically deleted from storage between 24 November at 22:00 and 27 November 
      at 21:00 because of a configuration problem of the Rucio integration infrastructure. The number of single replica 
      files amount at approximately 50k. They belong to approximately 250 datasets hosted in datadisk , groupdisk and 
      localgroupdisk. The detailed list of lost files is still under construction. Measures are taken now to recover the files 
      which have other replicas available. The recovery procedures for the single replica files deleted will be discussed 
      and agreed with production and physics coordination in the coming days once the detailed list of lost file is produced.   
      
    • Migration final steps summary https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/52111
      Ciao,
      as agreed this morning meeting:
      - Tadashi this morning will stop the registering crons . Tadashi kill all the production jobs, and move tasks from 
        active to paused, . 
      - in parallel to Tadashi's action: DDM ops will cancel the subscription of dis and sub. Stopping SS after a while 
        that it has been checked that the subs are gone.
      - Central Catalog will be stopped all except traces. the reader cc will be checked the logs
      - David will send an elog to announce the user of possible "holding" for their tasks. -- done
      - rucio migration of the last 300k should start. no dis and sub
      - Cedric will send an announcement to then restart. there could be iterations if there are datasets which cannot 
        be migrated.
      - once panda is restarted we will check Deft too.
      good luck
      

  • CMS
    • NR

  • ALICE -
    • NTR

  • LHCb
    • MC and user jobs. Validation for "Legacy Run1 stripping campaign" almost finished. Now waiting for responses from Physics groups before getting the greenlight for launching the full campaign.
    • T0: Problem with lbvoboxes authenticating to LFC without proxy starting Friday night (GGUS:110469). Significant effect on productions over the weekend as new jobs were not being created automatically. Did not affect user jobs.
    • T1: NTR

Sites / Services round table:

  • ASGC: Downtime 5 December, upgrade worker node memory, GOCDB entry done
  • BNL: NTR
  • CNAF: NTR
  • FNAL: NTR
  • GridPP: NR
  • IN2P3: Incident report submitted from last week. Tuesday 9 Dec Downtime for whole day, almost all services affected, GOCDB entry done
  • JINR: NR
  • KISTI: NTR
  • KIT: NTR
  • NDGF: Full Downtime for 1 hour on Wed b/c of dCache head-node updates and reboot. GOCDB entry done
  • NL-T1: NTR
  • OSG: NTR
  • PIC: NTR
  • RAL: Sat night a problem was experienced on CMS castor - fixed now Sunday afternoon. Downtime for LHCb/Castor b/c of OS upgrade on headnodes, entered in GOCDB
  • RRC-KI: NR
  • TRIUMF: NR

  • CERN batch and grid services: NR
  • CERN storage services: tomorrow Castor/CMS and Castor/ATLAS will be upgraded to the latest release, should not have impact on operations
  • Databases: NTR
  • GGUS: NR
  • Grid Monitoring: NR
  • MW Officer: NR

AOB:

Thursday

Attendance:

  • Local: Stefan (SCOD), Tsung-Hsun (ASGC), Herve (Storage), Raja (LHCb), Ulrich (Grid Service), Maarten (ALICE),
  • Remote: Ulf (NDGF), Lisa (FNAL), John (RAL), Antonio (CNAF), Rolf (IN2P3), Jeremy (GridPP), Christoph (CMS), Rob (OSG), Pepe (PIC), Dea-Han (KISTI), Dimitri (KIT),
  • Apologies: Alessandro (ATLAS), Dennis (NL-T1)

Experiments round table:

  • ATLAS
    • Daily Activity
      • apologies today ATLAS cannot be present
      • Nothing special to report for sites.
      • tuning of the now in production ProdSys2 and Rucio is ongoing.

  • CMS
    • NTR

  • ALICE -
    • KIT: low job efficiency since the raw data reprocessing start on Monday
      • input data files are being read remotely, because not available from local SE
      • a local SE issue was fixed this afternoon and the efficiency may thus increase again

  • LHCb
    • MC and user jobs. Validation for "Legacy Run1 stripping campaign" finished. Plan to launch the full campaign this evening.
    • T0: Problem with lbvoboxes authenticating to LFC without proxy (GGUS:110469). Still not informed in the ticket about getting all the lbvoboxes whitelisted for read-only access.
    • T1: NTR
    • Others : "DOS" issue - large number of queries hitting LHCb configuration services starting on morning of 3 December 2014. Three ip addresses banned in the LHCb configuration (2 from Romania and 1 from FZK). The FZK address was un-banned this morning and jobs are running again at GridKa. email sent to Romanian site admin and we are waiting for response from him.

Sites / Services round table:

  • ASGC: NTR
  • BNL: NR
  • CNAF: Today problem with CVMFS b/c of faulty squid server, problem shall be solved now but being monitored.
  • FNAL: NTR
  • GridPP: NTR
  • IN2P3: NTR
  • JINR: NR
  • KISTI: NTR
  • KIT: NTR
  • NDGF: dCache update yesterday went fine. Migrating to new tape system, some staging may take a bit longer.
  • NL-T1: NTR
  • OSG: VOMS switch-over done, package release next Tuesday to remove old VOMSes from all configurations. Maarten: did you experience any problems? Rob: did not hear of any significant problems.
  • PIC: 15 Dec going to upgrade dCache instance. All monitoring plugins for xrootd have been installed.
  • RAL: In the process of upgrading CASTOR headnodes to SLC6. LHCb done, CMS / ATLAS next week.
  • RRC-KI: NR
  • TRIUMF: NR

  • CERN batch and grid services: GGUS ticket from LHCb, all done and servers have been whitelisted, ticket to be updated.
  • CERN storage services: NTR
  • Databases: NR
  • GGUS: NR
  • Grid Monitoring: NR
  • MW Officer: NR

AOB:

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2014-12-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback