Week of 110822

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Jamie, Dirk, Cedric, Hong, Ian, Mike, MariaDZ, Lola, Eva, Alex);remote(Michael, Gonzalo, Onno, Tiju, Jhen-Wei, Giovanni, Catalina, Vladimir, Rob).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • ntr
  • T1 sites
    • BNL problem mentioned on Friday (LFC not reachable from outside : GGUS:73642) fixed (firewall issue)
    • Files that cannot be stage on Triumf GGUS:73645 (fixed), GridKa GGUS:73678 (fixed), PIC GGUS:73680, IN2P3-CC GGUS:73683
    • SARA : tape downtime. Stop T0 export to the site.
    • IN2P3-CC : Problem at Lyon on IN2P3-CC_VL GGUS:73461
    • RAL :
      • SRM suffering from some deadlocks GGUS:73644 [ Tiju - fixed on Saturday night ]
      • Disk server gdss230, part of the atlasStripDeg (d1t0) space token, is currently unavailable at the RAL Tier1 (Sunday 21 August 2011 11:25 BST).
  • T2 sites
    • CA-SCINET-T2 : SRM problems GGUS:73684 (Thunderstorm activity in the Toronto area caused power glitches which took out our cooling system at the datacentre, hence all systems were shutdown.)
    • IL-TAU-HEP : SRM problem GGUS:73681


  • CMS reports -
  • CERN / central services
    • CERN Production dropped out of BDII. Ticket was submitted. This only has a small analysis impact.
  • T1 sites:
    • We were trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working
    • File consistency problem reported at FNAL
  • T2 sites:
    • NTR


  • LHCb reports - Ongoing processing of data.
    • T1
      • IN2P3 :
        • "slow access to data" solved
      • PIC :
        • Problems with data access.
        • "Problems transferring files to CERN" (GGUS:73576) was solved.
      • GRIDKA :
        • Faulty connections from 192.108.46.248 (GGUS:73630) still opened.

Sites / Services round table:

  • BNL - ntr
  • PIC - comment on LHCb situation: just back from vacation - true that last week we were a bit thin and unresponsive for LHCb as contact had to be away. Starting today starting as LHCb backup. Related to GGUS ticket opened for files transferred PIC-CERN and failing, situation understood, some hanging transfers from tape, restart needed, about 100 files mainly affecting ATLAS but also LHCb. Following - still issues but will check.
  • NL-T1 - ntr
  • ASGC - ntr
  • CNAF - concerning problems on CE reported on Friday the failure of SAM test was due to FE which had disk problem. No update for failure of other test.
  • FNAL - investigating 8K files mismatch between PhDEX and local data system. Initial analysis: those files somehow moved but nothing else
  • OSG - ntr

  • CERN - looking at missing CERN PROD in BDII

AOB:

Tuesday:

Attendance: local(Jamie, Oliver, Hong, Mike, Manuel, Lola);remote(Joel, Kyle, Xavier, Michael, Tore, Tiju, Ronald, Catalina, Shu-Ting, Gonzalo, Giovanni).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • 2 central service interventions yesterday:
      1. ATLAS DDM central catalogue intervention: temporary glitches in ADC activities
      2. Frontier intervention: not well advertised to ADC shifters/experts (Savannah:85879)
    • T0 data export service (Santa-Claus) upgrade this morning
    • files have locality "unavailable" at CERN-PROD_DATADISK space token (GGUS:73746) [ latest news from ticket is that diskserver inaccessible but now back in production ]
  • T1 sites
    • INFN: fail to contact SRM endpoint (GGUS:73619)
    • IN2P3-CC: job failures (GGUS:73461) and tape staging issue (GGUS:73683). Site claimed both problems should be fixed now.
    • PIC:
      • transient file transfer failure from CERN-PROD_TZERO to PIC_DATADISK due to SRM overload. problem fixed.
      • massive job failure due to missing ATLAS release (GGUS:73732). software re-installation in progress. [ Gonzalo - related to installation of ATLAS release which is ongoing. A number of jobs are failing as this release is still not ready. Savannah ticket where team of ATLAS is following this. Seems jobs are being sent - have stopped queue for time being. ]
  • T2 sites:
    • ntr


  • CMS reports -
  • CERN / central services
    • CERN Production dropped out of BDII. This only has a small analysis impact. Solved by BDII restart, GGUS:73700
    • tomorrow, Aug. 24th, WLCG production database (LCGR) rolling intervention, service will be available but degraded, shifters and users informed. VOMS proxy init - will it fail or hang? A: should work ok during intervention (Manuel)
  • T1 sites:
    • Still trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working.
    • File consistency problem resolved at FNAL. Problem was that adler32 checksum calculation crashed for 12 production jobs. Jobs were not declared as failures and resubmitted, files were registered without alder32 checksum. File metadata and production infrastructure corrected.
  • T2 sites:
    • NTR


  • ALICE reports -
    • T0 site
      • ALICE has agreed CASTOR upgrade 31st August as proposed last week by Massimo
      • GGUS:73759. Big mismatch between number of jobs reported by MonaLisa and the information provider. Under investigation
    • T1 sites
      • Nothing to report
    • T2 sites
      • Sao Paolo. GGUS:73676. Site administrator's certificate was invalidated by VOMRS CA synchronizer. The registration appears as 'Approved' but he cannot create a proxy as lcgadmin. Solved

Experiment activities:

Ongoing processing of data.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0
  • T1
    • PIC :
    • SARA :
      • Problem of space token for lhcb-user 5 space tokens with 2 with DATA but only 1 active)


Sites / Services round table:

  • BNL - ntr
  • KIT - ntr
  • NDGF -
  • RAL - ntr
  • NL-T1 - ntr
  • FNAL - we investigated 8K files mismatch between Phedex and local file system - just name change. Reminder of downtime on Thursday ~4h.
  • ASGC - ntr
  • PIC - ntr
  • CNAF - ntr
  • OSG - ntr

AOB:

Wednesday

Attendance: local(Hong, Jamie, Maria, Mike, Oliver, Luca, Fernando, Alex, Edoardo, MariaDZ);remote(Michael, Joel, Tore, Onno, Kyle, Gonzalo, John, Claudia, Catalina, Soichi, Shu-Ting).

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • Hypervisors incident this morning causing T0 export services down for few hours
    • 2 DB rolling interventions this afternoon
      1. ATLARC
      2. LCGR - we were reminded yesterday by CMS report, would be appreciated if IT-DB could make a brief reminder in the daily WLCG Operations Meeting one day before the intervention. Also the intervention will affect/degrade LFC and VOMS services, it would be nice if an "at-risk" downtime could be declared in GOCDB. [ Luca - on status board since a few days. Communicated to ATLAS DBAs. For LCG DBs put on status board and announce ~1 week before. Will try to put a reminder. ]
  • T1 sites
    • IN2P3-CC: still see files not being staged from tape, GGUS:73683 reopened
  • T2 sites
    • CA-SCINET-T2:failed to contact on remote SRM, GGUS:73684 reopened

  • CMS reports -
  • CERN / central services
  • T1 sites:
    • Still trying to get pilots running on the new whole node queue at IN2P3. Looks like a CREAM issue and experts are working.
    • tomorrow, Aug. 25th, 3 PM to 7 PM CERN time, FNAL maintenance downtime
    • having trouble with some WMS at CNAF, tracked in savannah: SAVANNAH:123072
  • T2 sites:
    • NTR


  • ALICE reports -
    • General Information: 6 MC cycles, Pass0 and Pass1 reco and a couple of analysis trainings ongoing. ~33K Running jobs
    • T0 site
      • Missmatch between the number of jobs reported by MonaLisa and information provider solved yesterday. GGUS:73759 solved
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - Ongoing processing of data. finalisation of current production to clear old problem of data access

Sites / Services round table:

  • BNL - ntr
  • NDGF - ntr
  • NL-T1 - small follow up on issue reported by LHCb yesterday: 5 entries in our DB with LHCb user space token, only one active. Rest is historical - kept by dCache but no longer works. Trying to find out why some files not in any space token - will take some time to find out
  • PIC - situation wiht ATLAS is that Panda queues still closed, waiting for release of 16.2.7 to be finalized and then will open queues again
  • RAL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • ASGC - ntr
  • OSG - Soichi from OSG OPS has questions about BDII. Talking to David Collados about GGUS:73058. Working to upgrade OSG's production version of BDII CMON latest version. Have to switch to UMD repository. Didn't know about this and wanted to make sure that this is the right approach! [ MariaDZ - will take offline with David ]

  • CERN DB - ATLAS archive and tag DB patched, WLCG now, tomorrow rolling patches for security. Offline DBs for ALICE and COMPASS. Had a crash on once instance of ADCR - instance 3 - at 12:54 - due to Oracle bug. DQ2 and Panda don't run on that instance.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS reports -
  • T0/Central Services
    • AMI (read-only) instance at CERN not properly started up after the hypervisors incident yesterday
    • 2 out of 3 load-balancing machines for panda server are out from load-balancing, causing high load on one machine therefore service hangs. The reason is "/tmp" on those 2 machines are filled up by temporary files created by Panda. Experts are investigating. Production and distributed analysis activities are affected.
    • reprocessed AOD/DESD data distribution to T1s were started yesterday. It progresses well.
  • T1 sites
    • IN2P3-CC: last part of RAW-to-ESD jobs of data reprocessing (the part involves tape staging) finished this morning. Local batch system is re-configured by site to avoid failures in file merging jobs caused by high load on WNs.
    • PIC got ATLAS release 16.6.7.6.1 reinstalled last night, production queue re-opened.
  • T2 sites
    • ntr



  • ALICE reports - General information: 6 MC cycles, Pass0 and Pass1 reco and a couple of analysis trainings ongoing. ~33K Running jobs
    • T0 site
      • Missmatch between the number of jobs reported by MonaLisa and information provider solved yesterday. GGUS:73759 solved
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations


  • LHCb reports -
  • Ongoing processing of data. finalisation of current production to clear old problem of data access

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
  • T1
    • CERN : file outside SPACE token (GGUS:73810)
    • SARA : is it possible to assigne the 1TB of "historical" LHCB-USER space token to LHCB-USER (GGUS:73087)


Sites / Services round table:

  • CERN FTS T2 Service Proposal:
    • Migrate last 2 of the 3 FTS agent nodes from SL4 to SL5 on Wednesday 31st August, 08:00 to 10:00 UTC.
    • During the intervention newly submitted transfers will queue up rather than being processed. Otherwise it is expected to be transparent.
    • The Tier2 service has been partially running SL5 for a couple of months already for some other active channels.
    • Rollback is quite possible.
    • Affected channels: CERN-MUNICHMPPMU CERN-UKT2 CERN-GENEVA CERN-NCP NCP-CERN CERN-DESY DESY-CERN CERN-MICHIGAN CERN-INFNROMA1 CERN-INFNNAPOLIATLAS CERN-MUNICH CERN-KOLKATA KOLKATA-CERN INDIACMS-CERN CERN-INDIACMS CERN-JINR CERN-PROTOVINO JINR-CERN PROTOVINO-CERN CERN-KI KI-CERN CERN-SINP SINP-CERN CERN-TROITSK TROITSK-CERN STAR-TROITSK CERN-ITEP ITEP-CERN STAR-ITEP CERN-PNPI PNPI-CERN STAR-PNPI STAR-PROTOVINO STAR-JINR STAR-KI STAR-SINP CERN-IFIC CERN-STAR STAR-CERN
    • Migration will be confirmed or not on Monday 29th at 15:00 CEST meeting.

  • CERN central services
    • gLite WMS for CMS back to normal, status of jobs up-to-date on all WMS nodes.

AOB:

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 18-Jul-2011

Edit | Attach | Watch | Print version | History: r16 | r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r12 - 2011-08-25 - SaizSantosLola
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback