Week of 150511

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria Alandes (chair and minutes), Maarten Litmaath (ALICE), Herve Rousseau (Storage), Lorena Lobato (DB), Ulrich Schwickerath (Batch&Grid), Mark Slater (LHCb)
  • remote: Felix Lee (ASGC), Michael Ernst (BNL), Lisa Giacchetti (FNAL), Rolf Rumler (IN2P3), Sang-Un Ahn (KISTI), Dimitri (KIT), Onno Zweers (NL-T1), Pepe Flix (PIC), Tiju Idiculla (RAL), Di Qing (TRIUMF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Apologies, due to other commitments no one from ATLAS can join today
    • Grid full, mainly MC Simulation
    • cvmfs stratum 1 problems. Taiwan is down very often but we will ask shifters to ignore it. FNAL was also partially down over the weekend but is not used by ATLAS.

  • CMS reports (raw view) -
    • Apologies for absence(s) this week as well. Hyper-busy weeks with CMS meetings
      • CMS Collaboration week running last week, now a very busy 3-days cross-projects workshop
    • ESNet problems seem to be fixed
    • <100% availability of some services, but nothing crucial
    • rest is business as usual

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Operations dominated by MC jobs and user analysis
    • T1
      • Problem to contact SARA SRM (GGUS:113324) Waiting for NLT1 to deploy fetch-crl 3.0.16 (only in EPEL-testing)

Onno explains that NL-T1 is updating fetch-crl from EPEL-testing and that all systems should be up to date by tomorrow.

Sites / Services round table:

  • ASGC: Almost all services recovered from outage last week. Still a few tape drives with problems. This affects ATLAS and it's been followed up via GGUS.
  • BNL: NTR
  • CNAF: Not present
  • FNAL: NTR
  • GridPP: Not present
  • IN2P3: NTR
  • JINR: Not present
  • KISTI: Unscheduled intervention of the OPN link from Daejeon to Geneva for 24 hours looping test on backup line. KISTI could completely inaccessible from 11 May 2015 16:00 (UTC) to 12 May 2015 16:00 (UTC).
  • KIT: NTR
  • NDGF: Not present
  • NL-T1: See LHCb report
  • NRC-KI: Not present
  • OSG: NTR
  • PIC: NTR
  • RAL: NTR
  • TRIUMF: Scheduled downtime tomorrow to update the dCache instance. Intervention will last 3h as of 17:00 (UTC)

  • CERN batch and grid services: Due to the heavy load experienced in the batch system, CMS site functional SAM tests are timing out as they are competing with pilot jobs. This leads the site to appear as unknown. This was already reported at the WLCG Operations Coordination meeting last week and both T0 and CMS experts are looking into it to find a solution.
  • CERN storage services:
    • CASTORATLAS Upgrade: To be rescheduled
    • EOSCMS MGM Intervention: Tuesday morning (9:00 to 11:00)
    • EOSCMS SRM Intervention: Tuesday afternoon (16:00 to 17:00)
    • CASTORCMS Upgrade: Wednesday morning (10:00 to 11:00)
    • CASTORLHCB Upgrade: Wednesday afternoon (14:00 to 15:00)
  • Databases: NTR
  • GGUS: Not present
  • Grid Monitoring: Not present
  • MW Officer: Not present

AOB:

  • Next meeting on Friday

Thursday: Ascension holiday

  • The meeting will be held on Friday instead.

Friday

Attendance:

  • local: Maria Alandes (Chair&Minutes), Maarten Litmaath (ALICE), Ulrich Schwickerath (Batch&Grid)
  • remote: Tommaso Boccali (CMS), Lucia Morganti (CNAF), Thomas Hartmann (KIT), John Kelly (RAL), Rob Quick (OSG), Rolf Rumler (IN2P3), Mark Slater (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • Apologies, no one from ATLAS can attend today
    • 2 ALARM tickets: CERN and BNL. Tickets sent to check on the status of the FTS servers. a bug (related to the GFAL2 version used in the deployed last week FTS server release) is causing the bringonline daemon to crash stopping the staging. RAL found some agents with the daemons dead, BNL idem, CERN was ok.

  • CMS reports (raw view) -
    • Will probably not connect, sorry (at a review meeting)
    • 2015 Production ongoing (at least the one for Trigger)
    • A few issues which were worth a ticket (mostly for cern / storage related problems):
      • GGUS:113687 (CERN REOPENED): all the SAM for T2_CH_CERN* are failing on eoscms, with "ERROR: [SE][GetSpaceTokens][SRM_AUTHORIZATION_FAILURE] httpg://srm-eoscms.cern.ch:8443/srm/v2/server: not mapped./DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba". It could be a nasty effect of user "sciaba" having primary group = ALEPH(xu) and not CMS(zh). But in any case, why it is happening just now (it has always been like this).
      • GGUS:113678 (CERN CLOSED): "failures of transferring files to T2_CH_CERN EOS.". Probably fixed by a change of firewall rules at CERN.
      • GGUS:113664 (CERN CLOSED): a serious problem with all the files written to EOS by T0 becoming unreadable later on, with metadata error. Solved by IT, but basically blocked T0 activities for ~12 hours at least.
      • GGUS:113657 (CERN OPEN): "the transfers to/from T0_CH_CERN are failing"; it seems to be working by now - the answer we got was CASTOR overloaded, but ticket still open.
      • GGUS:113707 (IN2P3 OPEN): jobs failures with "site-local-config.xml does not exist in CVMFS (looked at /cvmfs); is CVMFS running and configured properly?". Not completely clear if on the site or on services, investigating.

It is agreed to report at next meeting on Monday about the SAM issue since the ticket was updated in the morning so more clarifications may arrive later today. Maria explains there is no one from the storage team at CERN at today's meeting. She will follow up on the issues related to the firewall configuration and the metadata error to understand whether the EOS team has well understood the causes and could prevent similar issues in the future. Rolf asks Tommaso to put more details in the IN2P3 ticket since all jobs seem to be running fine at the site.

  • ALICE -
    • IN2P3-CC: an unknown recent change affected the VOBOX starting around Sat evening May 9 and causing the site to get drained of ALICE jobs this week
      • job submissions stopped as the CEs reported more jobs waiting in the queue than allowed by the threshold configured in AliEn for the site
      • when the threshold was increased Thu late afternoon, newly submitted jobs ended up "crippled" somehow
        • errors were logged for input sandbox transfers
        • on the WN each job exited quickly
      • a simple cure was found and applied to the environment on the VOBOX in the evening
        • a spurious directory was removed from LD_LIBRARY_PATH
        • not clear why that directory suddenly was a problem
      • the site then worked fine again

Rolf would like to follow up on this issue and understand the details. Maarten explains there is no ticket opened since the problem was finally gone and that there was a mail exchange with Renaud who has all the details. Rolf will talk to him.

  • LHCb reports (raw view) -
    • Operations dominated by MC jobs and user analysis
    • T1
      • Problem to contact SARA SRM (GGUS:113324) fetch-crl 3.0.16 seems to have solved the issue. Ticket closed.

Sites / Services round table:

  • ASGC: Not present
  • BNL: Not present
  • CNAF: Ongoing scheduled downtime for storage system should end this evening. More details in GOCDB
  • FNAL: Not present
  • GridPP: Not present
  • IN2P3: Comments for CMS and ALICE. See experiment reports
  • JINR: Not present
  • KISTI: Not present (Sang-Un unable to connect. NTR)
  • KIT: dCache instance upgraded OK. Postponed the database cleanup, since the actual purging of the waste entries is pretty db heavy.
  • NDGF: Not present
  • NL-T1: Not present (Official holiday, sent apologies)
  • NRC-KI: Not present
  • OSG: Nothing to report
  • PIC: Not present
  • RAL: Castor DB upgrade to be scheduled next week, likely on Tuesday, downtime in GOCDB to be declared
  • TRIUMF: Not present

  • CERN batch and grid services:
    • MyProxy running on myproxy.cern.ch will be updated on May 20th (please take a look to the ITSSB entry for more details)
  • CERN storage services: Not present
  • Databases: Not present
  • GGUS: Not present
  • Grid Monitoring: Not present
  • MW Officer: Not present

AOB:

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2015-05-18 - MariaALANDESPRADILLO
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback