Week of 160321

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Kate Dziedziniewicz-Wojcik (chair and minutes), Julia Andreeva (WLCG), Jose (Atlas), Xavier Espinal (Storage), Lawrence Field(Computing), Nilo Segura (DB), Vincent Brillault (Security), Maria Dimou (GGUS), Maria Alandes (WLCG), Maarten Litmaath (Alice)
  • remote: Antonio Falabella (CNAF), Christian (NDGF), Di Qing, John Kelly (RAL), Jose Felix Molina (PIC), Onno Zweers (NL-T1), Pavel Weber (KIT), Vladimir Romanovskiy (LHCb), Patrick (CMS)

Experiments round table:

  • ATLAS reports (raw view) -
    • New campaign of pile (digi+reco reconstruction) is running, about 70 Mevts per day (using 100k cores, 8-core jobs). This campaign will go on for the next months.
    • Heavy Ion tests. There will be likely a memory issue with this production. Currently trying to understand the requirements, this week need to check if jobs will work with 3GB per core.
    • FTS shares are not really working, annoying transient failures. There was an email discussion / thread about this at FTS level and steering.
    • INFN T1 transfer failures high, really low efficiency during the last 36h. Is it understood what happened? Many jobs are there we would it to be working during the week.
    • FZK-LCG2 has highly transfer failures.
Maria A. asked for a ticket for INFN issue, there's no ticket open, the issue might be mentioned in a few tickets, but otherwise it was something seen during the weekend.

  • CMS reports (raw view) -
    • md5/FTS issue now understood: legacy proxies signed with MD5.
    • Enabling Tier-2 sites for multi-core pilots in progress.
    • CMS users suffered from low voms-admin performance.
    • Taking cosmics this week (MWGR4).

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Validation of 2015 data reprocessing (ProtonArgon and Lead)
    • Validation of RUN1 incremental stripping
    • MC and User jobs
    • PIC: (GGUS:120317) Pilots Aborted due to closed queues.

Sites / Services round table:

  • ASGC: NA
  • BNL: NA
  • CNAF:
    • Transfers from CBPF to CNAF are failing. Maarten explained this is a result of a network issue on the CBPF end, there is a ticket open on it and network support are investigating.
  • FNAL: NA
  • GridPP: NA
  • IN2P3: NA
  • JINR:NA
  • KISTI:NA
  • KIT:
    • In order to carry out reconfiguration of storage file systems, we were migrating data in between ATLAS pools. Over the weekend, this resulted in a situation were ATLAS had no writable pools left until this morning (GGUS:120313). Only *DATA endpoints were effected, the *TAPE endpoints were not.
  • NDGF: NTR
  • NL-T1:
  • NRC-KI:
  • OSG: NA
  • PIC: Downtime preparation are ongoing, queues are being emptied. PIC will be down the whole day tomorrow due to electrical intervention. Site will be back up late in the evening.
  • RAL: NTR
  • TRIUMF: NTR

Maria D. asked for missing pledges slides.

  • CERN computing services: (By MariaD) GGUS:120274 contains CERN VOMS server trouble and needs attention.
Laurence explained the issue is being discussed with developers and is related to all the users expiring at the same time. There is a proposal to assign a random expiry date to the users.
  • CERN storage services: CASTORATLAS upgrade to new v2.1.16 finished today at 10h15' - OK but some errors reported after lunch (being investigated). EOSALICE namespace restart forced yesterday at 11h due to memory usage (waiting for burn-in to finish to deploy new servers with bigger memory)
Xavi explained that ALICE problems are related to memory limits being reached on the machines (big number of files, each file has a memory footprint). Demons hang due to lack of memory and require restart. New machines have arrived with a delay and will be installed and tested as soon as possible (end of this/early next week). Maarten advised not to rush things now, as we do not want a faulty node in the service.
  • CERN databases:
    • One instance of LHCBONR database run out of processes and had to be restarted as a result of user's activity on Thursday afternoon. Activity had been reported to LHCb.
  • GGUS:
    • Release this Wed 23/3 with ALARM tests as usual.
    • The Support Unit (unused for over 4 years) Network Operations will be decommissioned with this release. The proof: GGUS:118795
    • A TEAM ticket functionality enhancement will be implemented, unless people object. See https://its.cern.ch/jira/browse/GGUS-1497 for details. Summary: TEAM tickets will give 3 assignment possibilities:
      • ROC/NGI selection (via "Notify site" or by selecting the SU) (default)
      • TPM (offered as today)
      • DMSU (new option for all other problems).

  • Security: LibNss package update deadline is today. Reminder tickets will be sent from tomorrow on.

  • MW Officer:

AOB:

  • New GOCDB release contains WLCG specific tags (wlcg, alice, cms, atlas, lhcb, tier1, tier2). Please, use this functionality to filter WLCG information. This is especially useful for the Downtime calendar.
    • Sites are also reminded to check the tags associated to their services and correct them if they are wrong.
  • ATTENTION: next meeting on Tuesday March 29 !
  • Have a good Easter break !
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx GGUS-for-MB-Mar-16.pptx r2 r1 manage 2848.3 K 2016-03-21 - 16:42 MariaDimou GGUS Slides for the March WLCG MB Service Report
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2016-03-22 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback