Week of 151214

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Eric (CMS), Andrew (LHCb), Maarten (ALICE), Xavi (IT-DSS+SCOD)
  • remote: Lisa (FNAL), Jens (NDGF), Michael (BNL), Dimitri (KIT), Alexander (NL-T1), Peter (ATLAS), Tiju (RAL), Kyle (OSG), Rolf (IN2P3)

Experiments round table:

  • ATLAS reports (raw view) -
    • Bulk reprocessing due to start this week with corresponding staging from tape.
    • ATLAS detector shutdown this morning for tech stop.

  • CMS reports (raw view) -
    • Comments last meeting about missing files.
      • This is not a problem with our transfer system, but a problem with the production system. Very rare, possibly a race condition
      • Working to track it down, debugging added in the last couple of weeks to try
    • Tier0 working on clearing backlog from pp reference run and HI run. High data rates, high uptime.

  • ALICE -
    • The heavy ion data taking has ended successfully: thanks to all sites and experts involved!
      • Reconstruction and reprocessing will continue for many more weeks
    • CERN: team ticket GGUS:118321 opened Sunday evening
      • A small but persistent fraction of the reco jobs could not read their input files from CASTOR
      • Those files had been expunged from the disk cache due to a suboptimal policy
      • The policy has been changed to LRU now, thanks!

  • LHCb reports (raw view) -
    • Data Processing
      • p-Ar/p-Ne reconstruction almost 100% complete
      • Lead-Lead started last week and progressing (Note files are smaller so several times more than p-p data)
      • Monte Carlo at low pace, user analysis at T0/1/2D sites
      • Data prestaging for the Restripping 15 ( Run II data ) is started
    • Issues:
      • LHCb VO box VM rebooted overnight. Interrupted some agents and some internal false alarms.

Sites / Services round table:

  • ASGC: np
  • BNL: ntr
  • CNAF: np
  • FNAL: ntr
  • GridPP: np
  • IN2P3: ntr
  • JINR: np
  • KISTI: np
  • KIT: ntr
  • NDGF: dCache upgraded to v2.14 including a DB schema upgrade(~3h)
  • NL-T1: ntr
  • NRC-KI: np
  • OSG: ntr
  • PIC: ntr (Pepe send apologies before the meeting)
  • RAL: ntr
  • TRIUMF: np

  • CERN batch and grid services:
  • CERN storage services: ntr
  • Databases:
  • GGUS:
    • All the tests alarms after the release on the 9th have been closed. INFN alarm took five days to close
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: David (ATLAS), Manuel (PES), Maarten (ALICE), Asa (ASGC), Xavier (SCOD+DSS)
  • remote: Antonio (CNAF), Michael (BNL), Andrew (Nikhef), Sang-Un (KISTI), Christoph (CMS), Rolf (IN2P3), Tiju (RAL), Rob (OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Bulk reprocessing of 2015 pp data began last night - will run through the xmas break
    • Staging of RAW data from tape caused some timeout errors, this is not really a problem as long as the data keeps flowing
    • This staging in turn caused problems with FTS being overloaded
    • ATLAS computing experts will be best-effort only for next two weeks

  • CMS reports (raw view) -
    • Quite some work in the pipeline
      • CMS has still a backlog of PromptRECO for Heavy Ion and ppRef data
      • Large MC DIGI-RECO will keep us busy until January
      • RE-RECO of 2015 data about to start
    • Short outage of FTS3 at CERN on Wednesday: GGUS:118383
    • Some Dashboard instabilities since Wednesday: GGUS:118378

  • ALICE -
    • The CASTOR team has rearranged the ALICE disk servers into a single pool:
      • to allow convenient usage of all available resources, thanks!
    • High-memory arrangements for heavy ion reconstruction were undone also at KISTI and KIT
      • to allow more job slots to be used again, thanks!
    • KIT has added more servers to their ALICE tape SE for higher throughput: thanks!
    • CBPF providing opportunistic job slots since Mon evening: thanks!
    • Expectations for the end-of-year break:
      • steady MC production
      • heavy ion reconstruction
      • low analysis activity
    • Thanks to all sites and experts for another successful year!
    • Season's greetings and best wishes for 2016!

  • LHCb reports (raw view) -
    • Data Processing
      • High level of Monte Carlo running, with intention of continuing to run this during YETS (including on HLT farm); user analysis at T0/1/2D sites
      • Data prestaging for Stripping 24 Run II data ongoing.
    • Issues:
      • Ticket for VOMS (GGUS: 118361). Confused situation because of proxy chain related bug in version of DIRAC released at the same time. Still a question about intermittent but then serious problems accessing VOMS.

Sites / Services round table:

  • ASGC: Downtime Storage upgrade fro a DPM server: Dec 29 00:00am -Dec 12pm
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP: ntr
  • IN2P3: ntr
  • JINR: ntr
  • KISTI: ntr
  • KIT: ntr
  • NDGF: Upgraded dCache as reported. Large number of files became cache copies (i.e. could be erased without penalty) after the service restart, due to a serious bug affecting 2.12, 2.13 and 2.14. Restored with data loss, list of ~17k files will be sent to ATLAS and ALICE. Three other T1s affected.
  • NL-T1: ntr
  • NRC-KI: ntr
  • OSG: Limited staff, best effort during xmas break.
  • PIC: np
  • RAL: ntr
  • TRIUMF: np

  • CERN batch and grid services: During xmas break will drain 3 cells (out of 10) of openstack, will accept only short jobs there. Early Jan they will be redeployed with new parameters to improve efficiency.
  • CERN storage services:
  • Databases:
  • GGUS:
    • On the 21st of December, from 6:00 to 9:00 UTC, there will be a maintenance at central KIT network components, GGUS will be affected. During the maintenance work interruptions of the network connection up to one hour may occur.
  • Grid Monitoring:
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2015-12-17 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback