Week of 170417

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday: Easter Monday holiday

  • The meeting will be held on Tuesday instead.

Tuesday

Attendance:

  • local: Andrew (LHCb), Vincent (security), Maarten (Alice, WLCG), Julia (WLCG), Kate (WLCG, DB, chair)
  • remote: Luca L (CNAF), Tommaso (CMS), Di Qing (TRIUMF), FaHui (ASGC), John K (RAL), Xin Zhao (BNL), Dave M (FNAL), Nurcan (ATLAS), David B (IN2P3)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • data16 and data15 reprocessing is running in full speed, slots occupied up to ~320k overall during last week, smooth operations during the Easter break.
      • CERN Frontier servers are under stress with reprocessing and user jobs, an additional server is installed, not functional yet. Frontier loads are under investigation.
    • Problems:
      • VOMS issue: INC:1333585 "Cannot get user attributes from VOMS using voms-admin API". Need urgent help from the VOMS service manager.
Maarten commented VOMS was upgraded due to urgent security issue and the new version contains a bug. A rollback is not possible due to security threat. The developers will be contacted to check when the big fix will be available.

  • CMS reports ( raw view) -
    • not much to say, overall a calm week
    • data 16 rereco coming in the next days, and after that MC17 processing ...
    • AFS/Kibana is pink since fri morning, solved this morning. No apparent problem, wrote to itmon but the emergency level is very low. AFS expert confirmed this was a monitoring issue.
    • a few DAS/CMSWEB alarms on vocms 140/141 since Sun. Still, the system seems to work

  • ALICE -
    • Very high activity

  • LHCb reports ( raw view) -
    • Activity
      • MC Simulation, Data Stripping and user analysis
      • Staging campaigns are ongoing for Data Stripping.
    • Site Issues
      • T0: SRM problems fixed quickly last week (GGUS:127638)
    • T1:
      • CNAF: Uploading problems over the weekend fixed (GGUS:127728) Due to another VO's GPFS usage pattern.
      • RAL: running with a limit on the number of Merge jobs to avoid problems with storage (GGUS:127617) but better than the situation before the version downgrade.
      • RRCKI: running with a limit on the number of user jobs due to limits on concurrent open files in dCache (no GGUS for this)
      • T2:
        • Seeing SL6.9 openssl problems at several sites. Tickets issued.

Sites / Services round table:

  • ASGC: ntr
  • BNL: DCache was interrupted last week (memory usage was increased after upgrade to release 3). Memory increase in the machine took 2 hours instead of expected 5 minutes.
  • CNAF: NTR
  • EGI: nc
  • FNAL: ntr
  • IN2P3: ntr
  • JINR:
    • SAM dropped at 10-04-2017 due to LAN problem, fixed by rebooting one of two our CE.
    • Downtime "at risk" for tomorrow 18-Apr-17 08:00:00 UTC. One of two main electric transformers will be replaced, we hope UPSes will level the power drop.
  • KISTI: nc
  • KIT: nc
  • NDGF: nc
  • NL-T1: nc
  • NRC-KI: nc
  • OSG: nc
  • PIC: nc
  • RAL: LHCb SRM had to be downgraded
  • TRIUMF: NTR

  • CERN computing services: nc
  • CERN storage services: nc
  • CERN databases:
    • LHCBONR was out of processes Wed/Thu night due to a storm of connections from one of the users.
    • An issue with web server blocked APEX and ORAWEB applications on Sunday - ATLAS LAr electronics database was no available due to that
  • GGUS: ntr
  • Monitoring: nc
  • MW Officer:
    • RHEL/SL 6.9 openssl update fall-out
      • openssl 1.0.1e-57 by default prohibits TLS to be used with DH keys smaller than 1024 bits
      • Java-based services will fail openssl client connections if their version of Java is too old
      • Or if its disabled algorithms are defined incorrectly
      • Java-based services need to run a sufficiently recent version of Java to avoid such problems
      • The latest 1.7 and 1.8 releases should be OK
      • Last Thu an alert about this matter has been sent to the wlcg-operations list
  • Networks:
    • perfSONAR 4.0 was released on 17th of April (http://www.perfsonar.net/release-notes/version-4-0/)
      • Site on auto-updates will get it automatically - no action needed.
      • Sites planning to update perfSONARs to CC7 are encouraged to wait until 4.1 is released.
  • Security: ntr

AOB:

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2017-04-18 - NurcanOzturk
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback