Week of 100208

LHC Operations

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Jean-Philippe, Ueda, MariaG, Tim, Nicolo, Harry, Ignacio, Timur, Steve, David, Andrea, Eva, Patricia, Roberto, MariaD, Alessandro);remote(Jon, Michael, Rolf, Angela, Gonzalo, Ron, Gang, Rob, Gareth).

Experiments round table:

  • ATLAS reports (Ueda) - NDGF LFC was unstable on Saturday (fixed). Some files were not available (fixed: ggus:55328). NDGF - PIC -- Transfers timeout (ggus:55330). BNL - TAPE buffer became full (fixed, but no response in ggus:55326 -- GGUS-OSG issue?). SARA - some files not available with due to migration of data (fixed, but no downtime?, ggus:55334)

  • CMS reports (Nicolo)- T0: Issues with reading files from t0export pool - GGUS #55341. No evidence of delegation issue on FTS2.2 at CERN during weekend. ASGC: Files pending in migration to tape since December now on tape. RAL: 115 files waiting for tape migration. CREAM-CE tests ongoing with problems at CNAF and PIC (developers contacted).

  • ALICE reports (Patricia)- T0 site: Good behavior of the services during the weekend. Both LCg-CE and CREAM-CE have been running in parallel, at this moment (14:00) increasing the number of jobs via CREAM. T1 sites: No issues to report. T2 sites: GRIF_IPNO: the site is in downtime from today until next Thursday to the system disk on the SE with two disks in RAID 0. Cyfronet: The site announced a new power cut this week. Today the site will shut alice-se.grid.cyf-kr.edu.pl down and tomorrow they will start a scheduled downtime. The downtime will be long (probably till the end of the week) because they have to rearrange some racks and cables. CapeTown: Issues reported in terms of the local CREAM-CE seem to be solved now. The site admin has asked us to test the system again. GRIF announced on Friday evening the setup of the CREAM-CE for ALICE. The system has been tested by the site admin with a positive feedback. The system will be put in production as soon as the CREAM-CE will be set to point to the same cluster as the LCG-CE.

  • LHCb reports (Roberto)- Few remaining MC-productions running to completion during the weekend submitted few more (order of 10M events to be produced). T0 sites issues: LFC streams SAM test failing against pic RAL and CNAF all at the same time. Checked that there was announced on IT Status Board an intervention on down streams databases but we missed it. T1 sites issues: NIKHEF: MC efficiency lower than anywhere else.Long jobs landed into shortish queues. GGUS open and closed ...most likely some problem application side. GridKA: Jobs having permission problems at user script level in local WNs. (GGUS open, under investigation). RAL: Shortage in the power supply affecting the air conditioning system for a short while. No appreciated impact on LHCb.

Sites / Services round table:

  • FNAL: NTR
  • BNL: the tape buffer full problem reported by ATLAS is due to an incorrect calculation of space by SRM (being looked at). Actually the Atlas ticket has not been received by BNL.
  • IN2P3: NTR
  • KIT: tape hardware back to normal: libraries are available.
  • PIC: the timeout problem for transfers between NDGF and PIC is still not fully understood. The latest ticket has been attached to the master ticket 54806.
  • NDGF: no update on this issue, but not all (geographically distributed) pools are on OPN.
  • NLT1: tape interface to dCache broken, cannot migrate data to tape, being investigated.
  • ASGC: 400 SLC4 nodes are offline to be migrated to SL5 before 12th February.
  • RAL: the CASTOR DB problem reported is due to the fact it takes 10 minutes to reconfigure the rack. A parameter could be changed but we need to be careful. Power issue on Friday but all services stayed up. Outage tomorrow morning for network recongiguration: some failures can be expected.

AOB: (MariaDZ) The ALARM 'address' given by the OSG developers for BNL is 6314879307@messagingNOSPAMPLEASE.sprintpcs.com When tried last week with the round of ALARMS tests after the GGUS Release (Feb 3rd) this didn't work. We don't know if this is due to the system not recognising the GGUS certificate or treating the test as spam or a failure of the sms-to-email interface.

Tuesday:

Attendance: local(Jean-Philippe, Andrea, Lola, Ignacio, Julia, Timur, Harry, Roberto, Patricia, Steve);remote(Jon Bakken, Michael, Ronald, Tomas/NDGF, Angela, John Kelly, Rob).

Experiments round table:

  • ATLAS reports - INFN-T1 -- in preparation for the downtime, transfers to the TAPE endpoint are suspended and queues are set offline. TAIWAN-LCG2 -- in preparation for the downtime, transfers to the site are suspended and queues are set offline (by Suijian). No major issues to report.

  • CMS reports (Andrea)- RAL: 115 files waiting for tape migration - were too small to trigger tape migration, forced. Transfers T0-->RAL recovered successfully after end of scheduled networking intervention. CNAF: SL5 CEs still publishing SL4 OS in BDII to support VOs not ready for migration - site reports that it will be changed before end of February. TIFR: batch system problems. Lisbon: CMS SW installed. CSCS: missing module finally installed. ASGC: setup of tape families.

  • ALICE reports (Lola)- There are no requests for the LCG-CE (in terms of jobs), nor for the CREAM-CE. The CREAM-CE has been re-tested this morning before the next MC cycle starts. No issues found with ce201, which is the CREAM-CE used by ALICE. T1 sites: No issues to report. The Pass4 reconstruction at FZK finished with no incidents. T2 sites: Follow up of the issues reported yesterday with the CREAM-CE in CapeTown. Still the local system is facing some instabilities reported to the developers.

  • LHCb reports (Roberto)- No large activity in the system today. T0 sites issues: LFC : propagation of trusted vobox into the quattor template. Has it been done? T1 sites issues: RAL: Network intervention extended by ~ one hour. Some lhcb users noticed the outage. GridKA: Jobs having permission problems at user script level in local WNs. (open). CNAF: CREAMCE is failing all jobs submitted through.

Sites / Services round table:

  • FNAL: network upgrade completed (memory was added to the routers/switches).
  • BNL: Ticket routing being investigated. Jon confirms that the ticket was properly received and handled at FNAL. It was just misunderstanding. MariaD updated the ticket.
  • NLT1: The interface problem between the tape system and dCache has been solved. A router board will be replaced at Nikhef tomorrow afternoon.
  • KIT: maintenance needed on one of the tape libraries. This will take place between 09:30 and 13:00. Writing will not be affected but reading can be affected.
  • RAL: the scheduled down time this morning overrun by half an hour. Everything back ok.
  • ASGC: preparing for the migration SLC4 to SL5.

  • Ignacio: castoratlas upgrade tomorrow between 09:00 and 11:00 (down time).
  • MariaD: Could BNL please update ticket 55326?

AOB: (MariaDZ) On the Atlas report yesterday on https://gus.fzk.de/ws/ticket_info.php?ticket=55326 info on GGUS-OSG routing in the ticket. The GGUS ALARM test to BNL yesterday worked. https://gus.fzk.de/ws/ticket_info.php?ticket=54260 is re-opened and requires input from panda experts please.

Wednesday

Attendance: local(Jean-Philippe, Harry, Roberto, Timur, Nicolo, Alessandro, Patricia, Eva, MariaD, Ignacio);remote(Onno/NLT1, Michael/BNL, Thomas/NDGF, Jon/FNAL, Angela/KIT, Rolf/IN2P3, Tiju/RAL, Rob/OSG).

Experiments round table:

  • ATLAS reports - RAL --problem in transfers to Tier2s (ggus:55419). CERN CASTOR -- no problem observed after the upgrade. A large number of errors have been observed since last night in accessing t0merge pool due to a rather high load ATLAS Tier0 experts are aware of the problem and probably contact the castor operations to consult.

  • CMS reports (Nicolo)- T1s: ReReco running on CNAF, FNAL. FNAL: Power and cooling failure, network issues, services recovered in 2-3 hours. RAL: lcgadmin job stuck in queue blocking new sw installation jobs, fixed by site admins. Merge job failure at Vienna T2 and PBS problem at Indian T2.

  • ALICE reports (Patricia)- Currently there is a single MC cycle running plus several users submitting their analysis jobs to the Grid. Therefore, the number of jobs running currently in the Grid is stil low. The highest peak of running jobs has been achieved today at noon with 2500 concurrent jobs. T0 site: Production services were tested yesterday to have them ready for the new MC cycle. T1 sites: No issues to report. T2 sites: Several instabilities could be observed at Subatech. Due to the small number of jobs running currently in production, we are testing some of the new developments which will enter in the next AliEn version (2.18). In particular at subatech, we are testing the new 2nd VOBOX backup infrastructure discussed and agreed during the last ALICE TF Meeting. More details tomorrow during the next ALICE TF Meeting.

  • LHCb reports (Roberto)- Yesterday submitted and run small MC productions; today these same are complete other in the pipe. Internal problem with software installation modules:only CNAF among T1's seems to have the LHCB application properly installed. LHCb experts looking at that. Some SAM tests are not publishing the information: the reason is not clear yet. T0 sites issues: none. T1 sites issues: SARA/NIKHEF: users with UK issued certificate seems to experience problems accessing data. GRIDKA:users with UK issued certificate seems to experience problems accessing data. IN2p3: users with UK issued certificate seems to experience problems accessing data. PIC: users with UK issued certificate seems to experience problems accessing data. CNAF:users with UK issued certificate seems to experience problems accessing data.

Sites / Services round table:

  • Onno/NLT1: 2 down times are scheduled: one at SARA on Friday morning for disk server firmware upgrade and new kernel, the down time is announced for half day but could be extended; there will be a disk server firmware upgrade at Nikhef next Tuesday.
  • Michael/BNL: NTR
  • Thomas/NDGF: NTR
  • Jon/FNAL: power + cooling incident yesterday due to an EPO circuit problem. CMS services were up and running in 2-3 hours. There was also an authorization module failure which made some SAM jobs fail (due to a kernel bug?). A 4 hours downtime is scheduled for tomorrow morning.
  • Angela/KIT: NTR
  • Rolf/IN2P3: NTR
  • Tiju/RAL: two scheduled downtimes tomorrow: one for CASTOR DB 12:00-14:00 UTC. Also at risk because of the cooling system from 09:00.
  • ASGC: Scheduled downtime today.

  • OSG: (RobQ) The solution of https://gus.fzk.de/ws/ticket_info.php?ticket=52982 will take time. Maria will put the ticket 'in progress' until the job is really done. Rob agrees with this. He also asked to receive a notification for every ALARM ticket to the USA Tier1s (including tests), as well as the OSG Goc. Maria said that all such tickets send email notifications (which may be converted to sms at the destination) to the sites wherelse they are assigned automatically to the ROC. In the USA case this is OSG(Prod). So the OSG Goc gets the ticket assignment, hence email notification. Rob's personal email is possible with no action from GGUS side if Arvind/Shoichi put Rob's personal email in the field of alarm email of the Tier1s' Resource Group.

  • Ignacio: CASTOR upgrade for Atlas stager went well this morning, but waiting for more experience with this release before planning CMS upgrade.
  • Eva: privileges on Java packages removed this morning from DB servers. There was a problem of replication (conditionDB and LFC) from CERN to RAL yesterday morning and the replication had to be restarted in the afternoon. Replication problem between CERN and BNL: collecting traces to pass to Oracle. Will take time. Currently 3 schemas are not getting all the updates.
  • MariaD: Could Atlas experts update the ticket about the Panda problem which occurred during the Christmas shutdown? Alessandro will follow up.

AOB: (MariaDZ) Updated test ALARM procedure with Tier1 timezone and opened a standard format savannah ticket for the next GGUS release to monitor success of the exercise every month. Details in https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru

Thursday

Attendance: local(Timur, Lola, Jamie, Maria, Steve, Roberto, Nicolo, Gavin, Ignacio, Andrea, Alessandro, Simone, Giuseppe);remote(Xavier, Jon, Michael, Gang, Thomas Bellman NDGF), Angela, Ronald, Rob, Rolf, Gareth).

Experiments round table:

  • ATLAS reports -
    1. DDM and Panda being reconfigured to use StoRM+TSM at CNAF (LFC bulk update done yesterday)
    2. Problem with the spanish CRL not being updated at CERN yesterday. The CRL is imported from the Spanish CA into a cache at CERN (to minimize the number of queries to the CA). But the CERN cache has been blacklisted in Spain CA 1 month ago because it was hammering the server at 70Hz. Problem is multi-folded:
      • Communication: CA allerted CERN 1 month ago according to the CA. Why was this overlooked? Has the right channel being used?
      • Technical: why so many requests? Even yesterday, when CERN was white listed rate was 4Hz. But this should be a few times per day ...
    3. Reprocessing validation was reported yesterday. Pending issues in UK and FR for memory-demanding jobs (killed by batch system). Campaign will start tomorrow. No T2s involved.

Point 2: Gavin - problem understood: using stock squid setup as proxy cache. Many conditional gets hitting remote CA server - this caused them to block us (CRLs didn't get update). Considering setting lifetime on cache. (Default expiry time 1 hour - should see roughly this rate of requests to server for checking for updates.

  • CMS reports -
    • T1s 1. ReReco of data and MC run on KIT, FNAL, CNAF 1. Skims running at FNAL - one corrupted file on disk was restaged from tape 1. Low rate backfill tests of CREAM CE at KIT, CNAF, RAL, PIC
    • ASGC 1. Files with wrong checksum on disk buffer, restaged from tape 1. Squid servers down, restarted.
    • RAL 1. MC dataset not migrating to tape
    • T2 highlights
      • MC ongoing in PIC, IN2P3, RAL, FNAL, CNAF, T2 regions
        1. Merge job issues fixed in T2_AT_Vienna
        2. Jobs aborted at T2_FR_CCIN2P3
        3. Problem with /var/condor full at T2_US_Purdue
        4. Transfer of input generator files not moving to T2_TR_METU
      • SAM test failures at T2_IN_TIFR (ongoing, PBS issues)
      • T2_BR_UERJ - network issues, increased bandwidth, restarting transfers.

Jon - not aware of any disk corruption at FNAL yesterday. Nico - was in elog will check and forward info.

  • ALICE reports - GENERAL INFORMATION: New set of 4 new MC cycles is running since yesterday night with a positive feedback of WLCG services at all sites. There are no neww reconstructions of raw data at T1 sites and no new analysis trains
    • T0 site
      • Good behaviour of the services during the last night when the new MC cycles were started up.
    • T1 sites
      • All ALICE T1 are currently running in production. CNAF has just announced this morning the setup of a new CREAM-CE system. It will be put in production this evening
    • T2 sites
      • Waiting for news from both Cyfronet and GRIF_IPNO. Both sites announced service interruptions for this week.
      • In general terms good behavior of all T2 sites during this new MC production. Individual issues will be followed just after the operations meeting

  • LHCb reports - No activity (500 jobs in the system from a couple of users)
    • T1 sites issues:
      • All dCache sites: file access issue reported yesterday found to be due to an known incompatibility between the version of root (5.26/00a) used by LHCb analysis application (DaVinci) since Monday and dcap libraries.
      • Issue with Spanish CA affecting all Spanish users and services running with Spanish credentials.

Sites / Services round table:

  • PIC - besides CRL problem have today a scheduled downtime intended for 1h. Took 1h more. Problem 32/64bit issue with dCache upgrade. In contact with dCache team. Jobs suspended during intervention - no jobs crashed. FTS channels set inactive and now restored - data flowing.
  • FNAL - 1) started downtime - expected to last 4 more hours. 2) Switch crash last night caused 2h outage of part of system, came back normally. Experts say not unusual for switches to crash after power outage (we had one 2 days ago).
  • BNL - ntr
  • ASGC - ntr; scheduled downtime finished and most services have recovered.
  • NDGF - ntr
  • KIT - ntr
  • NL-T1 - last night some downtme at NIKHEF. Scheduled for 45' as temporary net interrupt. Replacement of border router. Unforeseen config changes - rolled back to old setup. Investigation ongoing. No date for next replacement of router. 2h unavailable.
  • OSG - followup from yesterday; Will get 2nd alarm page for BNL and FNAL sites. Alarm mail will go to T1 and also to OSG. Still an alarm for BNL seems completed but still open in GGUS. Can be closed.
  • IN2P3 - ntr
  • RAL - "At risk" - modification to pumps for A/C. Memory limit for ATLAS jobs - looking at temporary workaround for current reprocessing. Jobs being chucked out from 3GB queue. Will discuss longer term solution with ATLAS.

  • CERN - following problems on T0MERGE and identified 3 of new servers (replacing out of warranty machines) that were performing badly. Removed from configuration - seems to cure problem. Still to follow up.

AOB:

Friday

Attendance: local(Steve, Jean-Philippe, Lola, Timur, Eva, Simone, Roberto, Nicolo);remote(Jon Bakken/FNAL, Xavier/KIT, Michael/BNL, Jeremy/GRIDPP, Rolf/IN2P3, John/RAL, Gang/ASGC, Rob/OSG, Vera/NDGF, Onno/NLT1, Gonzalo/PIC).

Experiments round table:

  • ATLAS reports (Simone)- Two T1s back in production (INFN-T1 and ASGC).Reprocessing should start today. 9 T1s have been validated (FR,DE,NG,US,IT,ES,NL,CA,UK).

  • CMS reports (Nicolo)- T1s: Skims running at FNAL - batch farm issues - and CNAF - completed. Increased rate of backfill tests of CREAM CE. RAL: Tape migration restarted. Problems in Lyon not fixed. Hardware problems fixed at indian T2. Mapping problem in Bari. Installing SL5 in Pakistan.

  • ALICE reports (Lola)- GENERAL INFORMATION: Production ongoing today with one active MC cycle currently running. No activities in terms of analysis train for the moment.The issue reported yesterday in terms of SAM (SAM test suite was not being executed since several hours and results were not published) has been solved yesterday afternoon. The certificate with which the SAM infrastructure runs for ALICE was over. The new certificate was installed in the SAM UIs and in the evening new result were already published in the SAM page. T0 site: No issues to report. T1 sites: Yesterday afternoon the 2nd CREAM-CE system announced by the site admin entered production for ALICE. T2 sites: Prague T2: Changing the configuraiton of LDAP for this site in order to comply with the new hardware available at the site.

  • LHCb reports (Roberto)-System draining last MC productions for preparing for the major upgrade of all DIRAC central machines next Tuesday. Released LCG58a from AA that fixes the incompatibility issue preventing to use dCache sites (now banned). T0 sites issues: none. T1 sites issues: RAL: disk server on USER Space Token not available for a short while yesterday. T2 sites issues: Shared area issues at: UFRJ-IF, ru-Moscow-SINP-LCG2 and INFN-NAPOLI-CMS. SAM jobs failing at UKI-NORTHGRID-LANCS-HEP.

Sites / Services round table:

  • FNAL: downtime scheduled in OSG, but information not propagated to CMS and CERN. Availability figures need to be updated.
  • KIT: NTR
  • BNL: NTR
  • GRIDPP: NTR
  • IN2P3: NTR
  • RAL: problems with 2 disk servers belonging to ATLASMCDISK, put offline (probably for the weekend). Disk server numbers: GDSS401 and GDSS402.
  • ASGC: NTR
  • OSG:
    • We have opened GGUS ticket 55511 based on Jon Bakken's downtimes in SAM.
    • I'd like to discuss who's responsibility it is to retire Alarm tickets in GGUS, the sender of receiver. I've sent mail to Maria and GGUS on this item.
    • I want to request a follow up Alarm test before the next scheduled test to be sure we have ironed out all the communication and exchange issues encountered in the tests earlier this week.
    • Email and ticket communication has been started on all 3 items.
  • NDGF: NTR
  • NLT1: scheduled downtime for SARA SRM upgrade this morning: everything ok.
  • PIC: NTR

AOB:
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2010-02-12 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback