Week of 110328

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, Maarten, Mattia, Fernando, Dirk, MariaD, MariaG, Jamie, Douglas, Luca, Steven, AndreaS); remote(Jon, Ron, Rolf, Rob, Gonzalo, Claudia, Tiju, Felix, Michael).

Experiments round table:

  • ATLAS reports -
    • T0, Central Services
      • Problem with contact for the CERN SRM Saturday morning, causing lots of transfer failures from all over. Problem was ticketed, and continued for an hour or so, but then went away, not sure about the problem. (GGUS:69077) GGUS Ticket has a link to a SNOW ticket, but link doesn't seem to work.
    • T1
      • Problems with BNL data tape Sun morning for a few hours. People were concerned, but the tier-0 exports weren't falling behind yet. By mid-day this was fixed. This was because of a loss of space for the FTS logs. Typically this space is automatically cleaned by the FTS log cleaner but that apparently failed in this case. (GGUS:69083)
    • T2/3
      • Problems with transfers to Genova globus_ftp_client Login incorrect, this has been reported a number of times over the past couple weeks. (GGUS:69075)
      • Problems with transfers from Tokyo to Milano, these seem to come and go very few days. (Savannah:80112)

  • CMS reports -
    • LHC / CMS detector
      • Smooth running over the weekend at 2.75TeV
      • Now in technical stop
    • CERN / central services
      • We are trying to figure out why cmsprod was staging. We could use some example file names to chase this down.
      • Degradation of transfer efficiency on Debug from CERN to Tier-1s. GGUS ticket submitted.
    • Tier-0 / CAF
      • NTR
    • Tier-1
      • RAL and KIT in maintenance today
    • Tier-2
      • NTR

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Noting to report
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • EXPRESS and FULL validation of work flows for Collision11
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • Oracle intervention this morning on the LHCb Bookkeeping. Tomorrow SRM castor intervention
        • we did not see any major problem with the settings of CERNVMFS.
      • T1
        • Recovery of DATA lost still in progress for RAL, GRIDKA and IN2P3.
      • T2 site issues:
        • NTR

Sites / Services round table:

Sites:

  • ASGC: Downtime announcement (31-3-2011) for 3 hours due to hardware maintenance (electric panels)
  • BNL: on the problems on the FTS log space (GGUS:69083). BNL clarified that since the logs are used for retrying failed transfers, the problem was impacting only those and not the majority of the transfers
  • IN2P3: Recovery of deleted LHCb files has finished (100% success). Files will be copied in an area available for the experiment
  • KIT: Downtime announcement (31-3-2011) between 8:00 and 9:00 local time (6:00-7:00 UTC) affecting FTS and LFC due to Oracle move to new hardware
  • NDGF: End of intervention (ATLAS). ALICE is coming back during the meeting
  • NL-T1: 2 dcache pools offline this morning (hardware failure). Now being restored.
  • OSG: Generation of gridview reports fixed. Number for BNL are correct by now
  • RAL: CASTOR CMS upgrade ended. Running validation tests

Central services:

  • Dashboard: Time out on LHCb SRM test observed
  • Databases: NTR
  • CASTOR (SRM): Intervention announcement (29-3-2011):
    • 10:30-11:00 : SRM-LHCB update to 2.10-1, service may be unavailable
    • 11:00-12:00 : SRM-PUBLIC update to 2.10-1, service may be unavailable
    • 15:00-15:30 : SRM-ALICE update to 2.10-1, service may be unavailable
  • CASTOR (GGUS:69077)
    • Two pathologies (in one case there is a core dump: now with the development team
  • Grid Services: NTR

AOB:

  • (MariaDZ) The SNOW-to-GGUS direction of the interface entered production today. The March GGUS Release was announced on the CIC portal for this Wednesday 2011/03/30. Amongst others a special mapping for ALARM tickets to the Tier0 will allow CERN service managers to actually update the SNOW tickets also outside working hours. Related ticket Savannah:119821

Tuesday

Attendance: local(Massimo, Maarten, Ken Bloom (CMS), Jamie, Dirk, MariaG, MariaD, Zbigniew, Alexei);remote(Michael, Joel, Rolf, Jhen, Jos, Jon, Christian, Tiju, Gonzalo, Claudia).

Experiments round table:

  • ATLAS reports -
    • Technical stop
    • ADC
      • start data10_hi DESD and NTUP datasets replication to Tier-1s (2 primary replicas ATLAS wide)
      • HI Reps are asked to confirm data placement policy
      • continue mc10_7TeV AOD datasets replication to Tier-1s (2 primary replicas ATLAS wide)
    • Alarm tickets
      • AFS at CERN GGUS

  • CMS reports -
    • LHC / CMS detector
      • Now in technical stop, catching up on runs which haven't been reprocessed due to 48 hour conditions hold
    • CERN / central services
      • We are trying to figure out why cmsprod was staging. We could use some example file names to chase this down.
      • Degradation of transfer efficiency on Debug from CERN to Tier-1s. GGUS ticket submitted, has been set to "very urgent" priority.
    • Tier-0 / CAF
      • NTR
    • Tier-1
      • Some MC production in progress, many sites available for WMAgent testing
    • Tier-2
      • Several sites in scheduled downtimes, otherwise MC production continues

  • ALICE reports -
    • T0 site
      • WMS production was enabled yesterday
      • Scheduled intervention to upgrade kernel of ALICE CAF nodes tomorrow afternoon
    • T1 sites
      • FZK: the firewall is at the edge of capacity and our machines cannot be moved to LHC OPN. Under investigation to find a solution
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • EXPRESS and FULL validation of work flows for Collision11
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • SRM castor intervention
      • T1
        • Recovery of DATA lost done for all the site. Marco Cattaneo , our computing coordinator send a message to our T1 contacts. "I would like to warmly thank all those involved in this recovery for the considerable effort that went into this. This saves LHCb from having to reprocess the full 2010 dataset ahead of the next stripping campaign"
      • T2 site issues:
        • NTR

Sites / Services round table: Sites:

  • ASGC: NTR
  • CNAF: Tomorrow intervention on ATLAS and LHCb STORM
  • BNL: NTR
  • FNAL: NTR
  • IN2P3: NTR
  • KIT:
    • Discussion on ALICE issue. Suggestion to see in recent problems experience by ATLAS. The idea to move the monitoring traffic to OPN does not seem the best way to proceed and anyway it should not discussed in this meeting
  • NDGF: NTR
  • NL-T1: NTR
  • OSG: NTR
  • PIC: NTR
  • RAL: Tomorrow Wed 8am-4pm CASTOR intervention (all VO affected but CMS)

Central services:

  • Dashboard: NTR
  • Databases:
    • Outage yesterday for the Online DB (due to a network intervention).
    • Thursday we will do a test on the CMS online DB (on CMS request) to move it in standby mode and then back
  • CASTOR (SRM): Interventions going on (29-3-2011): Public and LHCb in the morning, ALICE this afternoon. CMS on Thursday
  • CASTOR (Central Service): CASTOR central services (VMGR, VDQM and Cupv) will be upgraded to version 2.1.10-1. Prepare for tape sizes greater than 2TB. The upgrade might cause temporary instabilities in accessing data (we expect only from tape).
  • AFS: Incident generating alarm ticket (GGUS:69121) solved at 14:45 (Hardware failure on server afs155 around 12h45, rendering 3 partitions and about 110 ATLAS volumes inaccessible. Apologise not to have acknowledged the ticket upon reception.
  • Grid Services: NTR

AOB:

Wednesday

Attendance: local(Massimo, Ken, Jarka, Jamie, MariaD, Alessandro, Gavin, Eva, Nilo, Lola, Edoardo); remote(Michael, Rob, Andreas, Jon, Gonzalo, Joel, Stefan, Ron, Tiju, Claudia, Rolf).

Experiments round table:

  • ATLAS reports -
    • Technical stop. Cosmic run will be started around 6pm CEST
    • ADC
      • MC and HI datasets replication between Tier1s
      • No update on GGUS:69121 alarm though the problem is gone
      • DDM Central catalog issue (Mar 29, 20:30-21:30 CEST)

  • CMS reports -
    • LHC / CMS detector
      • Now in technical stop.
    • CERN / central services
      • We are trying to figure out why cmsprod was staging. We could use some example file names to chase this down. (I know nothing more on this!)
      • Degradation of transfer efficiency on Debug from CERN to Tier-1s. (GGUS:69097) submitted, has been set to "very urgent" priority. We have seen no activity on this and it is considered serious.
      • VOMS Core Service on lcg-voms.cern.ch is down.
      • SNOW ticket (INC:026034), there is a file that appears to exist only in the castor namespace at CERN, but no physical copy.
    • Tier-0 / CAF
      • Still trying to complete processing of several runs.
    • Tier-1
      • Some MC production in progress (trying to get to 50% utilization), many sites available for WMAgent testing
      • Went through open GGUS tickets for the T1SCM meeting. Maria didn't think these were over threshold for there, but: There are five tickets (68837, 68839, 68842, 68843, 68845) related to the support and configuration of glexec at T1 sites (ASGC, PIC, KIT, CCIN2P3 and FNAL, respectively). These were all created on March 22. All sites involved seem to be investigating the matter. * Tier-2
      • Several sites in scheduled downtimes, otherwise MC production continues

  • ALICE reports -
    • T0 site
      • Scheduled intervention to upgrade kernel of ALICE CAF nodes : ongoing now but not finished yet
    • T1 sites
      • SARA-MATRIX: GGUS69140. VOBox was not reachable since last night. The reason was that during the night there was a a pick of memory consumption and the machine crashed.
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • EXPRESS and FULL validation of work flows for Collision11
    • New GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • SRM castor problem with bringonline: (GGUS:69165)
      • T1
        • IN2P3 : Can not reproduce the problem of shared area at IN2P3.
        • RAL : can the EGEE broadcast can be identify with a correct message. The Unscheduled downtime message was the same as the schedule one and when the unscheduled message has been send it was not clear that the CASTOR intervention was not over ....
        • CNAF : STORM intervention.
      • T2 site issues:
        • NTR

Sites / Services round table: Sites:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: Only ATLAS Storm upgraded. Since problems were observed other instances have not be moved and they are ready to roll back if other instabilities appears. The upgrade was a Storm upgrade + move to SL5
  • IN2P3: NTR. LHCb (and Dashboard) observed some unavailability of the shared area. Since this is not clear yet a ticket has not been submitted.
  • FNAL: NTR
  • KIT: Tomorrow's downtime for FTS/LFC-Oracle backend is postponed to April 7 for LFC. However, FTS will be upgraded tomorrow: short downtime between 8am and 9am UTC. Dashboard reported some failures (lcg_cp) in SRM tests at KIT.
  • NDGF: NTR
  • NL-T1: some disk server rebooted (affecting ATLAS)
  • OSG: NTR
  • RAL: Some unscheduled downtime added (to correct the confusion between the intervention in local time and UTC)

Central services:

  • Dashboard: NTR
  • Databases: ALICE pit intervention today. Tomorrow the foreseen test on the CMS online will be done (failover test)
  • CASTOR: Today upgrade OK (central services). Tomorrow CMS will be upgraded to the new version of SRM. During the afternoon a decision will be taken around the observed instabilities (LHCb on the new version and CMS on the old) and a possible new deployment schedule will be discussed with the experiments
  • Grid Services: The observed VOMS problem is a monitoring glitch
  • Network: NTR

AOB: (MariaDZ) GGUS Release this morning as announced on Monday. Full list of fixes and new features on https://gus.fzk.de/pages/news_detail.php?ID=442. Also a reminder of the Ticket Timeline Tool (TTT) is avaialble on https://gus.fzk.de/pages/didyouknow.php . ALARM tests started around 10:30am. Notes in Savannah:119812

Thursday

Attendance: local(Massimo, Steve, Gavin, Alexei, Jamie, Stefan, Maarten, Oliver, Ken, Mike, Andrea, Nilo, Eva, Ian, MariaG); remote(Jhen Wei, Michael, Jon, Ronald, Tiju, Gonzalo, Claudia).

Experiments round table:

  • ATLAS reports -
    • Technical stop. Combined cosmic run during the night
    • ADC
      • Central catalog issue (Mar 29th evening). postmortem
      • CASTOR problem. (GGUS:69077) ticket was answered for 10 hours
        • "We are still experiencing waves of failures on the SRM" (ML)
      • AFS problem. (GGUS:69192) alarm ticket.
      • MC derived datasets replication is in progress
      • An issue with prodsys containers registration (fixed)
      • Ticketing. Central Services reported that the request for new machines was postponed for 2+ weeks, because information wasn't propagated to Bernd

  • CMS reports -
    • LHC / CMS detector
      • Now in technical stop, waiting for run plan to be clarified, no high-lumi running expected until Monday.
    • CERN / central services
      • (GGUS:69097) resolved. Turned out to be a stuck server which was rebooted Wednesday afternoon. This exposed some problems:
        • SNOW does not update GGUS ticket; CMS was unable to know the status of investigations. Made a SNOW ticket for that, INC027276. (SNOW:INC027276)?
        • Why did it take two days for the issue to be addressed? Need clearer communications; please use cms-crc-on-duty@cernSPAMNOTNOSPAMPLEASE.ch list.
      • SNOW ticket mentioned yesterday also resolved through the reboot; file was on that server.
      • Intervention this morning on Castor, keeping an eye out for problems, none observed yet.
      • We are trying to figure out why cmsprod was staging. We could use some example file names to chase this down. (Still waiting.)
    • Tier-0 / CAF
      • No news.
    • Tier-1
      • Some MC production in progress (trying to get to 50% utilization), many sites available for WMAgent testing
    • Tier-2
      • Several sites in scheduled downtimes, otherwise MC production continues

  • ALICE reports -
    • T0 site
      • Scheduled intervention to upgrade kernel of ALICE CAF nodes : all nodes have been updated successfully except for one that is ongoing.
    • T1 sites
      • FZK: the problem with MonALISA is still there and it has not been understood yet. The site is working, we are subtmitting jobs but the monitoring for ALICE is not working
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • EXPRESS and FULL validation of work flows for Collision11
    • New GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • SRM castor problem with bringonline: (GGUS:69165) has been fixed very fast. The new version has been deployed yesterday evening and fixes the problem.
        • Occasional BDII timeouts from the CERN top level BDII have been experienced since last Wednesday by Dirac Agents (INC:026625)
        • Webafs has been down this morning, resulting in pilot jobs not downloading and installing themselves properly. . Server is up again - be confirmed that the problem is solved (INC:027185)
      • T1
        • GRIDKA: Pilots are starting and subsequently dying very fast. The problem could be related to the webafs problem - to be confirmed (GGUS:69224)
      • T2 site issues:
        • NTR

Sites / Services round table: Sites:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: Storm upgrade: increased memory consumption on the GPFS side now under control. System under observation
  • IN2P3: ntr
  • FNAL: In 1h the load-balanced SRM will be put in prod (transparent)
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: Maintenance between 9pm and tomorrow morning (5am). No downtime but reduced throughput rate to tape storage
  • OSG: Alarm test successful
  • PIC: At risk today (dCache patch being applied; transparent)
  • RAL: ntr

Central services:

  • Dashboard: ntr
  • Databases: Switch over test (CMS online) successful. Since ~11 LHCb online DB under load.
  • CASTOR: The combination of high load from maintenance operations triggered a number of instabilities (GGUS:69077, GGUS:69100, GGUS:69180); the manifestation are crashes on the SRM (core dump of the front end) seen as low efficiency in transferring; maintenance operations are given by high number of DB connections from stagers (5 stagers active due to migration of boxes) and by draining processes (retire/recycle/access disk servers triggering disk2disk copies). Load reduced overnight (stop draining machines) and DB connections reduced (around 10:00 UTC). LHCb and CMS SRMs upgraded to 2.10-2
  • Grid Services: Monday VOMSr will be upgraded (transparent). As today all lxbatch nodes have the CVMFS client.
  • Network: ntr

AOB:

Friday

Attendance: local(Massimo, Alexei, Mattia, Ken, Maite, Maarten, Pedro, Eva, Nilo, MariaG, Simone, Jamie); remote(Michael, Jon, Jhen Wang, Alexandre, mariaF, Rolf, Jeremy, Tiju).

Experiments round table:

  • ATLAS reports -
    • Technical stop. Combined cosmic run (data11_cos project pattern)
    • ADC
      • Short LFC@FZK outage on Mar 31
      • MC derived datasets replication between Tier-1s

  • CMS reports -
    • LHC / CMS detector
      • Technical stop is over, but no stable beams expected over the weekend.
    • CERN / central services
      • Ta da -- tested GGUS-SNOW coupling today (GGUS:69258), and now SNOW does update GGUS tickets appropriately. Progress!
      • CRC was phoned yesterday and asked to intercede with rogue user who was doing large amounts of Castor staging. The user was contacted and told to desist, which he did.
      • We are trying to figure out why cmsprod was staging. We could use some example file names to chase this down. (We'll still take the info.)
    • Tier-0 / CAF
      • No news.
    • Tier-1
      • Some MC production in progress (trying to get to 50% utilization), many sites available for WMAgent testing
    • Tier-2
      • Two sites in scheduled downtimes, otherwise MC production continues

  • ALICE reports -
    • T0 site
      • Yesterday all new jobs submitted to CERN started failing due to an incompatibility that had arisen with the older version of AliEn still in use on the VOBOXes; those machines were upgraded to the latest version, now installed on the local disk instead of AFS; various configuration and other issues needed to be debugged before jobs were successful again late Thu evening
    • T1 sites
      • FZK: MonALISA currently is working again
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • EXPRESS and FULL validation of work flows for Collision11
    • New GGUS (or RT) tickets:
      • T0: 1
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • SRM castor problem : list of protocol pass by the client was take into the right order. Problem fix by changing the default list of protocol on the SRN side before the fix is deploy in production..
        • all WN at CERN using CERNVMFS (problem of value of the variable VO_LHCB_SW_DIR on some WN during 2 hours)
      • T1
      • T2 site issues:
        • NTR

Sites / Services round table: Sites:

  • ASGC: ntr
  • BNL: Last night we noticed a degradation of the data transfer performance accompanied by an increased transfer failure rate. We found that the problem was related to the SRM server performance. Detailed investigations of the problem were carried out over the course of last night. We found that all components associated with the SRM service were up and running and the hardware supporting the service was performing at the expected level and stability. The only anomalous behavior we noticed was I/O intensive activity on the SRM database (running on a dedicated server). This has been rarely observed so far and was mostly caused by postgres vacuum, but even disabling vacuum entirely did not help in this case. We then looked at database queries and managed to isolate a few of them and found that there was one kind that locked the table for significant periods of time. We also noticed that the read I/O rate reached 200 MB/s which is very unusual for the typical access pattern that hits the database in the context of SRM operations. In summary, we believe that the SRM server has caused the problem by sending some maliciously constructed queries to its database. The performance returned to normal once the queries were no longer active, but we expect the problem to recur once such queries hit the database again.
  • CNAF: ntr. It was mentioned that although the SRM intervention for LHCb did not take place GOCDB messages were generated anyway
  • IN2P3:
  • FNAL: Load-balanced SRM intervention yesterday OK
  • KIT: Short intervention OK
  • NDGF: Monday 11-15 UTC ATLAS data not available (PDC)
  • NL-T1: ntr
  • OSG: ntr
  • PIC:
  • RAL: ntr

Central services:

  • Dashboard: ntr
  • Databases: LHCb online problem mentioned yesterday solved by 5pm yesterday. Root cause: network problem. No data loss
  • CASTOR: Behaviour difference in SRM observed by LHCb; fixed by configuration change
  • Grid Services: ntr
  • Network:

AOB: Positive evidence AFS improving as follow up of some tickets (n.b. not all servers migrated yet)

-- JamieShiers - 22-Mar-2011

Edit | Attach | Watch | Print version | History: r29 < r28 < r27 < r26 < r25 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r29 - 2011-04-01 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback