Week of 101129

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Douglas, Maria, Jamie, Marcin, Maarten, Eddie, Dirk, Steve, Massimo, Harry, Flavia, Harry, Roberto, MariaDZ, John);remote(Rolf, Jon, Michael, Joel, Xavier, Kyle, Vladimir, Tiju, Stephen).

Experiments round table:

  • ATLAS reports -
    • T0:
      • PVSS2COOL replica stalled for TRT folders 03:00 Sunday. Experts notified at the time by shifter, things back to normal by mid-morning.
      • Use of lxvoadm5 still discussing allowed logins. There will be a temp. list of about 40 people put in place, and a meeting later either this week or next, on how to best manage the allowed login list for the service.
    • T1:
      • Reprocessing failures at Taiwan, traced to problem with release 16.2.1 install. This release had problem installs at many sites, so not failure of Taiwan, but a central release install failure. But it caused all reprocessing at tier-1 to stop, and task shuffling had to happen to move work to CERN and DE clouds. Which also caused a few data transfer issues between tier-1s. People are trying to finish reprocessing cycle by today, so this became urgent. Release install fix for Taiwan this morning, but still problems at other sites.
      • TRIUMF running low on space in the datatape space token. May effect ability to store data for the CA cloud. Not sure of current status.
      • INFN-T1 still having problems with CE SAM tests. This was reported on Friday, and I believe fixed enough to go green later, but failed again over the weekend. There was a response that "the node chosen for test is not mounting Atlas' file systems and should not be used for such test.", but the test is still red, and something should get fixed here. (GGUS:64660)
    • T2:
      • Disk space issues in a number of sites. Seems to be interference of data placement services, and data cleaning services on top of new group and heavy ion data productions. People are finding ways to work around this, and discussing how best to clean things up.
      • Unscheduled downtime for UKI-SCOTGRID-GLASGOW, I believe still getting fixed.
      • SRM contact issues with WEIZMANN-LCG2, NCG-INGRID-PT, RU-Protvino-IHEP, UKI-SCOTGRID-DURHAM.

  • CMS reports -
    • Experiment activity
      • Good running till this morning
    • CERN and Tier0
    • Tier1 issues and plans
      • Mopping up production at Tier-1s
      • T1_US_FNAL transfers issues. Large tape queue for exports. Many sites providing files with wrong checksums, Savannah:118013. Exports look like they've recovered. [ Jon - report indicates a large tape q at FNAL but there isn't any q. Steven - line has been here since Wednesday but no update. "Exports look like ... " would be consistent with no tape q. Jon - source sites have to resolve problems with files with bad checksums. Whole line is a little old? ]
    • Tier-2 Issues
      • Savannah:118022 some DPM sites are seeing job failures. Using dpm 1.7.4 library instead of CMS shipped 1.7.0 seems to solve it. Update pending for CMSSW.

  • ALICE reports -
    • T0 site
      • Nothing to report. xroot redirector loosing keys. SINDES problem - being followed up.
    • T1 sites
      • FZK: issues with proxy renewal solved. Problem with a router that does not like connections to myproxy at CERN using same source port at KIT. Known problem with certain routers - workaround very easy - unset globus port range and all started working nicely. Hopefully this will make ALICE VO box at KIT a bit more ok.
    • T2 sites
      • Usual operations

  • LHCb reports - Reprocessing going at full steam. Merging during the weekend
    • T0
      • CASTOR LHCBDST service class being unusable almost all w/e (GGUS:64693). The intense merging activity and the very aggressive FAILOVER system in LHCb brought to a situation with many queued transfers and the two disk servers over there couldn't cope with the load.
    • T1 site issues:
      • Issue with SARA on Friday when some users could not delete files. Some files registered with role that was not a user role - has been fixed now.
      • Increased timeout at Lyon and # failures decreased drastically.
      • SAM tests timing out at SARA (shared area tests) but problem disappeared on Sunday

Sites / Services round table:

  • FNAL - ntr
  • BNL - ntr
  • IN2P3 - CMS user who put a ticket for IN2P3 CC now waiting for reply from this user. Urgent ticket since November 23. Please can we get a reply? Or close? GGUS:64447.
  • KIT - ntr
  • RAL - one disk server DISK0TAPE1 machine down - no un-migrated files
  • INFN - our site mentioned as having problems with SAM tests. There is a problem with submission of those tests. Some tests landed on WN that is not mounting ATLAS file system. Problem with role of those tests?
  • OSG - network problems with BDII. Network now seems to be stable and now all working correctly. Rob will send mail today about rolling back changes on CERN side.

  • CERN DB - PVSSCOOL problem. ATLAS offline instance 4 rebooted due to high load. Many sessions connecting to DB at same moment. Not sure if this is consequence of what happened at 03:00? UTC or not?
  • CERN FTS Monitor Update: On Monday 6th December at around 08:30 UTC http://fts-monitor.cern.ch/ will migrate to https://fts-monitor.cern.ch/ , the migration includes an update of the Lyon FTS monitor package and an update SL5. A preview of the new service is available at https://fts300.cern.ch/. During this migration of up to 1 hour SLS status for the FTS may be unavailable (not down) and grey for a time.

AOB:

  • GGUS outage on Friday. Investigation still ongoing and will provide SIR.
  • GGUS release Wednesday 1st December, published in GOCDB and sent to all support units by mail. Will be minuted today and repeated tomorrow. Early morning on Wednesday GGUS will be unavailable for uploading new release. With this release request from ATLAS to allow shifters to escalate tickets will be implemented (TEAM and ALARM - user tickets were already possible).
  • FroNTier - new RPMs announced last Wednesday. Fix vulnerability and hence T1s encourage to upgrade. New Squid RPMs announced last Thursday - ATLAS T1s (at least) encouraged to upgrade. Should improve a lot time to restart services. CERN services have already been upgraded.

Tuesday:

Attendance: local(Harry(chair), Steve, Nilo, Patricia, Eddie, Roberto, Alessandro, Marcin, Gavin, Flavia, Massimo, Manuel, MariaD, Zbygzek);remote(Jon(FNAL), Joel(LHCb),Thomas(NDGF), Rob(OSG), Gareth(RAL), Pepe(PIC), Jeremy(GridPP), Ronald(NL-T1), Vladimir(CNAF), Stefan(CMS), Dimitri(KIT), Rolf(IN2P3)).

Experiments round table:

  • T0 & Central Services:
    • CERN CASTOR problem. Notification of the problem via email worked ok and ATLAS were kept informed of what was going on.
    • DB instance 4 rebooted due to high load caused by COOL, problem lasted for just 15 mins. ATLAS is working to understand what caused the high load.
    • ProdSys Dashboard not updating. Dashboard support contacted. This application is supported by ATLAS.

  • T1:
    • IN2P3-CC transfer problem from FZK and PIC, under investigation: GGUS:64687 . Issue about files unavailable logged GGUS:64780
    • IN2P3 have requested ATLAS to increase the number of files being transferred from CERN back to its previous higher number but ATLAS would like to first understand what has changed. The request will now be made in a ggus ticket from Lyon and can be further discussed at the Thursday Service control meeting.

  • Experiment activity: Good running till this morning

  • CERN and Tier0: No open issue

  • Tier1 issues and plans
    • Testing reprocessing with pile-up at RAL and PIC
    • T1_US_FNAL transfers issues. Many sites providing files with wrong checksums, Savannah:118013. Queues OK now. Reason for wrong checksums still being investigated but it is not affecting operations.

  • Tier-2 Issues: no big T2 issues

  • T0 site
    • swap_full error messages reported by voalice09 again during the past weekend. Problem solved restarting locally the xroot daemon. This is an old issue appearing quite frequently and in addition, PES reminded us that is machine is running out of warranty. The replacement of the machine is ready (new vobox: voalice16) and the new machine configuration has been performed this morning. voalice09 will be removed shortly.

  • T1 sites
    • FZK: cream-1 reporting this morning and still now 0 available CPUs. The information is not critical for Alice, however the ALICE contact person has been informed about this issue
    • FZK SE and IN2P3-CC SE: Both services are reporting xrdcp problems today in MonALISA. However the cental service of ML is reporting some problems this morning. The ML expert has been contacted before triggering any action on the sites. Latest news is that it is not an ML problem and IN2P3 report it as coming from the xrdcp command.

  • T2 sites
    • Good behavior of all T2 sites in general, following the status of Kalkota-T2 (out of production for several days)

  • Experiment activities: Reprocessing going at full steam. Problem with the Merging during the weekend (LHCb application problem)

  • T0
    • Requested CASTOR to change the DN of the LHCb datamanager to be : " /C=BR/O=ICPEDU/O=UFF BrGrid CA/O=CBPF/OU=CAT/CN=Renato Santana " since that of the previous data manager will be revoked overnight. Checks show that this is in place. Massimo suggested in future to revoke manually during the working day to be able to verify correct working.
    • The issue with CASTOR LHCBDST service class (GGUS:64693) has not been fully understood. In touch with S. Ponce, it could be somehow related to the same cause bringing down disk servers at RAL months ago: each merging job requires tens of small (~100MB) files to be downloaded; as a consequence of that the system was running at the limits and further requests (ex. from the FAILOVER) were just piling up (snow ball effect). Once banned the SE for few hours yesterday afternoon, it recovered and started to work again (we set the limit of slot to be 80 - the hardware behind is 2 disk servers with 22 disks so there were too many transactions).
    • Requested to change LHCBRDST from T1D1 to T1D0 at CERN only to switch on garbage collection as it is filling up. Will rethink before next data taking.

  • T1 site issues:
    • RAL: Requested CASTOR to change the DN of the LHCb datamanager to be : " /C=BR/O=ICPEDU/O=UFF BrGrid CA/O=CBPF/OU=CAT/CN=Renato Santana "
    • GRIDKA: observed thousands of timeout retrieving tURL out of the LHCb_DST space token last night (GGUS:64772). Roberto reported this was due to some hardware failure during the night.
    • LHCb found a problem in their reprocessing software this morning and have rebuilt their application to restart activity, testing first then probably full speed tomorrow.

Sites / Services round table:

  • NDGF: will have a short downtime tomorrow at 12 for kernel upgrades.

  • RAL: have an internal network problem causing slow file transfers.

  • NL-T1: Their scheduled downtime intervention was cancelled but the downtime was left in place (after question from Joel)

  • KIT: As reported by LHCb there is a hardware problem in one of their diskpools - under investigation.

  • OSG: Yesterday a mail was sent to the pertinent people about resetting the bdii timeout.

  • IN2P3: Reminder to use tickets for reporting problems rather than direct mails to people to avoid problems due to personnel changes. Also they would appreciate an answer from CMS to ggus ticket 64447. Stefano will take a look.

  • CERN CASTOR: There was a one hour CASTOR outage yesterday due to a corrupted database file. Recovery was fast and an incident report will be prepared.

  • CERN database: As reported by ATLAS he ATLR4 database server has now rebooted twice at the same time of day around 16.10. Has many open sessions so looks like contention. ATLAS distributed computing staff are investigating.

  • CERN Linux: slc4 lxbatch services will close on 13 December and lxplus services on the 14th December.

  • GGUS: new ggus release tomorrow to be followed by the usual alarm tests. Please send any ggus issues for the Thursday SCM to Maria Dimou so she can prepare the slides (someone else will have to present them).

AOB: Flavia is working on the accounting of storage resources for ALICE. She invites KIT to install the xrootd information providers that an Alice collaborator has made available so as to validate them and make available storage accounting information for Alice.

Wednesday

Attendance: local(Harry(chair), Massimo, Steve, Patricia, Roberto, Eddie, Alessandro, Marcin, Eduardo);remote(Jon(FNAL), Onno(NL-T1), Thomas(NDGF), Vladimir(CNAF), Rob(OSG), Stefano(CMS), Tiju(RAL), Pepe(PIc), Dimitri(KIT)).

Experiments round table:

  • T0, T1s & Central Services: Nothing special to report except a small incident yesterday when the ndgf srm tape endpoint failed for a few hours.

  • Experiment activity: Taking data with HI fill. No new incidents.

  • CERN and Tier0: No open issue

  • Tier1 issues and plans
    • Not much going on
    • T1_US_FNAL transfers issues. Many sites providing files with wrong checksums, Savannah:118013. Queues OK now. Reason for wrong checksums still being investigated. Will stop reporting daily, since it is not being solved fast. Jon reminded that this is due to bad incoming files of which there are about 70 with sources at two Tier-1 and multiple Tier-2. Stefano agreed this is an issue for CMS dataops.
    • T1_FR_CCIN2P3 about GGUS:64447 as indicated yesterday:
      • tkt content: seems a user wrote a confused/incomplete report to the wrong place, we tell everyone to use CMS crabFeedback list for this, at least assuming the user was using CMS supported tool Crab. Nothing prevents adventurers from doing their own grid submissions.
      • tkt envelope: sites should be able to assign to CMS for clarification w/o going through this meeting, correct ? To be followed up with ggus.

  • Tier-2 Issues: no big T2 issues

  • T0 site
    • No issues reported by ALICE after the swap of voalice09 to the new voalice16
    • Pending actions on voalice16 are to increase the importance in CDB and put it back in Production mode. After this voalice09 can be retired.

  • T1 sites
    • CNAF. Local cream system (ce01) fails at submission time this morning. The problem appears only when the submitted jdl file includes an ISB. The reported error specifies an error reponse of the ftp server. GGUS: 64827. ALICE contact person at the site has been contacted and this is now solved. Site suggestion is also to use their ce07 to reduce any overloads.
    • Local SEs at IN2P3 and FZK: Good behaviour of the local SE at IN2P3. However FZK still fails with the same error messages reported yesterday. Tracking the issue through GGUS: 64836. Probable cause is lack of disk space.

  • T2 sites
    • Kalkota issue reported yesterday has been identified and is awaiting solution.

  • Experiment activities: Reprocessing of 2010 data going at full steam. Merging > testing jobs running

  • T0
    • Open GGUS:64826 for changing the VOAdmin of our FTS servers at CERN. All sites have replied except CERN. Steve reported CERN had replied but it had not got through the ggus-remedy mail interface.

  • T1 site issues:
    • opened 6 GGUS ticket to ask the new Data Manager in LHCb will inherit the same capabilities (GGUS:64829 to GGUS:64835)
    • GRIDKA: still some instabilities with the overall Storage system (no matter the space token) affecting our reconstruction jobs (GGUS:64772)

Sites / Services round table:

  • RAL: Yesterdays problem with slow transfers was due to a bad tranceiver in a network switch, now replaced. RAL has just had a power outage at 13.30 local time and the whole site has been put in downtime.

  • KIT: Had a 1 hour cream-ce maintenance downtime to repair slow performance issues.

  • CERN databases: The ATLAS PVSS2COOL database replication stalling reported on Monday is now thought to be connected with the audit table - possibly an interference between the application and the Oracle internal auditing. Should be able to give more details at the next meeting.

AOB:

Thursday

Attendance: local(Andrea V, Maarten, Jamie, Maria, Steve, Eddie, David, Harry, Alessandro, Andrea S, Roberto, Massimo, Nilo, Marcin, Zbyszek, Stephane, Lola);remote(Jon, Foued, Vladimir, Gareth, Stefano, Joel, Rolf, Ronald, Rob).

Experiments round table:

  • ATLAS reports -
    • LHC/ATLAS activity: no physics foreseen at least till tomorrow. Shift crew reduced (no night shift)
    • T0 & Central Services:
      • DB instances rebooted atlr4 (5am) and atlr3 (11:30). Causes are still under investigation. [ Unfortunately we still don't know what is going on - the 2 reboots are slightly different than previous 2. Now almost no load on DB and reboots repeated. Have an additional instance - so far ATLAS 4, now also ATLAS 3 (for Panda). Symptoms considered as root cause now seem not to be - perhaps a consequence. I/O errors seen. Still don't know root cause. Each time new symptoms. High write rate spikes just before reboot which could be related to I/O errors. Maarten - are these machines physically next to each other? Power? A - don't think so. ]
      • CERN-PROD GGUS:64857 problems in accessing DBReleases on HOTDISK.
    • T1:
      • RAL-LCG2 unscheduled downtime due to power outage, form 2010-12-01 14:30CET to 11:00 am (day after)
    • Network 'degradation': ATLAS is observing poor throughput between BNL and both UKI-SCOTGRID-GLASGOW and UKI-NORTHGRID-LANCS-HEP. Should we submit a GGUS? To who?

  • CMS reports -
    • Experiment activity
      • Taking data with HI fill (when there). Running happily so far
    • CERN and Tier0
      • No open issue
    • Tier1 issues and plans
      • Not much going on, running backfill at T1's
      • weathered RAL power cut w/o noticing
    • Tier-2 Issues
      • no noteworthy T2 issues, files get lost here and there, disk pools come and go... usual things. Site admins always very responsive to those problems, thanks.

  • ALICE reports -
    • T0 site
      • voalice16 is already in production mode and performing well (now new xrootd redirector for ALICE)
      • Central AliEn services will be stopped for 1 hour for a DB upgrade. AliEn will be inaccessible in this period, however the already running jobs will not be affected
    • T1 sites
      • CNAF: GGUS: 64827.Closed. ce01 is working well. Due to ce01 yesterday´s failure we added to LDAP ce07 to take over ce01's submission, and as it is performing quite well, we leave them working together. The cream ce01-lcg bdii has some problem and the experts are looking into it.
      • CNAF: different problem than the one observed yesterday with ce07 also. It was a problem with CREAM AliEn module. Installed the new version that will be included in ALiEn v2.19 to avoid confused messages in the CE log
      • FZK: issue with the SE is still there, GGUS:64836 ongoing
    • T2 sites
      • Usual operations

  • LHCb reports - Reprocessing going at full steam. Merging is running smoothly
    • T0
      • Quickly updated the FTS instances to take on board the new Data Manager roles.
      • LHCb_RDST has to be converted to T1D0 (basically: switch ON the garbage collector)
    • T1 site issues:
      • opened yesterday 6 GGUS ticket to ask the new Data Manager in LHCb will inherit the same capabilities (GGUS:64829 to GGUS:64835). All sites reacted in less than 24 hours.
      • GRIDKA: instabilities observed with their dCache system (GGUS:64772) seem to have gone. Check this URL
      • NIKHEF - using CernVM FS as shared area for all analysis activities and have not seen problems so far. Will discuss later today at T1SCM

Sites / Services round table:

  • FNAL - ntr
  • KIT - ntr
  • RAL - to follow up on short power outage - everything came back. 2 disk servers for CMS and 2 for ATLAS waiting to get back but otherwise all back.
  • IN2P3 - ntr
  • CNAF - ntr
  • NL-T1 - ntr
  • OSG - have talked with Ricardo about BDII and will be reverting to original BDII timeouts on Monday and will be watching closely.

  • CERN storage - had this morning problem on ATLAS - jobs failing - coming from accessing 1 file. 1 server out of 8 in maintenance. Observe ATLAS is pushing system at maximum. If this is just a spike ok if not need more boxes. Reported in ticket -> ATLAS. Ale - ATLAS was running more than 3K parallel analysis jobs. We think that the behaviour was good in the sense that can accept that at this level some degradation due to h/w limitation. CPU allocated for grid usage at CERN are 3K in total don't think we need to rediscuss ... In this case using also some fair share.

  • CERN small problem 11:00 - 09:00 this morning notifications from CERN Remedy didn't get back to GGUS. Now ok and backlog cleared. Andrea - some CMS SAM tests stuck this morning doing VOMS proxy init

AOB:

Friday

Attendance: local(Eddie, Andrea, Simone, Jamie, Maria, Alessandro, Flavia, Maarten, Jacek, Harry, Roberto, Lola, Ignacio, Ueda);remote(Ulf, Gonzalo, Jon, Xavier, Joel, Onno, Kyle, Rolf, Tiju, Vladimir).

Experiments round table:

  • ATLAS reports -
    • T0 & Central Services:
      • Heavy Ion processing will be done at Tier0 for the next few days of data taking and not anymore at the Tier1s.
      • EOS (LargeScaleTest) endpoint failing 30% of file transfers GGUS:64939 .
    • T1:
      • TAIWAN-LCG2 job failures GGUS:64920 . Due to LFC path not properly configured, fixed very quickly from the site.

  • CMS reports -
    • Experiment activity
      • Waiting for next HI fill
    • CERN and Tier0
      • No open issue
    • Tier1 issues and plans
      • Not much going on, running backfill at T1's
      • T1-KIT have a GGUS:64918, sounds like a misunderstanding with our shifter, Stefano will follow up
    • Tier-2 Issues
      • no noteworthy T2 issues, files get lost here and there, disk pools come and go... usual things. Site admins always very responsive to those problems, thanks.
    • Central services
      • one voms.cern.ch node had a problem causing voms-proxy-init to hang at times, annoyance for users, but a killer for scripts (SAM test submission) GGUS:64905, do we need a smarter load balancer ?

  • ALICE reports -
    • T0 site
      • nothing to report
    • T1 sites
      • CNAF: CM connection timeouts observed this morning. Minor operation
      • IN2P3-CC: log of the failed attempt copies to alice::ccin2p3::tape missed. Operation required for the ALICE experts
    • T2 sites
      • nothing to report

  • LHCb reports - Reprocessing going at full steam. Merging is running smoothly
    • T0
      • 2 new disks servers added to LHCb_DST
        • Found also a bug in CASTOR. Workaround to decrease the timeout on LSF from 30 minutes to 30s. to allow new coming request to be taken in time.
      • LHCb_RDST has to be converted to T1D0 (basically: switch ON the garbage collector) Done
    • T1 site issues:

Sites / Services round table:

  • NDGF - ntr
  • PIC - yesterday around 22:00 incident with cooling of module that affects part of WN farm. Since then we are running at 50% capacity - hope to recover during afternoon
  • FNAL - ntr
  • KIT - ntr
  • NL-T1 - ntr
  • IN2P3 - ATLAS problems: we consider we are back to nominal state - all services open. Some things not completely understood but going on in discussion between dCache developers and local experts. A post-mortem later once understood.
  • RAL - update on power issue. Have all services back except for one disk server for CMS which will have to be reinstalled Yesterday night we had a problem on tape robot which was fixed early morning. Some VOs might see a backlog but system back up. Upgrade of ATLAS CASTOR instance on Monday - already declared.
  • CNAF - ntr
  • ASGC - ntr
  • OSG - Monday CERN will be rolling back BDII timeouts to previous defaut state at 15:00 local time. On December 9 GGUS will renew certs so GOC () will be unavailable for short while whilst new certs loaded.

  • CERN CASTOR (CMS) - broadcast about unscheduled downtime due to DB index rebuild. "Rebuild of index in the castorcms stager DB that has been found to be corrupted" 10:15 - 1016. (DBAs found corrupted index - sent warning to CMS. DBAs say this is due to known bug in current version of Oracle which will be fixed in upgrade to 10.2.0.5 which will be done in January.)

  • CERN DBs: ATLAS problem - offline DB - still not understood. Deploying some extra monitoring to help to understand this problem.

AOB:

-- JamieShiers - 25-Nov-2010

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2010-12-03 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback