Week of 100118

LHC Operations

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:



Attendance: local(Jamie, Maria, Harry(chair), Nicolo, Andrew, Jean-Philippe, David, Alessandro, Jaroslava, Gavin, Ignacio, Ricardo, Simone, Patricia, Eva, Roberto, Julia, Dirk);remote(Gonzalo(PIC), Angela(KIT), Michael(BNL), Onno(NL-T1), Gang+Jason(ASGC), Pepe(PIC+CMS) ,Daniele(CMS), Mark(IN2P3), Gareth(RAL)).

Experiments round table:

  • ATLAS reports - 1) Problems with NDGF FTS service (out of space in /var). Quick intervention from T1 to alleviate the immediate problem but cause still under investigation. 2) Tier-2s of Napoli, Lisbon and QMC-London down over the weekend.

  • CMS reports - 1) Pending jobs revealed that some hosts on CMS T0 apparently are never running CMS jobs, already seen in December but at that time the unused hosts were in maintenance. About 500 job slots so being followed up. 2) New round of backfill jobs submitted to test CREAM CEs, issues at CCIN2P3 and CNAF reported last week were fixed. 3) Deployed debugging library at FNAL to track rare error in opening files on dCache. 4) SAM CE test failures at ASGC this morning after a power surge.

  • ALICE reports - 1) Four pending requests for myproxy registration of new Tier-2 VO-boxes have now been completed. 2) Production ongoing at all T1 sites with no incidents to report (scheduled downtime at SARA). 3) High overload of both ALICE VOBOXES at IN2P3 reported on Friday solved on Saturday morning. The CREAM-VOBOX entered production at that moment. No interruptions of the services observed. Good performance seen.

  • LHCb reports - 1) There are 4 new productions requested and all are about full reconstruction of COLLISION09 data. Only users during the week end and now less than 1000 jobs inthe system. 2) Issue with SRM unit tests failing at some sites. Working for a solution. 3) Waiting for news for a second box at CERN behind LFC-RO. 4) Intervention to migrate CASTOR to TSM at INFN-T1 (6TB in total, ~2K files). They are all ready to proceed. Storm and CASTOR endpoints banned at CNAF to proceed with. 5) Angela reported from KIT on the CREAM-CE mapping problem seen by LHCb last week. There were wildcard entries (*) in the mapping configuration that CREAM cannot work with. Now removed - LHCb will test again.

Sites / Services round table:

  • PIC: Reminder that there will be a cooling intervention tomorrow with batch capacity reduced down to about 400 slots.

  • KIT: A sudden increase of ATLAS transfers to Wuppertal filled the FTS disks to over 50% with logs which had to be cleaned - a cron job has since been added but this activity was not expected. Simone reported this was FTS traffic to a group file space where the issue was not bandwidth but the number of files, about 70K, and he suggested to retune the FTS settings. Not clear this would help since the issue was log files filling space.

  • BNL: Had a network problem Sunday when a 10 Gbit ethernet line card on a core router failed and had to be replaced. Some jobs failed and were put by PanDA into the "holding" state. They will fail eventually later with "lost heartbeat".

  • NL-T1: Nikhef is draining batch queues for a torque and bdii upgrade tomorrow afternoon. At SARA all nodes have been rebooted following the scheduled kernel upgrades. Roberto queried SAM test failures at SARA for LHCb today and Onno reported that SRM had not been restarted properly but is now fixed

  • ASGC: They suffered a power surge about 07.25 UTC which took down most core services. Most were recovered within 30 minutes but Oracle critical services took 2 hours. Job scheduling is back to normal but the CASTOR service and SRM are not yet restored - expected to take another hour.

  • RAL: Last Friday's FTS problem was resolved by the end of the afternoon though there was another problem for 10 minutes this morning when a database node behind the service was rebooting. There will be at-risks scheduled this week: CASTOR SRM updates tomorrow and Wednesday and a memory upgrade to Oracle RAC nodes behind CASTOR.

  • IN2P3: Currently have a problem of worker nodes being stressed by jobs and ending up being rebooted. The suspicion is on ALICE jobs which have hence been blocked in batch for now. There were CREAM-CE problems at the weekend which may be connected with this. There was also an SRM failure at the weekend where it was down for 17 hours leading to SAM test failures. A script is being prepared to automate recovery.

  • CERN Databases: The latest Oracle security patches are being deployed on the integration services.

  • CERN dashboards: The poor performance of the CMS dashboard has finally been resolved as due to memory caching. The memory has been doubled and performance has improved threefold.

  • CERN networking: A 2.5 Gbit/sec traffic from Uni Geneva overloaded they firewall and had to be stopped. The suggestion is for them to register to use the firewall bypass. Simone reported this as due to a single user downloading a dataset that was only in Geneva. It should have been done via subscription and letting FTS move it rather than by direct copy but such actions in individual Tier-2 sites are unpredictable and might happen at any time. Eduardo pointed out that Geneva is special in having a 10 Gbit link but it was agreed this issue needs following up.

  • CERN FTS: The issue with corrupted FTS delegated credentials is now understood thanks to FNAL. It turned the longstanding race condition bug was erroneously restored in a build of FTS 2.2 by including an old dependency. A new version, including also a fix for the bug of agents crashing, will be released as soon as possible.

  • CERN CASTOR: The castort3 (CMS and ATLAS analysis stager) was upgraded this morning during a 4 hour scheduled downtime.

AOB: (MariaD) On GGUS-OSG issues: The following is taken from last Monday's 2010-01-10 wlcg notes: 'Could OSG please update https://gus.fzk.de/ws/ticket_info.php?ticket=54538 (urgent) Michael clarifies: Harvard and Boston form together one T2 center. MariaD will discuss ticket routing for this case offline with Kyle and Michael. ' Today this ticket is still urgent and untouched. Comment by Guenter: if there is still no corresponding OSG ticket, someone should do the following:

- assign ticket to GGUS
- assign ticket to OSG again.

Michael (BNL) reported that the problem had long been resolved and what remained was to solve the consistency issue of Harvard and Boston being essentially a single Tier-2. Should be solved in the next few days.

The ALARM monthly text conclusion is 2010-02-03 pm will be the 1st testing date, right after the Jan GGUS release. Doc updatedhttps://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#Periodic_ALARM_ticket_testing_ru Details: https://savannah.cern.ch/support/?111475#comment13


Attendance: local(Jaroslava, Harry(chair), Nicolo, AndreaV, Roberto, Jean-Philippe, Eva, Patricia, Lola, Alessandro, AndreaS, Jamie, Julia, TimurS);remote(Gonzalo(PIC), Angela(KIT), Jens(NDGF), John(RAL), Ronald(NL-T1), Pepe(PIC&CMS), Jeremy(GridPP), Gang, Jason).

Experiments round table:

  • ATLAS reports - 1) SARA/NIKHEF back from downtime yesterday before noon. FTS did not come back but was fixed a few hours afterward. NL cloud back online from this morning. 2) Wuppertal SE was down in the morning so they restarted their SRM door. However the FTS channel had meanwhile been set offline by FZK to reduce errors and had to be reopened to restart data transfers. 3) Jens queried the reason for the large amount of transfers seen at NDGF over the weekend and Alessandro reported it as user activity that must be expected from now on.

  • CMS reports - Daily report: 1) IN2P3 suffering in CREAM CE tests from a dcache issue in stageout since last week. According to local contact, the issue was identified, but the final fix is not available, and a workaround will be put in place soon. 2) KIT experiencing 'Error 73 - Failed writing to read-only file system.' in CREAM CE tests. 3) Dashboard reporting failures in CREAM CE tests at CNAF, no detailed logs available in CMS submission framework yet. Also problems staging in from tape delaying data export from CNAF. 4) FNAL error reported yesterday in opening files on dCache is possibly related to too many file descriptors open. 5) SAM CE test failures reported yesterday at ASGC are fixed but there are now SRM test failures. 6) Julia reported on mismatches between CMS running jobs and the number reported by the dashboard as due to many jobs staying in the WMS-LB system for hours. Weekly Planning: 1) Tier-0 running workflow and release tests. 2) Tier-1 running backfill testing jobs at all T1s (except ASGC). 3) Tier-2 to finish the two running Madgraph workflows. Might receive new requests this week. 4) Following-up the SL5 WNs migration and tape recycling/repack in ASGC Tier-1, to bring the site back to operations and be ready for 2010 run. 5) Following-up CREAM CE tests and failures observed at several Tier-1s. 6) Several Tier-2 sites still have only SL4 worker nodes and several more have SL5 WN's, but no SL5 builds due to various problems. Tickets are opened for them and progressing. 7) CMS first mid week global run of 2010 (MWGR) to occur end of next week (27/28-Jan).

  • ALICE reports - The ALICE contact person at IN2P3 has been contacted in order to collect more information about possible ALICE jobs blocking the local WNs with an excess of i/o operations. From ALICE side, the experiment did not change the production type over the weekend and the only additional element is indeed the submission through CREAM. IN2P3 will set up a specially monitored subcluster and redirect ALICE jobs there.

  • LHCb reports - 1) Ongoing activities in certifying the new DIRAC production system (based on SVN) over the new h/w delivered for central boxes. 2) The CERN volhcb13 server got its dirac partition full and then the job logging info database corrupted causing some user problems in recovering outputs. 3) GRIDKA confirmed that now CREAMCE are mapping correctly sgm users. 4) at CNAF Storm is being upgraded to a recent version for supporting TSM. 5) SRM SAM jobs were failing at RAL. This seems related to the new code of the unit test from LHCb which is now debugged. Currently rolled back to the old-stable code for the critical unit test. 6) There is another failed diskserver at RAL in the lhcbDst space token. John reported a disk server containing 3200 files had temporarily been taken out of production yesterday for a memory check. He reported there was another LHCb disk server giving fsprobe errors that was being worked on but Roberto confIrmed this was an older problem. 7) Jean-Phillipe had followed up on the LHCb request for another LFC server behind CERN Read-only instance and asked them to submit a ticket - earliest delivery will be next week. Remedy ticket was open on Monday (CT654872)

Sites / Services round table:

  • NDGF: Close to the end of an FTS update. Also added a disk to cope with future numbers of FTS log files following the large number (1 million logs from 600000 transfers) seen over the weekend.

  • NL-T1: Restarting after the SARA maintenance led to two problems - the reboot of the ATLAS LFC got stuck and the FTS version was accidentally upgraded to version 2.2 but has now been returned to 2.1. At NIKHEF the bdii migration to a 64-bit platform is completed as are the kernel upgrades and the torque upgrade is ongoing.

  • GridPP: The London Royal Holloway Tier-2 (ATLAS and CMS) will be in extended downtime for the next 4 weeks.

  • BNL (email report): Unable to attend due to overlapping meeting but no particular issues.

  • ASGC: Following the power surge event yesterday we found the srm transfer efficiency reduced to 60-70% only. It affects atlas production transfers and cms also observe the instability of srm transfers today. Since yesterday, we are able to fix: * direct file i/o able to complete shortly. * scheduling problem also have been fixed after restarting the rm master, and also job manager after force killing the ipc processes. * confirm the full func of stager and also rh service. * we found that the SRM has sent the request and it was received by the rhserver but never picked up by the stager. * have recompute the stats yesterday. * now we reduce the stage timeout to 90s, and remain the same at 180s for file transfer for srm timeout.



Attendance: local(Jaroslava, Harry(chair), Nicolo, Jamie, MariaG, Jean-Philippe, TimurS, Simone, Dirk, Steve, Denise, Eduardo, Patricia, Roberto, MariaD, Miguel);remote(Angela(KIT), Ginzalo(PIC), Ron(NL-T1), Gang+Jason(ASGC), John(RAL), Rolf(IN2P3), Jens(NDGF)).

Experiments round table:

  • ATLAS reports - Nothing to report today.

  • CMS reports - 1) CERN CASTORCMS DEFAULT degraded last night 23:00-01:00. 2) Investigating failures in CREAM CE tests at IN2P3. 3) KIT '426 Transfer aborted' error in FNAL-->IN2P3 4) ASGC - SRM issue fixed. 5) Software deployment on SLC5 starting on T2_UK_London_IC. 6) still 3 T2s with open tickets. 7) T2_EE_Estonia lost NFS software area, also reflected in SAM tests

  • ALICE reports - 1) Change of general production cycle so an unstable job profile should be expected for today. 2) Ticket sent to PX support from Birmingham by mistake CT655327, please skip it. 3) Ticket sent to VOBOX support: CT655195. ssh access to the ALICE VOBOXES is closed. The issue might come from a change in the default behavior of the ncm-useraccess component which was announced in December and went into production with the last scheduled upgrades (last week?). Operational use of the machines is ensured (via gsissh), any sudo operation is however denied. Following with the experts. 4) Following the CCIN2P3 issue reported since Monday the site admin locked every aligrid job. At about 16:00 the jobs were unlocked in a controlled way by lowering BQS limit during the observation period (until today), this means setting the limit to 300 slots. Once the slots were opened, ALICE closed the submission to the local CREAM-CE system just to ensure a single submission procedure, using therefore the WMS submission mode only. We have asked the site admin if the problems observed on Monday are still visible and if we can open again the submission to CREAM. Rolf later reported that the analysis of the problematic ALICE jobs is ongoing but it could just be a coincidence as regards ALICE.

  • LHCb reports - 1) LFC 3D replication out of CERN yesterday had some problem(trapped by our SAM and SLS and confirmed by Eva who received an alarm). At 16:47 we got an alarm from our monitoring system: the capture process was delayed for 90 minutes. There was a problem receiving the archive log files from the source database on the downstream system and a GGUS team ticket was submitted. 2) LHCb provisionally agree on the intervention for the CERN CASTORLHCb upgrade to 2.1.9 the 26th, 3) CNAF: Storm successfully upgraded (1.4--> 1.5) and SAM confirmed it is back to life. Data migrated to TSM. Now changing the configuration service and adding these replica in LFC. The TxD1 endpoints have been reintegrated.

Sites / Services round table:

  • PIC: 1) ATLAS have submitted a ticket on transfers from NDGF to PIC failing with timeout errors. We have asked NDGF if they have changed something as PIC have not but this problem may have been there since many days and is perhaps related to the previous trial of using jumbo frames. 2) Since several days LHCb jobs have been running at very low cpu efficiency seemingly waiting for something. Roberto thought this must be individual user jobs and will check.

  • NL-T1: 1) NIKHEF have increased the MTU on their router to the LHCOPN from 1518 to 9018. 2) NIKHEF also has an issue with their MAUI client which cannot talk to the MAUI server after an upgrade - being worked on. 3) At SARA one pool node is down when both disks of a RAID-1 array failed at the same time but with no data loss. Being reinstalled and should come back later today.

  • ASGC (email report): The CASTOR problems root cause arise from bad service assignment at backend db cluster. All the services should split into three nodes from the castor db cluster but after the power cycle, all the services fall back into same node thus we have observe extreme load situation at this particular node. Have force altering the distribution and have the other services taking the free node. Furthermore, after the power cycle, seems non-smp kernel is favored after system reboot and this is causing limited performance from current dual core instances of Oracle cluster. Have gradually reboot all 5 nodes from two cluster, this actually improve a lot the overall situation. I try simple service rearrangement 20min ago, and it actually help improving the situation, up to 83%, and I am restarting the srm and stg service right after the rearrangement of the other services. Am expecting the transfers will be able to stabilize further after the change,

Release report: deployment status wiki page

AOB: 1) (MariaD) USAG tomorrow Thursday 2010-01-21 at 9:30am in room 28-R-014 and by EVO. Two major developments to be discussed - see Agenda http://indico.cern.ch/conferenceDisplay.py?confId=81363. 2) (MariaG) The first of a new series of WLCG weekly service coordination meetings will be held tomorrow from 16.00. The mandate and agenda are at http://indico.cern.ch/categoryDisplay.py?categId=2726


Attendance: local();remote().

Experiments round table:

  • ALICE reports (Reported by Patricia before the meeting) - Problem with the access to the ALICE VOBOXES at the T0 solved. Once the changes in the VOBOXES templates are included, it takes a while until its full distribution to all nodes. Now both the sudo privileges and the access to the machine are working fine. Ricardo has been informed and the ticket can be closed. Clermont and Strasbourg mentioned today bad results visible in the MonaLisa page, regarding the publication of the user proxy status. The issue is coming from the script responsible of the testing of this proxy. Differences in the libs used by this script (Alien environment) and the libs of the VOBOX service itself created such bad (and fake) results. Solved in both sites, the propagation to the rest of the sites will be discussed today during the ALICE TF Meeting

Sites / Services round table:



Attendance: local();remote().

Experiments round table:

Sites / Services round table:


Edit | Attach | Watch | Print version | History: r17 | r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2010-01-21 - PatriciaMendezLorenzo
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback