Week of 100104

LHC Operations

WLCG Service Incidents, Interventions and Availability

CMS | LHCb | WLCG Service Incident Reports | Broadcast archive |

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Olof,Eva,Jean-Philippe,Miguel,MariaG,Jamie,Antonio,Manuel);remote(Michael,Onno,Angela,Gareth,Michel/GRIF,Rolf).

Experiments round table:

  • ATLAS -

  • ALICE (Reported by Patricia before the meeting) - Alice has been regularly running MC production during this vacations period as it planned before the Xmas time. The new MC cycle began the 25th of December and it is still running with no interruptions. In total Alice has executed 240K jobs, achieving picks over 15K concurrent jobs during several days. In general the behavior of the sites has been quite good, also remarkable the good stability of the services at the T0. The WMS@CERN have been able to server also external sites including some T1 sites with no incidents to report. In addition Alice has finished the backlist of those sites that are not running SL4 in their WNs and/or VOBOXES. In total around 7 sites have been finally blacklisted although we have to mention that in some of them the positive answer of the site admins have ensured a final upgrade of the system by the end of this week. Finally we would like to remind the list of pending actions sent before Xmas to the PX support list in terms of new VOBOX registrations onto the list of trusted nodes of myproxy server. Currently we have the following requests in status REGISTERED:
UNIBA: Deprecation of a VOBOX and inclusion of a new one -- Cagliari-INFN: Registration of 2 VOBOXES -- Birmingham: Registration of 1 vobox -- ITEP: Registration of one VOBOX -- CIEMAT: Registration of one VOBOX -- Trieste: Registration of one VOBOX -- SINP: Registration of one VOBOX (sent during the vacations period) -- Bologna-T2: Registration of one VOBOX (sent during the vacations period) -- Prague: Registration of one VOBOX (sent today) --

Sites / Services round table:

  • Michael (BNL): smooth running
  • Onno (NLT1): 2 issues at NIKHEF: one controller burned out in new setup, waiting for assessment from vendor, small amount of Atlas data not accessible; Torque instability, trying to tweak parameters to improve stability. SARA will be unavailable for most/all of next week.
  • Angela (KIT): Atlas SRM had to be restarted; cause is not yet clear. FTS down for a few hours because of a disk full (quickly fixed).
  • Michel (GRIF): NTR
  • Gareth (RAL): DB errors: BIGID problem for CASTOR: is it the same problem as a few months ago? FS probe problem on one LHCb disk server: will contact LHCB to get the file cheksums to validate the server. Outage tomorrow as previously announced (testing bypass UPS)
  • Rolf (IN2P3): dCache upgrade today: should be back in one hour.

  • Miguel (CERN): CASTOR upgrade on Wednesday for ATLAS.
AOB:

Tuesday:

Attendance: local(Miguel,Ueda,Eva,Jean-Philippe,Jamie,Roberto,MariaG,Simone,Alessandro,Antonio);remote(Nicolo,Michael,Angela,Gareth,Alessandro/INFN,Ronald,Jeremy,Rolf).

Experiments round table:

  • ATLAS (Ueda)- Transfer timeouts between PIC and SARA (both directions): ticket opened to PIC. Transfer problem between Weizmann T2 and SARA (port not opened). Credential errors at SARA: probably fixed by the upgrade this morning.

  • CMS reports (Nicolo)- Tier0 is shutdown, upgrade planned before data taking. A few job failures at T1s before Christmas (being looked at). Millions of MC jobs at T2s running ok. Not much activity during the Christmas break, but smooth operation. ASGC: installation of new tape drives, 2 tickets closed, performing final consistency checks. IN2P3: retransfer of datasets removed by mistake got stuck before Chrismas because of a tape staging issue at FNAL fixed yesterday; job submission failures for CREAM-CE at IN2P3. PIC: transfer failures from Lisbon because of a problem with proxy renewal configuration. CNAF: few problems in transfers, solved by itself. Problems in deployment of CMS software at T2s because of missing dependencies.

  • ALICE (Reported by Patricia before the meeting) - No urgent issues since yesterday but the registration of the new VOBOXES (reported yerterday) into the list of trusted nodes of myproxy server. The production throught the corresponding sites is stopped until the complete registration of the nodes into myproxy.cern.ch. In terms of the sites, yesterday it was reported an issue with the local CREAM-CE system at JINR which has required the assistance of the developers; The local system cannot accept jobs when the corresponding jdl includes an inputsandbox. The system reports a globus-gridftp error. The solution wull be reported to the whole Alice TF list for the rest of the sites. Small issues with Hungarian and Polish T2 sites reported this morning and already solved. The production ongoing smoothly with picks over 18K concurrent jobs.

  • LHCb reports (Roberto)- Not much activity and the system ran almost unattended. MC application errors will be fixed in next LHCb application release. No reprocessing/reconstruction during the break. 100k user jobs with a few failures in application code and because of data unavailability. Few tickets still opened for RAL because of file unavailability (RAL has received the list of checksums from LHCb to validate the server). Shared area instability at T2s.

Sites / Services round table:

  • Michael: smoooth operation. SLAC in downtime because of ongoing work on a Storage Server. Planning for a LAN upgrade at BNL on Thursday to connect the storage servers to a new switch: there could be some instability for one hour with a few failures but no outage will be set in GOCDB, however an entry will be put in OIM and in eLog. Also adding 700TB to the disk storage configuration (transparent).
  • Angela: HW problem on GPFS filesystem serving Atlas pool. Waiting for engineer. Should be fixed soon.
  • Gareth: the test of bypassing UPS went well this morning. All is ok.
  • Alessandro/INFN: CREAM-CE now running well. Tomorrow is holidays in Italy but urgent tickets will be handled.
  • Ronald: NTR
  • Jeremy: NTR
  • Rolf: unscheduled outage yesterday 5PM-10PM because of the overload on the local batch system. This was triggered by a user error who generated 24M entries in the DB instead of 2M, the cancel did not work and the user resubmitted several times with the same error producing several times 24M entries in the DB. No VO could submit jobs. Already submitted jobs were almost not affected.

  • Miguel: Atlas CASTOR upgrade tomorrow as planned.

AOB:

Wednesday

Attendance: local(Miguel, Jean-Philippe, Andrew, Eva, Edoardo, Harry, Alessandro, Antonio, Roberto, Jamie, Simone);remote(Michael, Gareth, Onno, Rolf).

Experiments round table:

  • ATLAS (Ueda)- reminder of FTS problem at CERN during the Christmas break ("could not load client credentials"): FTS support looking at it. CASTOR upgrade done this morning as planned: transfers resumed ok but not all servers in ATLDATA are back in prod yet after the upgrade to SL5. There are also some XROOT Manager problems (probably unrelated).

  • ALICE (Reported by Patricia before the meeting) - MC production ongoing with no mayor incidents and new peaks over 19K concurrent jobs. The requirement mentioned yesterday in terms of new VOBOX registrations into the list of trusted nodes of myproxy server at CERN already fullfilled, thanks to PX support for the prompt actions. Site admins of the corresponding sites will be informed and the corresponding sites will be included in the production as soon as the VOBOX service is checked and validated at all these sites.

  • LHCb reports (Roberto)- There is only user activity going on and at very low rate. No complaint.

Sites / Services round table:

  • Michael/BNL: reminder of the planned intervention tomorrow. An announcement has been done in OIM. The intervention will start at 10AM(Eastern time) and will last about 5 hours.
  • Gareth/RAL: FSprobe problem: the checksum validation has not been completed yet but the ticket has been updated with the list of files.
  • Onno/NLT1: 2 issues with SARA SRM: transfer problems with PIC. PIC had increased the MTU value a week ago. SARA does not observe errors when transfer goes to a SARA node with a low MTU value, so this is probably the cause. The second problem concerns transfers with Weizmann Institute in Israel (ticket 54416): gridFTP control channel is ok, but data channel times out. Firewall at SARA seems to be ok. Transfers between CERN and Weizmann are ok. Still investigating. The credential problems at SARA have been fixed by installing dCache 1.9.5-11.
  • Rolf/IN2P3: for the CREAM-CE problem reported yesterday, IN2P3 would like CMS to submit a ticket.

  • Edoardo/CERN: new router installed for LCG: so far so good, but please monitor failures and report problems. Tomorrow a transparent recabling of some routers will take place.
  • Antonio: gLite update: staged rollout of gLite 3.1 Update 60 and gLite 3.2 Update 7. It's a consistent release. Sites having some nodes on SL4 and some nodes on SL5 should upgrade all at the same time because of incompatibilities. For gLite 3.1: new version of WMS and a fix in CREAM for BLAH vulnerability. For gLite 3.2: introduction of GLEXEC, CREAM and SCAS.
  • Eva: BIGID problem reported yesterday: it is not the same problem as the one reported and fixed a few months ago. The problem is being investigated.

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local(Miguel, Jean-Philippe, Simone, Ueda, Harry, Alessandro, Roberto, Andrea, Gavin, Manuel, Jamie, MariaG, Antonio);remote(Angela/KIT, Gonzalo/PIC, Tristan/NLT1, Michael/BNL, Gareth/RAL, Alessandro/INFN, Josep/PIC, Jason/ASCC).

Experiments round table:

  • ATLAS (Alessandro)- Dashboard not visible from outside, problem reported. Several interventions: BNL, NDGF (dCache upgrade), INFNT1 (CASTOR SRM upgrade to 2.8.5, some problems after upgrade). Triumf issue yesterday because of too frequent srmLs requests: polling interval changed from 1 to 600s. Fix could be applied to BNL Site Services. Alessandro will send the details to Michael.

  • CMS reports (Nicolo)- No particular issue. CREAM-CE problem at IN2P3: SAM tests are now ok; waiting for production ok before closing the ticket. Cleanup at ASGC ongoing; some files stuck in migration; some files are not registered correctly: waiting for CMS to check. Transfer to RAL failing: probably a problem in CMS catalogue. T2s migration to SL5: most tickets closed, still 5 sites to upgrade.

  • ALICE (Reported by Patricia before the meeting) - MC production ongoing with no major issues to report (around 18K concurrent jobs maintained since the last report). New Alien implementation being tested currently at CNAF and affecting the submission to the CREAM-CE service: Alice has implemented the CLI submission mode into the AliEn environment but this submission mode allows exclusively the declaration of a single queue. Alice has implemented the possibility to chose (for each agent submission and for each site) a local queue which is randomly taken from a list of local queues defined in the central services. Once this implementation is tested, it will be distributed to the rest of sites in a transparent way for the sites. In terms of T2 sites, two sites which were blacklisted by the 3rd of January (SINP in Russia and IRES in France) are expected to be put in production soon once the site admins has confirmed this week the migration of the remaining WNs and/or VOBOXES to SL5.

  • LHCb reports (Roberto)- 4 different MC productions running, few hundred thousand events requested per prod. No major issue. User analysis: no major issue. At RAL another disk server needs to be checked. CNAF plans to move LHCb data from CASTOR to GPFS and TSM as has already been done for CMS. About 6TB in total need to be copied to the new system - hopefully before LHC restart.

Sites / Services round table:

  • Angela/KIT: atlas disk pool has been fixed.
  • Gonzalo/PIC: The transfer problems between SARA and PIC could be due to the enabling of jumbo frames; they have been disabled. SARA/CERN/PIC experts are in contact to solve this problem in a more permanent way.
  • Tristan/NLT1: Problem could be similar. By decreasing MTU to 8800, transfers are ok. What is the maximum MTU size to be supported in WLCG? What is the policy about accepting ICMP packets? Could network experts at CERN give information about the 2 points above please?
  • Michael/BNL: Transparent upgrade will start in one hour and will last around 5 hours. No outage. Plan to move production to new file server appliance next Tuesday. Will need to drain the queues on Monday evening. The actual intervention will take place on Tuesday morning and will last one hour. Production should be restarted by Tuesday 11AM Eastern Time.
  • Gareth/RAL: RAL at risk today because of one air conditioning chiller and one pump. Some files at RAL are not accessible because of a server in draining mode.
  • Brian/RAL: FTS uses srmLs before srmPrepareToGet. There seem to be a problem when files have to be prestaged or copied from one pool to another. A ticket should be submitted so that FTS support can investigate.
  • Alessandro/INFNT1: SRM upgrade this morning. Upgrade ok but a few hickups because of incorrect Quattor templates. Quickly fixed (40 minutes instability).

AOB:

Friday

Attendance: local(Jean-Philippe, Miguel, Alessandro, MariaG, Jamie, Simone, Roberto, Harry);remote(Nicolo, Angela, Jeremy, Michael, Onno, Jason, Massimo, Luca, Rolf, Gareth, Brian).

Experiments round table:

  • ATLAS (Alessandro)- burst of errors at CERN because of CERN network instabilities (ticket 54549). Only 23 out of 123000 reconstruction jobs failed: very good. Simone wants to stress the criticality of corrupted FTS proxy problem at CERN (bug 60261). Atlas would like to know, as the problem is only seen at CERN, if FTS 2.2 can safely be deployed at other sites.

  • CMS reports (Nicolo)- 115 files in migration status at ASGC for a very long time. As they will probably stay in that state, CMS will delete the files and retransfer them. One of the datasets deleted by mistake at T1 was also deleted at originating T2 and so is definitively lost. The problem of file not accessible at RAL has been fixed by invalidating the entry in the CMS catalogue. The problem could be a leftover from a CASTOR problem last year. Bulk submission to CREAM CE has been restarted. No report yet about the stability.

  • ALICE (Reported by Patricia before the meeting) - MC production ongoling with more than 16K (continuously) running. A new CREAM-CE system has been provided last night for the ALICE production at JINR. The CREAM-CE and the 2nd VOBOX has been validated and put in production yesterday night. A 2nd SL5 VOBOX has been defined at CNAF replacing the old VOBOX. Bologna-T2 has already upgraded the VOBOX and will be put back in production soon (operation being performed currently by the site admin).

  • LHCb reports (Roberto)- MC production completed: no major problem. Currently about 1000 user jobs. WMS problems at CERN and T1s: tickets have been submitted.The GridKA problem has been understood, the problem at PIC and SARA remains. The problem is due to ICE. A patch has been successfully installed at CERN, so PIC and SARA should upgrade.

Sites / Services round table:

  • Angela/KIT: NTR
  • Jeremy/GRIDPP: NTR
  • Michael/BNL: yesterday's intervention went well. Michael wants to stress that the strategy they chose (upgrade without stopping the service) worked very well eventhough if at some point 30% of the Storage and of the networking was not available.
  • Onno/NLT1: The transfer problems between PIC and SARA and also between WEIZMANN and SARA are definitively MTU value related. Between PIC and SARA, the LHC OPN is used. Any MTU value lower than 8996 is ok, MTU=8996 blocks, higher values are not accepted by some of the routers on the OPN network. Between WEIZMANN and SARA, the LHC OPN is not used, public network is used, so ICMP requests cannot probably be used. Is MTU path discovery supported? What about "packetization layer path MTU discovery" (RFC 4821)? Could a CERN expert come to the meeting on Monday to discuss these issues please?
  • Jason/ASGC: NTR
  • Massimo and Luca/CNAF: LSF congestion, studying the problem. After LSF master reboot, the situation is better. Investigating with Platform the root cause of the problem.
  • Rolf/IN2P3: NTR
  • Gareth/RAL: outage on CASTOR ALICE
  • Brian/RAL: transfer problems to BNL are due to a server in draining mode and the fact that RAL is still using CASTOR 2.1.7 (sub-optimal version of drain). The problem is alleviated by doing manual re-prioritization of the requests for datasets being requested for transfer. As there remains about 500000 files to drain and as RAL only does it during working hours (to be able to react to problems), it will still take 2 to 3 weeks to complete the drain operation.

  • Miguel/CERN: CASTOR was affected (as well as many other services) by the network instabilities at CERN around 11:00 and 15:00. The problems are located in main core routers and are probably a consequence of the intervention around 09:00.

AOB:

-- JamieShiers - 17-Dec-2009

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2010-01-08 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback