Week of 101018

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Harry(chair), Jan, Yuri, Graeme, Carlos, Ulrich, Alessandro, Patricia, Lola, MariaD);remote(Michael(BNL), Barbara(CNAF), Jon(FNAL), Gonzalo(PIC), Xavier(KIT), Rolf(IN2P3), Marie-Christine(CMS), Kyle(OSG), Gang(ASGC), Tiju(RAL), Onno(NL-T1), Roberto(LHCb)).

Experiments round table:

  • T0
    • CERN ssh access to lxvoadm machine failed. BUG:117331. Request to "vobox.support" sent Oct.16 at ~7pm. (This was due to a network switch partial failure)
    • Still ATLAS production dashboard issue Oct.18 at 00:20. Experts informed. BUG:73904. Scheduled downtime Oct.19 (2h.) for an ORACLE DB upgrade.

  • T1s
    • Taiwan-LCG2 Transfer errors SCRATCHDISK. Elog18322, site informed and they are working on that. SE unavailable Oct.16 at 11:20. GGUS:63170 solved.
    • BNL: transfer failures to contact on remote SRM. GGUS:63164 verified: the SRM and pnfs services were restarted Oct.16 at 02:40.File transfer failures from DATADISK, HHTP_TIMEOUT to SRM.GGUS 63177: solved Oct.17 at 3:41.
    • NDGF DATADISK transfer failures related to GRIDFTP_ERROR. GGUS:63179 assigned Oct.17 at 07:30. No reply yet.
    • INFN-T1 MCDISK transfer failures: "Source file/user checksum mismatch". GGUS:63175.

  • Experiment activity
    • Preparing for last fill before technical stop, high rate (800-1000 KHz) expected on Stream A when data taking resumes

  • CERN and Tier0
    • T0 Prompreco injector crashed Saturday night , memory exhaustion; job relaunched
    • GGUS:62696 occasional errors when opening files (via xrootd) on the Tier-0 processing pools: on hold
    • CASTOR Intervention for upgrade for HI planned for tomorrow starting at 9

  • Tier1 issues
    • Reprocessing at ASGC still ongoing with glitches - talking to site.

  • Tier2 Issues

  • MC production
    • large production ongoing

  • T0 site
    • Good behavior of the services in general during the last weekend
    • voalice10 (xrootd) showed some error messages last night. The issues were again associated to the firewall (reported last Friday). Problem solved this morning. Will prepare an iptables configuration in the quattor profile of this machine.

  • T1 sites
    • CNAF: AliEn user proxy expiration during the weekend. Services restarted this morning
    • Good behavior of the T1 sites in general

  • T2 sites
    • Usual operations concerning GRIF and UNAM in particular. In contact with the sites.

  • Experiment activities: Mostly user jobs in the morning, reconstruction of new data taken yesterday was not problematic.

  • T0
    • The data consistency suite running at CERN is trapping issues with data with incorrect checksum computed (due to a bug hitting also RAL installation). The same suite also reports files damaged due to problem in the upload. The reason of this last point has to be further investigated.
    • LHCb Service Status Board: the feeders have to be restarted

  • T1 site issues:
    • CNAF: A user reporting his job with problems accessing data. A couple of WNs were not properly configured w/o GPFS mounted properly
    • CNAF: Many jobs stalled: LSF configuration issue
    • IN2p3: received the list of files in a faulty disk-server. Data manager removed them from the catalog level. (GGUS:63024)
    • RAL: Half of the jobs stalling (and then eventually killed by either the dirac watch dog or the LRMS because exceeding the wall clock time). Investigations decided this was a Dirac not a site issue - a new Dirac patch will be put into production.

Sites / Services round table:

  • BNL: Have defined a maintenance window tomorrow for rearranging power distribution to about 20% of worker nodes. They will be drained overnight then off for about 8 hours.

  • CNAF: For the LHCb LSF issue a bug in the LSF license server is suspected. They are in touch with LSF support.

  • PIC: On Saturday morning had an accidental cooling stoppage affecting about 60% of worker nodes where running batch jobs had to be killed. Full capacity was restored after 2-3 hours.

  • IN2P3: The LHCb disk server failure ticket is being followed up.

  • ASGC: There are 3 files CMS jobs cannot access. Looks like an inconsistency in CASTOR - being investigated.

  • CERN CC: Annual power tests are scheduled for 1 November. They should be transparent but all grid services have been put at-risk for that day.

  • CERN CASTOR: ALICE and CMS CASTOR upgrades tomorrow from 09.00. Restrictions an ATLAS users mounting tapes in stagepublic have been put in place. Following the rescheduling of the LHC technical stop any posible heavy ion high-rate data taking testing with ALICE and CMS will have to be rescheduled.

AOB: The latest LHC schedule is to be found at https://espace.cern.ch/be-dep/BEDepartmentalDocuments/BE/2010-LHC-schedule_v1.9.pdf indicating a technical stop from Tuesday 19 October to Friday 22 October followed by about two weeks of proton physics leading to Heavy Ion setting up starting Friday 5 November.

Tuesday:

Attendance: local(Harry(chair), Edward, Maarten, Alessandro, Jan, Ignacio, Lola, MariaD);remote(Michael(BNL), Gonzalo(PIC), Roberto(LHCb), Jon(FNAL), Marie-Christine(CMS), Ronald(NL-T1), Tore(NDGF), Tiju(RAL), Kyle(OSG), Graeme(ATLAS), Dimitri(KIT)).

Experiments round table:

  • Ongoing issues
    • CNAF-BNL network problem (slow transfers) GGUS:61440 , GGUS:63134. Failures due to transfer timeouts for the large files (>2GB, often ~4GB).

  • ATLAS
    • Magnets off for maintenance and shifts reduced during LHC technical stop.

  • T0
    • Bulk reconstruction of weekend data is finishing, so data export to T1s proceeds as normal.
    • Access permissions on the CERN ATLAS Twiki have changed so only members of ATLAS can read - being investigated.

  • T1
    • IN2P3-CC report recovery from "locality is unavailable" problems. GGUS:62783, GGUS:63180.
    • Update from Hiro on CNAF-BNL network problem. GGUS:61440. Situation is improving though not quite solved.
    • INFN-T1 checksum problems on one file - no response from site. GGUS:63184. Suspicion is of file corruption during transfer from worker node to storage.

  • Middleware
    • There is a serious, but easy to fix, bug in the gLite 3.2 WN build that affects T2s using SuSE as their base release. This includes some important ATLAS T2 sites (MPP and LRZ) so ATLAS would like the issue escalated. GGUS:61106. (Note the original ticket dates from August).

  • Experiment activity
    • Technical stop

  • Tier1 issues
    • Various rereco certifications on going
    • Problem in site-BDII at PIC early this morning: the site had disappeared from the dashboard. Fixed (Savannah sr #117347: SAM test in visible error since 3h)

  • Tier2 Issues
    • Nothing to report

  • MC production
    • large production ongoing

  • AOB
    • HI Tests likely to be Thursday, with ALICE
    • New CRC: Stefano Belforte, Oct 20-25

  • T0 site
    • Small operations in voalice13 to startup the proper PackMan service. Production ramping up
    • No issues with voalice10 today (reported yesterday). SE@CERN performing well.

  • T1 sites
    • All SE@T1 sites reporting fine in ML, no issues for the workload management system

  • T2 sites
    • UNAM and GRIF issues reported yesterday. Both sites were in downtime yesterday
    • Following today the operations in Bologna-T2, Legnaro, Clermont

  • Experiment activities:
    • No data received last night. Reconstruction running to completion. Some delay related to problem at IN2p3 (RAW data not available because of the disk server) and problem at RAL (stalling jobs)

  • T0
    • The CASTOR test suite run has trapped two data inconsistencies on CASTOR - probably bad uploads of MC files from worker nodes.
    • LHCb SSB: the feeders have to be restarted

  • T1 site issues:
    • IN2p3: Shared area issue preventing to install latest version of LHCb application. (GGUS:63234). This problem with the shared area at Lyon must be escalated.
    • IN2p3: LHCb want a clear estimation on when the disk server will be back to life to take a decision on how to handle few remaining data to reconstruct.
    • RAL: Problem of stalling jobs was due to a bug introduced in DIRAC screwing up the estimation of the CPU

Sites / Services round table:

  • PIC: Site bdii was not publishing any data overnight for about 8 hours - fixed by reconfiguring. Does not seem to have caused major problems for the experiments. Experiment SAM srm tests did not show any failures while the ops ones did. Alessando stated that within the SAM framework tests to a site with no bdii visible are not published. This should be followed up.

  • FNAL: Network connection between FNAL and KIT was down for 3 hours last night.

  • RAL: One disk server for LHCb is down.

  • KIT: Have rescheduled the downtime of a cream-ce and an lcg-ce from this Thursday to next Thursday. Due to a cluster split.

  • CERN CASTOR
    • Have upgraded alice and cms castor versions to 2.1.9 in preparation for the HI run.
    • A 24 hour test of recording ALICE and CMS simulated raw data to tape (no data export) at 2 GB/sec each is now scheduled from 14.00 on Thursday 21 October.

  • AOB: Maarten reported that ATLAS tests at RAL of the gLite 3.2 FTS have found a critical bug in that on delegation proxies of an unsupported version are created. Sites should are suggested to stay with the working 3.1 version while this is fixed. The problem lies in the web service component and not the agent.

AOB:

Wednesday

Attendance: local(Renato, Harry(chair), Graeme, Patricia, Maarten, Gavin, Edward, Ulrich, Jan, Ignacio, Vawid, MariaD, Alessandro, Roberto);remote(Michael(BNL), Gonzalo(PIC), Jon(FNAL), Rolf(IN2P3), Gang(ASGC), Tiju(RAL), Kyle(OSG), Stefano(CMS), Onno(NL-T1), Jens(NDGF), Luca(CNAF), Dimitri(KIT)).

Experiments round table:

  • ATLAS
    • No data taking during LHC technical stop.

  • T1
    • IN2P3-CC locality problem is not completely solved, GGUS:62782. We continue to report files we see fail, but can the site check from their side so we don't have to tediously report transfer by transfer please.
    • INFN-T1 corrupted file, site has confirmed this is the only copy and the checksum is bad - it will be purged. GGUS:63184.
    • TAIWAN-LCG2 errors on transfers to BNL ("source file failed on the SRM with error [SRM_FAILURE]"). GGUS:63286.
    • RAL-LCG2 disk server down, reported by the site to be for memory change.
    • RAL-LCG2 SRM went down for ~1 hour. GGUS:63291. Now solved.

  • Middleware
    • FTS on gLite3.2 testing in the UK was stopped after progress in understanding the proxy delegation problem.

  • Central Services
    • DDM functional tests did not run for ~24 hours due to a bad service restart - now fixed.
    • Twiki access restrictions were put in place yesterday which, by default, limit ATLAS twiki access to ATLAS authors only. We are relaxing these restrictions on a page by page basis, but please let us know if something you think you should be able to see is inaccessible (email atlas-adc-expert@cernSPAMNOTNOSPAMPLEASE.ch).

  • Experiment activity
    • Technical stop

  • CERN and Tier0 * opened GGUS:63246 for BDII problems

  • Tier1 issues
    • Various rereco certifications on going
    • Problem in site-BDII at PIC yesterday was fixed SAV:117347
    • CMS T1_FR_IN2P3 is experiencing low transfer rates on some disk pools (under control)

  • Tier2 Issues
    • Nothing to report

  • MC production
    • large production ongoing

  • AOB
    • HI Tests Thursday 10:am to Friday 10:am, with ALICE. CASTOR team had this scheduled for 14:00 to 14:00 but CMS need to stop earlier. To be confirmed with ALICE when they can start.
    • CRC: Stefano Belforte, Oct 20-26

  • GENERAL INFORMATION:
    • Pass 0 and Pass1 reconstruction activities ongoing together with two MC cycles. Good status of all T1(T0) SEs in MonaLisa

  • T0 site
    • GGUS: 63282. All CREAM-CEs at CERN failing this morning. (This was due to transient LSF master blockages).

  • T1 sites
    • All T1 sites in production, no remarkable issues

  • T2 sites
    • Usual operations applied to several T2 sites with no remarkable issues

  • Experiment activities:
    • Impressive amount of MC jobs (65K run in the last 24 hours) and very tiny failure rate (~1%) (see pics in report link above)

  • T0
    • isscvs.cern.ch is no longer accessible outside CERN (CT719539). Issue immediately handled and problem understood to be related to the CERN firewall setup for CVS servers.

  • T1 site issues:
    • GRIDKA: observed instabilities (SAM and real activities perfectly correlated in timing) on the SRM endpoint (GGUS:63253)
    • RAL: one disk server of lhcbMdst service class was not available yesterday (GGUS:63230)
    • IN2p3: any news from SUN (GGUS:63024)? Since had report that SUN had left and expecting to have service back tomorrow morning (some 34000 files are concerned).

Sites / Services round table:

  • BNL: Michael reported a new RH5 root exploit in the gnu dynamic linker under certain setuid configurations (probably found on many hosts). The CVE number is 2010-3847 and BNL have already deployed a workaround that can be done during normal operations with no reboot. Maarten reported this exploit is already in discussion by the EGI security teams and may lead to some temporary loss of capacity as sites decide how to respond.

  • PIC: had another overnight service incident when 4 of their 5 dcap doors went down at the same time around midnight causing many running jobs to fail. The doors have been restarted and the cause is under investigation.

  • NL-T1: They also have a workaround in place for the root exploit reported by BNL.

  • NDGF: Will be patching some pools in the next few days for the root exploit. They are aware of a new easy to do kernel root exploit (CVE-2010-3904) for which they are preparing kernel patches (to blacklist one kernel module).

  • CNAF: Have put an at-risk downtime for a storage upgrade - should be transparent. The open ggus ticket from ATLAS yesterday (for a corrupted file) did not get into the Italian ROC ticketing system so this needs to be checked.

  • CERN Batch. Fix for the above problem will be deployed today together with patch for CMS Maradona problem. Working on both the root exploits reported above.

  • CERN CASTOR: Preparing for the Heavy Ion running test tomorrow. Both ALICE and CMS raw data pools will have associated 40 tape drives to guarrantee their rates of 2 GB/sec to tape.

  • CERN srm: SLS monitoring of CERN srm has shown various unavailabilities that is thought to be a monitoring only issue. It affected c2public which is used by all the srm tests.

  • CERN databases: There was an intervention today on the ATLAS online cluster with a switch replacement. Maria reported she is getting Oracle errors from the CERN CA since many hours but nothing is shown on SLS. The ticket she submitted will be followed up.

AOB:

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS
    • Some cosmics runs, but no export to T1s.

  • T1
    • IN2P3-CC locality problem still not completely solved, GGUS:62782.
    • TAIWAN-LCG2 errors on transfers to BNL solved, disk server overload. GGUS:63286.
    • TAIWAN-LCG2 jobs failing due to a software area overload last night. Site reduced the number of jobs slot to ATLAS. Recovered with reduced capacity. BUG:74033.
    • RAL-LCG2 SRM problems yesterday - report from site that this was being caused by an ATLAS user pulling data from RAL into an NDGF cluster. User was suitably chastised and investigation ongoing.

  • Middleware
    • FTS on gLite3.2 - some UK sites were not accepting ATLAS VOMS credentials issued by the BNL VOMS server. Sites have been alerted.
      • Reminder to all sites to ensure all 3 ATLAS VOMS servers are properly configured.

  • Experiment activity
    • Technical stop

  • CERN and Tier0
    • HI test in progress.No info yet.

  • Tier1 issues
    • Reprocessing ongoing, no issues

  • Tier2 Issues
    • large production ongoing, no issues
    • user's analysis reached 100K jobs/day last week average

  • MC production
    • large production ongoing

  • AOB
    • CRC: Stefano Belforte, Oct 20-26

  • GENERAL INFORMATION:
    • Same reconstruction and MC activities ongoing with no major issues to report.

  • T0 site
    • CREAM-CEs all back in production after the issue reported yesterday. For today a trasnparent patch was expected, no issues found

  • T1 sites
    • All T1 sites in production

  • T2 sites
    • no remarkable issues to report

  • Experiment activities:
    • No MC, no reconstruction, validation of the new reprocessing on going. No major issues

  • T0
    • none

  • T1 site issues:

Sites / Services round table:

AOB: (MariaDZ) A big number of new GGUS Support Units come to life with the major GGUS Rel. 8.0 on 2010/10/27. They are listed on 2 slides on https://gus.fzk.de/pages/news_detail.php?ID=420, linked from GGUS home under Latest News.

Friday

Attendance: local();remote().

Experiments round table:

Sites / Services round table:

AOB:

-- JamieShiers - 16-Oct-2010

Edit | Attach | Watch | Print version | History: r16 | r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r11 - 2010-10-21 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback