Week of 100412

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(IUeda, Jean-Philippe, Nilo, Jan, Lola, David, Dirk, Andrea, Nicolo, Alessandro, Simone, Maarten, Maria, Flavia);remote(Michael/BNL, Gonzalo/PIC, Jon/FNAL, Rolf/IN2P3, Rob/OSG, Gareth/RAL, Gang/ASGC, Angela/KIT, Ron/NLT1, Luca/CNAF).

Experiments round table:

  • ATLAS reports -
    • ALARM for CERN DB (GGUS:57169) created 2010-04-10 04:31 UTC, still "assigned" as of 2010-04-11 19:45 CET
    • ATLAS online to offline oracle streams blocked since 2010-04-09 19:20 CET, atlas dba called cern dba by 21:21 CET. No solution by the next morning, ATLAS Tier-0 expert submitted alarm ticket around 2010-04-10 11:00 reboot of ALTAS online DB instance 1. ~15 hours of backlog (consumed by 2010-04-10 16:30 CET)
    • NOTE : restart of the ATONR instance affected the ATLAS data taking. need some coordination in future. Question: No update in the ggus ticket -- Is ggus the proper way for DB issues at CERN?
    • CERN - GGUS:57187 - temporary failure in FTS "failed to contact on remote SRM [httpg://srm-atlas.cern.ch:8443/srm/managerv2].". The same error message as in GGUS:57057 caused by a BDII problem.
    • CERN - GGUS:57204 - "Connection refused" for lxfsrb6003, lxfsrb5803, lxfsrb4809, etc. - possibly due to a load, but no success with retries after hours.
    • KIT - 2k jobs failed at FZK-LCG2 (GGUS:57167) -- three servers of the sw cluster lost the directory of atlas
    • PIC - transfer failures (GGUS:57166, GGUS:57177): fixed or being handled
    • BNL - GGUS:57183 - FTS failure in PIC-BNL channel [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] cannot create archive repository: No space left on device]. Is it an FTS problem as the SE has plenty of space?
    • IN2P3-CC - GGUS:57190 - many job failures (>1000) with "lcg_cr: Connection timed out" also SAM tests failures - SRM server restarted. Ok now.

  • CMS reports -
    • T0 Highlights: One CMSR Oracle node crashed on Friday, several CMS DB services affected Remedy #675438.
    • T1 Highlights: Monte Carlo reprocessing running at all T1s. Major issues at PIC: Jobs initially failing with staging permission issues, fixed. Then jobs crashing with "# 8004 - std::bad_alloc exception (memory exhaustion)", GGUS:57173. Unmerged area full at FNAL, cleaned up. CNAF: ~600 jobs stuck in "Submitted" state on WMS, Savannah #113773. KIT: (GGUS:57178) Transfers failing for proxy expiration, fixed. Also job abort errors on CE. IN2P3: SRM issues in SAM tests and transfers, fixed before ticket submission. Comment from Gonzalo: could CMS put full Savannah URL in GGUS tickets?
    • T2 highlights: MC production running. SAM CE errors on T2_IN_TIFR, fixed restarting gridftp server on CE. SAM SRM errors on T2_IT_Rome, fixed restarting dCache.

  • ALICE reports -
    • GENERAL INFORMATION: successful reconstruction operations during the last night (Pass 0 reconstruction together with calibration reconstruction activities). Special emphasis to the analysis train activities with 4 different cycles currently running. For the moment all MC cycles are completed and there are no news simulation activities running in the system.
    • T0 site: Performing Pass0 reconstruction activities with no incidents to report
    • T1 sites: Several instabilities might be expected today at FZK due to the reconfiguration of the site after the upgrade to AliEnv2.18 (activity ongoing and testing expected during the afternoon. In addition, a 3rd VOBOX has been provided at the site for the testing and development of the glexec infrastructure for ALICE. Configuration of this new VOBOX will be performed this afternoon.
    • T2 sites: Configuration problems at the VOBOX in Wuham (user proxy renewal mechanism not working). Working together with the site admin to solve the problem.

Sites / Services round table:

  • BNL: FTS issues for transfers between PIC and BNL. ATLT2 issues being worked on.
  • PIC: do other T1s enabling gsidcap see the same problems? comment from Ron: the problem affects CMS, Atlas and LHCb (conflict in libraries), but could be fixed with an LD_PRELOAD (see receipe by Ron under AOB). Comment from Jon: gsidcap is not allowed at FNAL. Comment from Gonzalo: requirements from the different experiments are not the same, so it is a problem for shared instances; gsidcap is needed for tape protection and ACLs to protect production areas from users.
  • FNAL: 4 hour robot outage on Saturday; unmerged dCache section 100% full - cms data operators notifed & backfill jobs were aborted. Reminder - CSP alerts for idle data not due to FNAL
  • IN2P3: tape library outage tomorrow as announced before; there should be enough disk space to store new files, but existing files only tape resident will be inaccessible. Work for extension of the building will start this week, so some risks are never excluded. There should not be problems with big files in Lyon as the FTS is setup correctly. Atlas files normally never exceed 6-7GB.
  • OSG: there was a recovery exercise last Saturday. It was not as transparent as expected but there was no activity. The ticketting interface was down from 09:00-13:00 (US time). Learning from this test, the procedures will be improved before doing another test.
  • RAL: problem today just before lunch with the DBs used by LFC and FTS. FTS is just back now and LFC service is being restored.
  • ASGC: NTR
  • KIT: Shared software area problems: the fileservers used for LHCb and Atlas got unstable; 2 servers were added and the system should be stable again. Some Worker Nodes in bad shape (some daemons taking all memory, although the swap space is ok). An automatic script has been put in place to detect this condition and take the machine out of production.
  • NLT1: see discussion about gsidcap
  • CNAF: CMS problems during the weekend to access storage as the bandwidth was saturated by too many concurrent jobs. Being discussed with CMS. CMS should limit the number of concurrent jobs while waiting for a configuration upgrade.

  • CERN Central services:
    • Ian: transparent upgrade of CASTOR public to 2.1.9-5
    • David: next Wednesday at 09:30, the LCG routers will be replaced but this should be transparent as there are 4 servers available.
    • DB: the 4th CMS offline DB server rebooted with a kernel panic last Friday (solved this morning). An Oracle bug affected the transfers between online and offline DBs (solved last Saturday). Again Oracle server kernel panic this morning (solved by 10:00). The Atlas DB backlog was cleaned at 14:00 today.

AOB: (MariaDZ) ATLAS ALARM ticket GGUS:57169 is waiting for CERN DB supporters' input since April 10th. A reminder was sent by M. Barroso this morning to the physics' db e-group. Ticket GGUS:57206 was wrongly assigned to ALICE. User Romain Reuillon is trying to use ALICE resources but couldn't be found in any LHC VO (?!)

From Ron: At SARA LHCb and ATLAS independently reported issues regarding data access through gsidcap. The error message was something like:

Domain[ROOT_All] Info gfal:gsidcap://bee34.grid.sara.nl:22128//pnfs/grid.sara.nl/data/atlas/atlashotdi sk/cond09_data/000009/gen/cond09_data.000009.gen.COND/cond09_data.000009.gen.CON D._0002.pool.root

Error ( POLLIN POLLERR POLLHUP) (with data) on control line [83]

According to Rod Walker from ATLAS there is a conflict between gsidcap (through gfal) and the oracle libs. Probably a function with the same name. There is also a savannah ticket on this issue: https://savannah.cern.ch/bugs/?65620.

Rod says that he issue could be fixed through:

export LD_PRELOAD=$LCG_LOCATION/'$LIB'/libgfal.so

This can be put in

$VO_ATLAS_SW_DIR/local/setup.sh.local

It means this lib is loaded for all dynamically linked executables, but it should be harmless.

Comment from Gonzalo: However, it looks to me that the issue reported by CMS accessing files through gsidcap at PIC (in the savannah below) is not quite the same. The error messages quoted are somewhat different at least:

https://savannah.cern.ch/support/index.php?113774

In this ticket one of the CMS experts says "I know how to hack the environment that allows CMSSW to run with GSIDCAP. The recipe can be provided"... so, we can wait and see if this recipe to patch the CMS-gsidcap problem has something to do with the ATLAS one, or we are here collecting gsidcap patches of different colours.

Tuesday:

Attendance: local(Ricardo, Jean-Philippe,Maarten, Harry, Maria, Lola, Ignacio, Nicolo, Andrea, Simone);remote(Jon/FNAL, Michael/BNL, Gonzalo/PIC, Tore/NDGF, Angela/KIT, Rolf/IN2P3, Gareth/RAL, Ronald/NLT1, Gang/ASGC, Luca/CNAF, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • SAM looping jobs: a very high load against the ATLAS DDM central catalog has been observed in the last week. The problem comes from a new SAM test that in pathological occasions would start looping and spawn tenths of processing accessing aggressively the central catalog. The problem is being cured, but there are still many jobs running as sgm on several long queues. Sites will be contacted for some help in killing those jobs. A different deployment scenario for Central Catalogs is also being discussed (to prevent "user" activity affecting central activities)
    • No problem observed this morning with the Lyon tape buffer during the intervention (did not get full, which was one worry expressed yesterday).
    • gsidcap/gfal/oracle: long thread going on. The workaround involving the preloader is not working (there are side effects which have to do with the ATLAS sw setup). From Gonzalo, using dcap + tape protection would mean that asking via dcap for a file on tape would fail: this is OK for ATLAS, since files on tape are always prestaged via SRM before accessing and if for a reason the file is not on disk, the dccp command is welcome to fail. So using dcap instead of gsidcap is a possible workaround for the moment for PIC. But, is using dcap (instead of gsidcap) a good option for other sites (access for data at SARA from Nikhef WN is an example). A good fix needs to be found.

  • CMS reports -
    • T1 Highlights: Monte Carlo reprocessing running at all T1s. PIC - issues opening files with gsdicap GGUS #57173. Site reverted to dcap and removed tape staging protection, files pre-staged. FNAL - job failures with file access errors, also correlated to low quality in T0-->FNAL transfers - caused by heavy processing load? KIT - Jobs now running OK - still unstable in CE SAM tests however. CNAF - ~600 jobs stuck in "Submitted" state on WMS, Savannah #113773
    • T2 highlights: MC production running. SAM SRM errors on T2_BE_IIHE, overloaded with remote user stageouts. SAM CE and SRM errors on T2_FI_HIP, ongoing network issues. SAM CE on T2_EE_Estonia, blackhole worker node, fixed. Unscheduled extension of downtime for SE upgrade at T2_CN_Beijing.

    • weekly plan:
      • Tier-0: data processing
      • Tier-1: run redigi/rereco 35x at all sites, prompt-skimming
      • Tier-2: MC productions for low filter eff. workflows, pre-production.
      • Moving JobRobot to new CMSSW version. Plan to stop automatic migrations of datasets to CAF DBS, restricting CAF DBS to CAF-only data (express streams, user-produced datasets). Next Thursday, T2 Support Meeting. T2s invited to join: T2_EE_Estonia, T2_FR_IPHC, T2_IN_TIFR, T2_RU_IHEP, T2_RU_PNPI, T2_RU_SINP, T2_UK_London_IC, T2_UK_Sgrid_Bristol.

  • ALICE reports -
    • GENERAL INFORMATION: 2 new MC cycles have started by yesterday night together with Pass0 reconstruction at the T0 and 4 analysis trains.
    • IMPORTANT NEWS: ALL sites have been migrated to AliEnv2.18. All site admins and regional experts have been informed by yesterday afternoon. Site admins and regional experts are encouraged to check the good behavior of the local AliEn services at the VOBOXES
    • T0 site: Over 1400 concurrent jobs currently rtunning on the system mostly through the CREAM-CE resources. No issues reported by the experiment
    • T1 sites: All T1 sites will be put in failover mode by today, the action will be transparent for the site and will not stopped any production or activity at the site
    • T2 sites: Configuration problems reported yesterday with the VOBOX in Wuham have been solved. Now the local CREAM-CE is sufferring of some authentication problems which prevent the setup of the site in the production partition of ALICE.

Sites / Services round table:

  • FNAL: Backlog in writing files to tape due to enstore bug, too many requests overloaded dCache for 2 hours but it recovered on its own
  • BNL: NTR
  • PIC: dcap has always been available at PIC. CMS jobs must be authorized to stage files from tape, so need to disable tape protection for all VOs. An LHCb problem will be discussed offline.
  • NDGF: NTR
  • KIT: still problems with some Worker Nodes. It seems that part of the /proc filesystem is not accessible. Trying to put in place a script to detect this condition.
  • IN2P3: NTR
  • RAL: disk server belonging to ATLASMCDISK had a memory problem. This was fixed and the filesystems are being fsck'ed.
  • NLT1: dcap versus gsidcap is being discussed between SARA and NIKHEF.
  • ASGC: CASTOR firware upgrade being done. Should be completed in the next 2 hours.
  • CNAF: NTR
  • OSG: NTR

  • CERN central services: NTR
  • AOB: the user mentionned in the AOB yesterday has nothing to do with the ALICE problem. The ticket has been assigned back to ALICE.

AOB:

Wednesday

Attendance: local(Ricardo, Jean-Philippe, Ignacio, Lola, Eva, Nilo, Simone, Alessandro, Maarten, Nicolo, Harry, Maria);remote(Gonzalo/PIC, Ron/NLT1, Tore/NDGF, Jon/FNAL, Michael/BNL, Greig/LHCb, Rob/OSG, Angela/KIT, Gang/ASGC, Tiju/RAL, Luca/CNAF).

Experiments round table:

  • ATLAS reports -
    • SAM looping jobs are being killed. O(20) GGUS tickets sent yesterday and most of them have been answered. Load on Central Catalog went down by a factor 10. *Yesterday at approx 19:00 CEST the machine at the CERN Computer Center running the runQuery service went down (possibly a hardware failure). The service is being reinstalled on a different machine.

  • CMS reports -
    • T0 Highlights: CASTORCMS DEFAULT degraded in SLS, due to 6k user requests
    • T1 Highlights: Monte Carlo reprocessing running at all T1s.
      • 70 jobs stuck on CREAM CE at RAL - CNAF WMS admins were asked to kill them. Elog #1746 - 3 jobs failing persistently, input files corrupted, invalidating Savannah #113844.
      • KIT batch system instabilities still causing intermittent failures in CE SAM tests Savannah #113824.
      • ASGC Intermittent issue with Maradona errors in CE SAM and JobRobot tests, Savannah #113850. SRM issues in SAM tests and transfers. Report from site contact: "Problem due to oracle DB archive log has exceeded limit size again, we are cleaning some log files now." (now fixed).
    • T2 highlights: MC production running. Jobs failing in France region T2s, probably due to issue with input generator file. SAM SRM errors on T2_BE_IIHE, overloaded with remote user stageouts.
    • Other: Hardware issues with Dashboard collector - results not available for 6 hours last night, now being reimported.

  • ALICE reports -
    • GENERAL INFORMATION: Finishing the 2 MC cycles started up at the beginning of the week. In addition reconstruction and train analysis tasks ongoing
    • T0 site: ALICE services restarted this morning at voalice13.cern.ch to restart the agents submission through the WMS service
    • T1 sites: FZK and RAL have been reconfigured to use both available voboxes in failover mode. No incidents observed at the site after this operation. CNAF and CCIN2P3 voboxes will be set this afternoon
    • T2 sites:
      • Bari-T2: CREAM-CE system providing timeout messages at submission time. site admin contacted yesterday evening
      • Troitsk: local resource BDII of the CREAM-CE not reachable from the VOBOX (although working properly from lxplus for example). A possible communication issue between the VOBOX and the local CREAM-CE could be the reason of the problem. Site admin contacted

  • LHCb reports -
    • Reconstruction at T1s: no problem
    • data access problem (RFIO) to CASTOR at CERN. Not yet understood. Ticket 57243 has been updated with filenames and timestamps.
    • FZK NFS upgrade ok (site unbanned)
    • SARA dcap door opened
    • PIC: eLog entry, no GGUS ticket because it was a transient problem.

Sites / Services round table:

  • PIC: CMS and gsidcap: currently CMS has to use dcap and the tape protection should be removed. However this is not a transparent intervention. Nicolo will check with CMS if it is urgent. In case it is, the intervention needs to be scheduled with all VOs. Currently CMS uses pre-staging to overcome the problems.
  • NLT1: NTR
  • NDGF: NTR
  • FNAL: NTR
  • BNL: NTR
  • KIT: kernel upgrade on Worker Nodes
  • ASGC: High Availability feature and extra frames successfully added to the robot. Tests being run.
  • RAL: NTR
  • CNAF: asking about LHC schedule to schedule an intervention. The official long term plan is at
https://espace.cern.ch/be-dep/BEDepartmentalDocuments/BE/2010-LHC-schedule_v1.4.pdf which shows the next technical stop on 26-28 April.
  • OSG: With today's GGUS release, the contact/emergency email addresses are present at the level of the OSG (OIM) Resource Group. This allows users submitting usual, TEAM or ALARM GGUS tickets not to see the long list of individual OSG resources. Thanks to the GGUS and OSG developers this ends the saga of multiple site names as documented in
https://twiki.cern.ch/twiki/bin/view/EGEE/SA1_USAG#GGUS_to_OSG_routing_July_2009_sn and constitutes a great functionality improvement!

  • CERN central services:
    • router upgrade postponed by 2 weeks.
AOB:

Thursday

Attendance: local(Ricardo, Jean-Philippe, Miguel, Alessandro, Maarten, Harry, Simone, Lola, Ignacio, Nicolo, Flavia, Maria);remote(Jon/FNAL, Greig/LHCb, Tore/NDGF, Michael/BNL, Gang/ASGC, Tiju/RAL, Angela/KIT, Ronald/NLT1, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • LYON: tonight, starting from approx 4:00 AM CEST, Lyon started failing production jobs with an SRM error in lcg-cr (PrepareToPut). At the same time, FTS transfers started failing as well, with a misleading SOURCE error from CERN. The case is traced in GGUS https://gus.fzk.de/ws/ticket_info.php?ticket=57343. The problem was cured at 10:30 CEST restarting the SRM frontend. But the same problem happened again just before the meeting and the ticket was re-opened. Question: is it possible to convert a team ticket into an alarm ticket? Probably FTS timeouts should be tuned. To be discussed offline.
    • RAL: 2 files could not be delivered to RAL in the last 48h. The problem was actually in FTS at CERN, where 2 FTS jobs had completed all file transfers apart from one, which was in Active state since 36 hours. The FTS job has been killed and transfers resubmitted (now completed) but it would be good to know why the two transfers were stuck. FTS support has been contacted.
    • A tentative plan for the ATLAS reprocessing campaign is in place. The current (tentative) dates are from April 22nd to May 5th. Details in
https://twiki.cern.ch/twiki/bin/view/Atlas/ADCDataReproSpring2010

  • CMS reports -
    • T0 Highlights:
      • CASTORCMS DEFAULT degraded in SLS, due to 6k user requests. Comment from Miguel: the pool configuration can be changed to avoid this problem.
      • One node in standby on CASTORCMS T1TRANSFER Remedy #676592
      • 3 FTS jobs Active since April 6th on CERN-RAL FTS channel, file transfers seem stuck in 'Preparing' phase, srmStatusOfPrepareToPut causing heavy load on RAL SRM GGUS TEAM #57363
    • T1 Highlights:
      • Monte Carlo reprocessing running at all T1s. Prestaging requested at PIC (DataOps confirms that it is OK until PIC can schedule the intervention). 24 files failing persistently at ASGC, possibly I/O related, notifying site contact Elog #1756. 3 files have been manually restaged by ASGC. They will do the same with the remaining files.
      • FNAL: Transfer failures from KIT and CERN - Savannah #113871. DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://cmssrm.fnal.gov:8443/srm/managerv2]. Givin' up after 3 tries
      • IN2P3: SRM SAM test failures Savannah #113870, CMS and ATLAS transfers also affected GGUS #57343, now OK
      • KIT: batch system instabilities still causing intermittent failures in CE SAM tests Savannah #113824.
    • T2 highlights:
      • MC production running smoothly
      • SAM CE errors on T2_IT_Pisa caused by blackhole worker node, now fixed.
      • Reduced SAM availabilty at T2_RU_IHEP since Apr 5.

  • ALICE reports -
    • Reconstruction tasks: Pass 0 and Pass 1 reconstruction activities ongoing at CERN with a minimum activity.
    • MC activities: two new MonteCarlo cycles have been started yesterday night.
    • Analysis: 4 train analysis cycles currently running
    • Several sites have claimed that the latest Alienv2.18 has not been properly distributed to all sites. Distribution issues of the latest release will be discussed today during the TF meeting.

  • LHCb reports -
    • Experiment activities: Data reconstruction and user analysis continuing.
    • Issues at the sites and services:
      • T0 sites issues: Data access problem with CASTOR (57243) seems to have been resolved although there is concern that the service can become overloaded. Comment from Miguel: if the I/O load in that pool becomes higher the pool size has to increase as well. The current configuration allows 100 concurrent accesses to a single file or 1000-1200 concurrent accesses if many files are accessed. One can also limit the number of concurrent access per user and the number of file instances can also be increased.
      • T1 sites issue: NIKHEF/SARA: dcap file access working with no problems.

Sites / Services round table:

  • FNAL: dcache serving 32000 files locally to production & users - struggling. Added pnfs:// option for the user files, previously had only been used in production. Still working on adding pnfs:// option for recent files. Also - FTS server stuck, restarted server.
  • NDGF: NTR
  • BNL: NTR
  • ASGC: NTR
  • RAL: NTR
  • KIT: NTR
  • NLT1: NTR
  • OSG: new Web Interface put in production. All alarm tickets correctly displayed and handled.

AOB:

  • Maria:
    • successful GGUS release yesterday.
      • There are now 1 one entry for BNL and 2 entries for FNAL (T1 + T3). The FNAL T3 will not get the alarms. ATLAS please change your twiki. No more need to tell your shifters which BNL site name to choose. Only one is shown on the GGUS ticket submit form.
      • support for international character set but CERN PRMS does not understand international character set (strange characters displayed).
      • Emergency Email address values (used for GGUS ALARM tickets) disappeared from GOCDB for the 2nd time since last October. GOCDB developers should implement a test that the field is not empty, at least for WLCG Tier1s. Info in https://savannah.cern.ch/support/?113228#comment5 and https://gus.fzk.de/ws/ticket_info.php?ticket=57319 .
      • https://gus.fzk.de/ws/ticket_info.php?ticket=57131 was wrongly assigned to the T0. It must be fixed by ATLAS!

Friday

Attendance: local(Harry, Jean-Philippe, Ricardo, Simone, Alessandro, Patricia, Eva, Nilo, Ignacio, Maarten, Nicolo, Jan);remote(Jon/FNAL, Gang/ASGC, Gareth/RAL, Michael/BNL, Barbara/CNAF, Gonzalo/PIC, Rolf/IN2P3, Andreas/KIT, Brian/RAL, Onno/NLT1, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • The good news: yesterday a large number of datastes has been subscribed for MC reconstructed data consolidation (mostly merged AODs). New ATLAS aggregated data transfer record achieved (9GB/s over the grid).
    • Issues at T0:
      • LSF: Starting from 11 AM yesterday, the number of LSF running jobs for the ATLAS T0 decreased dramatically. The issue has been reported at 5PM by the T0 experts via ALARM ticket https://gus.fzk.de/ws/ticket_info.php?ticket=57371. The problem is known and is a rare event, which is cured by a reconfiguration of LSF which happens anyway every morning. The problem in fact went away at 6:30 but at the time of the writing of this report (10:30 AM) the alarm ticket is still not answered. From a subsequent email thread, looks like the problem was in the ticket flow, since the CERN unit responsible for LSF never got the ticket. A post mortem is needed.
      • CASTOR SRM: CASTOR SRM did show signs of degradation starting from 4AM. At that moment the failure rate was on the order of 20% and the ATLAS AMOD decided it could wait for a couple of hours for people to wake up, so he asked the ATLAS shifter to send a GGUS TEAM ticket (not ALARM). But at 8AM, the situation was much worst (80% failures), since the T0 at the same time had restarted full processing after the LSF incident and the load of file transfers increased. So the AMOD "promoted" the ticket to ALARM https://gus.fzk.de/ws/ticket_info.php?ticket=57382. In this case things worked smoothly. The Data Operations expert called the AMOD by phone few minutes later and the problem was solved at 8:55 CEST (from the ticket "1 problematic machine identified and taken out of production", I assume it is a SRM frontend).
      • "Fake SOURCE problems" from CERN to many T1s: the problem was discussed yesterday for Lyon and it has to do with T1 SRM being slow returning a turl to FTS after a PrepareToPut, and therefore exceeding the gridftp server lifetime at the source. This is now happening for many T1s and the possibility that something wrong is happening at the T0 should probably be investigated (some of the T1s show no issue in data access from jobs for example). The issue is tracked in GGUS https://gus.fzk.de/ws/ticket_info.php?ticket=57381.
    • Issues at T1s:
      • SARA SRM problems: SARA is showing SRM errors starting from yesterday evening. Unfortunately, the issue was initially tracked on an existing GGUS
https://gus.fzk.de/ws/ticket_info.php?ticket=57189 about staging of files from dcache to WNs, which had nothing to do with the current problem, since staging from dCache to WNs does not use SRM but rather gsidcap. In addition the mentioned issue has to do with the ATLAS SW stack and has been reassigned to the VO support. The SRM issue is now tracked in a new GGUS https://gus.fzk.de/ws/ticket_info.php?ticket=57377. Failures in SRM put in SARA are observed both for MC production and data transfers. The problem has been solved in the late morning, at 11:30 CEST it was gone.
      • ASGC 0 length files and auto-retry in FTS: ASGC is showing errors in importing data from many sources. There are two distinguished problems, both tracked in
https://gus.fzk.de/ws/ticket_info.php?ticket=57372. First, in some case the SRMPrepareToPut in CASTOR fails and leaves a zero length empty file. Second, the ASGC FTS is configured with 3 auto retries (I would like to remind ATLAS asks everyone to set this to 0) so, the second retry will try to overwrite the 0 length file, but since the overwrite option of FTS is not used (on purpose), the transfer fails with a FILE_EXIST message (which confused the ATLAS shifter).
      • CNAF SRM failures: high number of errors observed both for MC production and data transfers. Issue tracked in https://gus.fzk.de/ws/ticket_info.php?ticket=57374. From the ticket: "There is an hw problem on one of the storage systems dedicated to Atlas: this causes degradation of the overall system and failures. Now (since 21.30 CET) the situation is stable and we waiting for the hw maintenance intervention." It would be good to have an estimate of the timescale for the intervention. In addition, there are also errors for INVALID_PATH on put, not clear if they are correlated (they constitute 99% of the errors). GGUS has been updated asking for more info.
      • LYON SRM instabilities: followed up in
https://gus.fzk.de/ws/ticket_info.php?ticket=57343. According to the solution, there is now an auto-restart of SRM in case of problems. And in fact problems did reappear and went away without any ticket being submitted by ATLAS. Still not clear to ATLAS the cause of the problem and weather a proper fix is to be expected soon.
      • FZK batch system is experiencing problems since one week. From the DE cloud contact "Update 16-04-2010 8:36 : The batch farm is still in a strange state. Many worker nodes freeze because the /proc filesystem becomes partially unreadable. User and system commands stall when they try to access the /proc filesystem, e.g. in order to read their environment file. The PBS mom daemon is also affected. This causes substantial PBS server problems because the PBS scheduler also stalls when it connects to a frozen mom."
      • TRIUMF files disappearing: since a couple weeks or so, a very rare (but at the same time worrying event) has been noticed in Triumf. The issue is tracked in
https://gus.fzk.de/ws/ticket_info.php?ticket=56849. In short, a file in Triumf is transferred to another site (could be a CA T2 or another T1). The transfer fails. For whatever reason, the file is removed from the SOURCE. There is an absolute time correlation (within the same second) from the FTS logs and the dCache logs about the deletion. There is nothing in the code of DDM site services removing files. FTS does remove files, but the destination in case of failure. So investigations at the moment are concentrated on dCache. The debug level in the logs has been increased (the current is not sufficient) but the problem is not reproducible, happens from time to time, so it is hard to debug.

  • CMS reports -
    • T1 Highlights:
      • Monte Carlo reprocessing running at T1s. Jobs stuck in Submitted state cleaned up manually by WMS admins, Savannah #113903
      • PIC: 2 files continually failing exports from PIC, Savannah # 113899
      • ASGC: Files with incorrect checksum on CASTOR disk pool, restaging from tape Savannah # 113897
    • T2 highlights:
      • MC production running smoothly
      • CMSSW reinstalled at T2_IT_Pisa, site back in production after 1 day.
      • T2_IT_Legnaro CMS software area ran out of space.

  • ALICE reports -
    • GENERAL INFORMATION: No new inputs in the Alice shifter offline log last night. Pass1 reconstruction activities ongoing moreover at the T0 site. In addition, 4 analysis trains are also ongoing.
    • News during yesterdays ALICE TF meeting:
      • Replication of all RAW (which were processed) starts
      • Fraction of RAW will be transferred to specific storages for detector code commissioning exercises
      • Pass1 reconstruction activities at the T0 will continue until the end of the week
      • Start preparation for Pass 2 (better calibration and code updates)
      • MC cycles as requested by Physics Coordination
    • T0 site: Good behavior of both backends (LCG-CE and CREAM-CE) currently in production
    • T1 sites: Manual actions needed at the one of the 2 VOBOXES placed at FZK yesterday night to ensure the proper configuration environment of alienv2.18. VOBOX back in production this morning. No requirements in terms of production for the rest of T1 sites
    • T2 sites: Restart of all services in Trujillo this morning. The site came back in production after these operations. Grenoble CREAM-CE seems to be correctly passing the SAM tests. It will be checked this afternoon by ALICE before putting it in production.

Sites / Services round table:

  • FNAL: dCache overload due to a user renaming files and using its own dCache library.
  • ASGC: will fix soon the number of FTS retries. The problem of transfers failing is complicated but is being investigated.
  • RAL: some issues migrating CMS data to tape.
  • BNL: NTR
  • CNAF: confirms ATLAS report. Asks if CMS is going to start soon the new processing. Nicolo will check.
  • PIC: will upgrade the LAN switches on the 27th April. Could solve the problem reported by ATLAS earlier this week.
  • IN2P3: NTR
  • KIT: Batch system currently runs fine after approx. 150 WNs (out of 1400 in total) have been rebooted. All WNs meanwhile have a kernel update. (The symptoms were: partially unreadable /proc file system and kswapd using 100% CPU even though no swapping was done -> kernel bug suspected). This action took very long because affected WNs needed to be identified and reset manually since shutdown was also hanging and KIT's system management scripts also were affected by the batch server problems and hanging connections to the affected WNs. Will monitor during the weekend.
  • NLT1: SARA SRM performance problem fixed by changing Postgres buffer size and some dCache tuning.
  • Brian: asking about FTS timeouts. Ticket has not been updated.
  • OSG: one bug discovered in ticket exchange. An exception was raised because of a field value missing. Being fixed.

AOB:

-- JamieShiers - 08-Apr-2010

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2010-04-16 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback