Week of 100308

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Availability SIRs & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Maria, Nicolo, Jamie, Andrea, Jacek, Graeme, Alessandro, Patricia, Roberto, Jean-Philippe, Majik, Timur, Jan, MariaDZ, Gavin);remote(Jon/FNAL, Gonzalo/PIC, Rolf/IN2P3, Michaela Lechner (NDGF), Gareth/RAL, Michael/BNL, Angela/KIT, Ron/NL-T1, Gang/ASGC, Rob/OSG).

Experiments round table:

  • CMS reports -
    • T1 Highlights:
      • MC processing and ReReco completing at T1s: FNAL, RAL, PIC, KIT
        1. cmsHotDisk service class deployed at RAL to solve access issues to hot input files, adjusting optimal number of replicas of hot files.
      • IN2P3: SRM SAM test failures Sunday 7th, disappeared after ~4 hours.
      • CNAF: CE SAM test failures - LSF master dying, batch system admins investigating.
    • T2 highlights
      • MC ongoing in all T2 regions.
        1. Some jobs seem lost in communication problem between CREAM CEs and WMS (status Done on CREAM but Running on WMS). Update: known WMS/ICE bug, patch already in certification Savannah #61405.
        2. Several workflows failing on all T2s due to operator mistake (requested SLC4 CMSSW architecture for jobs running on SLC5 CMSSW version), resubmitting corrected workflows.
        3. SAM test failures at T2_BR_UERJ, T2_UK_London_IC, T2_RU_JINR

  • ALICE reports - GENERAL INFORMATION: Production dominated by the user analysis jobs (also small fraction of jobs runing at the T0 for Pass1 reconstruction)
    • T0 site
    • T1 sites - Minimum activity at these sites for the moment
    • T2 sites
      • Configuration of the new Hiroshima services at the ALICE LDAP finished this morning. Site admin informed about the operations. site back in production
      • The last two French sites (IPNL and Grenoble) which were still not providing a CREAM-CE system for ALICE are beginning with this service setup. When this operation will be finished, one of the current dedicated WMS for ALICE in France will be removed

  • LHCb reports - MC productions running at sustained regime w/o major problems (9-10K jobs concurrently)
    • T0 sites issues:
      • CASTOR upgrade to 2.1.9-4 + SRM upgrade to 2.8-6 went w/o problems to report
      • Yesterday some problems with the merging jobs at CERN because input data was not available (RT open). Today it became available again.
    • T1 sites issues: Noticed CNAF CE has been failing the job submission critical SAM test on Sunday morning (all CEs affected, problem with LRMS local submission) and also IN2p3 had the critical unit test for SRM failing on Sunday.
    • T2 sites issues: Shared area issue at IN2P3-LPC jobs failing at Pisa

Sites / Services round table:

  • PIC - had a GGUS ticket opened by ATLAS - team ticket - reported that some jobs fail as they do not find a pool xml file catalog in s/w area. s/w area seems ok - but file is not there. s/w installation problem? GGUS #56328. Tomorrow am scheduled downtime for FTS 2.2.3 upgrade - will coincide with stress test.
  • FNAL - ntr
  • NDGF - ntr
  • IN2P3 - had some problems with BDII over w/e. Known performance problem of SL5 BDII. Might explain SRM difficulties. Graeme - don't think so as this was a transfer stage failure with error from gridftp itself.
  • ASGC - last week we had many errors "could not load client credentials". Deleted credentials from disk. Applied cron to do this before upgrading to FTS 2.2.3. No date yet...
  • RAL - ntr
  • BNL - ntr for T1. GGUS ticket 55101 for NDGF not accepting proxy from BNL for LFC. Sent by mistake? Manually dispatched. Problem lies with NDGF. They have not uploaded most recent LCG VOMS certs so proxy not accepted. NDGF - trying to find someone who wants to test - installed newest files. Don't have RPM based installation. Please can someone test this? From our view should work now. Graeme - will try to find someone with DoE cert to test. Michael - will also look at it.
  • KIT - ntr
  • NL-T1: on Saturday evening a power outage in East Amsterdam which also involved NIKHEF part of NL-T1. All WNs turned off as a result - all should now be back ok. Tomorrow morning at 06:30 a router will be replaced which means Internet connectivity briefly interrupted.
  • OSG: AHM for OSG at FNAL this week so most people at FNAL for this. Send mail to me (Rob) for any issues.

  • CERN: CERN site linux upgrade proceeding today. All will be upgraded by end of day.

  • CERN DB: problem with replication of LFC for LHCb to RAL. Looks like replica was manually updated and hence replication failed. Data should be R/O but not enforced.

  • CERN - SRM will roll forward at 09:00. As from 09:30 ATLAS will drive data (large first) and 30' later small files. Planned rollback at 16:00. T1s should be prepared to get data from CERN during this time. Rollback earlier in case of problems. Known problem with this SRM release which would affect CMS (srm cp)

AOB:

  • USATLAS - would like sites to publish downtimes under OSG resource group names. Details - see Alessandro.

Tuesday:

Attendance: local(Elisa, Jamie, Maria, Graeme, Gavin, Cedric, Majik, Nicolo, Harry, Jean-Philippe, Jacek, Andrea, Alessandro, Lola, Stephane, Andrew, Miguel, Julia, Timur, MariaDZ);remote(Jon/FNAL, Michael/BNL, Gang/ASGC, Ronald/NL-T1, Gareth/RAL, Rob/OSG, Pepe/PIC, Angela/KIT, Michaela/NDGF, Gonzalo/PIC, Rolf/IN2P3).

Experiments round table:

  • ATLAS reports -
    • CERN
      1. Problematic files reported at yesterday's meeting actually transferred ok after a few hours, so no ticket was submitted.
      2. There is a problem with one RAW file which is unavailable 32 hours, but not migrated to tape (https://gus.fzk.de/ws/ticket_info.php?ticket=56293). [ Jan - 20 files missing, similar symptons as other disk server. Files should have migrated to tape but box died before this could happen. Graeme - call back from T0 to SFOs was only to release files when migration bit set - not the case for these (FILE CLASS TAPE).
      3. Disk server MIA: https://gus.fzk.de/ws/ticket_info.php?ticket=56184. News? [ Jan - no news, still being investigated. ]
    • T1
      1. LFC upgrade at TRIUMF, no problems observed.
      2. Network upgrade at PIC, no problems observred.
      3. CNAF reduced transfer efficiency for a few hours last night: https://gus.fzk.de/ws/ticket_info.php?ticket=56284. Recovered by ~0000UT.
      4. Enabled checksums for INFN-T1 in FTS, but problems observed: https://gus.fzk.de/ws/ticket_info.php?ticket=56296.
      5. Production and analysis offline at SARA for two days of interventions.
    • T2
      1. INFN-MILANO back in production.
      2. LSPC back in production.
      3. RRC-KI still being validated.
      4. Storage problems at CA-SCINET-T2, https://gus.fzk.de/ws/ticket_info.php?ticket=56286.
    • SRM tests - look ok. Have to decide if we continue tests overnight, tomorrow, Thursday etc and cross check with LHC schedule.

  • CMS reports -
    • T1 Highlights:
      • MC processing and ReReco completing at T1s
      • PIC: some MC datasets from T2s subscribed for custodial storage before tape families were set up, ticket open to track correct tape migration.
    • T2 highlights
      • MC ongoing in all T2 regions.
      • SAM test failures at T2_FR_IPHC, T2_IN_TIFR, T2_IT_Bari
    • [Data Ops]
      • Tier-0: mainly testing (castor test). Today morning expected higher rate tests [O(1kHz) in stream A (tracker out)]. Event sizes smaller than usual. Online to ensure Express rate is reasonable.
      • Tier-1: move back to backfill as MC and other processing is finishing. Validation of 3_5 for the large scale reprocessing in on-going as well.
      • Tier-2: large scale MC production. We will be using 35X for the first time for full MC production.
    • [Facilities Ops]
      • CMS preparing for first 900 GeV collisions.
      • Follow-up on Tape Families setup in ASGC to bring the site to Operations before the run. Some 2009 Cosmic Data to be transferred in custodial/non-custodial mode to validate new Tape Families.
      • Follow-up on T1_DE_KIT transfer issues with a few T2 sites and the creation of new FTS channel for low throughput links.
      • The CMS SAM submission client has been migrated to a SLC5 VOBox.
      • The Job Robot will follow probably today. [ Andrea - done ] For this reason, the submission of jobs from the old SLC4 VOBox was stopped yesterday the 8th/March to let already submitted jobs to drain.
      • Today there is a CMS Computing Shift (CSP) Tutorial dedicated to the Cukurova Institute (Turkey)... Agenda here. More tutorials to follow.
      • CMS week: 15. - 19. March 2010 at CERN Indico agenda here. According to the current schedule for Monday, 14. March 2010, the CMS Computing Operations Meeting take place as normal.

  • ALICE reports - GENERAL INFORMATION: Production dominated by two RAW reconstruction cycles and some unscheduled user analysis jobs
    • T0 site
      • Reconfiguration of the WLCG ALICE VOBOXES performed this morning. This reconfiguration has been done to ensure a stable and redundant setup during the data taking:
        • Single VOBOX for WMS backend submissions
        • Two VOBOXES for CREAM backend submissions
        • One VOBOX (voalice14) taken out of production. This machine will be used as development machine only
    • T1 sites - Several T1 sites participating in the Pass2 reconstruction running today with no issues to report (small fraction of jobs)
    • T2 sites - no new issues to report

  • LHCb reports - Small productions running at very low rate occasional failures mostly due to application related issues. Production last week finished quite smoothly... just merging of smallish files was causing some trouble in DIRAC logic.
    • T0 sites issues:
      • Some merging jobs failed at CERN on Saturday and Sunday because of input data temporarily not available. On Monday the problem had disappeared.
    • T1 sites issues:
      • RAL:Apply process on streams replication to LFC@RAL failing yesterday. This seems to be due to a spurious entry in the RAL DB (read-only) that has been introduced manually. This is something that should not happen at all. [ Now under control and fixed. ]
    • T2 sites issues:
      • Continuing the investigation of data-upload related problem from several UK sites to CERN. Strict collaboration with Sheffield and Glasgow people.

Sites / Services round table:

  • FNAL - as Nicolo said FNAL received request to replicate hot files - 6 files - replicated to >100 pools on different nodes. Appears to be working.
  • BNL - ntr
  • NL-T1 - ntr
  • ASGC - ntr
  • RAL - have disk server out of action, part of ATLAS MC DISK. Hope back later. STREAMS replication problem. Some investigations - looks like additional record arrived due to regional Nagios monitoring (?!) being followed up.
  • KIT - ntr
  • IN2P3 - ntr
  • NDGF - problem yesterday has been solved.
  • PIC - FTS upgrade had to be cancelled and will be done next Wed/Thu. Today PIC is closed due to snow storm yesterday!
  • OSG - ntr

  • CERN: Monthly Linux upgrade pulled in the latest gLite-WN software which caused a service incident on the LCG-CEs affecting jobs which were submitted to SLC5 worker nodes. See SIR.

  • CERN: We will update the LSF version on the batch service to a new patch version (7.0.5 -> 7.0.6). We will also reboot the master batch nodes to pick up a new kernel. There will be two short periods (up to 5 minutes) when no new jobs can be submitted. Running jobs will not be affected. Andrea - does this do anything regarding high rate of jobs aborting due to Maradona error? Gav - working with Platform another patch will be applied soon.

  • CERN - MIA Server is back and being drained. Vendor will have to work on it.

  • CERN Dashboards - update of ATLAS DDM monitoring today. Downtime very short - a few minutes.

AOB: (MariaDZ) Just a reminder of the ALARM timezones for tomorrow as people asked yesterday: IIn the afternoon of the GGUS release date, initiated by the GGUS developers to the Tier1s, as part of the service verification procedure in 3 slices:

  1. Asia/Pacific right after the release,
  2. European sites early afternoon (~12:00 UTC),
  3. US sites and Canada late afternoon (~18:00 UTC).
Alarm test results/comments in https://savannah.cern.ch/support/?112668 please.

  • Nico - noticed CNAF has completed FTS 2.2.3 upgrade.

Wednesday

Attendance: local(Edaordo, Patrica, Majik, Elisa, MariaDZ, Andrea, Jamie, Prszemek, Harry, Timur, Jean-Philippe, Nicolo, Miguel, Graeme, Simone, Ricardo, Nilo, Jan, Alessandro. Cedric, Giuseppe);remote(Jon/FNAL, Michael/BNL, Gonzalo/PIC, Rob/OSG, Angela/KIT, Tiju/RAL, Rolf/IN2P3, Gang/ASGC, Onno/NL-T1).

Experiments round table:

  • CMS reports -
    • T0 Highlights: Replay running.
    • T1 Highlights:
      • MC processing and ReReco completing at T1s
        1. RAL - jobs failing for permission issue on new cmsHotDisk service class, now fixed.
      • Starting preproduction for next processing round.
        1. FNAL - jobs failed for incorrect architecture specification in workflow
        2. IN2P3 - some jobs failed because they were submitted to IN2P3-CC-T2 CE - resubmitted to T1 CE, running.
      • CNAF: since upgrade to FTS 2.2.3, about ~50% of CMS T2 site contacts not authorized anymore to submit transfer jobs to CNAF FTS [ Should now be corrected - incorrect mapping in gridmapfile. ]
        1. glite-transfer-submit exits with 'You are not authorised to submit jobs to this service'
    • T2 highlights: MC ongoing in all T2 regions.
    • Other: SAM test history not updating in Dashboard for a few hours - probably SAM programmatic interface not returning results used to import data in Dashboard.

  • ALICE reports - GENERAL INFORMATION: Massive MC production is foreseen at the end of this week. For the moment the current production is dominated by the Pass1 and 2 reconstruction tasks and some user analysis jobs (small proportion in comparison with the reconstruction jobs) [ About 8000 user analysis jobs, not using analysis train, seen just before meeting. ]
    • T0 site: The T0 is currently running most of the jobs due to the Pass1 reconstruction activities. Stable behavior of the services after the redistribution of VOBOXes performed yesterday morning. Both CERN CREAM-CE are now in production (2 voboxes) and also the LCG_CE (one VOBOX)
    • T1 sites: Responsible at this moment of most of the Pass2 reconstruction jobs, nothing special to report
    • T2 sites: Clermont (France): The new vobox provided by the site deployed to submit to the local CREAM-CE has not been validated. Problems with the local software area will prevent the good behaviour of the PackMan service and the installation of AliEn at the site. Following the issue with the site admins

  • LHCb reports - No activity yesterday, system pretty idle.
    • T0 sites issues: Tomorrow morning a 90 minutes downtime [ from 08:30 ] on all all central DIRAC machines for an intervention on the network service f513-c-ip169-shpyl-10.
    • T1 sites issues: CNAF: after the upgrade yesterday of Storm, the SAM unit test for SRM keeps failing affecting the whole site availability. Not clear the reason yet.
    • T2 sites issues: Continuing the investigation of data-upload related problem from several UK sites to CERN. Strict collaboration with Sheffield and Glasgow people.

Sites / Services round table:

  • FNAL - ntr
  • BNL - outage under investigation. Not caused by SRM. Combination of SRM and pnfs server. Gained more details of nature of problem - hopefully find a fix soon.
  • PIC - FTS migration to 2.2.3 cancelled due to snow has been fixed for Tuesday 18th March. Seeing quite a lot of spurious errors from SAM test for LHCb - LFC catalog test which checks streams. Intermittent errors. Triggering alarms to manager on duty - they risk to ignore alarms if they are too frequent. Elisa- most probably due to intervention at CERN to fix problem at RAL. Maria - problem with most T1s plus other problem with RAL. Elisa - just one hour after intervention...
  • IN2P3 - ntr
  • KIT - ntr
  • RAL - ntr
  • NL-T1 - SARA FTS/FTA downtime till in progress. SARA SRM downtime tomorrow is cancelled. Graeme - rescheduled? Onno - will be rescheduled the week after next. During LHC machine stop two weeks from now - will use that machine stop for the maintenance. Graeme - have to be quite agile with schedule!
  • ASGC - ntr
  • OSG - ntr

  • CERN DB - yesterday evening one app inserted tons of rows in ATLAS online DB - big impact on streams replication. 6hours of delay online->offline. Can also impact propagation to Tier1s.Investigation under way.

  • CERN batch intervention this morning "transparent".

  • CERN 90% of files recovered from 2 diskservers mentioned. If you have those files (the remaining 10%) please let us know.

AOB:

  • Reminder of tomorrow's Tier1 Service Coordination meeting.

  • LHC - All beams except LHC beams will be stopped on Friday (12 March) at 18:00. All LHC type beams will be stopped on Monday (15 March) at 05:00 to respect 3h radiation cool-down imposed by RP. Access will be given as from 08:00. Re-start of the machines is planned for Tuesday (16 March) as from 18:00.Slot for 450 GeV collisions to be redefined.

Thursday

Attendance: local(Nicolo, Lola, Cedric, Gavin, Maria, Jan, Jamie, Andrea, Miguel, Harry, Timur, Jean-Philippe, Roberto, Alessandro, Patricia, Simone, Nilo, Eva);remote(Jon/FNAL, Michael/BNL, Gonzalo/PIC, Angela/KIT, Rob/OSG, Ronald/NL-T1, Gang/OSG, Rolf/IN2P3, Tiju/RAL, Gareth/RAL, Barbara/CNAF. Michaela/NDGF).

Experiments round table:

  • CMS reports -
    • T0 Highlights
      • PromptReco replay successful.
      • CASTORCMS DEFAULT service class red in SLS last night, caused by large number of user requests Elog #148
    • T1 Highlights:
      • MC processing and ReReco tails completing at T1s
      • Preproduction for next processing round.
        1. IN2P3 - no successful jobs so far: job running for a long time; some eventually fail (killed waiting for file open on dCache).
      • CNAF: authorization issue on FTS 2.2.3 solved
      • PIC: site reports that Maradona errors in JobRobot decreased with no intervention on their side.
    • T2 highlights
      • MC ongoing in all T2 regions.
    • Other
      • JobRobot failures at all sites from 4pm to midnight caused by mistake in configuration - results not accounted against sites.
      • Investigating configuration issue with squid server at T2_IN_TIFR.

  • ALICE reports - GENERAL INFORMATION: There are just a few jobs currently running in the system coming basically from the last Pass1 reconstruction jobs and some chaotic user analysis jobs. AliEn experts announced this morning that due to some technical problems the AliEn installation is not available. Any alien installation should be avoinded for the moment until the experts solve the problem
    • T0 site: Tiny number of jobs currently in production, no particular issues to report
    • T1 sites: CNAF: The new storage T1D0 has been installed and it is working as the site admin has just announced. The new system has 880T on tape (which will increase during the year) and a 50T disk for the file staging. The system has been successfully tested by ALICE this morning
    • T2 sites:
      • Hiroshima T2: A hardware fault of a LAN switch between LCG-CE and LCG-WN has been detected, and the site has been declared in "maintenance" mode at GOCDB since the connection to all WN under the CE is lost. The problem will be solved after the weekend
      • French federation. There are currently only 2 sites using still the LCG-CE: Grenoble and IPNL. Following the requirement of the sites admins to deprecate one of the current dedicated ALICE WMS, the configuration at these 2 remaining sites has been changed to use that unique ALICE WMS that will be maintained in the future: grid07.lal.in2p3.fr

  • LHCb reports - No activity at all.
    • T0 sites issues:LFC replication tests failing occasionally at all T1's. To be investigated the reason (could be due to the master at CERN) [ Not a problem with service but with SAM suite for LHCb ]
    • T1 sites issues: SAM suite still failing at CNAF: put some protection in the code of the test to shield it form unexpected responses given by Storm.

Sites / Services round table:

  • FNAL - ntr
  • BNL - Observation: flood of several K analysis jobs entering analysis resources at BNL. Extremely short jobs - less than 2' runtime. Stageout 6-10 files / job. Using lcg_cp hence many transfer requests from WN to SE using gridftp. For transfer means many sockets opened and ports used that are not immediately freed up. See thousands of sockets in timewait which lasts ~2'. Number piles up until machines runs out of usuable ports. Looking at ways of reusing ports before the timeout. Maybe some impact from this? To be seen.. (Some transfer failures seen but no outage). User defined job - not a production job. User just looking for some attributes in file and hence short jobs.
  • PIC - ntr
  • IN2P3 - CMS problems - been solved - problem with tape mount scheduling system. Not much more known - more info tomorrow.
  • NL-T1 - FTS at SARA: transfer agenrts refuse to start even after clean install. Reported to FTS developers but no reaction so far... JPB will check. ..
  • RAL - FTS upgrade: next Wednesday morning. Will contact local expt reps to check. "At risk" on LHCb CASTOR over w/e - draining out some RAID5 disk servers. LHCb won't access area during this drain.
  • ASGC -ntr
  • KIT - ntr
  • OSG - some issues with alarm tickets. Alarm ticket submission failed to open OSG tickets. Alarms came through to pagers but no tickets opened at OSG, BNL or FNAL. Will retest in ~40' from now. (Submitter name was not added to alarm) GGUS will generate alarm according to normal procedure.
  • NDGF - ntr

  • CERN - starting draining of disk servers running out of warranty for CMS - CMS CAF in particular. Should be a background activity and "transparent". Jan - for SRM ATLAS test SLS is now working. Decision after meeting on whether to stay at SRM 2.9.
  • CERN - team ticket from ATLAS on ATLAS T0 batch - problem dispatching jobs to WNs. FTS update: planned for next week; details tomorrow. Following issue with Turkish CA which is publishing a CRL with an expiry date which is too far in the future. [ All Turkish sites started failing SAM tests yesterday ]

AOB: (MariaDZ) ALARM test results' summary in https://savannah.cern.ch/support/?112668#comment7

  • LHC schedule: technical stop March 15 at 05:00 to March 17. Close machine and experiments at lunchtime on that day.

Friday

Attendance: local(Cedric, Maria, Jamie, Nilo, Eva, Nicolo, Gavin, Majik, Timur, Jean-Philippe, Miguel, Lola, Alessandro, Roberto);remote(Jon/FNAL, Michael/BNL, Rolf/IN2P3, Tore Mauset (NDGF), Gang/ASGC, Onno Zweers (NL-T1), Gareth/RAL, Gonzalo/PIC, Rob/OSG, Xavier Mol (KIT)).

Experiments round table:

  • CMS reports -
    • T1 Highlights:
      • MC processing and ReReco tails completing at T1s
        1. Last 4 jobs failed at PIC, site contact notified.
      • Preproduction for next processing round.
        1. IN2P3 - jobs now running, following site intervention to make files accessible on dCache
        2. RAL - jobs failing consistently on one file: file OK, probably software problem.
    • T2 highlights
      • MC ongoing in all T2 regions.
        1. 35 files produced at T2_IT_Bari lost before transfer to T1_IT_CNAF for custodial storage - invalidated.
      • SAM tests running infrequently on CEs at T2_FI_HIP - possibly issue with registration of ArcCEs in SAMDB.
      • Fixed configuration issues with squid server at T2_IN_TIFR. [ Now reported to be ok ]
      • T2_AT_Vienna reported issues in contacting voms.cern.ch to build mapfiles after upgrade - caused by entries with obsolete format in yaim group.confs file.

  • ALICE reports - GENERAL INFORMATION: The new MC production cyle announced for the end of this week has been postponed waiting for the new version of AliRoot. Currently the production is coming from the still running Pass1 and 2 reconstruction tasks
    • T0 site - Running the pass1 reconstruction tasks, there are not particular issues to report. The (pilot) AliEn v2.18 has been intalled this week at CERN for testing purposes and it is showing a good performance.
    • T1 sites - No particular issues to report
    • T2 sites
      • Subatech (France) has announced before the daily operations meeting that they had to reset the machine because the operating system hanged with message "Out of Memory". Still looking into the problem
      • Subatech and Torino will be used as pilot sites to test the latest AliEnv2.18 version. Site admins will be announced in advance by the beginning of the next week

  • LHCb reports - Relaunched the stripping of bbar and ccbar over the MC production of the 10th of February. Testing the new code for SRM Unit test (temporarily set to "non critical")

Sites / Services round table:

  • FNAL - ntr for FNAL but remind people that US switches to daylight savings time on Sunday
  • BNL - ntr
  • IN2P3 - no news on incident of yesterday - tried to get info from local CMS staff; no ticket; no source of info.
  • ASGC - ntr
  • RAL - currently draining some LHCb disk servers. Have stopped access to CASTOR service class whilst this goes on and failing some SAM tests because of load.
  • PIC - ntr
  • NL-T1 - have issue with FTA: FTS & FTA upgraded last Wednesday since when we have not been able to get FTA working. Highest priority - GGUS 56384. Tried many things without success. Agent forking to create child process - child immediately exits with seg fault. Provided in GGUS ticket several logs & strace & core dump. gLite developers having a look at it. If we can't fix by Monday then we will consider to downgrade but will decision Monday.
  • NDGF - ntr
  • OSG - have 2 issues: 1) regards to GGUS 56346 where several jobs failed on ATLAS mw T2. Initially traced for C++ library. Maybe not STDLIB but ATLAS release - still being worked on by US ATLAS & OSG - waiting some input from BNL 2) Alarm re-test went smoothly - FNAL perferctly; BNL ok but bug found : 2 tickets on same problem merged which caused large # of emails to be returned to OSG ticketing system. Found & fixed quickly. [ Michael - the first issue is now resolved; release reinstalled and now ok ]

  • CERN - planning to switch alias of FTS T0export service next Tuesday (but see discussion). Old service - FTS 2.1 - has quite heavy use at the moment. WIll discuss offline how best to change. T2 service will be switched Wednesday. Simone - some DNs and IPs to trace requests? Nicolo - can request T1s to stop / switch on Monday pm. [ Solve offline with ATLAS/CMS ]

AOB:

-- JamieShiers - 04-Mar-2010

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2010-03-12 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback