Week of 110221

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Alessandro, Jamie, Douglas, Roberto, Mike, Uli, Maarten, MariaDZ, Andrea, Simone, Eva, Miguel, Nilo);remote(Michael, Jon, Gonzalo, Ron, Dimitri, John, Thomas, Rolf, Jeremy, Suijan, Rob, Stefano, Daniele Andreotti).

Experiments round table:

  • ATLAS reports -
    • T0, Central services
      • Problems with file transfer and access at FZK on Sunday from 13:00-17:00CET. This was traced to a corrupted CRL file, for the "seegrid". This corrupted file blocked all access, even though no transfers or file access was requested from members of this CA. This was ticketed and at 17:00 FZK copied a new CRL file for this CA from CERN afs, and this fixed the problem. The reply to the ticket is that "I think the update to EGI Trust Anchor in combination with fetch-crl-3.0.5-1 caused the problems". (GGUS:67685)
      • GGUS problems on Sunday at the same time. Reports from shifters it seems that people with a grid cert. from CERN could not use GGUS, but people with grid cert. from other CA's could. The response from GGUS dev. is that this was a related problem, since they get their CRL files from FZK. But has far we could tell the CERN file was fine at FZK. It would be good for GGUS dev. to investigate perhaps why one corrupted CRL file would block access to members of other CA's. [ GGUS incident in AOB ]
      • Problems with access to datatape at CERN. Functional tests to datatape at CERN have been hanging until timeout for a few days. There is a ticket on this (GGUS:67675), and a promise to investigate, but no solution to this ticket yet. [ Ale - problem not yet fixed but two tickets - will correlate. Miguel 1 ticket from Armin on scheduling errors (being investigated) and another on SRM access. SRM bug with Oracle - restart of daemons this morning. ]
      • Beam splash events on Sat. night, people were very excited, and pleased. Data was reconstructed in a timely fashion by systems, and things are looking good for ATLAS, looking forward to more beam.
    • T1
      • Taiwan was placed back into production by ATLAS today, after the storage changes have been finished.
      • Queues are offline at IN2P3, and this is causing problems with ATLAS production, but jobs are working in the cloud at tier-2 sites, this is only a problem for jobs which need to run at the Tier-1. [ Rolf - queues for ATLAS are open so don't understand the comment. We did not close the queues - please can you investigate. Probably taken offline on the ATLAS side ]
      • There were no issues reported from the updates to the ATLAS pilot code on Friday, and things were working for RAL over the weekend.
    • T2
      • A number of problems at tier-2 sites, but these are getting worked out.

  • CMS reports -
    • CMS / CERN / central services
      • Detector/DAQ OK getting beam splashes
      • GGUS:67719 opened a bit ago: inconsistency in top BDII
    • Tier-0
      • HeavyIon zero-suppression revealed to have large fraction of errors not caught in initial validation, we will need to re-run.
        • Tier0 needs to be available for data taking now, finding the right place and time is in progress, IT will be consulted
    • Tier-1's
      • pre-production with CMSSW 3_11has started, also running backfills at T1's to test new production tool
      • GGUS:67573 with ASGC: their Castor storage is overloaded to the point that SAM tests are timing out. Production jobs also affected. SAM errors went and came back. Not clear is Site has understood the problem or done anything. For sure it has nothing to do with WAN interruption and hinted on Friday. [ Site has decreased job slots for CMS to overcome many jobs reading from CASTOR. Don't know why timeouts. ]
    • Tier-2's
      • little production yet, waiting for 3_11 here as well. Analysis running fine
    • Miscellanea
      • NTR
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • CNAF: GGUS 67729. At MonALISA we see ~200 jobs running but the information provider reports ~1200. The lcg status is REALLY RUNNING, but probably those jobs are doing nothing .
    • T2 sites
      • Usual operations

  • LHCb reports - MC activity further production submitted in the w/e. Major update of DIRAC tomorrow - preparing for that. Reported on Friday about possible issue with CREAM "cutting parameters of executable" run in the pilot - seems to be related to a bug fix in new version of CREAM CE. Developers suggesting to rollback to previous version. BUG:59343
    • T0
      • Pilots aborting against CREAMCE (GGUS:67710) [ seems to be fixed ]
    • T1
      • GridKA: SRM endpoint not working (no matter the space token). GGUS was not working too......open GGUS afterwards (GGUS:67687)
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • PIC - ntr
  • NL-T1 - ntr
  • KIT - as mentioned think update to crl caused problem seen yesterday
  • RAL - ntr
  • IN2P3 - in downtime for MSS until 24th Feb.
  • NDGF - ntr
  • ASGC - nta
  • CNAF - ntr
  • GridPP - ntr
  • OSG - things related to SAM records. Downtimes for mid Jan sent and gotten confirmation. BNL downtimes for mid Jan should be in SAM DB now. Still working on records affected by bug in our monitoring in rsv monitoring which caused a lot of records to be sent as unknown. That's been fixed now too!

* CERN Dashboards - CMS site usability i/f upgraded today having been validated by VO for past two weeks. No interuption but there will be some changes on SSB site for shifters to be discussed within CMS and then dashboard team. New version of site status board collectors deployed last week. Much better performance and stability than previously. ATLAS SSB will be upgraded this week with additional functionality asked for. CMS will be upgraded too in 1 week's time.

  • CERN CE - bug in ncm sudo wiped out include statements. Something triggered a run of this over w/e - fixed this w/e. CERN upgrading all CREAM CEs to latest version. 4 done so far. Probably only one left for LHCb at the moment - prefer not to roll back as previous version had other problems at CERN - maybe a hot fix. Maarten - think we can get a hot fix - will respond to thread.

AOB: (MariaDZ) There was a problem to access GGUS pages on Sunday 20/2 pm (first reports around 14:30 by Jarka) due to an error in the CRL update. A workaround was suggested to the shifters by MariaDZ via Douglas and Jarka around 16:50. The problem was solved around 18:40. Some users needed to restart their browser to regain access to the GGUS pages. At the meeting ATLAS reports showed that it remains unclear which certificate holders were affected by this problem, so a SIR is needed for the next T1SCM, this Thursday.

Tuesday:

Attendance: local(Jamie, Uli, Miguel, Simone, Stefan, Maarten, Mike, Alessandro, Eva, Nilo, Andrea);remote(Jon, Michael, Stefano, John, Ronald, Rolf, Suijan, Rob, Gonzalo, Foued).

Experiments round table:

  • ATLAS reports -
    • ATLAS general info, Central services
      • project tag: data11_900GeV
    • T0, T1
      • CERN-PROD open tickets: GGUS:67708 (no stager id request - answer now in ticket. Understanding what we should do if this issue happens again? Team ticket? Alarm ticket? Miguel - ask in ticket! Work-around being put in place for SRM problems. Should not cause long unavailabilities), GGUS:67754 (file problematic in data export) , GGUS:67780 (one user file lost on SCRATCHDISK)
      • IN2P3-CC panda queues yesterday had been set offline due to a misunderstanding on the DB downtime: it has been understood the DB downtime would have affected Tier1 services like FTS and LFC but it is not the case. Now online.

  • CMS reports -
    • CMS / CERN / central services
      • Detector/DAQ OK getting beam splashes
      • GGUS:67719 inconsistency in top BDII, opened yesterday, (still) no news [ Uli - under investigation even though ticket has not yet been updated ]
    • Tier-0
      • ready for data
    • Tier-1's
      • pre-production with CMSSW 3_11has started, also running backfills at T1's to test new production tool
      • GGUS:67573 with ASGC: castor overload tracked to pileup sample not replicated on other pools as suggested in the past. Nevertheless jobs are running, slowly but OK. Hope to get current work over by Friday, then site has green light to upgrade Castor asap. Anyhow please confirm with CMS DataOps before starting the upgrade.
    • Tier-2's
      • little production yet, waiting for new requests. Analysis running fine
    • Miscellanea
      • NTR
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports - General : There was a problem since Sunday evening affecting one of the central services optimizer, due to that jobs where failing. The problem was solved yesterday afternoon.
    • T0 site
      • SINDES problem is delaying the migration of the voalicefs01 to SLC 5. Issue ongoing
    • T1 sites
      • FZK: cream-1 was taken out of production yesterday because it was not working properly. cream-3 entered in production to substitute it.
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Problem of truncated parameters when submitting to CREAM CE after upgrade to 1.6.4 (BUG:78565) [ Uli - might make sense to open a GGUS ticket - requested by developers ]
    • Tier 0
      • Dirac system drained yesterday due to intervention / shutdown today of voboxes hosting Dirac service at CERN. The downtime is over (since 11:30) and services have been restarted.
    • T1
      • NTR
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • FNAL - ntr
  • BNL - ntr
  • RAL - ntr
  • NL-T1 - ntr
  • IN2P3 - still in downtime for MSS; downtime for ATLAS DB DBATL and DBAMI over and all went fine [ Ale - when will MSS downtime end? Rolf - for moment all going as planned so should be over during 24 Feb - Thu ]
  • ASGC - ntr : experiment has requested the replication of the datasets and the ticket has been updated
  • PIC - ntr
  • KIT - ntr
  • OSG - ntr

  • CERN DB - one of the nodes of CMS offline cluster rebooted yesterday afternoon and is being investigated
  • CERN - power cut that affected LHCb some diskservers went offline this morning

AOB: (MariaDZ) GGUS Release tomorrow!!! Registered in GOCDB and announced via the CIC portal to all Support Units!

Wednesday

Attendance: local(Mattia, Jamie, Maria, Simone, Miguel, Roberto, Alessandro, Eva, Giuseppe, Lola, Edoardo, Uli, MariaDZ);remote(Michael, Jon, Gonzalo, Dimitri, Lorenzo, John, Jeremy, Thomas, Suijan, Stefano, Rolf, Rob, Ron).

Experiments round table:

  • ATLAS reports -
    • ATLAS general info, Central services
      • project tag: data11_1Beam (no data export)
    • T0
      • CERN-PROD storage issues (10 ggus to CERN-PROD related to storage problems in the last 7 days, 18 ggus in the last month) :
        • GGUS:67804 (no stager id request; the old one 67708 has been closed and verified, this is a new one),
        • GGUS:67754 (file problematic in data export; since yesterday) ,
        • GGUS:67780 (one user file lost on SCRATCHDISK; since yesterday),
        • GGUS:67793 (file transfer export problems from SCRATCHDISK),
        • GGUS:67798 (general problems with the storage, put/get highly inefficient).
        • Another problem of a file not accessible on a disk server (not reported via ggus): the /castor/cern.ch/grid/atlas/DAQ/2010/00169751/physics_bulk/data10_hi.00169751.physics_bulk.daq.RAW._lb0340._SFO-12._0001.data file from the stager point of view was ONLINE and NEARLINE on t0atlas, but the BringOnLine bulk request was stuck in pending state since this file was not really accessible; stager_rm of that file --> only NEARLINE, then a bringonline again of that file (on default), file again ONLINE and NEARLINE and accessible, problem solved. [ Miguel - degradation since SRM upgrade, broken diskserver and other problems all being investigated. Prioritizing and working in parallel - Giuseppe on SRM, Miguel scheduler errors etc. Tickets about broken diskserver not most urgent. Simone - serious data taking in ATLAS starts in 5 days. Possible decisions include rollback SRM. Cannot take data like this! Review situation tomorrow. Giuseppe - should have full idea by tomorrow. Most things understood - rather "roll-forward" rather than back. ]
    • T1 -- nothing to report

  • CMS reports -
    • CMS / CERN / central services
      • Detector/DAQ OK getting beam splashes
      • GGUS:67719 inconsistency in top BDII: being followed up, not resolved.
    • Tier-0
      • ready for data
    • Tier-1's
      • pre-production with CMSSW 3_11has started, also running backfills at T1's to test new production tool
      • GGUS:67573 with ASGC: castor overload tracked to pileup sample not replicated on other pools.
        • Hope to get current work over by Friday, then site has green lite to upgrade Castor asap. Anyhow please confirm with CMS DataOps before starting the upgrade.
      • CMS Production jobs could not match FNAL CE"s due to CNAF BDII not listing them. Investigations still ongoing, possoibly something to change on CNAF side. GGUS:67826 SAV:119300 [ Jon - we have received a GGUS ticket regarding this. Believe it is not correct for FNAL or any OSG site to register in GOCDB. Stefano - being in GOCDB is not needed - US T2s are not in GOCDB but are in BDII. CEs are in CERN BDII but not in CNAF BDII. Jon - I am going to register in ticket that we at FNAL have no relevant actions to take. Rob - my understanding is the same, OSG sites should register with OIM and not GOCDB. Stefano - ticket was opened by support person at CNAF and we will disseminate information. ]
    • Tier-2's
      • little production yet, waiting for new requests. Analysis running fine
    • Miscellanea
      • NTR
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • Nothing to report - migration of xrootd servers to SLC5 still not done. Due today at 10:30 but some new config was desired but as machines are quattor managed this is not desirable. Hopefully tomorrow at 09:30.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - MC production running smoothly.
    • Issues at the sites and services
      • Problem of truncated parameters when submitting to CREAM CE after upgrade to 1.6.4 (BUG:78565). A patch has been rolled quickly and installed on a couple of problematic CREAMCE at GridKA and CERN. LHCb confirm the problem has gone.
    • T0
      • Recovered normal operations after the downtime of all DIRAC services yesterday.
      • There is a ever more evident problem with AFS - in general. For all users, it is almost impossible working on lxplus irrespective of user location. Users report long times to execute basic AFS commands like ls or any "tab completion commands". Snow ticket open a couple of days ago. [ Uli - under investigation ]
    • T1
      • NTR
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • PIC - ntr
  • KIT - ntr
  • CNAF - ntr
  • RAL - ntr
  • NDGF - some dCache pools crashed around lunch - working on getting them back up. Simone - thread with NDGF about stagerid and bringonline not filled. This has been tracked to be problem on ATLAS side. Sorry for noise.
  • ASGC - castor upgrade will inform both ATLAS and CMS before we do it. Setting up a testbed for test of upgrade and things go smoothly
  • IN2P3 - will terminate downtime for HPSS a bit earlier than planned. In fact HPSS is up but connection to dCache not yet completely tested and will be up sometime this afternoon.
  • NL-T1 - ntr
  • GridPP - ntr
  • OSG - ntr

  • CERN DB - reboot of one of CMS nodes; cause is understood; due to excessive XX consumption from one application. Workaround at DB level + working with developers to avoid this in future. Reboot of one node of ATLAS offline DB, high load by COOL reader accounts. After reboot added back to clusterware. From users point of view ok but streams processes had problems accessing local f/s so had to remove again. Local f/s corruption. Now node back again.

AOB:

  • GGUS - monthly release this morning. Alarm tests have started. Presentation tomorrow at T1SCM on GGUS issues is any. Plus progress on GGUS-SNOW i/f. Team to alarm upgrade put in operation last month had a bug concerning alarm email notification. This is fixed in new release. Good to have feedback from alarmers if any remaining issues.

Thursday

Attendance: local(Lola, Mike, Alessandro, Uli, Andrea, Jamie, Stefan, Jan, Jacek, Maarten, MariaDZ);remote(Rolf, Stefano, Thomas, John, Xavier, Gonzalo, Jon, Daniele, Kyle, Michael).

Experiments round table:

  • ATLAS reports -
    • ATLAS general info, Central services
      • project tag: data11_7TeV (we might get parasitic collisions)
    • T0
      • CERN-PROD GGUS:67856 (2 HI files not accessible, now fixed, ticket closed)
      • Have the new "features" introduced in SRM-ATLAS to 2.10 understood? [ Jan - from Tier0 we don't have a patched release of SRM; Current version has 2 issues; triggering an Oracle error much more than before and also copying files into default pool which shouldn't happen. The latter is understood and could be fixed but no release for the moment. For first problem no good work-around. Could restart SRM e.g. every 20-30'. Regular restarts would help but would have background noise of refused connections. Problem was observed on testbed even with old release. Only see it in production with current release and ATLAS load. Roll back possible but would include DB rollback (this means changes to schema, not rollback of Oracle version). Not guaranteed to be a fix. DB update was one month ago but SRM upgrade one week ago. Propose workaround now and as soon as new code to address disk-to-disk copies deploy as well. Would seriously impact data taking / distribution. Jan will summarize and send around a mail ]
    • T1
      • IN2P3-CC HPSS mass storage downtime ended, tape endpoints whitelisted in DDM.

  • CMS reports -
    • CMS / CERN / central services
      • Detector/DAQ OK getting beam splashes
      • GGUS:67719 inconsistency in top BDII: being followed up, not resolved.
    • Tier-0
      • ready for data
    • Tier-1's
      • pre-production with CMSSW 3_11has started, also running backfills at T1's to test new production tool
      • CMS Production jobs could not match FNAL CE"s due to CNAF BDII not listing them. GGUS:67826 SAV:119300. Understood as a BDII patch released in January was only scheduled for installation tomorrow. The patch was not flagged as high priority frown [ Maarten - will follow up ]
    • Tier-2's
      • little production yet, waiting for new requests. Analysis running fine
    • Miscellanea
      • NTR
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • Migration to SLC 5 of the first xrootd server has been completed this morning successfully. We will proceed with the other 4 machines during the afternoon and tomorrow
    • T1 sites
      • IN2P3: AFS sw area problem last night, it was quickly solved. The read-only and r/w volumes were out of sync. We need to modify an AliEn module to solve this problem - will be done in the coming days. Grateful to assistance from IT-CF and IT-DSS in upgrade
    • T2 sites
      • Usual operations

  • LHCb reports - MC production running smoothly.
    • T0
      • Snow ticket on afs slowness open at INC014928. Currently with Steve Traylen
      • CREAM issues - [ Uli - patch for CREAM developers rolled out on all CREAM CEs at CERN. Confirmed by LHCb that this works. One problem dating back to beginning of week that is also solved. Related ticket put to unsolved in GGUS by LHCb. If there is a new problem create a new ticket ]
    • T1
      • Some same tests failing on Tier1 sites - following up
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • OSG - identified reason that downtime records didn't make it from OSG to SAM mid-Jan: description too long for DB. Working on a day to truncate records so that they are sent correctly.
  • CNAF - ntr
  • PIC - ntr
  • KIT - ntr
  • RAL - ntr
  • NDGF - ntr
  • IN2P3 - ntr
  • BNL - ntr
  • FNAL - ntr

  • CERN - would like to check with ATLAS current usage of CREAM CEs. Don't used upgraded CEs. Some problems with Panda config. More than one thousand jobs are running on CREAM CE which is out of production(!) Ale - will make point to Panda expert.

AOB: (MariaDZ) Please observe the https://gus.fzk.de/pages/didyouknow.php entry this month. These are reminders of useful tips changing at every GGUS release, i.e. monthly.

Friday

Attendance: local(Uli, Lola, Simone, Maarten, Stefan, Jamie, Giuseppe, Mike, Alessandro, Eva);remote(Michael, Jon, Gonzalo, Alexander, Lorenzo, Rolf, Kyle, Xavier, Gareth, Stefano, Suijan, Christain).

Some changes maybe lost due to Twiki fun

Experiments round table:

  • ATLAS reports -
    • ATLAS general info
      • project tag: data11_7TeV (no stable beam foreseen, but possible parasitic collisions)
    • T0
      • CERN storage: yesterday at around 16:30 ATLAS agreed with CERN-IT-DSS for a downgrade from SRM2.10 (+ some new castor lib) to SRM 2.9-4 (+some old castor lib). The rollback did not work. We stayed with 2.10 till this morning. Around 11:00 a new patch to SRM2.10 has been put in place to fix the issues (SRM2.10-1)
    • T1
      • PIC storage failing transfers: GGUS:67907
      • a ggus ticket has been wrongly assigned to BNL, ticket closed. This was about FT for Channel commissioning, difficult to be spot by shifters. [ Simone - issue importing into RAL from any other Tier1. Rate per file ~2MB/s - about one order of magnitude less than expected. RAL contacted; re-evaluating setup of pools after CASTOR migration. On same subject sites ask more and more if there is a procedure to debug links and who ones responsibility to debug links in general. ] [ Michael - comment on problem mentioned by Ale about UK sites, e.g. Bham and MCR to BNL. Since BNL on receiving side we can view FTS log info. Found data transfer phase starts and halts after about 50KB of data transferred. Always same pattern. Transfer stalls after the first few packets. Seems true in all cases. May have to get network providers involved again! Problem limited to Tier2 sites - path different]

  • CMS reports -
    • CMS / CERN / central services
      • Detector/DAQ OK getting beam splashes
      • GGUS:67719 inconsistency in top BDII: being followed up, not closed yet.
      • problems with Castor fro Tier0 since this early morning blocking P5-T0 xfers
        • GGUS:67891 opened at 9:50 as TEAM, escalated to ALARM at 10:36, no reply as of 13:15
        • GGUS:67901 opened as TEAM at 10:02 no reply
      • problem with Castor affecting HeavyIon processing at T0 since yesterday
    • Tier-0
      • ready for data
    • Tier-1's
      • pre-production with CMSSW 3_11has started, also running backfills at T1's to test new production tool
      • GGUS:67826 SAV:119300. CNAF BDII was missing needed patch. Now Understood, GGUS one could be closed.
    • Tier-2's
      • serious problem reported for CREAM / SGE. Half of jobs are reported as aborted although succesful. T2_ES_IFCA decided to take CREAM out of production. Priority raised in Savannah but SGE support appears understaffed. We may ask WLCG to support LCG-CE until this is solved.
      • little production yet, waiting for new requests. Analysis running fine
    • Miscellanea
      • NTR
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports - General Information: During the previous couple of weeks pass0 and pass2 reconstruction have been ongoing. T0&T1s were blocked for running just reconstruction jobs
    • T0 site
      • Migration to SLC 5 of the xrootd servers was completed this morning successfully. All the machines are back in production up and running.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports -
    • MC production running smoothly.
    • Certification for the next Dirac release is ongoing
    • The failing SAM jobs yesterday were due to a misconfiguration of Dirac which has been fixed now.
    • T0
      • Snow ticket on slow interactive usage of lxplus nodes open at INC014928.
    • T1
      • NTR
    • T2 site issues: :
      • Some T2 sites are running CREAM CE 1.6.4 and LHCb jobs fail because of (BUG:78565) [ Maarten - official mechanism is via EGI broadcasts. After that open GGUS tickets to sites ]

Sites / Services round table:

  • BNL - nta
  • FNAL - ntr
  • PIC - had an issue with SRM overload this morning which affected from 01:00 - 02:00 caused by an internal process related to backup of some internal databases. All now fixed. Transfers now fine.
  • NL-T1 - ntr
  • CNAF - ntr
  • KIT - plan on updating CREAM servers on Monday one by one. Did not decide on exact time - will be announced later today
  • RAL - nta
  • IN2P3 - have an operational problem with jobs from one ATLAS user. As Tier1 & Tier2 share same WNs had to limit number of jobs - user is asking for 4GB of memory which is > limit and hence jobs get killed. If several such jobs run this goes beyond physical limit which induces crashes. 17K jobs - 13K of which have crashed. Local ATLAS person has contacted ATLAS user but no response yet. Ale - who contacted cloud support? Rolf - going through pilot jobs will give name offline. Ale - we are aware of issue of HI reprocessing jobs which need > 4GB memory. No solution for the moment. NL-T1 limited # concurrent jobs as temporary patch. Rolf - we can do the same but this will limit total capacity.
  • ASGC - ntr
  • NDGF - ntr
  • OSG - ntr

  • CERN - slow response on lxplus and batch still being investigated. Keep eye on CERN status board. CE - still many ATLAS jobs on out of service CE. Simone - a week ago asked about decommissioning of LFC-ATLAS-CENTRAL. Confirmed that this can be retired. Front-end can be stopped and DB 1 week later. open a SNOW ticket!

AOB:

-- JamieShiers - 17-Feb-2011

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2011-02-25 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback