Week of 100830

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Peter Love, Maria, Jamie, Ewan, Zsolt, Luca, Harry, Maarten, Eddie, Lola, Marie-Christine, Andrea, Ignacio, Dirk);remote(Michael, Vera, Gonzalo, Davide, Angela, Kyle, Vladimir Romanovsky, Catalin, Ron).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • SARA oracle outage GGUS:61265 (Need plan to get NL online) - based on GGUS update need to organize a meeting to prepare a plan to get NL-T1 cloud back online
      • INFN-BNL network problem GGUS:61440
      • FZK-NDGF transfer issue GGUS:60437 (test pending)
      • RAL-NDGF functional test GGUS:61306 (FT transfers continue to fail)
    • Aug 28 - Aug 30 (Sat - Mon)
      • IT cloud offline for production due to bad task (ATLAS issue)
      • All quiet smile

  • CMS reports -
    • Experiment activity
      • Physics over the weekend until this morning; high data rate into stream A (>400 Hz)
    • Central infrastructure
      • T0Mon sometimes hung for more than 1 h (esp. Tier1Scheduler, sometimes stuck for 2 to 3 hours)
    • Tier1 issues
      • SAM-SRM tests failing Sunday at T1-ES-PIC for almost 3 hours; see https://savannah.cern.ch/support/?116471 Disappeared without having to take action; maybe CMS SAM problem; further analysis showed that lcg-cp's timeouts seemed to have affected several T1s during the weekend and last week. Could It could be that the UI we use at CERN for running those tests has a problem?. [ Andrea - looked at SAM tests and didn't see any correlation with problems at Tier1s. If there was a UI problem it would be visible at Tier2s. ]
    • Tier2 Issues
    • MC production
      • Very large (1000M) MC production ongoing T1s and T2s; will start the reprocessing with CMS_SW 38XX soon

  • ALICE reports - GENERAL INFORMATION: Pass1 reconstruction and 3 MC cycles ongoing
    • T0 site
      • Nothing to report.
    • T1 sites
      • SARA: issue with tape storage element. It is reporting "unequal file sizes" between the one copied and the local one, but in the error you can see that the file size is the same
    • T2 sites
      • Usual operation issues

  • LHCb reports -
    • NOTE: Very intense activities in WLCG these days with backlogs of data to recover and with reprocessing+merging of old data to be delivered soon to our users. At this entry of the LHCb elog (see full report) more explanation. The observed poor performances of diskservers/SRMs at T1's are due to the exceptional amount of merging jobs and relative merged data transfer (1GB/s sustained) across T1s.
    • T0 site issues: none
    • T1 site issues:
      • RAL: Disk server was down, solved after ~12 hours (GGUS:61625 from RAL)
      • CNAF: All jobs aborted at two CEs GGUS:61633, removed from production
      • GRIDKA: All jobs aborted at two CREAM CEs GGUS:61636, removed from production
      • IN2P3: SRM was down GGUS:61634, solved 12 hours later with "The SRM was restarted"
      • CNAF: Problem with the 3D replication GGUS:61646

Sites / Services round table:

  • NL-T1 - one issue at NIKHEF: a few pilot jobs of LHCb were killed because they filled up scratch filesystem. SARA: installed one node with a normal filesystem (no ASM) and try to do a restore on that node. Peter - how long will take? A: currently in progress. Peter - if it doesn't work we'd like to schedule a phone call to understand options (possibly Wednesday morning as many people away in Grenoble).
  • BNL - ntr
  • NDGF - ntr
  • PIC - ntr
  • CNAF - as LHCb said a couple of issues - looking into them. One related to CEs, other LFC, expect to be fixed this afternoon.
  • KIT - ntr
  • FNAL - ntr
  • OSG - ntr

  • CERN DB - about 03:00 message that streams to CNAF (LFC & conditions) went down. This afternoon could in principle restart propagation for LFC but not conditions. Davide - DBA looking into problem.

  • CERN storage - transparent upgrade to CASTOR ATLAS. Sorting out some minor monitoring. Probably not related to update.

AOB:

  • Problem with gLite 3.2 VOMS server. Released recently but generates VOMS proxies that are not understandable on WMS nodes. A formal broadcast is expected shortly.

Tuesday:

Attendance: local(Stephane, Jan, Alessandro, Nilo, Maria, Maria, Ewan, Edward, Jamie, Marie-Christine, Patricia, Harry, Luca, Flavia, Farida, Maarten, MariaDZ);remote(Angela, Catalin, Alessandro, Vera, Kyle, Michael, Gonzalo, Rolf, Jeremy, Vladimir).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • SARA oracle outage GGUS:61265 (Need plan to get NL online)
      • INFN-BNL network problem GGUS:61440 [ Michael - in process of talking to ESNET. Had coordination meeting yesterday. Their engineers will look at the peering connection to Geant2 today so hope for more news for tomorrow. ]
      • RAL-NDGF functional test GGUS:61306 (FT transfers continue to fail) [ Vera - RAL on holiday so don't expect much on their side ]
      • FZK-NDGF transfer issue GGUS:60437 (SOLVED and VERIFIED the 31st)
    • Aug 31 (Tue)
      • Nothing to report except the ongoing issues (on top)

  • CMS reports -
    • Experiment activity
      • Nothing to report
    • Central infrastructure
      • T0Mon sometimes hung for more than 1 h (esp. Tier1Scheduler, sometimes stuck for 2 to 3 hours)
    • Tier1 issues
      • SAM-SRM tests failing showing up sometimes at T1-ES-PIC again, see old ticket https://savannah.cern.ch/support/?116471 [ Gonzalo - this morning doing some investigation and saw that sporadic failures always on same pool. Rebooted around 10:00 - 11:00 and expecting that this will solved problem. ]
      • T1_TW_ASGC coming up after scheduled downtime; not visible yet and shows SAM_SRM errors [ site downtime extended until 16:00 ]
    • Tier2 Issues
      • T2-BE-UCL back to normal.
    • MC production
      • Very large (1000M) MC production ongoing T1s and T2s; will start the reprocessing with CMS_SW 38XX soon
    • AOB
      • request for from T3 to become T2 ongoing (US Vanderbilt)
      • Next CRC on duty Sep 1-7 Ian Fisk.

  • ALICE reports - GENERAL INFORMATION: Usual Pass1 reconstruction activities at the T0 plus 4 MC cycles. Low rate of raw data transfers in the last 24h (target sites: IN2P3 and CNAF)
    • T0 site
      • CASTOR operation foreseen for today currently ongoing
      • GGUS:61667 submitted yesterday afternoon (one of the ALICE CAF nodes was unreachable). SOLVED
      • GGUS:61691 submitted this morning (ce202 refusing any connection at submission time) [ Ewan - CE problem being looked at. Looks like Tomcat problem. There is a cron job but not catching everything. Maarten - this node would ideally have been put in scheduled downtime. Can you deal with one node being in unscheduled downtime? Ulrich upgraded CE201 whilst on holiday. Fix this particular issue has been released and when Ulrich back fix will be applied. ]
    • T1 sites
      • All T1 sites currently in production
    • T2 sites
      • Performing the usual operations for three new sites entering the ALICE production

  • LHCb reports -
    • Reconstruction, Monte-Carlo jobs and high users activity.
    • T0 site issues:
      • CERN: none
    • T1 site issues:
      • RAL: Disk server was down, solved after ~12 hours (GGUS:61625 from RAL)
      • CNAF: All jobs aborted at two CEs GGUS:61633, removed from production
      • GRIDKA: All jobs aborted at two CREAM CEs GGUS:61636, removed from production
      • IN2P3: SRM was down GGUS:61634, solved 12 hours later with "The SRM was restarted"
      • CNAF: Problem with the 3D replication GGUS:61646 [ Alessandro - replication solved, one instance of server up and running. Admins working to get other(s) back ]

Sites / Services round table:

  • KIT - one of our LCG CEs went down over night due to h/w failure. Restarted on new h/w. Down for 3 hours.
  • FNAL - ntr
  • CNAF - ntr
  • NDGF - ntr
  • BNL - ntr
  • PIC - ntr
  • IN2P3 - pre-announcement of major outage 21 - 22 Sep for maintenance purposes. Outage to do maintenance on robotics, on electrical and cooling, Oracle, dcache, IRODS, SRB, etc etc Start to close jobs evening of 19 Sep for very long jobs. Hope to restart on 23 Sep.
  • NL-T1 - a few issues. DB issue - last night created a filesystem on one of DB nodes to bypass Oracle ASM. Restore there also failed due to data corruption. Now doing full restore on a complete different node. In progress. Assume finished this evening. This is to verify that we do have a correct backup. Suspicions about storage h/w on which DB resided. In parallel have power cycled old DB h/w and now re-creating everything from scratch and will see if this helps. Just now had phone conf with Eric from CERN. He gave some tips e.g. going to do a full restore from and older full backup copy directly to CERN. Again to see if we have a copy of the DB somehow correct. Other issues: NIKHEF power dip this morning that caused part of compute to crash. A number of LHCb jobs took > 16GB memory each and >130GB of data into /tmp. NIKHEF contemplating killing such jobs! [ Stephane - this is very important for ATLAS and very useful if we have help from CERN. Will follow with SARA people ]
  • OSG - operations centre wondering if we could get a copy of BDII change log. email Lawrence and Ricardo. To see if there is anything that they should apply. Maarten - latest top level BDII version in gLite 3.2 5.09. Toward end Sep next major release for gLite 3.2. ]
  • GridPP - ntr but confirm RAL closed for site holiday

  • CERN DB - today memory on CMS online DB upgraded to 32 GB. Services restarted with larger memory. CMS PVSS replication online to offline stopped due to user error - added column not support by streams. One node crashed due to high load (ATLAS) with Panda. Should be fixed today.

  • CERN Storage - yesterday CMS reported missing files in T0export. half dead disk server was rebooted and now ok. Looking to change model to not depend on single ds. CASTOR ALICE and CMS upgrades this morning. gridftp checksums back on. CASTOR LHCb and NS upgrades (transparent) tomorrow.

AOB:

  • Dashboards - some performance problems with DB of CMS dashboards. Now back to normal thanks to help of DBA
  • Frontier servers now monitored through SLS service at CERN. sls-monitor.cern.ch should be allowed to make queries to their frontier servers. Concerns all Tier1s.

Wednesday

Attendance: local(Renato, Ignacio, Maria, Alessandro, Jamie, Patricia, Luca, Eddie, Steve, Harry, Simone, Stephane, Kors, Jean-Philippe, Maarten, Jan, MariaDZ, Andrea, Edoardo, Roberto);remote(Xavier, Michael, Gonzalo, Ron, Kyle, Vera, Rolf, Tiju, Alessandro, Catalin, IanF).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
    • Sep 1 (Wed)
      • IN2P3-CC SRM failures with error message "NO_SPACE_LEFT", but lcg-stmd cmd reports ~9TB free. GGUS:61712 [ Rolf - will follow up ]
      • TAIWAN-LCG2 extended their downtime (restoring castor stager db).
        • General about GOCDB and OIM downtime declarations: if a site declare a downtime (sched or unsched is the same) "outage" then ATLAS tools will automatically exclude the affected services. On the contrary if a site declare "at risk" no action is taken.

  • CMS reports -
    • Experiment activity
      • Nothing to report (transparent CASTOR intervention was transparent to CMS.)
    • Central infrastructure
      • Upgrading Tier-0 components and moving to next release of CMS Software. CMS will change the run "epoch" coming out of the technical stop. New period will be called Run2010B.
    • Tier1 issues
      • New tape families needed for new run period. IN2P3 and FNAL have already responded.
    • Tier2 Issues
      • One of the samples needed by the SAM tests was accidentally cleaned out of several sites. We're putting it back, but some sites may have lower availability as this gets back to normal
    • MC production
      • Very large (1000M) MC production ongoing T1s and T2s; will start the reprocessing with CMS_SW 38XX soon
    • AOB
      • request for from T3 to become t2 ongoing (US Vanderbilt)

  • ALICE reports - GENERAL INFORMATION: Massive MC production still ongoing with 4 MC cycles
    • T0 site
      • Intervention on the CASTOR name server successfully completed this morning
      • GGUS:61691 submitted yesterday morning (ce202 refusing any connection at submission time) SOLVED. System back in production
    • T1 sites
      • Intermittent instabilities found at RAL this morning with one of the CREAM-CE systems: lcgce01.gridpp.rl.ac.uk. Wrong information provided by the system in terms of scheduled jobs.
      • Still some issues with the SE results published by ML for NDGF and SARA (dCache). Origin of the problem coming from the setup of the envelop for this system in AliEn. Experts working on it
    • T2 sites
      • Setup of three new sites into the Alice production still ongoing

  • LHCb reports -
    • Reconstruction, Monte-Carlo jobs and high users activity.
    • LFC: observed some degradation in performances (as originally reported in the ticket GGUS:61551). DIRAC side all suggestions and improvements from LFC developers has been put in place but it seems there is no way to improve. Closing a session when the server is overloaded simply does not work. Information from SLS not always respond to the real situation.
    • PIC and CNAF report a critical SRM SAM test failing because the USER space token is full. They provide the pledged for this space token so apparently the availability of these sites should not be affected for this failure. SAM tests do not check neither the space is full nor the site provides the pledges.
    • T0 site issues:
      • CERN:
        • (LFC degradation - GGUS:61551).
        • The intervention on the CASTOR name server has finished OK this morning (upgrade to 2.1.9-8).
    • T1 site issues:
      • RAL: Due to limited number of connection to disk server almost all activities (FTS jobs, data upload, data connection in reading) are failing timing up (waiting for a slot to be freed). It is fine to protect disk servers but we cannot survive like that. [ Tiju - having meeting to discuss diskserver job slot limits ]
      • SARA/NIKHEF: user jobs filling up the disk space on the WN. Chasing this up with culprits. Apart from that access to conditionDB is still problematic -using other T1's Databases
      • CNAF: 3D replication problem: what is the status? (GGUS:61646) [ Davide - replication problems should be solved by now ]
      • GRIDKA: CREAMCE problem (GGUS:61636) still under investigation. Requested GridKA people to point which endpoint should be re-enabled in the LHCb production mask to verify the solution.
      • PIC: many users report timeout in setting up jobs and this is a typical issue related to the shared area. Confirmed that the NFS server was under heavy load due to some ATLAS activity.

Sites / Services round table:

  • NL-T1 - finally some good news. Yesterday restored Oracle DBs on new h/w. Now gradually putting things back into production. ATLAS LFC up, FTS up, DB people working with 3D people to get streams going again and get 3D DBs up to speed. Continuing to investigate root cause of problem. At NIKHEF people rebooting WNs with new kernels - rolling. Done when WN ideal. [ Simone - have to try to understand how many entries "lost" in LFC as restored up to a point a few hours back. Ron - 08:24 on Wed Aug 18 is point in time of restore. Simone - think it was something like 2 hours lost? Ron - yes. SIR requested.
  • KIT - ntr
  • BNL - reminder that we have started major network upgrade transparent for ATLAS and so far progressing well
  • PIC - in last 12h had overload in NFS s/w area. Has mostly affected LHCb jobs but also others. Situation fine since this morning. Load caused by ~200 ATLAS jobs that gmake code in WNs. This puts big load on shared area. In contact with people to try to understand details.
  • NDGF - ntr
  • RAL - issue raised by ATLAS about test failures RAL-NDGF. Discussed with network people but found nothing wrong "RAL-side"
  • IN2P3 - nta
  • CNAF - ntr
  • FNAL - ntr
  • OSG - ntr

  • CERN storage - CASTOR NS transparent intervention. Index creation went smoothly Also a stager upgrade for LHCb. Also transparent. Now closing series of upgrade to 2.1.9-8.

  • CERN Grid - CREAM s/w upgrade scheduled for 10:00 tomorrow

  • CERN DB - replication to SARA - will have to use transportable tablespaces to restore. RAL offering to help.

  • Network: BNL-CNAF issues: agreed to do test over OPN to see if problem still there. Maybe network or servers.

  • VOMS problem reported yesterday. Not due to VOMS but component used by WMS and others - "gridsite" Has a bug which hard codes signature algorithm which has changed. Developer has agreed to fix with high priority - in a few weeks should have a new release. (gLite 3.2 VOMS proxies cannot be used to do anything with services using gridsite, which includes WMS, FTS, CREAM, DDM, Panda, ...)

AOB: (MariaDZ) The problem reported by Vladimir yesterday in updating a GGUS ticket is not related to his privileges as a GGUS TEAM member. It is due to a problem in the NGI-DE interface to GGUS for parsing attachments. This is a recent interface. It entered production with the July GGUS Release of 20100721, see https://savannah.cern.ch/support/?114561. It is documented in: https://gus.fzk.de/pages/ggus-docs/interfaces/29410_Interface_NGI_DE.pdf . Problem being investigated by NGI-DE ticketing system developers.

Thursday

Attendance: local(Jean-Philippe, Maria, Jamie, Maarten, Ignacio, Ewan, Patricia, Eddie, MariaDA, Harry, Stephane, Jacek);remote(Ky,e, Michael, Gonzalo, Angela, Alessandro, Jeremy, Vera, Rolf, Ronald, Catalin, Tiju, IanF, Roberto).

Experiments round table:

  • ATLAS reports -
    • Ongoing issues
      • SARA Oracle outage GGUS:61265 (plan to get NL online decided - see below)
      • INFN-BNL network problem GGUS:61440
      • RAL-NDGF functional test GGUS:61306 (FT transfers continue to fail)
    • Sept 2 (Thu)
      • IN2P3-CC : as observed yesterday, today again SRM failures with error message "NO_SPACE_LEFT". lcg-stmd cmd reports ~7TB free. GGUS:61712 updated, priority changed to very urgent. Now news from the site.
      • TAIWAN-LCG2 extended (again) their downtime. Any timescale on when the issue will be fixed?
      • NL clolud:
        • dq2-replica-consistency on the whole cloud for the dataset of which replicas have been created from ~7 days before the LFC issue and that were incomplete. List of datasets to be eventually resubscribed is ready. Still to be decided when to subscribe them, most probably after the weekend.
        • Production and Analysis queues are Offline (and will be kept offline for the next hours/days)
        • DDM: the cloud is in 'read-only' mode
        • ADC Ops decided to send Functional Test to SARA-MATRIX_DATADISK, configuration to allow it will be set up this afternoon.
    • ATLAS internal:

  • CMS reports -
    • Experiment activity
      • Nothing to report
    • Central infrastructure
      • Migration of software on online and offline systems to CMSSW_3_8 is completed. New datasets from testing triggers are working their way through the system
    • Tier1 issues
      • New tape families needed for new run period. Tickets open at ASGC, KIT, PIC, and RAL
      • Some confusion about ASGC being in downtime yesterday.
    • Tier2 Issues
      • Working on trying to commission links to 2 of the Russian Tier-2s from 3 of the Tier-1s. PNPI, ITEP. [ Maarten recently also GGUS ticket that ended up at CERN because of network problems affecting sandbox transfers. ]
    • MC production
      • Very large (1000M) MC production ongoing T1s and T2s; will start the reprocessing with CMS_SW 38XX soon. (Potentially starting tomorrow).

  • ALICE reports - GENERAL INFORMATION: Decreasing the number of MC cycles currently running. Rest of activities ongoing
    • T0 site
      • Decrease of the number of jobs to almost zero during the last night due to a lack of synchronization between the rw and ro AFS volumes. Origin of the problem has been found, currently working on it. GGUS:61516 CREAM DB at CERN reporting 15K jobs running. System has been updated. DB is still reporting a huge # of jobs. Not a problem at CERN but with CREAM itself - has been observed at other sites too. ce203 stoppped scheduling jobs for several reasons but now back in production.
    • T1 sites
      • CNAF: Zero jobs running at this site since yesterday at 17:00. Reported to the Alice contact person at the site. all scheduled jobs (over 500) have been removed leaving space for new submissions. Site back in production
    • T2 sites
      • Usual operations. No remarkable issues

  • LHCb reports -
    • Merging and data reprocessing. 8K jobs concurrently at T1s (quite impressive). Activities dominated by user analysis.
    • Held a meeting with LFC developers and LHCb to address some strange access patterns observed in the server's logs.
    • T0 site issues:
      • CERN:
        • FAILOVER full. Space managed by lhcb.
    • T1 site issues:
      • RAL: another disk server (gdss473, lhcbmdst service class), after two hours that was put in production developed hardware problems. List of files provided to the lhcb data manager [ Tiju - this server is back in production. Have removed all limits ]
      • PIC: MC_M-DST, M-DST and DST space tokens got full. Pledged provided space tokens banned in writing.
      • CNAF: MC_M-DST and MC_DST space tokens getting full.
      • IN2p3: Many pilots aborting against the CREAM CE. GGUS:61766

Sites / Services round table:

  • BNL - ntr. There is a category 4 hurricane that may hit Long Island tomorrow. Maybe hit by storm.
  • PIC - during morning saw spurious timeouts for ATLAS opening SRM. Experts working on it.
  • KIT - ntr
  • CNAF - ntr
  • NDGF - ntr
  • NL-T1 - work is in progress to restore streams
  • IN2P3 - might be explanation to LHCb pilot job issue. Had at risk this morning to fix cream ce problem. Ticket issued during this downtime - will check if LHCb jobs were impacted by this.
  • FNAL - ntr
  • RAL - nta
  • OSG - ntr
  • GridPP - ntr

  • CERN DB - in contact with ASGC regarding their problem with CASTOR DB.

AOB: (MariaDZ) Please remember to click, from time to time, on the Did you know...? link on the left banner of all GGUS web pages. There we put tips from functionality that exists and may be forgotten. The content changes every month with the GGUS Release.

Friday

Attendance: local(Maria, Jamie, Prszemek, Luca, Eric, Ewan, Zsolt, Maarten, Harry, Patricia, Roberto, Eddie, Jan);remote(Michael, Kyle, Alexander, FNAL, Rolf, Gonzalo, CNAF, Tiju, IanF, Andreas Davour).

Experiments round table:

  • CMS reports -
    • Experiment activity
      • Nothing to report
    • Central infrastructure
      • Testing xroot instead of rfio on Tier-0 [ Jan - please give us a heads-up when you still start this - service class and time frame. Ian - in "playback mode" - run from streamers themselves. ]
    • Tier1 issues
      • New tape families needed for new run period. Tickets open at ASGC, KIT
      • ASGC - prognosis for coming out of downtime? Does it require a SIR? A:yes
    • Tier2 Issues
      • Nothing to report
    • MC production
      • Very large (1000M) MC production ongoing T1s and T2s; will start the reprocessing with CMS_SW 38XX soon.

  • ALICE reports - GENERAL INFORMATION: Preparation of a new AliEn release (middle of October) discussed yesterday during the TF meeting.
    • T0 site
      • Following the advices of the PES experts, ce202 has been taken out of the production (status CLOSED) while experts perform several operations on the system. ce203 is back in production
    • T1 sites
      • CNAF issue reported yesterday: Decrease on the number of jobs to zero. Solved by the site manager removing all schedules agents. Site performing well today
      • SE issue reported this week and affecting NDGF and SARA: Bad results published for thse two sites in MonaLisa. Origin of the problem coming from the setup of the envelop for this system in AliEn. SOLVED.
    • T2 sites
      • No remarkable issues to report

  • LHCb reports -
    • Merging and data reprocessing + user analysis. ~40K jobs run (only) at T1's over the last 24 hours..
    • T0 site issues:
      • CERN:
        • LHCb_DST space token (pure disk for un-merged files) has been delivered. Still pool ACLs have to be sorted out.
    • T1 site issues:
      • SARA: Oracle server hosting the ConditionDB down. GGUS:61799
      • SARA: all production jobs since few days are failing to setup of the application environment (timeout) GGUS:61795. Suspect on the local conditionDB.
      • RAL: Oracle server hosting the ConditionDB down. GGUS:61800
      • RAL: very high failure rate following their LRMS full capacity has been restored. We are killing their disk servers with plenty of transfers failures, jobs failing to access data and all activities affected. Open ticket GGUS:61798 to track this down and to ask site to throttling the number of job slots to see if it has some positive effect. The plot (see full report) shows the production jobs at RAL in the last 24 hours with failures dominated by input data resolution [ Tiju - have reduced limits to 750 now. Don't see any pending jobs on CASTOR - looks like all jobs being processed quickly. ]

Sites / Services round table:

  • NL-T1 - we are preparing to sync DB with DB from RAL - will be done on Monday. Regarding GGUS:61795. This was a problem connecting to conditions DB - firewall setting now repaired.
  • BNL - ntr
  • PIC - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • CNAF - ntr
  • RAL - nta
  • NDGF - ntr
  • OSG - ntr

  • CERN DB - SARA mentioned streams replication to their site. ASGC - DB for CASTOR mostly recovered. Still a lot of work to do after that.

AOB:

-- JamieShiers - 27-Aug-2010

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2010-09-03 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback