Week of 110117

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Ricardo, Ignacio, Yuri, Ueda, Eva, Simone, David, Maria, Jamie, Massimo, Luca, MariaDZ, Stefan, Alessandro, Dirk, Lola, Andrea, Maarten, Roberto);remote(Jon, Gonzalo, Xavier, Federico, Stefano, Rolf, Ulf, Ron, Gareth, Suijan, Daniele, Paolo Franchini).

Experiments round table:

  • ATLAS reports -
    • SLS - problem in update of the service status Friday afternoon and Monday morning
      • when some of the services did not return their status?
    • some user reported voms-proxy-init command hung while BNL in downtime
      • different issue from WLCGDailyMeetingsWeek110110#Monday
      • similar behavior was observed in Nov when BNL VOMS service was not functioning.
      • checking one our systems we see, "Contacting vo.racf.bnl.gov:15003 ... Failed Error: Could not establish authenticated connection with the server. ... Trying next server ..." , that is, having BNL VOMS down is not a problem. [ Maarten - bug in voms proxy init code that could cause indefinite hang. Fixed in latest 3.2 UI but ATLAS using 3.1 UI ]
    • announcement for a gocdb downtime (object_id=24047) for RAL-LCG2
      • https://next.gocdb.eu/portal/index.php?Page_Type=View_Object&object_id=24047&grid_id=0
      • ANNOUNCEMENT on December 21, 2010 12:25:45 PM GMT+01:00 (Declaration Timestamp 21-DEC-10 10.48.00.000000)
      • no announcement at
          • Start Timestamp 17-JAN-11 08.00.00.000000
          • End Timestamp 18-JAN-11 17.00.00.000000
          • Announce Timestamp 16-JAN-11 08.00.00.000000
      • in general, how to check if a certain announcement is sent out or not when we don't receive it? [ Rolf - GOCDB was out since Friday evening until today noon so notification mechanism both GOCDB and operations portal needed. No notifications at all during this period. Failover version of GOCDB out too. ]
    • ATLAS distributed computing system downtime for database reorganization on 17-18 Jan

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 ready waiting for data. DAQ integration to restart in Feb 2, Cosmic Feb 9
      • SLS monitoring disappeared from Friday afternoon to ~midnight. Helpdesk CT738166 [ From IT Service Status Board: Update Monday 17/01 9:35 am: The problem is again caused by an user service that causes the SLS to get stuck while processing the services updates. The service has been removed from SLS. Everything will be back to normal. ]

We are investigating to fix the root cause of the problem.

    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples. User's analysis also keeps running in 100Kjob/day zone, though lower then before Christmas.
      • We CMS still need to workout a solution for proper monitoring of sites with CREAM-only CE's. Have started to look at solutions with dashboard team.
    • AOB
      • CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site MonALISA was not reporting since Friday night on voalice12. On Friday an upgrade to AliEn 2.19.49 was done on the machine and the service configuration file was not updated, although the rest of the services have been running smoothly during the weekend
    • T1 sites CNAF: Cluster Monitor was not working due to stuck connection, in consequence the CE either. Services had to be restarted this morning and now it is performing well
    • T2 sites: Usual operations

  • LHCb reports -
    • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.
    • Today we will restart the stripping (we thought about restarting it Friday, still waiting for a last test result).
    • T0
      • none
    • T1 site issues:
      • IN2P3 : (GGUS:59880). "Historical" problem with shared area. Ticket finally closed.
    • T2 site issues:
      • CBPF : GGUS:66204 : all pilots aborted, host cert expired
      • Glasgow: GGUS:66203 : data uploading problems, investigating

Sites / Services round table:

  • FNAL - ntr
  • PIC - ntr
  • KIT - ntr
  • IN2P3 - nta
  • CNAF - ntr
  • NDGF - software updates ongoing at the moment - still in scheduled downtime - but seems to be going ok
  • NL-T1 - reminder of downtime tomorrow morning to make Oracle RAC DB on old Sun h/w the primary DB again
  • RAL - as referred to in ATLAS report have a scheduled outage for upgrading CASTOR disk servers today and tomorrow. Will then be able to enable checksumming in ATLAS CASTOR. Take it that problem with GGUS notifications was simply the mechanics of notifications and that this downtime was not a surprise? Can't report much about GOCDB - it was failure of RAID Array behind GOCDB. Ueda- after this downtime we can start using checksum with FTS? Gareth - we need to specifically enable checksums. Maybe after a day or so to make sure things run as before. Alessandro - ticket GGUS:66044 in which ATLAS was asking all T1s to check new robot certificate is properly mapped into storage and FTS server. If there is any news please update ticket. Important that RAL and ASGC verify that the config is ok.
  • ASGC - ntr

  • CERN storage - this morning upgrade CASTOR ATLAS to 2.1.10 and also stager and SRM to Oracle 10.2.0.5.

AOB:

Tuesday:

Attendance: local(David, Roberto, Stefan, Maarten, Simone, Alessandro, Jamie, MariaDZ, Yuri, Luca, Ricardo);remote(Rolf, Ulf, Michael, Francesco, Ronald, Dimitri, Tiju, Suijan, Paolo, Jeremy, Rob, Jon, Gonzalo, Stefano).

Experiments round table:

  • ATLAS reports -
    • Central services
      • GOCDB downtime undelivered announcements GGUS:66226 and Savannah BUG:77090 for the database failover mechanism used in GOCDB-4)
      • upgrade of CASTORATLAS to CASTOR 2.1.10-0 and the upgrade of the stager and SRM DBs to Oracle 10.2.0.5 finished yesterday
      • migration of ATLAS Distributed Computing DBs ATLR->ADCR completed yesterday at ~17:20.
      • restarting ADC production services
    • T1s
      • RAL,PIC,SARA downtime
      • BNL,TRIUMF downtime completed.
      • CNAF - BNL network: engineers focussing on European part between Amsterdam and Vienna. Geant intervention underway, expect to know more tomorrow or Thursday

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples. User's analysis also keeps running in 100Kjob/day zone, though lower then before Christmas.
      • About CREAM-only sites, we suggested our sites not remove LCG-CE until April, yet. Anyhow a solution in dashboard has been identified and is being pursued to be OK both in the meanwhile and should site not all switch from LCG-CE to CREAM same day.
      • Maarten - by April WLCG sites should have a CREAM CE as availabilities will then - or soon after - have a CREAM CE which will be used as "the" production service for WLM. If site does not have CREAM CE marked as bad. Does not mean that they can't continue with an LCG CE after. Once all in place for CREAM CE (FCR, availability calculations etc.) can discuss phase out of LCG CE - a few months (or so...)
    • AOB
      • CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • New CAF nodes have to be configured in quattor
    • T1 sites
      • SAM tests. This morning was spotted a problem with ALICE SAM tests. Regardless reporting green in dashboard, the tests are not running since 12th of January. We found that the proxy under the tests are running had expired. The proxy is not going to be renewed but changed for my (i.e. Lola's) proxy. [ David - added Savannah bug BUG:77092 for this to follow up. ]
    • T2 sites
      • Usual operations

  • LHCb reports -
    • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference. [ 25K concurrent jobs ]
    • Today we will restart the stripping in a manual mode to confirm that the results are ok before moving to an automatic way..
    • T0
      • to fix the problem of MyProxyServer with CREAM CE. Can we change the configuration on AFS for LHCb only or do we need to release for THAT of the UI ? [ Maarten - by default have to supply a myproxy server for a UI. Could have a post-configuration step - automatable - to remove that configuration line for LHCb. And make sure this is done for future versions of the UI. ]
    • T1 site issues:
      • PIC : downtime (LHCb web portal redirect to the CERN one.)
    • T2 site issues:

Sites / Services round table:

  • IN2P3 - pre-announce of downtime Tue Feb 8 - complete outage of DB and long outage of HPSS MSS. Further details will be communicated 1 week before outage - should be in GOCDB already
  • NDGF - will have a storage outage for renumbering IP for head nodes of storage pools Thursday afternoon. One T2 currently down due to CMS overload. Only comment in GGUS ticket "so many CMS jobs that whole site died." Working on getting it back. GOCDB downtime scheduled for pool outage.
  • BNL - completed the multi-day intervention in foreseen time window. All services restored last night.
  • CNAF - ntr. Q: FTS checksums. Compared as numbers of strings? Clarified after the meeting:
    • According to these fixed bugs, the checksums are stripped of leading spaces and zeroes and then compared case-insensitively as strings:
  • NL-T1 - downtime at SARA has finished. DB for FTS and LFC moved back to RAC h/w that broke in Sep. Single node running as a hot spare.
  • KIT - ntr
  • RAL - intervention progressing to plan. Upgraded O/S and currently running tests.
  • ASGC - ntr
  • FNAL - ntr
  • PIC - currently downtime due to UPS maintenance. Restarting services this afternoon
  • OSG - question about installed capacity #s reported and why not appearing on monthly reports. OSG numbers have been sent for close to 1 year - why not appearing? Michael - do not appear for WLCG reliability for T2s, otherwise do show up.
  • GridPP - ntr

  • CERN DB - additional point: CMS online upgraded to 10.2.0.5 in scheduled intervention this morning

AOB: (MariaDZ)

Wednesday

Attendance: local(David, Yuri, Jamie, Massimo, Maarten, Julia, Stefan, Lola, Simone, Andrea, Alessandro);remote(Elizabeth/OSG, Michael, Jon Gonzalo, Ulf, Rolf, Francesco, Stefano, Paolo, Tiju, 77139, Ronald, Suijan, Dimitri).

Experiments round table:

  • ATLAS reports -
    • T0/CERN, Central services
      • migration of ATLAS Distributed Computing DBs completed, no problems reported.
      • all ADC cerntral production services are recovered after the downtime.
      • scheduled intervention on the ATONR database today (Oracle update 10.2.0.4 -> 10.2.0.5), downtime 14:00-17:00.
    • T1
      • TRIUMF file transfer failures:"DESTINATION error, SRM Authentication failed". GGUS:66262 submitted at ~20:30,fixed at ~23:30. edgmkgridmap accidentally made few DNs missing in SRM authentication file, corrected and verified.
      • RAL-LCG2_SCRATCHDISK file transfer failures. GGUS:66263 filed at ~21:00. Transfers from TRIUMF to RAL and the file is unavailable at the source. Fixed and verified this morning at ~9:00. Linked to GGUS:66262 (see above).
    • T2
      • MPPMU ~500 file transfer failures: "SOURCE error: locality is UNAVAILABLE". GGUS:66254 assigned at 17:20,Jan.9. One pool lost. Blacklisted on site's request till they can provide the list of lost files. Savannah support ticket:118739 submitted at ~8:30 today.
      • Another T2 site with problem with BDII - under investigation

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Serious problem with xrootd authentication surfaced for Tier0: https://cern.ch/helpdesk/problem/CT738571&email=gowdy@cern.ch, basically prevents us from running. IT reproduced the problem which appears due to recent K5 server migration, solution appears in sight.. We need a working configuration by end of week, Steve Gowdy following up with IT. [ Massimo - problem is really a Kerberos problem. Some nodes are in pre-production and some conditions where cross submission between new and old nodes creates a problem. Engineer working on it. COMPASS production also affected for similar reasons. Simone - should warn users. ]
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples. Users' analysis also keeps running in 100Kjob/day zone, though lower then before Christmas.
    • AOB
      • Andrea - at 4 T1s and one T2 jobs abort as no CE matched at site. Impact on site readiness. Problem at site or elsewhere? Maarten - will investigate
      • CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • voalice13: PackMan got stuck last night (5.05 AM) trying to download an AliRoot package to install. Due to that, the AliEn CE component stop working waiting for PackMan to finish the installation. After some clean up and restarting the services this morning the problem was solved.
    • T1 sites
      • IN2P3: since yesterday afternoon they are suffering from an AFS problem so the site was unavailable since last night. Site is back but still no jobs running
    • T2 sites
      • Usual operations

  • LHCb reports -
    • MC jobs running at full steam (30-40K jobs per day).
    • Stripping restarted
    • T0
      • SLS issue: Shibboleth authentication always requested form outside CERN accessing rrdgraph.php pages (used by internal components in DIRAC portal at PIC). CT738660. Fixed promptly yesterday.
    • T1 site issues:
      • NIKHEF: SAM jobs for shared area failing occasionally. This is consistent with the 10% failure rate observed for MC jobs at NIKHEF setting up environment using CERNVMFS (GGUS:66287). [ Ronald - it is our understanding that this failure rate is only marginally larger than at other sites. Is this correct? Stefan - will need to cross check. There were problems with setup project in the past said to be due to timeouts with CVMFS. ] [ Rolf - observed similar high failure rate but with AFS. LHCb jobs have very high access rate to shared area during setup. Problem not only for AFS but also CVMFS. Stefan - tool preparing runtime environment touching several files to do this. Preparing a flat setup script that would not need to check deep in tree for preparing runtime. This would overcome problem. ]
    • T2 site issues:

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • PIC - ntr
  • NDGF - NDGF T1 will have downtime on the head nodes for srm.ndgf.org and all storage tomorrow due to IP renumbering. The time is 12:30 to 14:30 UTC.
  • IN2P3 - just to add point on ALICE report. Says no jobs running but site was available - low number of jobs. Local ALICE support person installed new version of AliEn. After this we noticed all ok. Not necessarily solution to problem - still looking.
  • CNAF - ntr
  • RAL - have enabled checksums on ATLAS CASTOR
  • NL-T1 - ntr
  • ASGC - ntr
  • OSG - ntr

  • CERN storage - in addition to above (kerberos induced problems) had upgrade of ALICE CASTOR to 2.1.10. Transparent intervention on CMS which did not succeed fully. Some users on the default pool saw requests dropped. Going to full SLC5 nodes for head nodes. Promoted SLC5 node to be master. Major reconfiguration for CMS moving 1/2 PB from T0STREAMER to T0EXPORT to allow experiment to do compacting of data and processing of HI.

AOB:

Thursday

Attendance: local(Ricardo, Steve, Maarten, Maria, Jamie, Yuri, David, Nilo, Eva, Massimo, Ignacio, Alessandro, Simone, Andrea, MariaDZ);remote(Paco Bernabe, Federico, Ulf, Jeremy, Michael, Gonzalo, Jon, Francesco, Tiju, Stefano, Rob, Suijan).

Experiments round table:

  • ATLAS reports -
    • T0/CERN, Central services
      • nothing to report
    • T1
      • Taiwan-LCG2: job failures due to stage-in problems. GGUS:66140 reopened. Disk servers in hotdisk suffer from the high load when many jobs read the input in parallel. Some tuning done. Ticket closed at ~2:30, failure rate decreased to <1% (5066 completed jobs and 26 failed jobs).
      • TRIUMF file transfer failures. GGUS:66313. The issue with DDM delegating proxy is under investigation. [ Simone - DDM proxy looks ok, same machine is sending data to several sites. Could be some issue with delegated proxy in FTS or clock sync. Add this info to GGUS ticket and ask TRIUMF to check further. ]
      • RAL Frontier SAM test, request to add/tune some failover squid-related entries T2->T1 (central site monitoring). BUG:77104 in progress.
      • NDGF-T1 scheduled downtime (SE, SRM at risk), 7:00-15:00.
      • IN2P3: AMI DB transparent intervention to add a new schema to the AMI stream,13:00-14:00
    • T2

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples. User's analysis also keeps running in 100Kjob/day zone, though lower then before Christmas.
    • AOB
      • Offline DB updated yesterday, two hour downtime. Frontier launchpad unavailability made SAM tests fail at all sites (sorry), cleared in a few hours, reason found and fixed. PhEDEx verification agent was not tested with new DB and has a problem, sites can keep it off, fix in next release.
      • Problem with job robot jobs (see yesterday's minutes) = happens when jobs are submitted without correct role in proxy which explains why couldn't match site. [ Maarten - jobs don't match certain queues that are only open to certain privileged roles. Jobs should have been submitted with the correct proxy but by time of match making on WMS proxy looks like ordinary CMS proxy. Something wrong on client? On WMS? Such mix-ups have been seen in the past but a few years back... ]
      • CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • Occasional instabilities in user services for acquiring tokens or submitting jobs.
      • Users still adapting to small changes in AliEn functionality.
    • T1 sites
      • IN2P3 intend to let PackMan synchronize the read-only replicas of the AFS volume for ALICE software just as done at CERN.
    • T2 sites
      • GRIF_IRFU unavailable since 10 days.

  • LHCb reports -
    • MC jobs running at full steam (30-40K jobs per day).
    • Stripping restarted
    • T0
      • NTR
    • T1 site issues:
      • NIKHEF: Issue with CERNVMFS investigation on going. Found a potential problem with LAN. [ Paco - we are still working on this issue.]
    • T2 site issues

Sites / Services round table:

  • NL-T1 - nta
  • NDGF - pool updates should have gone ok and all back to normal
  • BNL - ntr
  • PIC - all fine; noticed that since ~24h received quite a large bunch of jobs from CMS which are waiting for input files from tape. Reprocessing? Some pre-stage operations? Stefano -
  • FNAL - ntr
  • CNAF - ntr
  • RAL - ntr
  • ASGC - ntr
  • KIT - downtime - 24 Jan from 05:30 - 09:30 - network uplink down for 2-3 hours. 26 Jan - maintenance of many services, firmware upgrade, o/s update etc. Full site down, LFC and FTS will be available but at risk. Downtimes in GOCDB.

* CERN Batch - kerberos related problem now fixed.

  • CERN VOMRS. Between the 17th January and today the 20th the vomrs synchronizer was not running at all and new VOMS was receiving no updates from VOMRS for and VO. Human error, I left the service not running after investigating a non-problem.

AOB:

Friday

Attendance: local(Edoardo, David, Simone, Yuri, Maarten, Jamie, Maria, Alessandro, Gavin, Jan, Roberto, Ricardo);remote(Rolf, Tiju, Andreas, Suijan, Paolo, Stefano, Rob, Michael, Jon, Alexander, Xavier, Gonzalo).

Experiments round table:

  • ATLAS reports -
    • T0/CERN, Central services
      • Blacklisting service (bourricot DB) problem. Reported at ~16:00 by the expert. Elog:21374,21377.Resolved at ~17:30. Perhaps due to a bug in the client-software, BUG:76437.
      • Functional test (data transfer) to Tier2s issue: fixed at ~23:00.
    • T1
      • RAL: after successful tests the checksum verification for transfers to RAL has been activated: BUG:76712, Tier0 to come next. Frontier/squid request BUG:77104 resolved.
      • TRIUMF: file transfer issue fixed, GGUS:66313 solved.
      • SARA: job failures on the stage-in step, GGUS:66347 in progress: the issue is under investigation why jobs run at SARA, but some input files stay at NIKHEF.
    • T2
      • A number of SRM-related issues at Tier2s, some of them have been already fixed.
    • AOB
      • Upgrade of ATLAS DDM catalog next week, intervention date to be confirmed

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples.
    • AOB
      • The problem mentioned by Andrea Sciaba'/Maarten Litmaath of CMSJobRobot jobs aborting due to REQUEST EXPIRED has been understood. A mix of a bug in gLite WMS and in CMS submission client triggered by an unusual usage pattern (i.e. all in the category of unforeseen situation). Clearly not a site problem. Near term fix is via the CMS client and is being pursued. Sites availability will be corrected by hand until then. [ Maarten - same WMS nodes for 2 streams: there is another WMS issue open since a couple of years. 1 such stream might have to wait a long time before seeing any progress. Risk is still there - bug in Condor G directly. Have not managed to persuade developers to fix it. Proxy delegation issue - probably only implemented for gLite 3.2. ]
      • CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • progress foreseen next week w.r.t. these high-priority items:
        • upgrade of Xrootd servers from SLC4 to SLC5
        • extension of CAF capacity by a factor 3
    • T1 sites
      • NTR
    • T2 sites
      • KISTI scheduled downtime for power maintenance has started early

  • LHCb reports - MC jobs running at full steam (30-40K jobs per day).
    • T0
      • SAM spotting SRM problems at CERN. GGUS:66351 [ Jan - known bug between SRM and Oracle backend. Only known cure is to restart servers. Had hoped it would be fixed with Oracle 10.2.0.5 but not the case. Oracle server side cache corruption issue. ]
      • Observed an anomalous failure rate of MC jobs due to timeout in setting up the environment (shared area issue). Increased the timeout to 2400 seconds.
    • T1 site issues:
      • pic: FTS issues submitting trasfer jobs (seems a clock not in sync): GGUS:66355 [ Gonzalo - just status: don't understand what is going on - first check symptoms look as if clock is 5' behind. Clocks well in sync. Cannot reproduce. Timestamps in log files match up to second with LHCb client without 5' delay. Still to be understood. Is this error only seen at PIC? Maarten - is time wrong on CERN UI? ]
    • T2 site issues:

Sites / Services round table:

  • IN2P3 - for ALICE problem we decided to split the volumes in R/O and R/W part which apparently solved problems. Local ALICE support has to do some adaptations.
  • RAL - ntr
  • NDGF - we will have short restart on SRM service to improve stability of pin management. At risk 11:00 - 14:00 in GOCDB
  • ASGC - ntr
  • CNAF - ntr
  • BNL - ntr
  • FNAL - ntr
  • NL-T1 - ntr
  • KIT - between 02:00 - 07:00 staging not possible for ATLAS as corresponding staging pools for ATLAS down. 13:00 - 14;00 unscheduled maintenance on one of tape libs so staging degraded.
  • PIC - around 12:30 had water leak incident in module hosting part of WNs. Triggered an alarm which meant switching off all WNs in this module - about 400 jobs killed and half WNs switched off. Will be solved over w/e.
  • OSG - a couple of tickets submitted about data formatting in info providers. VO info reported to Bestman. Forwarded on to gip developers for action.

  • CERN storage - ongoing shuffling of disk servers for CMS. 2 short interventions on ATLAS and ALICE LSF head nodes for next Mon and Tue respectively. 2' restart and shuffling of LSF head node. OK given.

  • CERN dashboards: announced an upgrade to ATLAS DDM dashboard Tue 25 at 10:00. Downtime < 1h. To use ADC standard cloud names and colours. Can affect people using DDM dashboard API. Savannah bug report 77165.

AOB:

-- JamieShiers - 14-Jan-2011

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2011-01-21 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback