Week of 110110

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Cedric, Julia, Maria, Jamie, Manuel, Dirk, Peter, Ueda, Stefan, Maarten, Massimo, Ignacio, MariaDZ, David, Luca);remote(Michael, Jon, Gareth, Alessandro, Rolf, Gonzalo, Ron, Tore, Suijan Zhou, Rob, Daniele).

Experiments round table:

  • ATLAS reports -
    • CERN-PROD_LOCALGROUPDISK SRM errors GGUS:65932
    • Problem with BNL voms server GGUS:65944. Fixed [ Michael - problem with VOMS service not server. Admin mistakenly thought that a particular certificate was not used; once original cert restored yesterday morning all ok. ]
    • LFC problem at GridKa : GGUS:65942

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
    • Tier1
      • No outstanding issues
    • Tier-2
      • No outstanding issues
    • AOB
      • Mail Gateway issue at CERN : affected sendmail from CMS hypernews, not user mails. One of these gateways was overloaded from an lxplus machine sending over 100,000+ mails. CERN/IT investigating who caused this volume of traffic
      • CRC this week (starting tomorrow) : Stefano Belforte (connecting remotely...)

  • ALICE reports -
    • T0 site
      • Efforts ongoing to make AliEn v2.19 work for users this week. Production continues and CAF is available for analysis on limited, pre-staged data sets (new data sets can be requested).
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report

  • LHCb reports - MC productions on going. Need to rerun the stripping for two streams (CHARM FULL and CHARM CONTROL). This is a very huge activity over all 2010 data.
    • T0
      • NTR
    • T1 site issues:
      • IN2p3: still problem installing software in their AFS area (vos release problem). A meeting LHCb-IN2p3 is currently being held to discuss about the status of the shared area and plans for the future. [ VOS release has now been done ]

Sites / Services round table:

  • BNL - nta
  • FNAL - ntr
  • RAL - had been running with some ATLAS FTS channels turned down to 50% of normal channels as one of ATLAS areas was getting full. Changed this morning to 75% of nominal. Will restore asap.
  • CNAF - ntr
  • IN2P3 - nta
  • PIC - a pnfs glitch this morning due to human error. Caused dCache service to be down for a couple of hours.
  • NL-T1 - downtime next week Tuesday to attempt to switch Oracle over to new h/w (7 node RAC). Planned in December but had to be postponed due to faulty network h/w.
  • NDGF - ntr
  • ASGC - ntr
  • KIT - issue with FTS now fixed.
  • OSG - ntr

  • CERN DB - ntr
  • CERN storage - this morning did CASTOR CMS upgrade to 2.1.10. Also upgrading stager and srm DB to 10.2.0.5.
  • CERN Grid services - during last two weeks seen some degradation of batch system. Correlated to high rate of submission from grid ILC batch queues. Raised ticket to find out why. GGUS:65965

AOB:

Tuesday:

Attendance: local(Eddie, Roberto, Maarten, Jamie, Maria, Gavin, Miguel, Ueda, Stefan, Massimo, Julia, Simone, Jacek);remote(Tiju, Stefano, Jon, Ulf, Jeremy, Rolf, Suijan, Alessandro, Rob, Xavier).

Experiments round table:

  • ATLAS reports -
    • ATLAS distributed computing system downtime for database reorganization on 17-18 Jan,
      • start draining on 16 Jan, in the evening
    • ATLAS restarted a series of data transfer measurements of full matrix (every site - every site)
      • excepting the sites declared as not appropriate to the tests
      • sites would observe transfers from/to unusual sites
    • Please provide us the pointer to the CERN FTS monitor [ Gavin - it is https://fts-monitor.cern.ch/ ]

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 still down [ Miguel - is there any Tier0 issue? A - no, waiting for new version of CMS SW to start reprocessing when new cosmic ray run. Not processing data, but not broken. ]
    • Tier1
      • No outstanding issues
    • Tier-2
      • No outstanding issues
    • AOB
      • CRC-on-duty : Stefano Belforte
      • Meeting to discuss Dashboard problems during Xmas break. For scheduled downtimes someone working on it 100% will have new version of SSB collector in production in one week. Will report at next CMS Facilities meeting on Monday.

  • ALICE reports -
    • T0 site
      • Efforts ongoing to make AliEn v2.19 work for users this week. Production continues and CAF is available for analysis on limited, pre-staged data sets (new data sets can be requested).
    • T1 sites
      • Nothing to report
    • T2 sites
      • Nothing to report
  • LHCb reports - MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.
    • T0
      • NTR
    • T1 site issues:
      • IN2p3: After the SW has been installed yesterday MC jobs ramped up at IN2p3-CC and IN2P3-T2 centers.
    • AOB - conditions DB SAM test failing yesterday in 5 sites. (One of tests in critical availability - for time being taken out of critical test list)

Sites / Services round table:

  • IN2P3 - ntr
  • RAL - FTS channels for ATLAS; updated now to full values
  • ASGC - ntr
  • FNAL - ntr
  • NL-T1 - downtime tomorrow for top level BDII, CREAM CE, WMS, to be migrated to ore stable host
  • KIT - ntr
  • CNAF - ntr
  • GridPP - ntr
  • OSG - question: had people at FNAL who run VOMS for CDF and D0 and asked whether they should registered in GOCDB. Who to contact? A: Tiziana Ferrari

  • NDGF We have srm.ndgf.org pool software updates + tape system expansion tomorrow. AT_RISK has been scheduled, as some Atlas and Alice data might be unavailable.CSC T2 site has the ARC-CE node jade-cms.hip.fi down with hardware problems.

  • CERN VOMS On Thursday 13th January at 10:00 CET the host certificate for lcg-voms.cern.ch hosting VOs dteam, cms, atlas, alice, lhcb and ops will be updated. A new lcg-vomscerts 6.3.0 was released before the new year.

  • CERN DB - this morning ALICE online DB down for 3h due to power cut a pit.

AOB:

  • Next WLCG T1SCM on Thursday 20 January - agenda to be circulated shortly.

Wednesday

Attendance: local(Jamie, MariaDZ, Ueda, Lola, Eva, Stefan, David, Gavin);remote(Jon, Francesco Norefini, Rolf, Joel, Michael, Onno, Tiju, Alessandro, Paolo, Suijan, Rob, Stefano).

Experiments round table:

  • ATLAS reports -
    • ATLAS will switch the certificate used for data transfers via DDM in the near future.
    • CNAF-BNL network problem (slow transfers) GGUS:61440 (since 2010-08-23)
      • The last update on 2010-12-17: "News from NRNs side: It seems the "Production Path" has been cleared. A commissioning phase is ongoing." [ Michael - production link has been re-engineered - a new circuit has been setup. Tests initiated by ESNET engineering which look fine so far. They want to ramp-up traffic so that it looks like production to see if there is packet loss. Engineering still working, ESNET, USLHCNET and GEANT involved. News in next days ]
      • ATLAS can perform some test transfers once the "commissioning" is done and the sites request it.

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 still not running, waiting for next cosmic ray runs
      • CMS Dashboard issues: all big problems reported lately solved now. Still many Savannah opened. Will review at next Monday CMS ops. meeting.
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples. User's analysis also keeps running in 100Kjob/day zone. Start seeing resource saturation effects (jobs pending at some site since several days)
    • AOB
      • CRC-on-duty : Stefano Belforte

  • ALICE reports - Problems with user job submission. Work still ongoing to fix. Big improvements but still not fully fixed. Downtime in web interface for MonaLisa webpage not accessible since 13:30. Should have been down for 90' but still not back.

  • LHCb reports - MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.
    • T0
      • NTR
    • T1 site issues:
      • IN2p3: After the SW has been installed yesterday MC jobs ramped up at IN2p3-CC and IN2P3-T2 centers.

Sites / Services round table:

  • NDGF: Our disk pool software upgrade will start soon. IMCSUL and IMCSUL-INF, two T2 sites are gone from the internet, we don't know if it is power or network errors as we have not had any connection to the machines there for 3h. They have been downtimed in GOCDB.

  • FNAL - ntr
  • CNAF - nta
  • IN2P3 - ntr [ Joel - didn't receive EGI broadcast saying intervention had started. BTW it is a downtime for LHCb, not an AT RISK. Rolf - this is business of portal people - don't know if there is a problem or not. ]
  • BNL - ntr
  • NL-T1 - announcement: next Tuesday Jan 18 scheduled downtime to move Oracle DB back to original h/w. Affects FTS and LFC ATLAS and LHCb.
  • RAL - ntr
  • ASGC - ntr
  • KIT - ntr
  • OSG - ntr. Was able to contact Tiziana about CDF and D0 VOs. In good shape there!

  • CERN DB this morning ALICE online upgraded to Oracel 10.2.0.5. At moment CMS ARCHIVE DB being done. More scheduled for next week and week after. LCG DB proposed Tue 25 and next Monday split of ATLAS offline DB. DQ2, PanDa and prodsys to new cluster. 3 hours full downtime and then these 3 apps will be available R/O until 18:00.

AOB:

Thursday

Attendance: local(Lola, David, Jamie, Maria, Ueda, Simone, Roberto, Jacek, Carlos, Alessandro, Stefan, Steve);remote(Michael, Jon, Gonzalo, Daniele, Ronald, Gareth, Joel, Rolf, Ron, Kyle, Andreas, Alessandro, Suijan, Foued).

Experiments round table:

  • ATLAS reports -
    • GGUS tickets sent out concerning the switch of the certificate -- thanks for the replies.
      • tests will be carried out step-by-step.
    • Data migration to tape has started at the end of the last year (ADCOperationsDailyReports2010#Dec_26_28_Sun_Tue).
      • Keep going with the data with the "project names" mc09_* and mc10_*
    • GGUS:66074 submitted to CERN for EOS space being full is a mistake -- it is an atlas internal issue -- apologies.

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 still not running, waiting for next cosmic ray runs
      • CMS analysis submission tool (CRAB) started failing on lxplus due to new snoopy.so (unannounced). snoopy removed by IT (thanks) and debug/fix in progress
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples. User's analysis also keeps running in 100Kjob/day zone, though lower then before Christmas.
      • At least one of our T2's (TR_METU) has moved to CREAM-only CE_wise but SAM treats still CREAM differently from LCG-CE and OSG-CE and site appears in error in CMS availability/reliability because has no working CE. Correction put by hand on our report, priority of proper fixing raised to top. Help from dashboard may be needed.
    • AOB
      • CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • Efforts ongoing to make AliEn new version work for users 100 %, it will take still few days.
      • We are still affected by some degradation of the batch system correlated to high rate of submission from grid ILC batch queues, so there is not a lot of activity [ Joel - looking with Stephane Paus at this problem. Steve - will ask and get back. Roberto - not the amount of jobs but way ILC is querying IS. ]
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports - MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference.
    • T0
      • CERN: GGUS:66067 can not import lfc with sl5 grid-env (not for Python 2.5 or Python 2.6)
      • CERN: Problem with the information published by the CEs. The problem boiled down to the overload generated by concurrent ILC activity. Problem also spotted by Alice (GGUS:65947)
    • T1 site issues:
      • CNAF: ce07-lcg lot of pilot send even if the system is full (investigation in progress). It seems related to a VOVIEW problem. [ Roberto - problem with VOVIEW system & this CE. FQAN bound instead of VO bound. Alessandro - have fixed problem; VOVIEW was not configured properly, should now be ok. ]
      • IN2P3 - should soon have possibility to install s/w properly on AFS - implementing something similar to CERN

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • KIT - last night machines hosting dCache SRM and pnfs went offline. Restarted and currently all functional. Looking for reason.
  • NL-T1 - ntr
  • RAL - since start of meeting had a problem come on LFC which we are looking at
  • IN2P3 - yesterday Joel mentioned problem with downtime notification. Discussed with ops portal people and they told me that since move to GOCDB4 there is a problem which is the downtime stored in GOCDB are now forwarded to ops portal regardless of age. Notification mechanism got overwhelmed by number of messages which caused delay. Known problem - definite fix on GOCDB side. Developers working on it; also ops portal on workaround.
  • NDGF - have had a few issues mentioned in earlier meetings this week; Latvian T2 went offline - now confirmed that this was a firewall issue; they should be online again - SE still offline but hopefully solved soon. Downtime booked and will be revoked when ok. T2 in Finland where ARC CE supporting CMS which has flaky hardware which means it doesn't accept new jobs but keeps processing old ones. No idea when it will be solved. Upgrades on pool software - all finished and done. Next week on Mon-Tue more upgrades; Mon: bunch of disk pools - ATLAS data unavailable; later same day tape pools which will make more data offline; Tue - bigger installation of new tape drives on one machine - mostly ALICE data which will be offline probably for all of Tue. 2/3 of ALICE tape capacity will still be online.
  • CNAF - ntr
  • ASGC - power maintenance at data centre. Announced 4h downtime but fixed already.
  • PIC - scheduled downtime next Tuesday
  • OSG - ntr

  • CERN Network - LCG router was scheduled to be put in production today but deferred

AOB:

Friday

Attendance: local(Eva, Nilo, Lola, German, Miguel, Maarten, Gavin, Jamie, Maria, Stefan, Ueda, Julia, Eddie, Roberto, Alessandro, Simone);remote(Rolf, Jeremy, Michael, Jon, Xavier, Joel, Onno, Ulf, Kyle, Franceso, Gonzalo, Suijan, Tiju, Stefano).

Experiments round table:

  • ATLAS reports -
    • CERN -- UNAVAILABLE files reported by the shifter (GGUS:65193)
      • trying to understand if these are already in the list of lost files (due to the powercut) or not.
      • response to GGUS:65847 ?
    • ATLAS distributed computing system downtime for database reorganization on 17-18 Jan
      • starting draining on 16 Jan, evening

  • CMS reports -
    • Experiment activity
      • Shutdown activities, Physics analysis of 2010 data, heavy preparation period for Winter Physics conferences
    • CERN and Tier0
      • Tier-0 still not running, waiting for next cosmic ray runs
      • CMS analysis submission tool (CRAB) started failing on lxplus due to new snoopy.so. Snoopy.so promptly fixed yesterday, rollout of new one ongoing. Incident closed.
    • Tier1
      • No outstanding issues. Data reprocessing going on full steam with 2010 data reprocessing.
    • Tier-2
      • No outstanding issues. MC going on full steam, completing Fall 10 MC and started 8TeV Spring 11 samples. User's analysis also keeps running in 100Kjob/day zone, though lower then before Christmas.
      • We still need to workout a solution for proper monitoring of sites with CREAM-only CE's. [ Julia - CMS has a separate instance of dashboard for availability calculations. Can change algorithm but need to understand requirements. ]
    • AOB
      • CRC-on-duty : Stefano Belforte

  • ALICE reports - GENERAL INFORMATION: beginning of February pass2 reconstruction of the HI run will start (affects all sites). Alien issues not fully solved but less and less issues
    • T0 site Nothing to report.
    • T1 sites
      • IN2P3: some issues with the proxy which are being already put in expert hands
    • T2 sites
      • Trieste-T2 is (not) back in production - had some h/w problems.
      • Announced downtime in Torino 22nd of January

  • LHCb reports -
    • MC jobs running at full steam (30-40K jobs per day). New requests coming almost continuously for Moriond conference. The decrease of running jobs which we see yesterday was due to a problem in the LHCbDIRAC code. (we tried to fix a problem of delegation for the proxy for CREAM CE but it was not working. So we went back to the previous version and we are discussing with the CREAM CE developer to properly fix it).. [ Roberto - problem was that using option -myproxyserver=" " triggers proxy renewal mechanism. CREAM CEs keep a cache and using option in this way proxy not delegated at each submission. To bypass set variable in JDL. With 3.1 UI prevents submission of parametric jobs, such as MC jobs.. Solution is to use a local configuration file and move pilot agent to gLite 3.2 instead. This will fix both problems. ]
    • Today we will restart the stripping.
    • T0
      • CERN : GGUS:66067 can not import lfc with sl5 grid-env (closed but with problem not solved) [ Maarten - gLite UI always comes by default hard-coded with assumption that system python will be used. Experiments have all moved to newer versions of python but then you cannot expect to import LFC python api. Refer to AA which is prepared for other version of python. In this ticket "both worlds" are mixed. Which doesn't work. Next version of UI will allow user to choose version of python. Extra argument when setting env through script or set variable before sourcing script. But there can only be one. ]
    • T1 site issues:
      • CNAF : problem of voview fixed
    • T2 site issues:
      • Torino : GGUS:66093 : 4 core with 4Gb of memory is a little bit too low for LHCb. [ Ale - not set on VO card? A: minimum for 32bit is 750MB for RAM and 1.5GB for VM. ]

Sites / Services round table:

  • IN2P3 - ntr
  • BNL - ntr; reminder of major intervention Sat/Sun and probably Mon
  • FNAL - had about 20 files that had an FNAL non-custodially that got accidentally flushed out of dCache disk - retransfering.
  • KIT - ntr
  • NL-T1 - ntr
  • CNAF - ntr
  • PIC - ntr
  • ASGC - ntr
  • RAL - on Mon and Tue outage on ATLAS CASTOR to upgrade diskservers to SL5 64bit
  • GridPP -
  • OSG - ntr

  • NDGF - Next week we will have multiple storage downtimes next week, all declared as AT_RISK with possible outages reading Atlas and Alice data. These are part of srm.ndgf.org HPC2N Tape upgrade: Monday Jan 17th 10:00-18:00 CET - ALMS, Library firmware HPC2N Tape upgrade: Tuesday Jan 18th 10:00-18:00 CET - LTO5 tape drive install PDC pool upgrades: Monday Jan 17th 10am CET -> Pool upgrades for DCSC and NSC pools: Monday 17.1.2011 13-17 CET

  • CERN Grid - overload from ILC VO has now gone, understood.

  • CERN Storage - CERN tape operations - will have to make an intervention on two STK libraries which hold half of tape data stored at CERN. 1 day per library. Rails where hand-bots are moving have to be fixed better. Needs to be done before restart. During 8 h data per library will not be readable. Requests will q; write will redirect to other libs. Tue 9 Feb and Wed 10 Feb 2nd lib. [ Jon - comment. We did this intervention and a dramatic improvement in reliability resulted.

* CERN DB - power tests on P5 CMS online will not be available whole Monday morning.

AOB:

-- JamieShiers - 07-Jan-2011

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2011-01-14 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback