Week of 080915

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Julia, Gavin, Simone, Andrea, Miguel, Harry, Jamie, Maria, Daniele, James, Roberto, Patricia, Felix, Nick, Steve);remote(Michael, Gareth).

elog review:.

Experiments round table:

  • CMS - see also https://prod-grid-logger.cern.ch/elog/CCRC%2708+Logbook/463. In running mode. Over w/e had several small cosmic runs (Sat). No processing failures - just one run stuck in DAQ. Couple of long runs soon after - all subdetectors apart from pixel, tracker and end-cap. Some 'blind regions' in lemon monitoring - see daily log. Sunday: more long cosmic runs, injected by manual update of T0 component. (Recover cmsmon). Some longer in prompt reco than expect. Data taking abit slower this morning - now ok. Custodial data to T1 sites. ASGC recovering after typhon. Temp. glitches with SAM tests - maybe related to GridKA power supply problems. Several T2 job robot failures, s/w installation failures.

  • ATLAS - quite a lot of cosmic data taking, exported to T1s as usual. Worked well. Internal copying of data from one pool to other at CERN - still some prolbems; too much data to be transferred - rate too high. 250MB/s - restricted list. Glitch at FZK as mentioned - power outage. Recovered ok.

  • LHCb - last week's real data: looked at pit, no request to store in CASTOR nor replicate (other than historical). Online and Grid still strongly recoupled. MC data - couple of main issues; PIC - problem staging files from tape. Fixed after GGUS ticket. 2nd problem with CERN WMS failures with list mismatch. Daemon to be restarted?

  • ALICE - VO box tests through SAM failure over w/e. Internal ALICE problem - credentials. Nothing for sites to worry about. Access to user interfaces of SAM inaccessible by ALICE - following up with Judit. Pass 1 recons at CERN using VO boxes at CERN - publication of CEs at CERN not always working - completely stopped recons of raw data. Change JDL to try to work-around. Q: manifestation? 90% (!) CEs disappeared for several hours. Q: solved? checked CE by CE for ALICE. Harry - always LSF interface, not responding. Load related - talk to Ulrich.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

  • Quiet weekend. High load at TRIUMF (streams) - 3rd time - related to ATLAS/COOL. TRIUMF reprocessing exercise which should stress conditions (but not that much...) Check with Rod Walker. Last week happened GridKA and SARA so maybe different? Simone follow up.

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Ricardo, Harry, Andrea, Gavin, Miguel, Olof, Simone, Roberto);remote(Gareth, Jeremy, Michael, Jeff, Pepe).

elog review:

Experiments round table:

LHCb (RS): 1) Lost a few real data files due to a flaw in the mechanism that transfers files from the pit to CASTOR. 2) Only a few production jobs are running at CERN (answer from Ricardo was they will reshuffle the LHCb LSF queues to put highest priority jobs first). 3) They have a problem with their stripping application where one job in 5 crashes. 4) Have started running MC at Tier 2 sites and found one where the working directory is being cleaned before jobs finish. Jeff then asked why no jobs at Nikhef - the answer was there is no data to be processed there (Tier 1 are only used for stripping and reprocessing).

ATLAS (SC): The system is smoothly collecting cosmics data. Yesterday they produced muon ntuples for export to the Tier 2. There is a level of 4% of failures in CERN to CERN copies (to change CASTOR pools) which go via SRM/FTS which needs investigation. There are FTS proxy delegation failures happening only at PIC where we see there cron to refresh delegation often fails with a message indicating a clock problem. Pepe said they were already talking to the FTS developers and it was agreed that ATLAS would give some debug information to them. CMS Phedex uses a different mechanism and LHCb are not currently transferring much to PIC so neither see the problem.

Sites round table:

PIC (PF): 1) We received a ggus alarm ticket about an FTS transfer problem of ATLAS where the information was of insufficient quality to help. SC apologised and agreed to tighten up their usage of such tickets. 2) We have thousands of SRM file accesses from LHCb jobs and this is affecting other VOs. RS said this was catching up a backlog of LHCb work and would soon stop.

NIKHEF (JT): Asked why there is no ATLAS work there ? SC suggested to ask Hurng as he follows closely on ATLAS work at NL-T1. This was agreed.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB: 1) Following up on yesterdays 'disappearing CE' at CERN RS said that LHCb had been incorrectly using hardwired CE names (including obsolete ones) rather than discovering them in the infomation system. 2) Following up on yesterdays report of overloaded Tier 1 conditions DBs Maria Girone and Sasha Vanyashin have worked on this. When ATLAS start a reconstruction set of jobs together and each job opens 20 DB connections the combination of 20 to 30 such jobs together with Oracle streams receiving data into the Tier 1 DB from CERN is sufficient to overload the Oracle backend. ATLAS tested this but without the streaming component and this is now a serious problem for them and other experiments in the case of shared servers.

Wednesday

Attendance: local(Andrea, Simone, Maria, Harry, Jamie, Roberto, Jan);remote(Jeremy, Luca, Gareth).

elog review:

Experiments round table:

  • ATLAS - not much to report. New project tag for collision tag at 900GeV defined. Various emails yesterday - should be consistent & sites should now be aware. In last 24H only problem some inefficiency from other places to RAL. Disappeared about lunchtime. Gareth - castor job manager failed this morning, down for a while and now resolved. Luca - mail from Kors about foreseen collisions. This weekend?? A: nobody really knows when but prefix will change as soon as this happens to indicate collision data. Luca - should conform to what is written by Kors? A: yes, absolutely. Maria - not for w/e but pushing to get beams tomorrow (likely) but no expecting collisions before next week. Continue to end September and then ~10 day break. Re-establish circulating beams then commission optics then 450+450 collisions. Might try to ramp to 1TeV per beam - max allowed with current machine config but then need to stop to push to higher energy. Simone - next week have to organise meeting about reprocessing to understand DB access pattern giving load issues. Exercise now stopped. Expects currently in BNL.

  • CMS - (https://prod-grid-logger.cern.ch/elog/CCRC%2708+Logbook/464) T0 operation quite stable and smooth. T1 workflows custodial data to stie recently subscribed OK. Unscheduled downtime at CNAF - "at risk" to fix backup in diesel generator. T2s: job robot failures at some sites, some CMS-SAM tests but nothing really crucial.

  • ALICE - following an issue observed at RAL with the local WMS which prevent the submission of Alice agents to the site. It might be associated to the certificate of the WMS but still under investigation.

  • LHCb - running in steady mode dummy MC (output uploaded to T1 SE then scrapped). Computational activity. Few small problems but not for here... About 40 jobs failing to stage files at PIC - promptly fixed, recognised to be broken tape, replaced. 400 jobs at PIC, each accessing 21 files, causing some storage troubles (but not typical load for this site). From Ron got mail about major problem with storage system at SARA - no broadcast or GOCDB entry? Now ready to migrate all SRM 1 endpoints to SRM 2 so can de-commission SRM 1 endpoints. Need to change all entries in file catalog. Send request to "LFC people". Q: SRM v1 data? A: change host name and access without spacetoken. Luca - and notify also T1s? A: yes. Q: global catalog change - change master then resync all???

Sites round table:

  • CNAF - circulated ticket today. Discovered diesel engine not working, engineer coming now to report it hence currently "at risk". Power cut longer than a few minutes would be disruptive. Back to normal this evening. Next Monday reinstall one of three CEs. Should we announce a downtime for this? CE in question will be drained and no longer announced via IS. Users will access other 2 CEs. Should clear as down in GOCDB. Site availability still ok. Traced problem on CASTOR 2 weeks ago (DB RAC) - known bug, applied patch: post-mortem to appear: Run Oracle on RHE5. Also only site on Oracle 10.2.0.4. Maria - Will extend discussions within DBA community to cover also CASTOR DB issues.

Core services (CERN) report:

  • tomorrow morning at 10:00 failover SRM b/e nodes to standby servers. May cause some temporary disturbances to running programs. Posted on CERN IT status board. Part of move of SRM b/e. CASTOR - 5% efficiency of transfering data from CERN-CERN. (Pool to pool). Olof found problem. Simone increased load to 200MB/s and problem seems solved. Thanks. Jan - looking at ways of tightening handle on CASTOR config issues - 2nd time in 2 weeks we've had config probs.

DB services (CERN) report:

  • Yesterday helped LHCb to complete migration of old DB server for online to new cluster. Migration painful but now completed. Are analysing with ATLAS and CMS access to online DBs in general. CMS, ALICE and LHCb "controlled", for ATLAS online DB is externally visible and rather heavily used for connecting from external sites. Is this what is required? ATLAS will try to close but for the moment cannot. For moment external activity impressive - need to followup.

Monitoring / dashboard report:

Release update:

AOB:

Thursday

Attendance: local(Jamie, Daniele, Harry, Roberto, Simone);remote(Gareth, Jeremy, Miguel).

elog review:

Experiments round table:

  • CMS - (https://prod-grid-logger.cern.ch/elog/CCRC%2708+Logbook/465) day relatively quiet, series of magent ramp ups/downs, ready for 0 and 3T. Some failures seen in merging step (wrong numbers of). T1 workflows - transfers. OK apart from some stuck to IN2P3. For T2s: nothing particular. URGENT REQUEST - 2 machines very soon to address problem that affected cmsmon. Higher priority than existing pending request.. Harry will phone Bernd to expediate. Transfers for CMS for German region. GridKA-Aachen, and Aachen->GridKA+DESY-GridKA, cannot exclude same problem as ATLAS. SImone - huge SRM V1/V2 migration ongoing. In many cases found much more convenient to get from elsewhere, hence FZK size v busy. Recent ATLAS+CMS reprocessing at Gridka would like to share plots of performance at site. Simone - not sure when reprocessing was run at FZK. Need to understand dates to do analysis. ATLAS - dedicated meeting with DB people about reprocessing tomorrow.

  • ATLAS - RAL notified transfers into RAL. Downtime today finishing at 1:00. Problem solved at middle of day. DB problem? Gareth - can confirm at DB problem overnight resolved in morning. FTS in FZK for transfers within cloud has v long latency scheduling requests. Channel independent. Something at level fo FTS server? Transfer itself takes normal time. Scheduling takes up to 30' This degrades overall service quite alot. Stephane Jezequel in contact with German cloud. Con-call with Taipei. Progress solving problems with pool and ACLs. Jason reset ACLs for millions of files so data in pipeline for deletion can now be deleted. In progress but good step forward.

  • LHCb - Outtage of WMS at CNAF. Got unresponsive this morning and had to reboot. For RAL - CASTOR instance of LHCb due to b/e DB. rootd protocol for accessing data in CASTOR without rfio. Any comments from RAL side? Gareth - will have to followup. For PIC: remaining jobs for DC06 stripping still there - just 20/30 jobs failing in staging files in last 24H. In Gonzalo's hands, looking at problems, probably tape problem. LFC migration: asked Sophie how many entries - about 1.5M replicas to be changed. Agreed must be done in collaboration with 3D people - will affect streams and hence needs planning / scheduling.

Sites round table:

  • RAL - had request about Xrootd from LHCb - need to check details.

It seems RAL first got an xrootd request and today got a rootd request and the later arrived in the form of a helpdesk ticket but not the former. One understanding is that LHCb want rootd because (a) it is not RFIO and (b) even if it does use RFIO internally it retries if RFIO fails. RAL is looking at whether the rootd server daemon interfaces with the TCastorFile class and how this can be implemented.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

  • Roberto - does ATLAS have problems accessing conditions DB at sites? A: yes, hence dedicated meeting. Heavy load. Tomorrow at 13:30. Main site where this was tested was SARA with Oracle auditing turned on. Site currently in critical situation and hence cannot do reprocessing tests there. Discussion to move elsewhere, e.g. CNAF, TRIUMF, Lyon. 20-30 connections per job, 30 jobs.

Friday

Attendance: local(Simone, Jamie, Harry, Andrea, Eva, Miguel, Gavin, Maria, Roberto, Patricia, Jan, Olof);remote(Gareth, Jeremy).

LHC status: progress report, week1

elog review:

Experiments round table:

  • ATLAS - nothing to report concerning sites & site issues. Meeting with IT DB experts on condidtions DB access & reprocessing. Richard Hawkings explained workflow and access patterns. Part of problems to be cured by new ATHENA about to be deployed. Part of problem comes from workflow (steering and release of jobs), maybe tuning some parameters, not starting bunches of jobs together. Repeat test in a more controlled way, T1 by T1, coordinating with DB team. Jeremy - glite update problem with Python path breaking ATLAS. In contact with Graeme. Likely to affect also other sites. The latest release of gLite omits $LCG_LOCATION/lib/python from PYTHONPATH which breaks everything for ATLAS (user jobs and production). Simone will talk to Remi today.

  • CMS - little to say, all quiet due to LHC magnet quench. Bug assigned to dashboard for timeout (interactive page to see status of jobs submitted) not working for one hour yesterday.

  • LHCb - quite week due to LHCb s/w week. 1) MC dummy production proceeding smoothly. All sites running smoothly their payload. 2) Couple of new T1 stripping activities coming soon. Reprocessing DC06 production as requested by physics group. All jobs still waiting to be picked up by pilots except GridKA already running 3) Stripping - PIC - why some files problematic in staging? Gonzalo. 4) Accident this morning + yesterday, RAL + PIC + Pisa etc user filling home directory with jobs. Used advised. Some sites ban whole VO!! Production meeting questions whether whole VO should be banned?? Possibility from LCG side to limit the size of the output? Andrea - why in m/w and not batch system? Jeremy - our process if user can be identified is to ban user and not VO. Check with Pisa why VO Q closed.

  • ALICE - announced new upgrade of Alien at all sites. Migrating to v51. Upgrade going smoothly, with useful small issues. One issue with RAL - migrate to WMS and not RB any more but WMS is not working. Registered but giving maps problems wrt authorization. Same problem seen last week. s/w area seems to have problems - must be shared between WNs and VObox. Config currently not seen by WNs. WNs can see some of s/w packages but not all. Jeremy - wrt WMS RAL tests seem ok - ALICE specific problem. Patricia - cannot submit with her cert. Will send mail after meeting.

Sites round table:

  • RAL - Turn off SRM v1 at end of the month.

  • CNAF CASTOR post-mortem - attached below.

Core services (CERN) report:

  • Ongoing investigations on slowness of scheduling transfers at FZK - Paolo looking at it.

DB services (CERN) report:

  • Meeting with DB sites for DB operations. Not much to report from sites in terms of interventions in pipeline. CNAF DBs for 3D have rolled back to 10.2.0.3 and still need to be upgraded to 10.2.0.4 due to problem during previous upgrade. Hope to do this in H1 Oct. Some issue related to discussion triggered by cursor sharing bug in Oracle - appeared 2 years ago in 10.2.0.4. Seems to be manifesting itself now and again. Upgrades do not seem to fix it fully. 10.2.0.4 seemed to contained a good bug fix but there is a new patch for it(!) - rolling patch. Proposing to deploy on validation clusters next week. Run there and upgrade production servers minimum 2 weeks later. No evidence of problems from this bug at CERN. However, affected LFC and VOMS last time. Popped up in relation to CASTOR at outside sites.

  • For LFC migration, 1.4M entries. Through streams would take several days to be replicated. Proposal: do as follows; 1) Sophie will prepare script, will be tested at CERN. When ready, stop streams and set tag to stop changes to be replicated. Run change at CERN. Then run change at each T1. Then restart streams replication. 2 questions: a) will LFC service will be stopped? stop activity so that it runs faster... stop at least writes. b) presume script can be run at destination sites too? Q: how often such changes? < or << 1 per year. Coordinate with RAL re above. Tentatively early October.

Monitoring / dashboard report:

Release update:

AOB:

  • Emphasize that issues reported here should be associated with GGUS ticket + elog entry. Sites cannot react on problems not known about! 2nd hardwired CEs in applcation have been deprecated for some time - also discussed at GDB last week!

  • ALICE cannot make corresponding change from 1 day to next.. ALICE not trusting in BDII to find CEs. Cannot be announced in this meeting?

  • Hard-coding nodenames is strongly deprecated. CE retirements announced at LCG SCM.
Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf post-mortem_of_September_7_CNAF_CASTOR_problem.pdf r1 manage 48.7 K 2008-09-19 - 15:34 JamieShiers post-mortem of CNAF CASTOR problem 7 September
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-09-19 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback