Week of 080526

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Kors, Nick, Gavin, Roberto, Jan, Simone, James, Jamie, Andrea);remote(Daniele, Alberto Pace, Pepe Flix/PIC, Andreas Heiss/FZK).

elog review:

Experiments round table:

  • ATLAS (Simone): the plan - run full exercise. First news - have to start tomorrow morning. Cosmic data from w/e (>30TB) being distributed - have priority. Variety of things still to setup. Take day to fix things. Good to have the exercise for 4 days. This would give too low ESD data for T1+partner(s). T1+all others would be too much. Thus 2 days of functionality test & then 2 days of throughput test - all subscribed everywhere. Metrics will be different. TBD. Continue with 'fake' MC production (can delete data at end of exercise). But alot of jobs at T2s and T1s. Will start tomorrow with agreement of people who define jobs. That's the plan! And now the problems... The "good news" is that "alarm procedure setup for CERN works(?) beautifully. CASTORSRM went down at 10:30. Alarm sent - problem fixed in ~1 hour. Got call from SMOD. Jan - not quite as seamless. By lucky chance SMOD read e-mail and GGUS - called experts who noted that offending m/c had already been rebooted(!). Problem started at 07:00 until 11:00 when db server was rebooted. Being followed up... Has happened twice in 3 days. Not clear its the same problem... A variation on a theme. DB server got too many connections - could no longer server reqs. Different ORA error. Gav's cronjob to workaround another problem v. active but root problem different, hence didn't help frown Started postmortem. Not clear why suddenly at 07:00 went wild. Hunch: prob with one of mss? Try to limit damage if something similar occurs. Limit # user sessions per account to 200. Enabling "keep-alive" on DB server and propagate this change also client-side. (Hope to) prevent server falling over... Asked ATLAS if we can move forward h/w upgrade - RAC h/w - move ATLAS account. Not decided... Can it be done tomorrow morning early? Or Thursday. Try for first slot with 2nd as fallback... This would / should / could relax load. Some parameters can be set in SRM config - # threads, # connections per f/e b/e node. SOAP backlog. Tuning could help with "cgsi soap".

  • LHCb (Roberto) Observation - fairly good recons and data rate over w/e 5/7 sites. RAL & IN2P3 banned wiating on cause of recons jobs crashing. Mis-config of site? Managed to reproduce good results of 2 weeks ago. Move forward to stripping. 11K stripping jobs - ancestor resolution. Issues - working directory for grid jobs through tmp - tmpwatch jobs cleaning up also files still accessed by running jobs. Will become a problem - to be addressed by site admin(?) Put in place quick and dirty hack to shield issue - only for production jobs. (Not user analysis jobs...) SARA - metadata problem. Poor SRM performance also at this site...

  • ALICE (Patricia, by e-mail):
    1. The workplan of VOBOXES migration and testing is over.
    2. AliEn v2-15 will be probably ready for sites deployment this week - It will be announced to the sites in time

Sites round table:

  • TRIUMF - ca problem solved but see comment below on site emergency contacts...

Core services (CERN) report:

  • On Friday we exercised "emergency contacts" phone number for TRIUMF (see site contacts. Procedure clearly needs to be refined (i.e. key-words, action(s) etc.)

DB services (CERN) report (Maria, by e-mail):

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Jamie, Harry, Roberto, Patricia, Simone, Andrea, Miguel, Gavin);remote(Gonzalo, JT, Jeremy Coles, Derek).

elog review:

Experiments round table:

  • LHCb: see LHCb CCRC08 meeting (minutes) of 26 May for a useful summary!
  • CMS: We have had a productive day. There are a few failures here and there, as reported by Maarten and Victor, but overall it looks very good. Details below. So, not counting FNAL, we are running ~2.2K jobs at T1's for ReReco and we have processed 37.2M events over last 24h. For skimming we are atm only running ~100 jobs at IN2P3 and FZK. We'll start another round at some point tomorrow. Other tests continue as expected. For more details on them, see CMS CCRC08 Phase2 Operatios elog.
    • ASGC: ~300 running jobs, 3.6M processed events (many workflows)
    • CNAF: ~250 running jobs, 2.5M processed events (JetET20)
    • FNAL: ~3100 running jobs, running all kind of workflows; ~28M unmerged + ~26M merged evts in the last 24h, all re-reco (see ProdMon plot)
    • FZK: ~900 running jobs, but only 5.1M processed events (MinBias) since we have a large failure rate
    • IN2P3: ~300 running jobs, ~2.0M processed events (MuonPT11)
    • PIC: ~350 running jobs (good), 11.0M processed events (JetET110 plus cosmics)
    • RAL: only ~120 running jobs at this moment (!), but we have processed 18.2M (cosmics, but still not bad :-))
  • LHCb (Roberto): smooth pit-T0-T1 transfers. Ancestors problem identified - flaw in online system logic. Fixed (pure LHCb-side). Recons now only happily running at CERN. All other sites have issues - mainly file access. Many elog entries. SARA: our recons app crashing - dcap port dead. Ron rebooted port - now should be ok(?) General 'slowness' in query srm metadata. Also fixed - 2 pool nodes hanging - restarted. NIKHEF - RAS. /tmp cleanup issue. PIC: file access - dcap connection (GGUS), same for FZK. FZK also /tmp problem. RAL - long standing rfio issue. CNAF - recons/stripping jobs - some fail, some ok(?). No clues.. IN2P3: dcap problem accessing data through local file access protocol. Reintegrated RAL&IN2P3 this morning. New mechanism of copying file to WN and opening locally. Seems to work fine. Analyze this later...
    Next week: start analysis and mc production in parallel to current activity.

  • ALICE (Patricia): phase III postponed until June. Not much - Alien tests with VO boxes at CERN.

  • ATLAS (Simone): inclusive exercise started this morning (Tx-Ty). 8 hours worth of data taking produced but not dispatched. All subscribed this morning - see how 8 hour backlog is digested. After 1 hour 2 sites had caught up, 2 hours all. Now at nominal rate - no particular issue...

Sites round table:

  • NIKHEF: (JT) - New Thumpers in prod last week. Still gaining experience. Bug in growing partition size. 2TB of ATLAS data currently not accessible. ATLASDATADISK - keep until end week and then dump - don't try to recover them. xfs - bug is growing fs > 2TB. Integer overflow -> corrupt superblock.

  • SARA: still troubles with new dCache release (p5) - gsidcap doors less stable - should other sites upgrade??

Core services (CERN) report:

  • SRM ATLAS intervention Thursday 1 hour at 09:00 - SRM software problem with handling of DB connections. Move DB to a bigger setup - hope to reduce exposure. Bringing forward due to problems. FTS queues for ATLAS only stop < 09:00. Restart after.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local(Andrea, Roberto, Harry, Gav, Simone, Miguel, Maria, Julia, Patricia);remote(Gonzalo, JT, Daniele, Michel, David/Marc/IN2P3, Andreas).

elog review:

Experiments round table:

  • CMS (Daniele) - see notes in yesterday's Twiki. Today - mostly transfers & t1 processing. All t1s working nicely at full capacity. Issues with skimming at PIC - not site issue, framework issue (CMS). Parallel reco of s43, re-reco of s156 & skimming. Very few failures, mainly memory issues, some reading problems. CSA team: also other exercises finished on scheduled despite 10 day delay in start. Transfers - move wrt planning - redo T1-T2 PIC & RAL. Contact sites to fix new schedule (HN). Stop all iCSA processing Tue next week. CCRC - tests may end before - some may continue before to give additional post-mortem data. Gonzalo: skimming at PIC - lot of internal b/w. Also other T1s? No - check also CASTOR sites? Rerun some skimming at PIC to try to understand more - feedback to framework people so that they can fix. Seems to be related to that. Gonzalo - consistent that other dCache sites don't see it? Not yet collected data. PIC maybe first to do real analysis on this. Look more next week...

  • LHCb (Roberto) - only good news. All problems reported yesterday understood or fixed. PIC - faulty WN (ip problem), CNAF - gfs problem fixed. IN2P3 unbanned and rolled back to remote file access. (To demonstrate problem fixed or still there?) RAL - working peacefully but using downloading approach. SARA : several issues fixed yesterday by Ron. Some problems bringonline using gfal. Still under investigation. FZK - problem with dcap connection. Now running fine with v few jobs failing. AOB: discussions on increasing share at PIC, being able to sustain more than now. Could complicate analysis of CCRC hence postpone to next week. Increase then. Next week - as tail - run some analysis & MC.

  • ATLAS (Simone) - non big issue. 3 small problems during night. Problem bringing files to ASGC - FTS in Taiwan (jobs remained submitted forever - followed by Jason and fixed. Backlog fixed. NDGF - due to UK CA upgrade. Fixed this morning. No backlog. All sites performing well for functionality test. Tomorrow after CASTOR upgrade subscribe all to all - 'full steam reprocessing'. Production - inputs have been generated. Should start this afternoon. Miguel - rate tomorrow? CASTOR - still 450MB/s in and 700MB/s out. BIg load is on Tier1s as replicate all to all. Tier1s - depends on their size. All should sustain 60MB/s from other Tier1s + rate from CERN.

  • ALICE (Patricia) - green light to latest Alien 2.15 version. Deploy at all sites. -> beginning of run 3. Activate sites and start...

Sites round table:

Core services (CERN) report:

  • Reminder of move of ATLAS SRM DB tomorrow. (Agreement on intervention on C2LHCb next Thursday)

DB services (CERN) report:

  • Intervention 17:00 - 19:00 on ATLAS RAC of TRIUMF - transparent - to increase memory from 4GB to 10GB per node.

Monitoring / dashboard report:

Release update:

AOB:

Thursday

Attendance: local(Jamie, Harry, Simone);remote(Derek, Michel, Jeremy, Daniele, Gonzalo).

elog review:

Experiments round table:

  • CMS (Daniele): continuing - post-mortem started & ramping up. Collecting plots - full fac. ops meeting (Fridays) next week - site reports (T1 & connected regions) + CERN. --> Post-mortem workshop. Recons - fine. Job submission & WMS ok. High throughput and v good performance of all T1s. More than 100M events for each of rounds. 5-6 days each. Long transfer & merge tails -> work closely. Skimming (pure CCRC) - interesting features. CMS app has some I/O optimisation required. Skimming can bring sites down! T1-T1 basically closed. T1-T2 continuing - long tail into early next week. Job submission - routinely 100K jobs / day to all tiers peaking at 200K jobs. Analysis exercises: individuals can submit 100K jobs in 10 days. Stop-watch exercise - also early next week. Extension of transfer events into e.g. T1-> CERN. End-to-end tests. Prompt reco + transfer to T1s & measure latency. Trying to finalize... Rerun of Feb test of P5 to CERN. Includes to tape.

  • ATLAS (Simone): last week is running relatively smoothly. Mode of T1-T1 tests switched from functionality to throughput. All ESDs subscribed to all other T1s. Going ok - some issues with a few T1s. (Solved ones skipped) RAL srmput into storage (elog). Also problem at FZK (network problem - fixing). Derek - overloaded earlier today. MC production ramped up over night. 20K running jobs - real jobs! Simulation & recons jobs. Haven't observed any degradation (jobs vs transfers & vice versa) Continue into tomorrow... Load generation stops tomorrow at 12:00. Data flow continues. Saturday: last part of M7 - including all detectors - will start. i.e. CCRC stops. Deletion exercise will start over w/e into Monday - room needed for FDR data for next week. Michel - jobs only at T1s or also T2s? Also T2s. French T2s see no jobs - drop mail to Stephane.

  • LHCb: Things are running smoothly as far as concerns transfers (T0-T1, Pit-T0, T1-T1 upload of RDST to T1 from WN). After a move to the new Online storage there were some problems that caused minor stoppages in the transfers last night. Reconstruction is also running fine at all sites except NL-T1 where we have a long standing issue with the staging at SARA.

    Downloading of input data local to the WN is occurring at IN2P3 / RAL.

    The issue at SARA (GGUS ticket #36800 since 26th) continues with all metadata queries returning that the files are not online. The suspicion is that there is an inconsistency in what is reported via SRM e.g. the file can be on tape and on disk but is not reported as online. (see entry in the elog: https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/369)

    For the stripping a small issue on the workflow.

Sites round table:

Core services (CERN) report:

> One LCG backbone router (l513-c-rftec-4.cern.ch) will be configured to
> provide some special debug output.

  • This maintenance was completed successfully without any impact on the traffic.

  • ATLAS SRM DB - following upgrade. Burst of timeouts. Paused FTS channels -> spike in # requests sent to SRM. stagerget times out. Reason for timeouts at very high load not understood yet. CASTOR or SRM? -> More analysis.

DB services (CERN) report:

  • Arranging with experiments & LCG community dates for migration to 10.2.0.4. A couple of hours downtime per service are to be expected.

Monitoring / dashboard report:

Release update:

AOB:

  • T2 session at workshop. (Daniele) get input from Ken etc.
  • FTS logs: report from May or experiments should look at them? Gav - lots of info in DB, e.g. possible contention between experiments. Try and produce something... T1-T1: logs of T1 FTS servers? No plans - at CERN more monitoring in DB. Some stuff not quite ready for T1s running on CERN DB...

Friday

Attendance: local(Jamie, Sophie, Andrea, Maria, Julia, Harry, Simone);remote(Michael, David+Julien/IN2P3, Michel, Daniele, Derek, Gonzalo, Stephen).

elog review:

Experiments round table:

  • CMS (Daniele): Longish reports over past couple of days, focus on following days. Next week - long-tails of t1-t2 transfers - give all sites opportunity to be tested in same way (non-regional transfers) Re-run of CRUZET-1 exercise if possible with a first repacker T0 component in it. FZK sees issues in importing data from CERN, GGUS #36971 posted (since ELOG is still down). They also have been learning much on dCache on-site, and now lost 1 day due to CERN power cut, so we will give some days of grace next week for T0-T1 tests. FTS - from 09:00 GMT transfers out of CERN but rate very low. Actions to recover FTS? Sophie - will check.

  • ATLAS (Simone): Brief statement about situation after power cut. FTS? Load-generators stopped with power cut. Exporting very little from CERN right now - to be checked. Transfers to RAL since last 2 hours. 1 remaining issue after p/c - PANDA server at CERN. M/C can be pinged but cannot ssh into it. Birger tried remote boot - did not work. Asked console operator - in q! Follow-up... Harry will check..

  • ALICE (Patricia): We continue with the alien deployment at all sites, therefore nothing really new to report.

  • LHCb (Roberto): RAS

Sites round table:

Core services (CERN) report:

  • LCG OPN: Following the power incident this morning we have detected a problem on the router module that connect to IN2P3. We need to immediately intervene on it.
    DATE AND TIME:
    Friday 30th of June 14:20 to 15:00 CEST
    IMPACT:
    The link LHCOPN CERN-IN2P3 will flap several time. Traffic to IN2P3 will be rerouted to the backup path.

  • ELOG: it seemed to have a problem booting from it’s disk – sysadmin team is now looking at it…

  • All services should be back or coming back. Report any problem...

DB services (CERN) report:

  • Affected rather heavily by power cut. Setup in mixed config with 1/2 servers on critical power, 1/2 not. 1/2 servers went down - not a transparent situation. Problems all under investigation. Not called by computer operators - only called DES. Detected problem at 08:00 when DB monitoring came back. Problem with CMS RAC - no clear disk replacement procedure. Required manual intervention. Fixed 10:30. ATLAS - node 1 & 2 ok, 3 & 4 went down. Boot sequence problem - fixed 11:30. LCG RAC. Node 1 not affected initially but problem detected with corrupted clusterware files. Fixed only 11:00. Daniele - need to understand when services are reliably back in production. LHCb RAC - startup sequence problems. Back 11:30. Now issues: 3D OEM & Streams monitoring. Filesystem corrupt problem. Conclusion: not a transparent exercise! For the sake of a few kW have several hours downtime and complicated recovery.

Monitoring / dashboard report:

  • Survived more or less smoothly. CMS d/b - lost some info. MonaLisa server on old host did not come up. Have to migrate - new machine ready. New MonaLisa server serving analysis restarted smoothly. Should not have lost data. CMS d/b has copy of data on disk and so can reload data if necessary. For ATLAS if DB is not available will lose some info. Simone - for transfer dashboard? A: Mostly job. Simone - transfer dashboard should in principle be robust as for CMS. Andrea - submission of CMS SAM tests was moved to CMS VO box (m/c with 8 cores). Submission frequency increased to once per hour.

Release update:

AOB:

-- JamieShiers - 22 May 2008

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2008-05-30 - DanieleBonacorsi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback