Week of 050711

Open Actions from last week:

  • LEMON sensors for
    • FTS - Gavin UNDERWAY
  • Gavin: Install and publish UI details at CERN - DONE
  • Gavin: install last FTS version and test.
  • call WORK not enough -> review which errors should be reported 24*7 - Sophie DONE
  • LFC log monitoring: how often should the log be parsed ? (ORA + number of threads used) -> every half hour. DONE (passed on to LEMON team by Sophie)

On Shift:

  • David/Roberto


Log: System worked over weekend- managed peak of 700MB/s from WAN pool + 200MB/s from castor1 pool concurrently. Lots of problems seen with remote SRMs. 1 hour downtime for all at ~11pm last night - FTS problem ?

New Actions:

  • Mail out plan for throughput phase to sites - James - DONE
  • Move all users over to castor2 - James WAITING
  • Look at why gridftp monitoring is broken - David/Roberto - DONE
  • Hourly tests (like SFT) - Simone FIST VERSION DONE
  • See why FTS may have died - Gavin DONE
  • Publish FTS Pilot details to expts - Sophie DONE
  • LFC Pilot migration for GSSDATLAS - Sophie DONE


Log: castor2 breakage last night at 5.30PM. v high load on machines. Problem solved after 2 hours. Reboot this morning. Gridftp logs now working. R-GMA wil ltake responsible for putting in missed data (3 days)

Update on actions: hourly tests in progress - will run on 26d NCM component was removing users - fixed by Vlado - watchdog script doesn't restart if stale lockfile Pilot migration ongoing - a few days to go

Info: Simone - GSSDATLAS had a production meeting with schedules for next seven months.

New Actions: James/David - tell site of plans for this morning. DONE

quick site call at 3? NOT TO DO

Need to reboot all nodes ( castor1 + castor2) DONE
lcg-mon-gridftp details DONE (1 time per hour, 1 restart, call operators)


Log: Saw some outages due to castor2 bug being hit. Bug fixed at 17:00, and rate went back up. Rate not high over rest of night, due to not enough channels having data.

Simone waiting on lcg-utils with timeouts in order to finish the hourly monitoring scripts.


  • David/Roberto : Files deleted via ns from castor. DONE
  • James/Yaodong: lcg-cr with timeouts DONE
  • James/Vlado: upgrade rpms on FTS nodes DONE

New Actions:

  • Jan/Sophie/Laurence : How to run "central" LEMON sensors


Log: Cleaned up entries from castor ns and recopied the 001 files (5 had a problem - understood to be oplapro76). System kept running. Need to understand why rates have dropped.


  • David/Roberto/James : Debug why site rates have done
  • ...: Communicate with sites on tuning parameters DONE
  • Vlado/JanVEldik: publish bhosts info DONE
  • All: Go through the 2nd level support
  • FTS: 3rd level support DONE

Talks for GDB:

  • David - POOL/LFC
  • Patricia - Exp. interaction
  • Gavin - Multi-VO FTS
  • ?? - Monitoring
  • James - Overall status
  • Vlado - castor2 status


Log: Finished copy of the 'bad' files in castor - could now get them out. Tuning of sites - no files/streams. Buffer size problem on oplapros 4MB, not 2MB. 1 SRM died oplapro79 - not clear why it died. LEMON hadn't picked up - Jan has fixed that. Alarms on Ben's test box which we moved out of production.


  • SRM Copy problem - trying to reproduce/debug
  • SRM Pools - keep separate WAN/PHEDEX until end July/ poss. end aug.


  • Ben: check oplapro79 to see if system killed SRM process
  • Olof: Try with one node moving
  • David: bdflush parameters. lxshare220d.
  • James/Ben: check oplapro80 for access for ben.

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2007-02-02 - FlaviaDonno
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback