Week of 050711
Open Actions from last week:
- LEMON sensors for
- Gavin: Install and publish UI details at CERN -
DONE
- Gavin: install last FTS version and test.
- call WORK not enough -> review which errors should be reported 24*7 - Sophie
DONE
- LFC log monitoring: how often should the log be parsed ? (ORA + number of threads used) -> every half hour.
DONE
(passed on to LEMON team by Sophie)
On Shift:
Monday:
Log: System worked over weekend- managed peak of 700MB/s from WAN pool + 200MB/s from castor1 pool concurrently. Lots of problems seen with remote SRMs. 1 hour downtime for all at ~11pm last night - FTS problem ?
New Actions:
- Mail out plan for throughput phase to sites - James -
DONE
- Move all users over to castor2 - James
WAITING
- Look at why gridftp monitoring is broken - David/Roberto -
DONE
- Hourly tests (like SFT) - Simone
FIST VERSION DONE
- See why FTS may have died - Gavin
DONE
- Publish FTS Pilot details to expts - Sophie
DONE
- LFC Pilot migration for GSSDATLAS - Sophie
DONE
Tuesday:
Log: castor2 breakage last night at 5.30PM. v high load on machines. Problem solved after 2 hours. Reboot this morning. Gridftp logs now working. R-GMA wil ltake responsible for putting in missed data (3 days)
Update on actions:
hourly tests in progress - will run on 26d
NCM component was removing users - fixed by Vlado
- watchdog script doesn't restart if stale lockfile
Pilot migration ongoing - a few days to go
Info:
Simone -
GSSDATLAS had a production meeting with schedules for next seven months.
New Actions:
James/David - tell site of plans for this morning.
DONE
- James/Jamie
- quick site call at 3?
NOT TO DO
- Vlado
- Need to reboot all nodes ( castor1 + castor2)
DONE
- Vlado/James/Roberto
- lcg-mon-gridftp details
DONE
(1 time per hour, 1 restart, call operators)
Wednesday
Log: Saw some outages due to castor2 bug being hit. Bug fixed at 17:00, and rate went back up. Rate not high over rest of night, due to not enough channels having data.
Simone waiting on lcg-utils with timeouts in order to finish the hourly monitoring scripts.
Actions:
- David/Roberto : Files deleted via ns from castor.
DONE
- James/Yaodong: lcg-cr with timeouts
DONE
- James/Vlado: upgrade rpms on FTS nodes
DONE
New Actions:
- Jan/Sophie/Laurence : How to run "central" LEMON sensors
Thursday
Log: Cleaned up entries from castor ns and recopied the 001 files (5 had a problem - understood to be oplapro76). System kept running. Need to understand why rates have dropped.
Actions:
- David/Roberto/James : Debug why site rates have done
- ...: Communicate with sites on tuning parameters
DONE
- Vlado/JanVEldik: publish bhosts info
DONE
- All: Go through the 2nd level support
- FTS: 3rd level support
DONE
Talks for GDB:
- David - POOL/LFC
- Patricia - Exp. interaction
- Gavin - Multi-VO FTS
- ?? - Monitoring
- James - Overall status
- Vlado - castor2 status
Friday
Log: Finished copy of the 'bad' files in castor - could now get them out. Tuning of sites - no files/streams. Buffer size problem on oplapros 4MB, not 2MB. 1 SRM died oplapro79 - not clear why it died. LEMON hadn't picked up - Jan has fixed that. Alarms on Ben's test box which we moved out of production.
Discussion:
- SRM Copy problem - trying to reproduce/debug
- SRM Pools - keep separate WAN/PHEDEX until end July/ poss. end aug.
Actions:
- Ben: check oplapro79 to see if system killed SRM process
- Olof: Try with one node moving
- David: bdflush parameters. lxshare220d.
- James/Ben: check oplapro80 for access for ben.