what are possible implications? What should we look out for?
Need people to look into timeouts of merge jobs. Merge jobs should be short, there is no reason they should be timing out, we have been blaming "network issues" but is there anything else that has changed over the last couple weeks that could be causing this problem? Jen is seeing it a LOT in Redigi and Luis is noting issues in StoreResults as well.
Site issues
Who put Caltech in drain? When you you put a site in drain you must e-log and file tickets, they have 4k idle nodes and nobody knows why they are in drain.
Turns out that Store Results is having the same issue with merge timeouts as Redigi is. Luis reported that WF's he ran with no issues several weeks ago are now having timeout issues, and was going to investigate further. Luis do you have an update?
working my way through the list of WF's in complete. Most of them are due to timeouts, at FNAL we were blaming the timeouts on network issues, but I am seeing them across the board. we need to figure this out, it's killing us in latency to have to make 2-4 acdc's per workflow to get everything through.
FWIW I'm agreeing with Dave, this sounds dangerous. We all know requestors can put "stupid things in" that could really break things bad. Having a bit of a buffer in there, may slow things down, but fast is not always best.