DAILY MEETING LOG
Gdrb06: 130573 ticket number (ancient rack down in the vault, might be related)
No UPS till 1 pm, starting now
WMS remedy ticket solved by Rolandas
Gdrb03 -> Tim Bell? Smartd_wrong
- find somebody to upgrade gLite RBs together with Thorsten
- 2 new Atlas UI being prepared
- Memory matchmaking after WMS modification?? Discussion at Hepix
- mail from Veronique about change of IP addreses, etc (only Antonio answered) CHECK!
- Egee broadcast end up in SPAM folder. Solved with mail group (egee broadcast added to white list)
Short stoppage of LXBATCH tomorrow morning, 14.09.2006 6:55am & change of MPI s/w on LXPLUS
We will suspend LXBATCH tomorrow morning at 6:55am for a few minutes. This is done as a precaution for an AFS intervention on one of the AFS servers. This intervention should be transparent, and the LXBATCH intervention is done to reduce the risk of a failure. We will also take this intervention to do necessary kernel upgrades on the LXMASTER cluster.
We also want to take this opportunity to announce a small software change on LXPLUS (SLC3). To be consistent with LXBATCH and the GRID middleware, we want to change the MPI software installation on LXPLUS from \'lam\' to \'mpich\'. Please let us know if you see a problem with this.
Batch will be resumed after the intervention has finished successfully without further notice. The expected intervention time is ~10 minutes.
SUMMARY OF LFC PROBLEM
Around 10.30, checking the logs, Atlas (Miguel Branco) saw that the servers had died. In the last 7 days they were dropping many requests and saw internal communication errors.
Simone said that they did 3 things in the afternoon: send a mail to GGUS (he will send it to me so we can track what happened), and send mails to lfc support. Sophie answered immediately, Jan van Eldik (service manager) was involved through Remedy a few minutes later and escalated to the 3rd level. David Collados (3rd level) saw it and restarted the servers.
David restarted them around 15:30, that’s why we see the alarms at that time. They were stuck. Yesterday we did not get any alarm -> Some wrong logic in the Lemon sensor that we should check? No read and no write exceptions were there, and also 3rd exception, but no alarm was raised. ACTION to review the Lemon sensor.
Security update that made the daemon loop? The update was applied in the morning, it might be related.
An egee broadcast should have been sent to inform the users. Yes, we were busy solving the problem and did not think about it. To be done for next time.
1.5.8 version of LFC server will solve the timeout error. What is the status of this version? Not known.
Discussion about how to announce OS and security updates (which sometimes, as this time, affect some services). They are now announced via Linux support and quattor announce, and then in the morning meeting. There are many and are difficult to follow.
The cern-prod-admin mailing list (where all service managers are) will be subscribed to quattor announce so at least we are aware of them and can correlate whenever there are problems with a service.
Discuss Remedy ticket WMS -> Lemon
- Lcgctb1, 2, 3, 4: no contact, service manager contacted
- Lxb0116, looks like Massimo’s machine
SOLVED: these were not production services, clarified. They were marked as “standby” in Lemon. Contact Thorsten and Veronique if this happens again.
Lxnoq machines are normally not monitored, except if the Service Managers of the machines have explicitely installed LEMON on it, or if the disks were not scratch when given from FIO to GD.
But it seems that they are registered on LAS because they have the state "standby" in stead of "unmanaged" which is the default for "lxnoq".
So I will clear the SMS state and it should solve the problem.
can you confirm ?
From: Maite Barroso Lopez
Sent: Friday, September 15, 2006 10:42 AM
To: Veronique Lefebure; Thorsten Kleinwort; Ulrich Schwickerath
Subject: Machines in standby
At this morning meeting I was pointed to several machines that are not (to my knowledge) grid production services, but that were referred as grid machines and so the gmod was requested to have a look.
I checked in Lemon and they are in standby state, and they are owned either by Massimo Lammana's team or by Markus' team.
Are they really production machines? Should they be monitored?
These are the details:
- 14 SEP 2006 14:35
no_contact, unable to ping. Service manager contacted firstname.lastname@example.org email@example.com firstname.lastname@example.org
INFO IN LEMON: standby
Reason: quiescing (For Power issue in CC) requested by lefebure on 13/07/2006 11:36:13
- 14 SEP 2006 14:35
lxnoq lcgctb1 +3
no_contact able to ping, no right to connect. Service manager contacted email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org INFO IN LEMON: standby
Reason: change cluster (For Virtual Machines for Markus Shultz) requested by Lefebure on 27/04/2006 15:52:33
SUMMARY: No pending issue left for next week