Disable puppet agent
This way upgrade does not happen on a puppet run, and you can do it manually on each node.
This works but affects both production and dev:
[aiadm]$ wassh -l root -c dashboard/elasticsearch 'puppet agent --disable'
Having a list of hosts will be needed to upgrade them one by one anyway, so just get a list from foreman and do
[aiadm]$ wassh -l root 'node1 node2 ... nodeN' 'puppet agent --disable'
Configure the upgrade in puppet:
Use repository hg_dashboard. Branch master for production, qa for the development cluster.
Change ES version in hiera
Set elasticserch::version in data/hostgroup/dashboard/elasticsearch.yaml
On minor (1.X -> 1.Y) version upgrade, change the yum repo in code/manifests/elasticsearch/base.pp
- osrepos::ai121yumrepo { 'elasticsearch-1.3':
- descr => 'Elasticsearch repository for 1.3.x packages',
- baseurl => 'http://linuxsoft.cern.ch/elasticsearch/elasticsearch-x86_64/RPMS.es-13-el6/',
+ osrepos::ai121yumrepo { 'elasticsearch-1.4':
+ descr => 'Elasticsearch repository for 1.4.x packages',
+ baseurl => 'http://linuxsoft.cern.ch/elasticsearch/elasticsearch-x86_64/RPMS.es-14-el6/',
Do a rolling upgrade-restart
If it's a production search node, remove it from the DNS alias
touch /etc/nologin
Wait (5-10 minutes) until nslookup dashb-es no longer lists this host's IP
We have only one search node in dev, so doing this is not useful, but should work the same way (unless the dashb-es-dev DNS alias is not load-balancing, is
http://sls.cern.ch/sls/service.php?id=DNSLOADBALANCING
still supported? I don't see it in the list).
If it's a data node, migrate the data away
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster":{"routing":{"allocation":{"exclude":{"_name":"-node-hostname-here-"}}}}}}'
Wait (30-60 min) until the data migrates away. Watch at
https://dashb-es.cern.ch/_plugin/whatson
or
https://dashb-es.cern.ch/plugin/head
Without this step there is a moment of red cluster health when the node is stopped as shards that had a primary on this node are assigned a new primary out of their replicas, and a long period of yellow health when new replicas are created to replace data that was on the node.
service elasticsearch stop
I did not notice problems when upgrading a running ES node and then restarting, but just in case it does not hurt to stop before upgrading.
puppet agent --enable && puppet agent -t
Puppet upgrades the package and starts the service
wait for the node to start and join the cluster
/var/log/elasticsearch/clustername.log or /var/elasticsearch/log/clustername.log on the data nodes
curl localhost:9200
curl localhost:9200/_cluster/health?pretty
curl localhost:9200/_nodes/hostname/info?pretty
Or check
https://dashb-es.cern.ch/_plugins/head
check node version
curl localhost:9200/_nodes/hostname/info?pretty
If it's a search node, add it back to the DNS alias
rm /etc/nologin
Wait (5-10 minutes) until nslookup dashb-es shows this host's IP again
If it's a data node, enable storing data on it again
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster":{"routing":{"allocation":{"exclude":{"_name":""}}}}}}'
If you did not migrate the data away before restarting, wait (~30 minutes) for data replication to be created until the cluster status is green again.
curl localhost:9200/_cluster/health?pretty
Or check
https://dashb-es.cern.ch/_plugin/head
or
https://dashb-es.cern.ch/_plugin/whatson
Notes
On the previous update I skipped migrating data away before restarting the data node, because I think it was noticeably slower than just restarting the node and waiting for new replicas. I think there is a way to configure more simultaneous shard migrations, if migration too slow we should look into it.
It's possible to migrate data away from several nodes at a time and upgrade them at the same time, but I don't think it will lead to a shorter upgrade time.
By default indexes have 2 replicas, so even without migrating data away it should be safe to restart 2 data nodes from a green cluster state, but it does not actually save time as recreating replicas after a double restart takes double the time. It's better to restart one data node at a time.
The MIG data writing service does not mind if we restart the search node without updating the DNS alias first, (it retries and gets the other IP from the alias). But it's more seamless for everyone to use the /etc/nologin, and compared to moving data around it takes very little time.
We have 3 master nodes. Make sure to never restart 2 masters at the same time, as 2 need to be running for safe master election.
It's probably best not to restart masters if cluster is yellow, but search nodes can be upgraded while waiting for the data nodes to move data.