Elasticsearch, Kibanaを5.6.2から6.1.2にupgradeした

といっても上書きupgradeじゃなくて新旧2つのElasticsearchにdouble writeして切り替えました。

うちの環境だとfluentd→kafka→kafka-fluentd-consumer→fluentd→Elasticsearchという経路でElasticsearchに書き込んでいます。
fluentdは0.12系を使っています。
kafkaが間にあるのでこの機会にmulti worker試したいなと思ってkafka以降の経路でfluentd 1を入れようとしたんですが、以下の理由から0.12を使いつづけています。
まあこれから試す人はfluentd 1で良い気はします。

fluentdのCPU使用率が上がった。
https://github.com/fluent/fluentd/issues/1801
もっともこの問題は現状解決されているはず。僕の環境ではすでに0.12で運用始まっているので試せてないですが。

fluent-plugin-record_modifierとfluent-plugin-prometheusがmulti worker未サポートだった。
prometheusに関してはもうサポートしたっぽい。https://github.com/fluent/fluent-plugin-prometheus/pull/44

record_modifierはfluentdのログを集めてきて下記のようにtagを書き換えるために使ってます。

<match fluent.**>
  @type record_modifier
  tag "fluentd"
  include_tag_key yes
  tag_key "loglevel"
  hostname ...
  portnum ...
</match>

Elasticsearchのclient側の話としては、fluent-plugin-elasticsearchが依存しているelasticsearch-rubyに互換性が無いように見えたのでruby環境を完全に分けました。
https://github.com/elastic/elasticsearch-ruby

Java clientに関しては5.4でも6.1で動きました。

一部JavaScriptからElasticsearch APIをたたいているところがあって、そこはContent-Type指定が必要でした。
https://www.elastic.co/guide/en/elasticsearch/reference/6.0/breaking_60_rest_changes.html#_content_type_auto_detection

Elasticsearch本体側の話をすると、うちの環境ではindexというかshardが多すぎて日付が変わるタイミングでのindexingでエラーが出るので事前にバッチで翌日のindexを作るということをしていました。

ま、そもそもindex多すぎるのってどうなのっていうのをdiscussで聞いてみたらやっぱり減らせみたいな話になりました。
https://discuss.elastic.co/t/how-to-handle-many-indices/102803

そこでElasticsearch 5.4.2まではshard数をnode数にしてたんですが、それを変えて一部サイズが大きいindexに関しては1 shardがだいたい50GB以下になるようにshard数を指定して、それ以外はshard数を1にしました。
これにより今までは3万近くあったshardが2000ちょっとに減りました。

50GBという数字は
https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
を参考にしました。

TIP: Avoid having very large shards as this can negatively affect the cluster's ability to recover from failure. There is no fixed limit on how large shards can be, but a shard size of 50GB is often quoted as a limit that has been seen to work for a variety of use-cases.

あとmaster専用nodeを用意したところ、日付が変わるタイミングでのindexingでエラーが出なくなったのでバッチも減らせそうです。
今までは全nodeをmaster/data兼用にしてました。
ただmaster専用にするとcerebroに表示されないという問題があります。

Elasticsearch 5.4.2と6.1.2の違いですぐ気付いたのがindexのサイズです。
5.4.2の時は16億ドキュメントで4.2TBでしたが、6.1.2だと3.1TBに減っています。

その辺の改善はここが詳しそう
https://www.elastic.co/blog/minimize-index-storage-size-elasticsearch-6-0

他にもrestart時のrecoveryが早くなっている模様
この辺かな。https://www.elastic.co/blog/elasticsearch-sequence-ids-6-0

searchも早くなっているけど、まあ台数も倍なのでそれの影響が大きいはず。
とはいえ性能改善も入っている模様
https://www.elastic.co/blog/index-sorting-elasticsearch-6-0

他にはCPU使用率のばらつきが減ってますね。

kibanaに関してはindex patternをバッチで作っている部分があって、そこはkibanaのAPIを使うように変更しました。
https://github.com/elastic/kibana/issues/3709