GBDTの理解に役立つサイトまとめ

GBDTは分析コンペや業務で頻出しますが、アルゴリズムの詳細はパッケージごとに異なるため複雑です。できることなら公式ドキュメント・論文・実装を読み込みたいところですが、私の実力的にそれは厳しいので参考サイトをまとめておきます。ゆるふわ理解に留まっている自分用のメモです。

GBDT

Gradient Boosting Interactive Playground

トイデータを使ってGBDTの挙動を確認できる。しばらく遊べる。

YouTube

とてもわかりやすい解説動画。Part1~4を視聴すればアルゴリズムの基本が理解できる。

Gradient Boost Part 1: Regression Main Ideas
Gradient Boost Part 2: Regression Details
Gradient Boost Part 3: Classification
Gradient Boost Part 4: Classification Details

よく見る下図のようなアルゴリズムを見てもビビらなくなる。
（2-3でlinear searchしている点は動画と異なる）

f:id:copypaste_ds:20190905001816p:plain — GBDTのアルゴリズム（wikipediaより）

Boosting algorithm: GBM - Towards Data Science

アルゴリズムの概要からsklearnのsource codeの説明までほどよくまとまっている。
復習時に参照することも多く、個人的に好きなサイト。

XGBoost

日本語の記事が充実している。日本語記事で概要を把握してから論文を読むのが良さそう。

日本語の記事

いずれの記事もわかりやすくまとまっている。（調べれば他にもわかりやすい記事が見つかる）

「目的関数に正則化項を加える」、「二次のテイラー展開で近似する」とか言われてもビビらなくなる。

f:id:copypaste_ds:20190905003927p:plain

Boosting algorithm: XGBoost - Towards Data Science

アルゴリズムの概要からGBDTとの違いまで簡潔にまとまっている。

Tree Boosting With XGBoost - Why Does XGBoost Win "Every" Machine Learning Competition?

GBDTとXGBoostのアルゴリズムを勉強したいときこれを読めば良さそう。
ただし100ページ程度あるので、ちょろっと復習したいときには不便かも。

Tree Boosting With XGBoost — Why Does XGBoost Win “Every” Machine Learning Competition?

上で紹介した100ページ程度ある内容の要点がまとまっている。
復習時に参照することも多く、個人的に好きなサイト。
Newton tree boostingとGradient tree boostingの違いなどが整理できる。

f:id:copypaste_ds:20190905005141p:plain

Introduction to Boosted Trees — xgboost 1.0.0-SNAPSHOT documentation

公式ドキュメント

[1603.02754] XGBoost: A Scalable Tree Boosting System

元論文

LightGBM

XGBoostに比べて解説記事が多くない。さっさと論文を読むのが理解の近道かも。

NIPS2017読み会 LightGBM: A Highly Efficient Gradient Boosting Decision T…

LightGBM = GBDT + GOSS + EFB だとわかる。 GOSS、EFBのわかりやすい解説もある。

f:id:copypaste_ds:20190905104035p:plain

LightGBM and XGBoost Explained | Machine Learning Explained

XGBoost, LightGBM 独自の工夫に関して簡単な説明がある。
下記の用語を聞いてもビビらなくなる。

Level-wise growth strategy, Leaf-wise growth strategy
Histogram-based method
Ignoring sparce inputs

f:id:copypaste_ds:20190905111521p:plain

Features — LightGBM 2.3.2 documentation

公式ドキュメント
カテゴリ変数の分割方法（エンコーディング方法）について説明がある。

The basic idea is to sort the categories according to the training objective at each split. More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.