In [2]:
import warnings
warnings.simplefilter("ignore")
In [3]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Según distintos benchmarks (papers, kaggle.com) los **algoritmos de uso general** que tienen más veces la mejor performance son: **Gradient Boosted Trees, Random Forest y SVM**.

Decision Trees

In [12]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(max_depth=2)
In [13]:
from sklearn.model_selection import train_test_split

X = pd.read_csv('../vol/intermediate_results/X_opening.csv')
y = X['worldwide_gross']
X = X.drop('worldwide_gross',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)
In [14]:
model.fit(X_train,y_train)
Out[14]:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
In [16]:
import graphviz
In [17]:
from sklearn.tree import export_graphviz

treedot = export_graphviz(model,out_file=None, feature_names=X.columns)
In [18]:
treedot
Out[18]:
'digraph Tree {\nnode [shape=box] ;\n0 [label="opening_gross <= 41613376.0\\nmse = 4.4919943637e+16\\nsamples = 1665\\nvalue = 141540319.054"] ;\n1 [label="opening_gross <= 22074048.0\\nmse = 1.33338221931e+16\\nsamples = 1506\\nvalue = 92999937.199"] ;\n0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;\n2 [label="mse = 4.9236662412e+15\\nsamples = 1257\\nvalue = 64781848.271"] ;\n1 -> 2 ;\n3 [label="mse = 3.147813102e+16\\nsamples = 249\\nvalue = 235450289.735"] ;\n1 -> 3 ;\n4 [label="opening_gross <= 70351576.0\\nmse = 1.10398118716e+17\\nsamples = 159\\nvalue = 601300162.289"] ;\n0 -> 4 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;\n5 [label="mse = 4.06753884592e+16\\nsamples = 92\\nvalue = 440868287.554"] ;\n4 -> 5 ;\n6 [label="mse = 1.22264857987e+17\\nsamples = 67\\nvalue = 821594676.851"] ;\n4 -> 6 ;\n}'
In [19]:
graphviz.Source(treedot)
Out[19]:
Tree 0 opening_gross <= 41613376.0 mse = 4.4919943637e+16 samples = 1665 value = 141540319.054 1 opening_gross <= 22074048.0 mse = 1.33338221931e+16 samples = 1506 value = 92999937.199 0->1 True 4 opening_gross <= 70351576.0 mse = 1.10398118716e+17 samples = 159 value = 601300162.289 0->4 False 2 mse = 4.9236662412e+15 samples = 1257 value = 64781848.271 1->2 3 mse = 3.147813102e+16 samples = 249 value = 235450289.735 1->3 5 mse = 4.06753884592e+16 samples = 92 value = 440868287.554 4->5 6 mse = 1.22264857987e+17 samples = 67 value = 821594676.851 4->6

Virtudes de los arboles de decision:

  • Metodo poderoso y probado
  • Interpretable
  • No necesita escalar los datos (clasificación), y menos preprocesamiento de variables

Sin embargo en la practica existen modelos que obtienen mejor rendimiento. Como mejorar el modelo de arboles de decisión?

Ensembles

Concepto General

Random Forest y Gradient Boosted Trees, forman parte de una familia de algoritmos que se denominan ensembles.

$$ Ensemble = Submodelos \rightarrow Entrenamiento \rightarrow Predicciones_{Intermedias} \rightarrow Voto \rightarrow Prediccion_{final}$$

Cómo funciona el algoritmo Random Forest?

Vamos a generar cientos de modelos de arboles de decisión que serán entrenados sobre conjuntos de datos bootstrapeados del conjunto de datos original y donde para cada etapa de separación el conjunto de features elegibles sera un subconjunto aleatorio del conjunto original de features.

Cada uno de los arboles entrenados luego podrá votar por su predicción y promediaremos estos votos.

Ensembles del pobre ("Poor man's ensembles")

  • Entrenar diversos modelos a mano
  • Promediar el resultado
  • Owen Zhang, número 1 de Kaggle.com durante un largo tiempo, ocupaba esta estrategia promediando diversos modelos XGBoost.
  • from sklearn.ensemble import VotingClassifier sirve por ejemplo para hacer un ensemble manual de clasificación

En general los ensembles del pobre funcionan ya que cada uno de los modelos que votarán en conjunto son bastante fuertes.

Porqué RF es poderoso?

**Leo Breiman** creador del Random Forest demostró que un ensemble podía tener buen poder de generalización sí:
  1. Los submodelos tienen buen poder de predicción
  2. Los submodelos están descorrelacionados

Así el algoritmo de Random Forest compromete un poco de poder de predicción de cada uno de los decision trees que arma, pero la forma aleatoria de generarlos hace que esten fuertemente descorrelacionados.

In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate

forest = RandomForestRegressor(200)
results = cross_validate(forest,X,y,cv=5,scoring='r2')
In [22]:
test_scores = results['test_score']
train_scores = results['train_score']
print(np.mean(train_scores))
print(np.mean(test_scores))
0.965572885203
0.524298909847

Mejor resultado que Lasso! Ya no tenemos Bias y tenemos un mejor score r2. Sin embargo tenemos una diferencia importante entre score de entrenamiento y de test (overfit).

Gradient Boosted Trees

In [50]:
from sklearn.ensemble import GradientBoostingRegressor

ensemble = GradientBoostingRegressor()
results = cross_validate(ensemble,X,y,cv=5,scoring='r2')
In [51]:
test_scores = results['test_score']
train_scores = results['train_score']
print(np.mean(train_scores))
print(np.mean(test_scores))
0.915139214355
0.525016912471

Cómo optimizamos los parametros de este último modelo?

Optimización de hiperparametros

  • Fijar un learning rate alto
  • Fijar parametros de los arboles
  • Fijados estos parametros, elegir el mejor numero de estimadores que conforman el ensemble
  • (Tarea) Con el learning rate dado y el numero de estimadores óptimo, optimizar los parametros de los arboles

Grid Search

Por ahora dijimos que:

  • train_test_split servia para evaluaciones rapidas, testeos y prototipaje
  • cross_validate es un método más robusto para poder estimar el rendimiento de tu algoritmo

Sin embargo una vez que hemos finalizado nuestra etapa de prototipaje y ya queremos establecer un modelo definitivo deberiamos seguir el flujo siguiente.

In [64]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)
In [77]:
from sklearn.model_selection import GridSearchCV

param_test1 = {'n_estimators':range(20,501,20)}
estimator = GradientBoostingRegressor(learning_rate=0.1, 
                                       min_samples_split=500,
                                       min_samples_leaf=50,
                                       max_depth=8,
                                       max_features='sqrt',
                                       subsample=0.8,
                                       random_state=10)
gsearch1 = GridSearchCV(estimator, 
                        param_grid = param_test1, 
                        scoring='r2', 
                        cv=5)
In [78]:
gsearch1.fit(X_train,y_train)
Out[78]:
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=8,
             max_features='sqrt', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=50, min_samples_split=500,
             min_weight_fraction_leaf=0.0, n_estimators=100,
             presort='auto', random_state=10, subsample=0.8, verbose=0,
             warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': range(20, 501, 20)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='r2', verbose=0)
In [79]:
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
Out[79]:
([mean: 0.65534, std: 0.05764, params: {'n_estimators': 20},
  mean: 0.71947, std: 0.06256, params: {'n_estimators': 40},
  mean: 0.73472, std: 0.06360, params: {'n_estimators': 60},
  mean: 0.73893, std: 0.06236, params: {'n_estimators': 80},
  mean: 0.74205, std: 0.06271, params: {'n_estimators': 100},
  mean: 0.74593, std: 0.06236, params: {'n_estimators': 120},
  mean: 0.74954, std: 0.06335, params: {'n_estimators': 140},
  mean: 0.75082, std: 0.06305, params: {'n_estimators': 160},
  mean: 0.75257, std: 0.06344, params: {'n_estimators': 180},
  mean: 0.75349, std: 0.06447, params: {'n_estimators': 200},
  mean: 0.75457, std: 0.06342, params: {'n_estimators': 220},
  mean: 0.75531, std: 0.06489, params: {'n_estimators': 240},
  mean: 0.75517, std: 0.06572, params: {'n_estimators': 260},
  mean: 0.75389, std: 0.06495, params: {'n_estimators': 280},
  mean: 0.75460, std: 0.06569, params: {'n_estimators': 300},
  mean: 0.75250, std: 0.06545, params: {'n_estimators': 320},
  mean: 0.75350, std: 0.06492, params: {'n_estimators': 340},
  mean: 0.75354, std: 0.06623, params: {'n_estimators': 360},
  mean: 0.75259, std: 0.06542, params: {'n_estimators': 380},
  mean: 0.75254, std: 0.06469, params: {'n_estimators': 400},
  mean: 0.75186, std: 0.06477, params: {'n_estimators': 420},
  mean: 0.75205, std: 0.06508, params: {'n_estimators': 440},
  mean: 0.75157, std: 0.06449, params: {'n_estimators': 460},
  mean: 0.75051, std: 0.06352, params: {'n_estimators': 480},
  mean: 0.75096, std: 0.06327, params: {'n_estimators': 500}],
 {'n_estimators': 240},
 0.7553059694284987)
In [80]:
gsearch1.best_estimator_
Out[80]:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=8,
             max_features='sqrt', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=50, min_samples_split=500,
             min_weight_fraction_leaf=0.0, n_estimators=240,
             presort='auto', random_state=10, subsample=0.8, verbose=0,
             warm_start=False)
In [84]:
cross_validate(gsearch1.best_estimator_,X_train,y_train)
Out[84]:
{'fit_time': array([ 0.12527442,  0.12485075,  0.12371683]),
 'score_time': array([ 0.00255871,  0.00274563,  0.00292444]),
 'test_score': array([ 0.64008566,  0.81626875,  0.76780843]),
 'train_score': array([ 0.86953232,  0.77262356,  0.81279493])}
In [85]:
final_results = cross_validate(gsearch1.best_estimator_,X_train,y_train)
In [87]:
test_scores = final_results['test_score']
train_scores = final_results['train_score']
print(np.mean(train_scores))
print(np.mean(test_scores))
0.81831693665
0.741387612516

Reflexiones de cierre

Recursos

  • Reddit /machinelearning y /learnmachinelearning
  • Analytics Vidhya y KD Nuggets
  • Kaggle.com y There is no Free Hunch
  • Arxiv, papers
  • Libros: "Pattern Recognition and Machine Learning" C.Bishop y "Elements of Statistical Learning".

Próximos pasos

  • Matemáticas
  • Praxis: Feature Engineering, Model Selection y Tuning
  • Deep Learning para NLP y Computer Vision
  • Machine Learning Bayesiano