minimaxir

automl-gs

Provide an input CSV and a target field to predict, generate a model + code to run it.
Under MIT License
By minimaxir

python machine-learning tensorflow keras xgboost automl

automl-gs


Give an input CSV file and a target field you want to predict to automl-gs, and get a trained high-performing machine learning or deep learning model plus native Python code pipelines allowing you to integrate that model into any prediction workflow. No black box: you can see exactly how the data is processed, how the model is constructed, and you can make tweaks as necessary.



automl-gs is an AutoML tool which, unlike Microsoft's NNI, Uber's Ludwig, and TPOT, offers a zero code/model definition interface to getting an optimized model and data transformation pipeline in multiple popular ML/DL frameworks, with minimal Python dependencies (pandas + scikit-learn + your framework of choice). automl-gs is designed for citizen data scientists and engineers without a deep statistical background under the philosophy that you don't need to know any modern data preprocessing and machine learning engineering techniques to create a powerful prediction workflow.


Nowadays, the cost of computing many different models and hyperparameters is much lower than the opportunity cost of an data scientist's time. automl-gs is a Python 3 module designed to abstract away the common approaches to transforming tabular data, architecting machine learning/deep learning models, and performing random hyperparameter searches to identify the best-performing model. This allows data scientists and researchers to better utilize their time on model performance optimization.



The models generated by automl-gs are intended to give a very strong baseline for solving a given problem; they're not the end-all-be-all that often accompanies the AutoML hype, but the resulting code is easily tweakable to improve from the baseline.


You can view the hyperparameters and their values here, and the metrics that can be optimized here. Some of the more controversial design decisions for the generated models are noted in DESIGN.md.


Framework Support

Currently automl-gs supports the generation of models for regression and classification problems using the following Python frameworks:



To be implemented:



How to Use

automl-gs can be installed via pip:


shell
pip3 install automl_gs


You will also need to install the corresponding ML/DL framework (e.g. tensorflow/tensorflow-gpu for TensorFlow, xgboost for xgboost, etc.)


After that, you can run it directly from the command line. For example, with the famous Titanic dataset:


shell
automl_gs titanic.csv Survived


If you want to use a different framework or configure the training, you can do it with flags:


shell
automl_gs titanic.csv Survived --framework xgboost --num_trials 1000


You may also invoke automl-gs directly from Python. (e.g. via a Jupyter Notebook)


```python
from automl_gs import automl_grid_search


automl_grid_search('titanic.csv', 'Survived')
```


The output of the automl-gs training is:



Once the training is done, you can run the generated files from the command line within the generated folder above.


To predict:


shell
python3 model.py -d data.csv -m predict


To retrain the model on new data:


shell
python3 model.py -d data.csv -m train


CLI Arguments/Function Parameters

You can view these at any time by running automl_gs -h in the command line.



Examples


For a quick Hello World on how to use automl-gs, see this Jupyter Notebook.


Due to the size of some examples w/ generated code and accompanying data visualizations, they are maintained in a separate repository. (and also explain why there are two distinct "levels" in the example viz above!)


How automl-gs Works

TL;DR: auto-ml gs generates raw Python code using Jinja templates and trains a model using the generated code in a subprocess: repeat using different hyperparameters until done and save the best model.


automl-gs loads a given CSV and infers the data type of each column to be fed into the model. Then it tries a ETL strategy for each column field as determined by the hyperparameters; for example, a Datetime field has its hour and dayofweek binary-encoded by default, but hyperparameters may dictate the encoding of month and year as additional model fields. ETL strategies are optimized for frameworks; TensorFlow for example will use text embeddings, while other frameworks will use CountVectorizers to encode text (when training, TensorFlow will also used a shared text encoder via Keras's functional API). automl-gs then creates a statistical model with the specified framework. Both the model ETL functions and model construction functions are saved as a generated Python script.


automl-gs then runs the generated training script as if it was a typical user. Once the model is trained, automl-gs saves the training results in its own CSV, along with all the hyperparameters used to train the model. automl-gs then repeats the task with another set of hyperparameters, until the specified number of trials is hit or the user kills the script.


The best model Python script is kept after each trial, which can then easily be integrated into other scripts, or run directly to get the prediction results on a new dataset.


Helpful Notes

Known Issues

Future Work

Feature development will continue on automl-gs as long as there is interest in the package.


Top Priority

Elsework

Maintainer/Creator

Max Woolf (@minimaxir)


Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.


License

MIT


The code generated by automl-gs is unlicensed; the owner of the generated code can decide the license.