Fastest dataset optimization and management for machine and deep learning. Stream data real-time & version-control it.
Under Mozilla Public License 2.0
By activeloopai

python deep-learning machine-learning pytorch tensorflow ai hacktoberfest ml computer-vision cv data-science cloud-computing datasets data-processing training collaboration hacktoberfest2021 mlops data-pipelines data-version-control

Dataset Format for AI

Getting Started
API Reference
Slack Community

## Why use Hub?
**ML engineers spend the majority of their time building infrastructure, transferring data, and writing boilerplate code. The Hub format + API simplifies these tasks so that users can focus on building amazing machine learning models ?.**

Hub enables users to stream unlimited amounts of data from the cloud to any machine without sacrificing performance compared to local storage ?. In addition, Hub connects datasets to PyTorch and TensorFlow with minimal boilerplate code, and it has powerful tools for dataset version control, building ML pipelines, and running distributed workloads.

Hub is best suited for unstructured datasets such as images, videos, point clouds, or text. It works locally or on any cloud.

Google, Waymo, Red Cross, Omdena, and Rarebase use Hub.

## How does Hub work?

Databases, data lakes, and data warehouses are best suited for tabular data and are not optimized for deep-learning applications using images, videos, and text. By storing data as chunked compressed arrays, Hub significantly increases data transfer speeds between network-connected machines. This eliminates the need to download entire datasets before running code, because computations and data streaming can occur simultaneously without increasing the total runtime.

Hub also significantly reduces the time to build machine learning workflows, because its API eliminates boilerplate code that is typically required for data wrangling ✌.

## Features
### Current Release
* Easy dataset creation and hosting on Activeloop Cloud, S3, or Google Cloud
* Rapid dataset streaming to any machine
* Simple dataset integration to PyTorch and TensorFlow with no boilerplate code
* Rapid data processing using transformations on distributed compute
* Data pipelines

### Coming Soon
* Dataset version control
* Dataset hosting on Azure
* Dataset query without having to download the entire dataset
* Rapid visualization of image datasets via integration with Activeloop Platform

Visualization of a dataset uploaded to Hub

## Getting Started with Hub
### Installation
Hub is written in 100% python and can be quickly installed using pip.
pip3 install hub
### Loading Datasets
Accessing datasets in Hub requires a single line of code. Run this snippet to get the first image in the [Objectron Bikes Dataset]( in the numpy array format:
import hub

ds = hub.load('hub://activeloop/objectron_bike_train')
image_arr = ds.image[0].numpy()
To access and train a classifier on your own Hub dataset stored in cloud, run:
import hub

ds = hub.load("s3://bucket_name/dataset_folder")
data_loader = ds.pytorch(batch_size = 16, num_workers = 4)

for batch in data_loader:

## Training Loop Here ##
### Creating Datasets
To upload your own dataset to Hub:
import hub

fns = my_images # List of image files in dataset

# Define empty dataset
ds = hub.empty("gcp://bucket_name/dataset_folder")

# Upload data
with ds:

# Create tensors
ds.create_tensor('images', htype = 'image', sample_compression = 'jpg')
ds.create_tensor('labels', htype = 'class_label')

# Append data
for fn in fns:

## Documentation
Getting started guides, examples, tutorials, API reference, and other usage information can be found on our [documentation page](

## ? For Students and Educators
Hub users can access and visualize a variety of popular datasets through a free integration with Activeloop's Platform. Users can also create and store their own datasets and make them available to the public. Free storage of up to 300 GB is available.

## Comparisons to Familiar Tools
### Hub and DVC
Hub and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Hub converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Hub format makes dataset versioning significantly easier compared to a traditional file structures by DVC when datasets are composed of many files (i.e. many images). An additional distinction is that DVC primarily uses a command line interface, where as Hub is a python package. Lastly, Hub offers an API to easily connect datasets to ML frameworks and other common ML tools.

### Hub and TensorFlow Datasets (TFDS)
Hub and TFDS seamlessly connect popular datasets to ML frameworks. Hub datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Hub and TFDS is that Hub datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. In addition to providing access to popular publicly-available datasets, Hub also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus.

### Hub and HuggingFace
Hub and HuggingFace offer access to popular datasets, but Hub primarily focuses on computer vision, whereas HuggingFace primarily focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Hub.

## Community

Join our [**Slack community**]( to learn more about unstructured dataset management using Hub and to get help from the Activeloop team and other users.

We'd love your feedback by completing our 3-minute [**survey**](

As always, thanks to our amazing contributors!

Made with [contributors-img](

Please read []( to get started with making contributions to Hub.

## README Badge

Using Hub? Add a README badge to let everyone know:



## Disclaimers

### Dataset Licenses
Hub users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.

If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a [GitHub issue]( Thank you for your contribution to the ML community!

### Usage Tracking
By default, we collect anonymous usage data using Bugout (here's the [code]( that does it). It does not collect user data and it only logs the Hub library's own actions. This helps our team understand how the tool is used and how to build features that matter to you! After you register with Activeloop, data is no longer anonymous, but you can opt-out of reporing using the CLI command below:

activeloop reporting --off

## Acknowledgment
This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome [cloud-volume]( tool.