E

ebulk

TOOL DESCRIPTION

Ebulk tool is a wrapper for Embulk, an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. It supports any kind of input file formats, parallel and distributed execution to deal with big data sets, transaction control to guarantee All-or-Nothing file transfer, and operation resuming. Ebulk is as easy as git to use, allowing the big data transfering to be done by using very few commands.

BIG DATA SHARING PLATFORM

Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake to share petabytes of data grouped into data sets. This project offers a solution to the big data sharing problem by solving the following key points:

  • Huge transfer (over slow and unreliable network)
  • Huge storage (with little budget)
  • Many protocols (S3, HTTP, FTP, etc.)
  • Many binary formats (ndarray, video, etc.)
  • Trade secret

PROJECT CONTENT:

  • Bash script for ingestion and download
  • Embulk plugins
  • Configuration files (yml)

REQUIREMENTS

This tool relies on Embulk Java application (see docs). Please make sure that Java 8 is installed.

After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed). If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:

curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

QUICK START

To start the download, run the following command:

```
ebulk pull <DATA_SET>
```

being <DATA_SET> the dataset reference showed in the site. (e.g. ebulk pull my-dataset)

This will automatically install Embulk and it will ask for user credentials. After authentication, it will start the download and create an output directory named as the dataset reference with the downloaded files.

<DATA_SET> could be also a path, then the last directory will be interpreted as the dataset reference (e.g. ebulk pull my_directory/sample/) --> dataset reference will be "sample"

CUSTOMIZE CHUNK SIZE

If there is need to specify the chunk size for split download (e.g. due to memory errors with big files), run the command with these parameters:

```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
``` 

being <CHUNK_SIZE> an integer to set the size in Mb. (e.g. ebulk pull my-dataset 10)

CUSTOMIZE OUTPUT DIRECTORY

Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.

```
ebulk pull <DATA_SET> -d <PATH>
```

being <PATH> the output location of the downloaded files. (e.g. ebulk pull my-dataset -d some/different/path)

The content of the will be downloaded in , and that location will be linked to the reference This means that no matter if the directory is moved or renamed, the operations will refer to the dataset reference: (e.g. ebulk pull moved/or/renamed/path will try to download the dataset 'my-dataset')

INGESTION QUICK START

To start the ingestion, run the following command:

```
ebulk push <DATA_SET>
```

being <DATA_SET> the dataset reference for your dataset, and the input directory where the files are. (e.g. ebulk pull my-dataset)

This will automatically install Embulk and it will ask for user credentials. After authentication, it will start the ingestion.

CUSTOMIZE CHUNK SIZE AND OUTPUT DIRECTORY

The chunk size to split the ingestions or the input directory customization works as in the download operation. (e.g. ebulk push my-dataset 10) (e.g. ebulk push my-dataset -d some/different/path)

USE A DIFFERENT INPUT STORAGE

Ebulk tool has some preinstalled input storage that user can use to ingest from different locations than file system. These are:

  • File transfer protocol: ftp
  • HTTP request: http
  • Amazon web service S3: s3 To use one of those storages as input, run the following command:

    ebulk push <DATA_SET> --storage <STORAGE>
    

being <STORAGE> one of the following available inputs: ftp, http, s3 (e.g. ebulk push my-dataset --storage http)

Each storage will request the user inputs like credentials, urls, etc. depending on each case.

ADVANCED STORAGE

The user can edit the Embulk configuration file of the selected storage to run more complex scenarios

  • Please keep in mind that some knowledge about Embulk is required

    ebulk push <DATA_SET> --storage <STORAGE> --advanced
    

CUSTOM

The user can request the installation of a new input storage, running the following command:

```
ebulk push <DATA_SET> --custom-storage
```

The tool will request the user to input the desired Embulk input plugin (gem) in order to install it. The input gem can be pick from here: http://www.embulk.org/plugins/