Commit 4b9ac6c0 authored by Ivan Tyagov's avatar Ivan Tyagov

Update ebulk readme

See merge request !3
parents a49e39ff e33c1e6e
......@@ -9,16 +9,58 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
- Many binary formats (ndarray, video, etc.)
- Trade secret
# PROJECT CONTENT:
- Bash script for ingestion and download
- Embulk plugins
- Configuration files (yml)
# REQUIREMENTS
This tool relies on **Embulk** Java application (see [docs](http://www.embulk.org/)).
This tool relies on **Embulk** Java application (see [docs](http://www.embulk.org/)).
Please make sure that [Java 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) is installed.
After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed).
# INSTALL
Please use the package installation for your operative system and follow the installation instructions.
## Linux
Ebulk package available in ubuntu-ppa repository allows to easily install the tool using apt commands.
Make sure `software-properties-common` is installed in order to run all apt commands:
```
sudo apt-get install software-properties-common
```
Add the ppa repository:
```
sudo add-apt-repository ppa:rporchetto/ebulk-ppa
```
Update your local sources and install ebulk:
```
sudo apt-get update
sudo apt-get install ebulk
```
## Debian considerations
For any OS version/series inconvenient during apt installation, it is recommended to install ebulk from the `.deb` package directly.
Please download the latest `.deb` ebulk package and install it by running:
```
dpkg -i ebulk_package.deb
```
## Mac OS X
Installation on Mac OS can be done via homebrew packages by running:
```
brew install https://github.com/roquegit/homebrew-ebulk/raw/master/ebulk.rb
```
## Potential installation issues
During the package intallation, or during first ebulk execution, the bash script will try to install Embulk automatically (if it is not installed).
If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:
curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
......@@ -30,11 +72,11 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
To start the download, run the following command:
```
ebulk pull <DATA_SET>
```
```
ebulk pull <DATA_SET>
```
being `<DATA_SET>` the dataset reference showed in the site.
being `<DATA_SET>` the dataset reference showed in the site.
(e.g. **ebulk pull my-dataset**)
This will automatically install Embulk and it will ask for user credentials.
......@@ -47,9 +89,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
If there is need to specify the chunk size for split download (e.g. due to memory errors with big files),
run the command with these parameters:
```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
```
```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
```
being `<CHUNK_SIZE>` an integer to set the size in Mb.
(e.g. **ebulk pull my-dataset 10**)
......@@ -57,9 +99,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
# CUSTOMIZE OUTPUT DIRECTORY
Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.
```
ebulk pull <DATA_SET> -d <PATH>
```
```
ebulk pull <DATA_SET> -d <PATH>
```
being `<PATH>` the output location of the downloaded files.
(e.g. **ebulk pull my-dataset -d some/different/path**)
......@@ -73,9 +115,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
# INGESTION QUICK START
To start the ingestion, run the following command:
```
ebulk push <DATA_SET>
```
```
ebulk push <DATA_SET>
```
being `<DATA_SET>` the dataset reference for your dataset, and the input directory where the files are.
(e.g. **ebulk pull my-dataset**)
......@@ -95,9 +137,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
- Amazon web service S3: s3
To use one of those storages as input, run the following command:
```
ebulk push <DATA_SET> --storage <STORAGE>
```
```
ebulk push <DATA_SET> --storage <STORAGE>
```
being `<STORAGE>` one of the following available inputs: ftp, http, s3
(e.g. **ebulk push my-dataset --storage http**)
......@@ -108,16 +150,16 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
The user can edit the Embulk configuration file of the selected storage to run more complex scenarios
* Please keep in mind that some knowledge about Embulk is required
```
ebulk push <DATA_SET> --storage <STORAGE> --advanced
```
```
ebulk push <DATA_SET> --storage <STORAGE> --advanced
```
# CUSTOM
The user can request the installation of a new input storage, running the following command:
```
ebulk push <DATA_SET> --custom-storage
```
```
ebulk push <DATA_SET> --custom-storage
```
The tool will request the user to input the desired Embulk input plugin (gem) in order to install it.
The input gem can be pick from here: http://www.embulk.org/plugins/
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment