Update ebulk readme

See merge request !3

Update ebulk readme
See merge request !3
4b9ac6c0 · Ivan Tyagov · a49e39ff · e33c1e6e · 4b9ac6c0
Commit 4b9ac6c0 authored Oct 28, 2020 by Ivan Tyagov
Hide whitespace changes
Inline Side-by-side

Showing with 71 additions and 29 deletions

README.md README.md +71 -29

No files found.
--- a/README.md
+++ b/README.md
@@ -9,16 +9,58 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
 - Many binary formats (ndarray, video, etc.)
 - Trade secret

-# PROJECT CONTENT:
- - Bash script for ingestion and download
- - Embulk plugins
- - Configuration files (yml)
-
 # REQUIREMENTS
- This tool relies on **Embulk** Java application (see [docs](http://www.embulk.org/)). 
+ This tool relies on **Embulk** Java application (see [docs](http://www.embulk.org/)).
 Please make sure that [Java 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) is installed.

- After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed). 
+# INSTALL
+
+Please use the package installation for your operative system and follow the installation instructions.
+
+## Linux
+
+Ebulk package available in ubuntu-ppa repository allows to easily install the tool using apt commands.
+
+Make sure `software-properties-common` is installed in order to run all apt commands:
+
+```
+sudo apt-get install software-properties-common
+```
+
+Add the ppa repository:
+
+```
+sudo add-apt-repository ppa:rporchetto/ebulk-ppa
+```
+
+Update your local sources and install ebulk:
+
+```
+sudo apt-get update
+sudo apt-get install ebulk
+```
+
+## Debian considerations
+
+For any OS version/series inconvenient during apt installation, it is recommended to install ebulk from the `.deb` package directly.
+
+Please download the latest `.deb` ebulk package and install it by running:
+
+```
+dpkg -i ebulk_package.deb
+```
+
+## Mac OS X
+
+Installation on Mac OS can be done via homebrew packages by running:
+
+```
+brew install https://github.com/roquegit/homebrew-ebulk/raw/master/ebulk.rb
+```
+
+## Potential installation issues
+
+ During the package intallation, or during first ebulk execution, the bash script will try to install Embulk automatically (if it is not installed).
 If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:

    curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
@@ -30,11 +72,11 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake

 To start the download, run the following command:

-	```
-	ebulk pull <DATA_SET>
-	```
+```
+ebulk pull <DATA_SET>
+```

- being `<DATA_SET>` the dataset reference showed in the site. 
+ being `<DATA_SET>` the dataset reference showed in the site.
 (e.g. **ebulk pull my-dataset**)

 This will automatically install Embulk and it will ask for user credentials.
@@ -47,9 +89,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
 If there is need to specify the chunk size for split download (e.g. due to memory errors with big files),
 run the command with these parameters:

-	```
-	ebulk pull <DATA_SET> -c <CHUNK_SIZE>
-	``` 
+```
+ebulk pull <DATA_SET> -c <CHUNK_SIZE>
+```

 being `<CHUNK_SIZE>` an integer to set the size in Mb.
 (e.g. **ebulk pull my-dataset 10**)
@@ -57,9 +99,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
 # CUSTOMIZE OUTPUT DIRECTORY
 Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.

-	```
-	ebulk pull <DATA_SET> -d <PATH>
-	```
+```
+ebulk pull <DATA_SET> -d <PATH>
+```

 being `<PATH>` the output location of the downloaded files.
 (e.g. **ebulk pull my-dataset -d some/different/path**)
@@ -73,9 +115,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
 # INGESTION QUICK START
 To start the ingestion, run the following command:

-	```
-	ebulk push <DATA_SET>
-	```
+```
+ebulk push <DATA_SET>
+```

 being `<DATA_SET>` the dataset reference for your dataset, and the input directory where the files are.
 (e.g. **ebulk pull my-dataset**)
@@ -95,9 +137,9 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
 - Amazon web service S3: s3
 To use one of those storages as input, run the following command:

-	```
-	ebulk push <DATA_SET> --storage <STORAGE>
-	```
+```
+ebulk push <DATA_SET> --storage <STORAGE>
+```

 being `<STORAGE>` one of the following available inputs: ftp, http, s3
 (e.g. **ebulk push my-dataset --storage http**)
@@ -108,16 +150,16 @@ Along with Wendelin platform, ebulk is combined to form an easy to use Data Lake
 The user can edit the Embulk configuration file of the selected storage to run more complex scenarios
 * Please keep in mind that some knowledge about Embulk is required

-	```
-	ebulk push <DATA_SET> --storage <STORAGE> --advanced
-	```
+```
+ebulk push <DATA_SET> --storage <STORAGE> --advanced
+```

 # CUSTOM
 The user can request the installation of a new input storage, running the following command:

-	```
-	ebulk push <DATA_SET> --custom-storage
-	```
+```
+ebulk push <DATA_SET> --custom-storage
+```

 The tool will request the user to input the desired Embulk input plugin (gem) in order to install it.
 The input gem can be pick from here: http://www.embulk.org/plugins/