README on front page

f57f73de · Roque · b8f0c2ec · f57f73de
Commit f57f73de authored Jun 20, 2019 by Roque
Hide whitespace changes
Inline Side-by-side

Showing with 120 additions and 0 deletions

README.md README.md +120 -0

No files found.
--- a/README.md
+++ b/README.md
+# ------ EBULK INGESTION-DOWNLOAD TOOL ------
+
+# CONTENT:
+ - Bash script for ingestion and download
+ - Embulk plugins
+ - Configuration files (yml)
+
+# REQUIREMENTS
+ This tool relies on **Embulk** Java application (see [docs](http://www.embulk.org/)). 
+ Please make sure that [Java 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) is installed.
+
+ After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed). 
+ If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:
+
+    curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
+    chmod +x ~/.embulk/bin/embulk
+    echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
+    source ~/.bashrc
+            
+# ------ DOWNLOAD ------
+# QUICK START
+
+ To start the download, run the following command:
+
+	```
+	ebulk pull <DATA_SET>
+	```
+
+ being `<DATA_SET>` the dataset reference showed in the site. 
+ (e.g. **ebulk pull my-dataset**)
+
+ This will automatically install Embulk and it will ask for user credentials.
+ After authentication, it will start the download and create an output directory named as the dataset reference with the downloaded files.
+
+ `<DATA_SET>` could be also a path, then the last directory will be interpreted as the dataset reference
+ (e.g. **ebulk pull my_directory/sample/**)  -->  dataset reference will be "sample"
+
+# CUSTOMIZE CHUNK SIZE
+ If there is need to specify the chunk size for split download (e.g. due to memory errors with big files),
+ run the command with these parameters:
+
+	```
+	ebulk pull <DATA_SET> -c <CHUNK_SIZE>
+	``` 
+
+ being `<CHUNK_SIZE>` an integer to set the size in Mb.
+ (e.g. **ebulk pull my-dataset 10**)
+
+# CUSTOMIZE OUTPUT DIRECTORY
+ Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.
+
+	```
+	ebulk pull <DATA_SET> -d <PATH>
+	```
+
+ being `<PATH>` the output location of the downloaded files.
+ (e.g. **ebulk pull my-dataset -d some/different/path**)
+
+ The content of the <DATA_SET> will be downloaded in <PATH>, and that location will be linked to the reference <DATA_SET>
+ This means that no matter if the directory is moved or renamed, the operations will refer to the dataset reference:
+ (e.g. **ebulk pull moved/or/renamed/path** will try to download the dataset 'my-dataset')
+
+
+
+# ------ INGESTION ------
+# QUICK START
+ To start the ingestion, run the following command:
+
+	```
+	ebulk push <DATA_SET>
+	```
+
+ being `<DATA_SET>` the dataset reference for your dataset, and the input directory where the files are.
+ (e.g. **ebulk pull my-dataset**)
+
+ This will automatically install Embulk and it will ask for user credentials.
+ After authentication, it will start the ingestion.
+
+# CUSTOMIZE CHUNK SIZE AND OUTPUT DIRECTORY
+ The chunk size to split the ingestions or the input directory customization works as in the download operation.
+ (e.g. **ebulk push my-dataset 10**)
+ (e.g. **ebulk push my-dataset -d some/different/path**)
+
+# USE A DIFFERENT INPUT STORAGE
+ Ebulk tool has some preinstalled input storage that user can use to ingest from different locations than file system. These are:
+ - File transfer protocol: ftp
+ - HTTP request: http
+ - Amazon web service S3: s3
+ To use one of those storages as input, run the following command:
+
+	```
+	ebulk push <DATA_SET> --storage <STORAGE>
+	```
+
+ being `<STORAGE>` one of the following available inputs: ftp, http, s3
+ (e.g. **ebulk push my-dataset --storage http**)
+
+ Each storage will request the user inputs like credentials, urls, etc. depending on each case.
+
+# ADVANCED STORAGE
+ The user can edit the Embulk configuration file of the selected storage to run more complex scenarios
+ * Please keep in mind that some knowledge about Embulk is required
+
+	```
+	ebulk push <DATA_SET> --storage <STORAGE> --advanced
+	```
+
+# CUSTOM
+ The user can request the installation of a new input storage, running the following command:
+
+	```
+	ebulk push <DATA_SET> --custom-storage
+	```
+
+ The tool will request the user to input the desired Embulk input plugin (gem) in order to install it.
+ The input gem can be pick from here: http://www.embulk.org/plugins/
+
+
+
+