Commit f57f73de authored by Roque's avatar Roque

README on front page

parent b8f0c2ec
# ------ EBULK INGESTION-DOWNLOAD TOOL ------
# CONTENT:
- Bash script for ingestion and download
- Embulk plugins
- Configuration files (yml)
# REQUIREMENTS
This tool relies on **Embulk** Java application (see [docs](http://www.embulk.org/)).
Please make sure that [Java 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) is installed.
After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed).
If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:
curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# ------ DOWNLOAD ------
# QUICK START
To start the download, run the following command:
```
ebulk pull <DATA_SET>
```
being `<DATA_SET>` the dataset reference showed in the site.
(e.g. **ebulk pull my-dataset**)
This will automatically install Embulk and it will ask for user credentials.
After authentication, it will start the download and create an output directory named as the dataset reference with the downloaded files.
`<DATA_SET>` could be also a path, then the last directory will be interpreted as the dataset reference
(e.g. **ebulk pull my_directory/sample/**) --> dataset reference will be "sample"
# CUSTOMIZE CHUNK SIZE
If there is need to specify the chunk size for split download (e.g. due to memory errors with big files),
run the command with these parameters:
```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
```
being `<CHUNK_SIZE>` an integer to set the size in Mb.
(e.g. **ebulk pull my-dataset 10**)
# CUSTOMIZE OUTPUT DIRECTORY
Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.
```
ebulk pull <DATA_SET> -d <PATH>
```
being `<PATH>` the output location of the downloaded files.
(e.g. **ebulk pull my-dataset -d some/different/path**)
The content of the <DATA_SET> will be downloaded in <PATH>, and that location will be linked to the reference <DATA_SET>
This means that no matter if the directory is moved or renamed, the operations will refer to the dataset reference:
(e.g. **ebulk pull moved/or/renamed/path** will try to download the dataset 'my-dataset')
# ------ INGESTION ------
# QUICK START
To start the ingestion, run the following command:
```
ebulk push <DATA_SET>
```
being `<DATA_SET>` the dataset reference for your dataset, and the input directory where the files are.
(e.g. **ebulk pull my-dataset**)
This will automatically install Embulk and it will ask for user credentials.
After authentication, it will start the ingestion.
# CUSTOMIZE CHUNK SIZE AND OUTPUT DIRECTORY
The chunk size to split the ingestions or the input directory customization works as in the download operation.
(e.g. **ebulk push my-dataset 10**)
(e.g. **ebulk push my-dataset -d some/different/path**)
# USE A DIFFERENT INPUT STORAGE
Ebulk tool has some preinstalled input storage that user can use to ingest from different locations than file system. These are:
- File transfer protocol: ftp
- HTTP request: http
- Amazon web service S3: s3
To use one of those storages as input, run the following command:
```
ebulk push <DATA_SET> --storage <STORAGE>
```
being `<STORAGE>` one of the following available inputs: ftp, http, s3
(e.g. **ebulk push my-dataset --storage http**)
Each storage will request the user inputs like credentials, urls, etc. depending on each case.
# ADVANCED STORAGE
The user can edit the Embulk configuration file of the selected storage to run more complex scenarios
* Please keep in mind that some knowledge about Embulk is required
```
ebulk push <DATA_SET> --storage <STORAGE> --advanced
```
# CUSTOM
The user can request the installation of a new input storage, running the following command:
```
ebulk push <DATA_SET> --custom-storage
```
The tool will request the user to input the desired Embulk input plugin (gem) in order to install it.
The input gem can be pick from here: http://www.embulk.org/plugins/
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment