first commit

parents
This diff is collapsed.
# ------ EBULK INGESTION-DOWNLOAD TOOL ------
# CONTENT:
- Bash script for ingestion and download
- Embulk plugins
- Configuration files (yml)
# REQUIREMENTS
This tool relies on **Embulk** Java application (see [docs](http://www.embulk.org/)).
Please make sure that [Java 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) is installed.
After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed).
If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:
curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# ------ DOWNLOAD ------
# QUICK START
To start the download, run the following command:
```
ebulk pull <DATA_SET>
```
being `<DATA_SET>` the dataset reference showed in the site.
(e.g. **ebulk pull my-dataset**)
This will automatically install Embulk and it will ask for user credentials.
After authentication, it will start the download and create an output directory named as the dataset reference with the downloaded files.
`<DATA_SET>` could be also a path, then the last directory will be interpreted as the dataset reference
(e.g. **ebulk pull my_directory/sample/**) --> dataset reference will be "sample"
# CUSTOMIZE CHUNK SIZE
If there is need to specify the chunk size for split download (e.g. due to memory errors with big files),
run the command with these parameters:
```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
```
being `<CHUNK_SIZE>` an integer to set the size in Mb.
(e.g. **ebulk pull my-dataset 10**)
# CUSTOMIZE OUTPUT DIRECTORY
Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.
```
ebulk pull <DATA_SET> -d <PATH>
```
being `<PATH>` the output location of the downloaded files.
(e.g. **ebulk pull my-dataset -d some/different/path**)
The content of the <DATA_SET> will be downloaded in <PATH>, and that location will be linked to the reference <DATA_SET>
This means that no matter if the directory is moved or renamed, the operations will refer to the dataset reference:
(e.g. **ebulk pull moved/or/renamed/path** will try to download the dataset 'my-dataset')
# ------ INGESTION ------
# QUICK START
To start the ingestion, run the following command:
```
ebulk push <DATA_SET>
```
being `<DATA_SET>` the dataset reference for your dataset, and the input directory where the files are.
(e.g. **ebulk pull my-dataset**)
This will automatically install Embulk and it will ask for user credentials.
After authentication, it will start the ingestion.
# CUSTOMIZE CHUNK SIZE AND OUTPUT DIRECTORY
The chunk size to split the ingestions or the input directory customization works as in the download operation.
(e.g. **ebulk push my-dataset 10**)
(e.g. **ebulk push my-dataset -d some/different/path**)
# USE A DIFFERENT INPUT STORAGE
Ebulk tool has some preinstalled input storage that user can use to ingest from different locations than file system. These are:
- File transfer protocol: ftp
- HTTP request: http
- Amazon web service S3: s3
To use one of those storages as input, run the following command:
```
ebulk push <DATA_SET> --storage <STORAGE>
```
being `<STORAGE>` one of the following available inputs: ftp, http, s3
(e.g. **ebulk push my-dataset --storage http**)
Each storage will request the user inputs like credentials, urls, etc. depending on each case.
# ADVANCED STORAGE
The user can edit the Embulk configuration file of the selected storage to run more complex scenarios
* Please keep in mind that some knowledge about Embulk is required
```
ebulk push <DATA_SET> --storage <STORAGE> --advanced
```
# CUSTOM
The user can request the installation of a new input storage, running the following command:
```
ebulk push <DATA_SET> --custom-storage
```
The tool will request the user to input the desired Embulk input plugin (gem) in order to install it.
The input gem can be pick from here: http://www.embulk.org/plugins/
exec:
max_threads: 1
min_output_tasks: 1
in:
type: file
path_prefix: ./csv/
parser:
charset: UTF-8
type: csv
delimiter: ';'
columns:
- {name: id, type: string}
- {name: id2, type: string}
- {name: id3, type: string}
- {name: id4, type: string}
out:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk"
user: "zope"
password: "asd"
exec:
max_threads: 1
min_output_tasks: 1
in:
type: file
path_prefix: ./csv/
parser:
charset: UTF-8
# newline: CRLF
type: csv
delimiter: ';'
# quote: '"'
# escape: ''
# null_string: 'NULL'
columns:
- {name: id, type: string}
- {name: id2, type: string}
- {name: id3, type: string}
- {name: id4, type: string}
out:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk"
user: "zope"
password: "asd"
exec:
max_threads: 1
min_output_tasks: 1
in:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/"
user: "asd"
password: "asd"
data_set: "sample"
chunk_size: "50"
output_path: "sample"
tool_dir: "."
out:
type: fif
output_path: "sample"
tool_dir: "."
exec:
max_threads: 1
min_output_tasks: 1
in:
type: wendelin
erp5_url: $DOWN_URL
user: $USER
password: $pwd
data_set: $DATA_SET
chunk_size: $CHUNK
output_path: $DATASET_DIR
tool_dir: $TOOL_DIR
out:
type: fif
output_path: $DATASET_DIR
exec:
max_threads: 1
min_output_tasks: 1
in:
type: wendelin
erp5_url: $DOWN_URL
user: $USER
password: $pwd
data_set: $DATA_SET
chunk_size: $CHUNK
output_path: $DATASET_DIR
tool_dir: $TOOL_DIR
out:
type: fif
output_path: $DATASET_DIR
tool_dir: $TOOL_DIR
exec:
max_threads: 1
min_output_tasks: 1
in:
type: fif
path_prefix: ["input/"]
supplier: [SUPPLIER]
data_set: [DATA_SET]
chunk_size: 0
out:
type: wendelin
erp5_url: 'https://softinst79462.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk'
user: [USER]
password: [PASSWORD]
tag: supplier.dataset.filename.extension.end
exec:
max_threads: 1
min_output_tasks: 1
in:
type: fif
path_prefix: [$DATASET_DIR]
supplier: $USER
data_set: $DATA_SET
chunk_size: $CHUNK
erp5_url: $DOWN_URL
user: $USER
password: $pwd
tool_dir: $TOOL_DIR
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
in:
type: fif
path_prefix: [$DATASET_DIR]
supplier: $USER
data_set: $DATA_SET
chunk_size: $CHUNK
erp5_url: $DOWN_URL
user: $USER
password: $pwd
tool_dir: $TOOL_DIR
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
tool_dir: $TOOL_DIR
# CUSTOM CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR CUSTOM EMBULK PLUGIN
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
# PLEASE FILL THE 'IN' SECTION ACCORDING TO YOUR PLUGIN
in:
# FOR EXAMPLE CSV FILES
# type: file
# path_prefix: MY_CSV_DIRECTORY
# FOR EXAMPLE AWS-S3 storage:
# type: s3
# bucket: MY_BUCKET
# path_prefix: ""
# access_key_id: MY_KEY_ID
# secret_access_key: MY_SECRET_KEY
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# CUSTOM CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR CUSTOM EMBULK PLUGIN
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
# PLEASE FILL THE 'IN' SECTION ACCORDING TO YOUR PLUGIN
in:
# FOR EXAMPLE CSV FILES
# type: file
# path_prefix: MY_CSV_DIRECTORY
# FOR EXAMPLE AWS-S3 storage:
# type: s3
# bucket: MY_BUCKET
# path_prefix: ""
# access_key_id: MY_KEY_ID
# secret_access_key: MY_SECRET_KEY
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
input_plugin: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# FTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR FTP STORAGE
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: ftp
host: $FTP_HOST
user: $FTP_USER
password: $FTP_PASSWORD
path_prefix: $FTP_PATH
#ssl_verify: false
#port: 21
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# FTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR FTP STORAGE
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: ftp
host: $FTP_HOST
user: $FTP_USER
password: $FTP_PASSWORD
path_prefix: $FTP_PATH
#ssl_verify: false
#port: 21
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# HTTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR HTTP URL
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: http
url: "http://archive.ics.uci.edu/ml/machine-learning-databases/00000/Donnees%20conso%20autos.txt"
method: "get"
# basic_auth:
# user: MyUser
# password: MyPassword
# params:
# - {name: paramA, value: valueA}
# - {name: paramB, value: valueB}
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: "zope"
data_set: "http"
tool_dir: "."
chunk_size: "50"
storage: "http"
path_prefix:
out:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk"
user: "zope"
password: "telecom"
exec:
max_threads: 1
min_output_tasks: 1
# HTTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR HTTP URL
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: http
url: $HTTP_URL
method: $HTTP_METHOD
# basic_auth:
# user: MyUser
# password: MyPassword
# params:
# - {name: paramA, value: valueA}
# - {name: paramB, value: valueB}
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# HTTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR HTTP URL
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: http
url: $HTTP_URL
method: $HTTP_METHOD
# basic_auth:
# user: MyUser
# password: MyPassword
# params:
# - {name: paramA, value: valueA}
# - {name: paramB, value: valueB}
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
path_prefix: $HTTP_PREFIX
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
exec:
max_threads: 1
min_output_tasks: 1
in:
type: s3
bucket: "roque5"
path_prefix: ""
access_key_id: "AKIAJLY3N4YBNAJMBLGQ"
secret_access_key: "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"
parser:
type: binary
supplier: "zope"
data_set: "encoding"
tool_dir: "."
chunk_size: "5"
input_plugin "s3"
out:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk"
user: "zope"
password: "telecom"
# S3 CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR S3 BUCKET
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: s3
bucket: $S3_BUCKET
path_prefix: $S3_PREFIX
access_key_id: $S3_ACCESS_KEY
secret_access_key: $S3_SECRET_KEY
auth_method: $S3_AUTH_METHOD
# endpoint:
# region:
# path_match_pattern:
# http_proxy:
# host:
# port:
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# S3 CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR S3 BUCKET
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: s3
bucket: $S3_BUCKET
path_prefix: $S3_PREFIX
access_key_id: $S3_ACCESS_KEY
secret_access_key: $S3_SECRET_KEY
auth_method: $S3_AUTH_METHOD
# endpoint:
# region:
# path_match_pattern:
# http_proxy:
# host:
# port:
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
path_prefix: $S3_PREFIX
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
source 'https://rubygems.org/'
gemspec
MIT License
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
# Embulk-wendelin-dataset-tool input-output plugin for Embulk
Input and output plugins for wendelin dataset-tool.
################### INPUT PLUGINS ###################
## Overview
* **Plugin type**: fif
* **Resume supported**: not for now
* **Cleanup supported**: not for now
* **Guess supported**: no
## Configuration
- **path_prefix**: description (array, required)
- **supplier**: description (string, default: `"default"`)
- **dataset**: description (string, default: `"default"`)
- **chunk_size**: description (integer, default: `0`)
## Schema (self included in plugin)
- {"name"=>"supplier", "type"=>"string"}
- {"name"=>"dataset", "type"=>"string"}
- {"name"=>"file", "type"=>"string"},
- {"name"=>"extension", "type"=>"string"}
- {"name"=>"end", "type"=>"string"}
- {"name"=>"data_chunk", "type"=>"string"}
## Overview
* **Plugin type**: wendelin
* **Resume supported**: not for now
* **Cleanup supported**: not for now
* **Guess supported**: no
- **chunk_size**: description (integer, default: `0`)
## Configuration
- **erp5_url**: description (array, required)
- **user**: description (string, required)
- **password**: description (string, required)
- **supplier**: description (string, default: `"default"`)
- **dataset**: description (string, default: `"default"`)
################### OUTPUT PLUGINS ###################
## Overview
* **Plugin type**: wendelin
* **Resume supported**: not for now
* **Cleanup supported**: not for now
* **Guess supported**: no
## Configuration
- **erp5_url**: description (array, required)
- **user**: description (string, required)
- **password**: description (string, required)
- **tag**: "supplier.dataset.filename.extension.end"
## Overview
* **Plugin type**: fif
* **Resume supported**: not for now
* **Cleanup supported**: not for now
* **Guess supported**: no
## Configuration
- **output_path**: description (string, required)
require "bundler/gem_tasks"
task default: :build
Gem::Specification.new do |spec|
spec.name = "embulk-wendelin-dataset-tool"
spec.version = "0.1.0"
spec.authors = ["Roque Porchetto"]