first commit

parents
This diff is collapsed.
# ------ EBULK INGESTION-DOWNLOAD TOOL ------
# CONTENT:
- Bash script for ingestion and download
- Embulk plugins
- Configuration files (yml)
# REQUIREMENTS
This tool relies on **Embulk** Java application (see [docs](http://www.embulk.org/)).
Please make sure that [Java 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) is installed.
After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed).
If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:
curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# ------ DOWNLOAD ------
# QUICK START
To start the download, run the following command:
```
ebulk pull <DATA_SET>
```
being `<DATA_SET>` the dataset reference showed in the site.
(e.g. **ebulk pull my-dataset**)
This will automatically install Embulk and it will ask for user credentials.
After authentication, it will start the download and create an output directory named as the dataset reference with the downloaded files.
`<DATA_SET>` could be also a path, then the last directory will be interpreted as the dataset reference
(e.g. **ebulk pull my_directory/sample/**) --> dataset reference will be "sample"
# CUSTOMIZE CHUNK SIZE
If there is need to specify the chunk size for split download (e.g. due to memory errors with big files),
run the command with these parameters:
```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
```
being `<CHUNK_SIZE>` an integer to set the size in Mb.
(e.g. **ebulk pull my-dataset 10**)
# CUSTOMIZE OUTPUT DIRECTORY
Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.
```
ebulk pull <DATA_SET> -d <PATH>
```
being `<PATH>` the output location of the downloaded files.
(e.g. **ebulk pull my-dataset -d some/different/path**)
The content of the <DATA_SET> will be downloaded in <PATH>, and that location will be linked to the reference <DATA_SET>
This means that no matter if the directory is moved or renamed, the operations will refer to the dataset reference:
(e.g. **ebulk pull moved/or/renamed/path** will try to download the dataset 'my-dataset')
# ------ INGESTION ------
# QUICK START
To start the ingestion, run the following command:
```
ebulk push <DATA_SET>
```
being `<DATA_SET>` the dataset reference for your dataset, and the input directory where the files are.
(e.g. **ebulk pull my-dataset**)
This will automatically install Embulk and it will ask for user credentials.
After authentication, it will start the ingestion.
# CUSTOMIZE CHUNK SIZE AND OUTPUT DIRECTORY
The chunk size to split the ingestions or the input directory customization works as in the download operation.
(e.g. **ebulk push my-dataset 10**)
(e.g. **ebulk push my-dataset -d some/different/path**)
# USE A DIFFERENT INPUT STORAGE
Ebulk tool has some preinstalled input storage that user can use to ingest from different locations than file system. These are:
- File transfer protocol: ftp
- HTTP request: http
- Amazon web service S3: s3
To use one of those storages as input, run the following command:
```
ebulk push <DATA_SET> --storage <STORAGE>
```
being `<STORAGE>` one of the following available inputs: ftp, http, s3
(e.g. **ebulk push my-dataset --storage http**)
Each storage will request the user inputs like credentials, urls, etc. depending on each case.
# ADVANCED STORAGE
The user can edit the Embulk configuration file of the selected storage to run more complex scenarios
* Please keep in mind that some knowledge about Embulk is required
```
ebulk push <DATA_SET> --storage <STORAGE> --advanced
```
# CUSTOM
The user can request the installation of a new input storage, running the following command:
```
ebulk push <DATA_SET> --custom-storage
```
The tool will request the user to input the desired Embulk input plugin (gem) in order to install it.
The input gem can be pick from here: http://www.embulk.org/plugins/
exec:
max_threads: 1
min_output_tasks: 1
in:
type: file
path_prefix: ./csv/
parser:
charset: UTF-8
type: csv
delimiter: ';'
columns:
- {name: id, type: string}
- {name: id2, type: string}
- {name: id3, type: string}
- {name: id4, type: string}
out:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk"
user: "zope"
password: "asd"
exec:
max_threads: 1
min_output_tasks: 1
in:
type: file
path_prefix: ./csv/
parser:
charset: UTF-8
# newline: CRLF
type: csv
delimiter: ';'
# quote: '"'
# escape: ''
# null_string: 'NULL'
columns:
- {name: id, type: string}
- {name: id2, type: string}
- {name: id3, type: string}
- {name: id4, type: string}
out:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk"
user: "zope"
password: "asd"
exec:
max_threads: 1
min_output_tasks: 1
in:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/"
user: "asd"
password: "asd"
data_set: "sample"
chunk_size: "50"
output_path: "sample"
tool_dir: "."
out:
type: fif
output_path: "sample"
tool_dir: "."
exec:
max_threads: 1
min_output_tasks: 1
in:
type: wendelin
erp5_url: $DOWN_URL
user: $USER
password: $pwd
data_set: $DATA_SET
chunk_size: $CHUNK
output_path: $DATASET_DIR
tool_dir: $TOOL_DIR
out:
type: fif
output_path: $DATASET_DIR
exec:
max_threads: 1
min_output_tasks: 1
in:
type: wendelin
erp5_url: $DOWN_URL
user: $USER
password: $pwd
data_set: $DATA_SET
chunk_size: $CHUNK
output_path: $DATASET_DIR
tool_dir: $TOOL_DIR
out:
type: fif
output_path: $DATASET_DIR
tool_dir: $TOOL_DIR
exec:
max_threads: 1
min_output_tasks: 1
in:
type: fif
path_prefix: ["input/"]
supplier: [SUPPLIER]
data_set: [DATA_SET]
chunk_size: 0
out:
type: wendelin
erp5_url: 'https://softinst79462.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk'
user: [USER]
password: [PASSWORD]
tag: supplier.dataset.filename.extension.end
exec:
max_threads: 1
min_output_tasks: 1
in:
type: fif
path_prefix: [$DATASET_DIR]
supplier: $USER
data_set: $DATA_SET
chunk_size: $CHUNK
erp5_url: $DOWN_URL
user: $USER
password: $pwd
tool_dir: $TOOL_DIR
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
in:
type: fif
path_prefix: [$DATASET_DIR]
supplier: $USER
data_set: $DATA_SET
chunk_size: $CHUNK
erp5_url: $DOWN_URL
user: $USER
password: $pwd
tool_dir: $TOOL_DIR
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
tool_dir: $TOOL_DIR
# CUSTOM CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR CUSTOM EMBULK PLUGIN
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
# PLEASE FILL THE 'IN' SECTION ACCORDING TO YOUR PLUGIN
in:
# FOR EXAMPLE CSV FILES
# type: file
# path_prefix: MY_CSV_DIRECTORY
# FOR EXAMPLE AWS-S3 storage:
# type: s3
# bucket: MY_BUCKET
# path_prefix: ""
# access_key_id: MY_KEY_ID
# secret_access_key: MY_SECRET_KEY
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# CUSTOM CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR CUSTOM EMBULK PLUGIN
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
# PLEASE FILL THE 'IN' SECTION ACCORDING TO YOUR PLUGIN
in:
# FOR EXAMPLE CSV FILES
# type: file
# path_prefix: MY_CSV_DIRECTORY
# FOR EXAMPLE AWS-S3 storage:
# type: s3
# bucket: MY_BUCKET
# path_prefix: ""
# access_key_id: MY_KEY_ID
# secret_access_key: MY_SECRET_KEY
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
input_plugin: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# FTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR FTP STORAGE
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: ftp
host: $FTP_HOST
user: $FTP_USER
password: $FTP_PASSWORD
path_prefix: $FTP_PATH
#ssl_verify: false
#port: 21
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# FTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR FTP STORAGE
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: ftp
host: $FTP_HOST
user: $FTP_USER
password: $FTP_PASSWORD
path_prefix: $FTP_PATH
#ssl_verify: false
#port: 21
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# HTTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR HTTP URL
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: http
url: "http://archive.ics.uci.edu/ml/machine-learning-databases/00000/Donnees%20conso%20autos.txt"
method: "get"
# basic_auth:
# user: MyUser
# password: MyPassword
# params:
# - {name: paramA, value: valueA}
# - {name: paramB, value: valueB}
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: "zope"
data_set: "http"
tool_dir: "."
chunk_size: "50"
storage: "http"
path_prefix:
out:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk"
user: "zope"
password: "telecom"
exec:
max_threads: 1
min_output_tasks: 1
# HTTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR HTTP URL
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: http
url: $HTTP_URL
method: $HTTP_METHOD
# basic_auth:
# user: MyUser
# password: MyPassword
# params:
# - {name: paramA, value: valueA}
# - {name: paramB, value: valueB}
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# HTTP CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR HTTP URL
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: http
url: $HTTP_URL
method: $HTTP_METHOD
# basic_auth:
# user: MyUser
# password: MyPassword
# params:
# - {name: paramA, value: valueA}
# - {name: paramB, value: valueB}
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
path_prefix: $HTTP_PREFIX
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
exec:
max_threads: 1
min_output_tasks: 1
in:
type: s3
bucket: "roque5"
path_prefix: ""
access_key_id: "AKIAJLY3N4YBNAJMBLGQ"
secret_access_key: "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"
parser:
type: binary
supplier: "zope"
data_set: "encoding"
tool_dir: "."
chunk_size: "5"
input_plugin "s3"
out:
type: wendelin
erp5_url: "https://softinst102878.host.vifib.net/erp5/portal_ingestion_policies/wendelin_embulk"
user: "zope"
password: "telecom"
# S3 CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR S3 BUCKET
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: s3
bucket: $S3_BUCKET
path_prefix: $S3_PREFIX
access_key_id: $S3_ACCESS_KEY
secret_access_key: $S3_SECRET_KEY
auth_method: $S3_AUTH_METHOD
# endpoint:
# region:
# path_match_pattern:
# http_proxy:
# host:
# port:
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
# S3 CONFIGURATION FILE
# PLEASE FILL THE FILE WITH THE CONFIGURATION OF YOUR S3 BUCKET
# ONLY THE 'IN' SECTION, OTHERS MUST REMAIN AS THEY ARE
in:
type: s3
bucket: $S3_BUCKET
path_prefix: $S3_PREFIX
access_key_id: $S3_ACCESS_KEY
secret_access_key: $S3_SECRET_KEY
auth_method: $S3_AUTH_METHOD
# endpoint:
# region:
# path_match_pattern:
# http_proxy:
# host:
# port:
# PLEASE LEAVE THE SECTIONS BELOW AS THEY ARE (unless you know what you are doing)
parser:
type: binary
supplier: $USER
data_set: $DATA_SET
tool_dir: $TOOL_DIR
chunk_size: $CHUNK
storage: $STORAGE
path_prefix: $S3_PREFIX
out:
type: wendelin
erp5_url: $ING_URL
user: $USER
password: $pwd
exec:
max_threads: 1
min_output_tasks: 1
source 'https://rubygems.org/'
gemspec
MIT License
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
# Embulk-wendelin-dataset-tool input-output plugin for Embulk
Input and output plugins for wendelin dataset-tool.
################### INPUT PLUGINS ###################
## Overview
* **Plugin type**: fif
* **Resume supported**: not for now
* **Cleanup supported**: not for now
* **Guess supported**: no
## Configuration
- **path_prefix**: description (array, required)
- **supplier**: description (string, default: `"default"`)
- **dataset**: description (string, default: `"default"`)
- **chunk_size**: description (integer, default: `0`)
## Schema (self included in plugin)
- {"name"=>"supplier", "type"=>"string"}
- {"name"=>"dataset", "type"=>"string"}
- {"name"=>"file", "type"=>"string"},
- {"name"=>"extension", "type"=>"string"}
- {"name"=>"end", "type"=>"string"}
- {"name"=>"data_chunk", "type"=>"string"}
## Overview
* **Plugin type**: wendelin
* **Resume supported**: not for now
* **Cleanup supported**: not for now
* **Guess supported**: no
- **chunk_size**: description (integer, default: `0`)
## Configuration
- **erp5_url**: description (array, required)
- **user**: description (string, required)
- **password**: description (string, required)
- **supplier**: description (string, default: `"default"`)
- **dataset**: description (string, default: `"default"`)
################### OUTPUT PLUGINS ###################
## Overview
* **Plugin type**: wendelin
* **Resume supported**: not for now
* **Cleanup supported**: not for now
* **Guess supported**: no
## Configuration
- **erp5_url**: description (array, required)
- **user**: description (string, required)
- **password**: description (string, required)
- **tag**: "supplier.dataset.filename.extension.end"
## Overview
* **Plugin type**: fif
* **Resume supported**: not for now
* **Cleanup supported**: not for now
* **Guess supported**: no
## Configuration
- **output_path**: description (string, required)
require "bundler/gem_tasks"
task default: :build
Gem::Specification.new do |spec|
spec.name = "embulk-wendelin-dataset-tool"
spec.version = "0.1.0"
spec.authors = ["Roque Porchetto"]
spec.summary = "Input/output plugin for ingestion/download of datasets to/from wendelin"
spec.description = "Loads records from fif files in local storage and send them to wendelin via http. Loads records from wendelin via http and store them locally."
spec.email = ["roqueporchetto@gmail.com"]
spec.licenses = ["MIT"]
# TODO set this: spec.homepage = "https://github.com/roqueporchetto/embulk-wendelin-dataset-tool"
spec.files = `git ls-files`.split("\n") + Dir["classpath/*.jar"]
spec.test_files = spec.files.grep(%r{^(test|spec)/})
spec.require_paths = ["lib"]
#spec.add_dependency 'YOUR_GEM_DEPENDENCY', ['~> YOUR_GEM_DEPENDENCY_VERSION']
spec.add_development_dependency 'embulk', ['>= 0.8.15']
spec.add_development_dependency 'bundler', ['>= 1.10.6']
spec.add_development_dependency 'rake', ['>= 10.0']
end
require 'singleton'
# class representing a file logger
class LogManager
include Singleton
INFO = "INFO"
ERROR = "ERROR"
WARN = "WARNING"
def initialize()
now = Time.now.strftime("%d-%m-%Y")
@filename = "#{now.to_s}.log"
end
def setFilename(tool_dir, prefix)
log_dir = "#{tool_dir}/"
if not File.directory?(log_dir)
Dir.mkdir log_dir
end
@path = "#{log_dir}#{prefix}_#{@filename}"
File.open(@path, 'a') { |file| file.puts "------------------------------------------------" + "\r\n" }
end
def getLogPath()
return @path
end
def info(message, print=FALSE)
log(message, print, type=INFO)
end
def warn(message, print=FALSE)
log(message, print, type=WARN)
end
def error(message, print=FALSE)
log(message, print, type=ERROR)
end
def abortExecution()
puts
info("PROCESS ABORTED")
unless @path.nil?
puts "PROCESS ABORTED : For more detailed information, please refer to the log file '#{@path}'"
end
exec("Process.kill 9, Process.pid >/dev/null 2>&1")
end
def logOutOfMemoryError(reference)
error("The data chunk size is too large. Proccess aborted for #{reference}.", print=TRUE)
info("Please, check the help or README.md to customize the chunk size.", print=TRUE)
end
def logArray(messagesArray, print, type)
messagesArray.each { |message| log(message, print, type) }
end
def log(message, print=FALSE, type=INFO)
if message.kind_of?(Array)
logArray(message, print, type)
return
end
if print
puts "[#{type}] #{message}"
end
now = Time.now.strftime("%d-%m-%Y %H:%M:%S").to_s
entry = "#{now} - #{type} : #{message}" + "\r\n"
unless @path.nil?
File.open(@path, 'a') { |file| file.puts entry }
end
end
end
require_relative '../wendelin_client'
require 'fileutils'
module Embulk
module Input
class Wendelininput < InputPlugin
CHUNK_SIZE = 50000000 #50mb
MEGA = 1000000
UPDATE = "U"
RESUME = "R"
DOWNLOAD = "D"
ABORT = "A"
TASK_REPORT_FILE = ".download-task-report"
COMPLETED_FILE = ".download-completed"
RUN_DONE = "done"
RUN_ERROR = "error"
RUN_ABORTED = "aborted"
Plugin.register_input("wendelin", self)
def self.getTaskReportFilename(data_set_directory)
return data_set_directory + TASK_REPORT_FILE
end
def self.getCompletedFilename(data_set_directory)
return data_set_directory + COMPLETED_FILE
end
def self.getPendingDataStreams(data_streams, task_report_file)
donwloaded_data_streams = []
File.readlines(task_report_file).each do |line|
record = line.split(";")
if record[1].chomp == RUN_DONE
donwloaded_data_streams.push(record[0])
end
end
pending_data_streams = []
data_streams.each do |data_stream|
if ! donwloaded_data_streams.include? data_stream[1]
pending_data_streams.push(data_stream)
end
end
return pending_data_streams
end
def self.askUserForAction(task, completed_file, task_report_file, action)
if action == RESUME
action_message = "#{RESUME}: Resume. Continues download from last file."
else
action = UPDATE
action_message = "#{UPDATE}: Update. Checks for new files."
end
valid_option = FALSE
while not valid_option
@logger.info("Please select an option [#{action}, #{DOWNLOAD}, #{ABORT}]", print=TRUE)
@logger.info(action_message, print=TRUE)
@logger.info("#{DOWNLOAD}: Download. Downloads the dataset from scratch.", print=TRUE)
@logger.info("#{ABORT}: Abort operation.", print=TRUE)
option = gets
option = option.chomp
if not [action, DOWNLOAD, ABORT].include? option
@logger.info("Invalid option", print=TRUE)
else
valid_option = TRUE
end
end
case option
when action
task['data_streams'] = getPendingDataStreams(task['data_streams'], task_report_file)
File.delete(completed_file) if File.exist?(completed_file)
if task['data_streams'].empty?
@logger.info("No new files in dataset.", print=TRUE)
@logger.info("Your downloaded dataset is already up to date.", print=TRUE)
end
when DOWNLOAD
ebulk_file = @data_set_directory + "/.ebulk_dataset"
ebulk_file_content = ""
if File.file?(ebulk_file)
ebulk_file_content = File.read(ebulk_file)
end
FileUtils.rm_rf(@data_set_directory)
unless File.directory?(@data_set_directory)
FileUtils.mkdir_p(@data_set_directory)
end
if ebulk_file_content != ""
File.open(ebulk_file, 'w') { |file| file.write(ebulk_file_content) }
end
File.delete(completed_file) if File.exist?(completed_file)
File.open(task_report_file, 'w') {}
when ABORT
@logger.abortExecution()
end
end
def self.transaction(config, &control)
begin
@tool_dir = config.param('tool_dir', :string)
@logger = LogManager.instance()
@logger.setFilename(@tool_dir, "download")
@erp5_url = config.param('erp5_url', :string)
@data_set = config.param('data_set', :string)
@logger.info("Dataset name: #{@data_set}")
if @data_set == '$DATA_SET'
@logger.error("There was an error setting the configuration file", print=TRUE)
@logger.info("Please try manual download or update manually the download configuration file.", print=TRUE)
@logger.abortExecution()
end
@user = config.param("user", :string, defualt: nil)
@logger.info("User: #{@user}")
@password = config.param("password", :string, default: nil)
@chunk_size = config.param('chunk_size', :float, default: 0) * MEGA
@output_path = config.param("output_path", :string, :default => nil)
if File.directory?(@output_path)
else
@logger.error("Output directory not found.", print=TRUE)
@logger.abortExecution()
end
@wendelin = WendelinClient.new(@erp5_url, @user, @password)
task = {
'erp5_url' => @erp5_url,
'data_set' => @data_set,
'user' => @user,
'password' => @password,
'chunk_size' => @chunk_size,
'output_path' => @output_path,
'tool_dir' => @tool_dir
}
if task['chunk_size'] == 0
task['chunk_size'] = CHUNK_SIZE
end
@logger.info("Chunk size set in #{task['chunk_size']/MEGA}MB")
@data_set_directory = @output_path.end_with?("/") ? @output_path : @output_path + "/"
task['data_set_directory'] = @data_set_directory
data_stream_list = @wendelin.getDataStreams(@data_set)
if data_stream_list["status_code"] == 0
if data_stream_list["result"].empty?
@logger.error("No valid data found for data set " + @data_set, print=TRUE)
@logger.info("Please use a valid dataset reference from the list of datasets available in the site.", print=TRUE)
@logger.abortExecution()
end
task['data_streams'] = data_stream_list["result"]
else
@logger.error(data_stream_list["error_message"], print=TRUE)
@logger.abortExecution()
end
task_report_file = getTaskReportFilename(@data_set_directory)
completed_file = getCompletedFilename(@data_set_directory)
if File.file?(task_report_file)
if File.file?(completed_file)
puts
@logger.info("This dataset was already downloaded. What do you want to do?", print=TRUE)
puts
self.askUserForAction(task, completed_file, task_report_file, action=UPDATE)
else
puts
@logger.info("There was a previous attempt to download this dataset but it did not finish successfully.", print=TRUE)
@logger.info("What do you want to do?", print=TRUE)
puts
self.askUserForAction(task, completed_file, task_report_file, action=RESUME)
end
else
dir_entries = Dir.entries(@data_set_directory).length
if File.file?(@data_set_directory+"/.ebulk_dataset")
dir_entries -= 1
end
if dir_entries > 2
puts
@logger.info("Dataset download directory is not empty! It will be overwritten: " + @data_set_directory, print=TRUE)
@logger.info("Continue with download? (y/n)", print=TRUE)
option = gets
option = option.chomp
if option == "n"
@logger.info("Download cancelled by user.", print=TRUE)
@logger.abortExecution()
end
end
ebulk_file = @data_set_directory + "/.ebulk_dataset"
ebulk_file_content = ""
if File.file?(ebulk_file)
ebulk_file_content = File.read(ebulk_file)
end
FileUtils.rm_rf(@data_set_directory)
unless File.directory?(@data_set_directory)
FileUtils.mkdir_p(@data_set_directory)
end
if ebulk_file_content != ""
File.open(ebulk_file, 'w') { |file| file.write(ebulk_file_content) }
end
File.open(task_report_file, 'w') {}
end
columns = [
Column.new(0, "reference", :string),
Column.new(1, "data_chunk", :string),
Column.new(2, "data_set", :string)
]
resume(task, columns, task['data_streams'].length, &control)
rescue Exception => e
@logger.error("An error occurred during operation: " + e.to_s, print=TRUE)
@logger.error(e.backtrace)
puts "[INFO] For more detailed information, please refer to the log file."
@logger.abortExecution()
end
end
def self.resume(task, columns, count, &control)
@logger = LogManager.instance()
task_reports = yield(task, columns, count)
if task_reports.any?
@logger.info("Reports:", print=TRUE)
if task_reports.length > 15
@logger.info(task_reports[0, 5], print=TRUE)
@logger.info(".....", print=TRUE)
@logger.info(task_reports[task_reports.length-5, task_reports.length-1], print=TRUE)
else
@logger.info(task_reports, print=TRUE)
end
@logger.info("Full task report:")
@logger.info(task_reports)
end
next_config_diff = task_reports.map{|hash| hash[RUN_DONE]}.flatten.compact
if(next_config_diff.length == count)
@logger.info("Dataset successfully downloaded.", print=TRUE)
@logger.info("#{count} files downloaded.", print=TRUE)
@logger.info("The files were saved in dataset directory: " + @data_set_directory, print=TRUE)
completed_file = getCompletedFilename(@data_set_directory)
File.open(completed_file, 'w') {}
if count > 10
next_config_diff = {}
end
end
return {RUN_DONE => next_config_diff}
end
def initialize(task, schema, index, page_builder)
super
@data_set = task['data_set']
@chunk_size = task['chunk_size']
@data_set_directory = task['data_set_directory']
@wendelin = WendelinClient.new(task['erp5_url'], task['user'], task['password'])
@logger = LogManager.instance()
end
def run
data_stream = task['data_streams'][@index]
id = data_stream[0]
ref = data_stream[1]
@logger.info("Getting content from remote #{ref}", print=TRUE)
begin
@wendelin.eachDataStreamContentChunk(id, @chunk_size) do |chunk|
if chunk.nil? || chunk.empty?
content = ""
else
content = Base64.encode64(chunk)
end
entry = [ref, content, @data_set]
page_builder.add(entry)
page_builder.finish
end
rescue java.lang.OutOfMemoryError
@logger.logOutOfMemoryError(ref)
return_value = RUN_ABORTED
rescue Exception => e
@logger.error("An error occurred during processing: " + e.to_s, print=TRUE)
@logger.error(e.backtrace)
puts "[INFO] For more detailed information, please refer to the log file: " + @logger.getLogPath()
return_value = RUN_ERROR
else
return_value = RUN_DONE
end
task_report_file = Wendelininput.getTaskReportFilename(@data_set_directory)
File.open(task_report_file, 'ab') { |file| file.puts(ref+";"+return_value+";") }
return {return_value => ref}
end
end
end
end
require 'base64'
require 'fileutils'
require_relative '../filelogger'
module Embulk
module Output
class Fifoutput < OutputPlugin
Plugin.register_output("fif", self)
def self.transaction(config, schema, count, &control)
@logger = LogManager.instance()
task = { "output_path" => config.param("output_path", :string, :default => nil) }
if File.directory?(task['output_path'])
else
@logger.error("Output directory not found.", print=TRUE)
@logger.abortExecution()
end
task_reports = yield(task)
next_config_diff = {}
return next_config_diff
end
def init
@output_path = task["output_path"]
@logger = LogManager.instance()
end
def close
end
def add(page)
begin
page.each do |record|
reference = record[0]
data_chunk = Base64.decode64(record[1])
data_set_directory = @output_path.end_with?("/") ? @output_path : @output_path + "/"
ref = reference.reverse.sub("/".reverse, ".".reverse).reverse.sub(record[2]+"/", "")
if ref.end_with?(".none")
ref = ref[0...-5]
end
dirname = File.dirname(data_set_directory + ref)
unless File.directory?(dirname)
FileUtils.mkdir_p(dirname)
end
File.open(data_set_directory + ref, 'ab') { |file| file.write(data_chunk) }
end
rescue Exception => e
@logger.error("An error occurred while writing file.", print=TRUE)
@logger.error(e.backtrace)
raise e
end
end
def finish
end
def abort
end
def commit
task_report = {}
return task_report
end
end
end
end
require 'base64'
require_relative '../wendelin_client'
module Embulk
module Output
class Wendelin < OutputPlugin
Plugin.register_output("wendelin", self)
def self.transaction(config, schema, count, &control)
task = {
"erp5_url" => config.param("erp5_url", :string),
"user" => config.param("user", :string, defualt: nil),
"password" => config.param("password", :string, default: nil),
"path_prefix" => config.param("path_prefix", :string, :default => nil),
}
task_reports = yield(task)
next_config_diff = {}
@logger = LogManager.instance()
@logger.info("Your ingested files will be available in the site in a few minutes. Thank for your patience.", print=TRUE)
return next_config_diff
end
def init
credentials = {}
@erp5_url = task["erp5_url"]
@user = task["user"]
@password = task["password"]
@logger = LogManager.instance()
@wendelin = WendelinClient.new(@erp5_url, @user, @password)
end
def close
end
def add(page)
page.each do |record|
supplier = (record[0].nil? || record[0].empty?) ? "default" : record[0]
dataset = (record[1].nil? || record[1].empty?) ? "default" : record[1]
filename = record[2]
extension = record[3]
eof = record[5]
data_chunk = record[4]
reference = [supplier, dataset, filename, extension, eof].join("/")
begin
if not @wendelin.ingest(reference, data_chunk)
raise "could not ingest"
end
rescue Exception => e
raise e
@logger.error(e.backtrace)
end
end
end
def finish
end
def abort
end
def commit
task_report = {}
return task_report
end
end
end
end
require_relative '../filelogger'
class Index
include Singleton
def initialize()
@index = 0
end
def increase()
@index = @index + 1
end
def get()
return @index
end
end
module Embulk
module Parser
class BinaryParserPlugin < ParserPlugin
Plugin.register_parser("binary", self)
CHUNK_SIZE = 50
MEGA = 1000000
EOF = "EOF"
def self.transaction(config, &control)
tool_dir = config.param('tool_dir', :string, default: ".")
@logger = LogManager.instance()
@logger.setFilename(tool_dir, "parser")
task = {
chunk_size: config.param('chunk_size', :float, default: CHUNK_SIZE) * MEGA,
supplier: config.param("supplier", :string, default: "parser"),
data_set: config.param("data_set", :string),
input_plugin: config.param("storage", :string, default: "parser"),
date: Time.now.strftime("%Y-%m-%d_%H-%M-%S")
}
columns = [
Column.new(0, "supplier", :string),
Column.new(1, "data_set", :string),
Column.new(2, "file", :string),
Column.new(3, "extension", :string),
Column.new(4, "data_chunk", :string),
Column.new(5, "eof", :string)
]
yield(task, columns)
end
def run(file_input)
@index = Index.instance().get()
@logger = LogManager.instance()
while file = file_input.next_file
begin
filename = "file_from_#{task['input_plugin']}_#{task['date']}"
each_chunk(file, filename, task['chunk_size']) do |record|
@page_builder.add(record)
end
@page_builder.finish
Index.instance().increase()
rescue java.lang.OutOfMemoryError
@logger.logOutOfMemoryError(path)
return
rescue Exception => e
@logger.error("An error occurred during file ingestion: " + e.to_s, print=TRUE)
@logger.error(e.backtrace)
puts "[INFO] For more detailed information, please refer to the log file: " + @logger.getLogPath()
end
end
end
private
def each_chunk(file, filename, chunk_size=CHUNK_SIZE)
extension = @index.to_s.rjust(3, "0")
npart = 0
next_byte = file.read(1)
first = TRUE
while true
data = next_byte
if not next_byte
if first
values = [task['supplier'], task['data_set'], filename, extension, "", EOF]
yield(values)
end
break
end
first = FALSE
data += file.read(chunk_size)
next_byte = file.read(1)
if not next_byte
eof = EOF
else
npart += 1
eof = npart.to_s.rjust(3, "0")
end
content = Base64.encode64(data)
values = [task['supplier'], task['data_set'], filename, extension, content, eof]
yield(values)
end
end
end
end
end
require 'net/http'
require 'openssl'
require 'yaml'
require 'open-uri'
require_relative 'filelogger'
# class representing a Wendelin client
class WendelinClient
def initialize(erp5_url, user, password)
@erp5_url = erp5_url
@user = user
@password = password
@banned_references_list = []
@logger = LogManager.instance()
@last_ingestion = Time.new - 2
end
def removeEOF(reference)
root = reference.dup
return root[0...root.rindex('/')]
end
def exists(reference)
uri = URI("#{@erp5_url}/ingestionReferenceExists?reference=#{reference}")
begin
res = open(uri, http_basic_authentication: [@user, @password]).read
rescue Exception => e
@logger.error("An error occurred while checking if reference exists: " + e.to_s)
@logger.error(e.backtrace)
return FALSE
else
return res.to_s == 'TRUE'
end
end
def ingest(reference, data_chunk)
@logger.info("Ingestion reference: #{reference}", print=TRUE)
if @banned_references_list.include? removeEOF(reference)
return FALSE
end
if Time.new - @last_ingestion < 2
# avoid send ingestions to close (specially for split ones)
sleep 3
end
if exists(reference)
@logger.info("There is another ingestion already done for the pair data_set-filename. Reference "\
+ removeEOF(reference), print=TRUE)
@logger.info("Rename your reference or delete the older ingestion.", print=TRUE)
@banned_references_list << removeEOF(reference)
return FALSE
end
if reference.include? "#" or reference.include? "+"
raise "Invalid chars in file name. Please rename it."
end
begin
uri = URI("#{@erp5_url}/ingest?reference=#{reference}")
rescue Exception => e
@logger.error("An error occurred while generating url: " + e.to_s)
@logger.error(e.backtrace)
raise "Invalid chars in file name. Please rename it."
end
response = handleRequest(uri, reference, data_chunk)
if response == FALSE
return FALSE
end
@logger.info("Record successfully ingested.", print=TRUE)
@last_ingestion = Time.new
return TRUE
end
def eachDataStreamContentChunk(id, chunk_size)
uri = URI("#{@erp5_url}#{id}/getData")
@logger.info("Downloading...", print=TRUE)
first = TRUE
res = open(uri, http_basic_authentication: [@user, @password]) {
|content|
while true
chunk = content.read(chunk_size)
if chunk.nil?
if first
yield chunk
end
@logger.info("Done", print=TRUE)
break
end
first = FALSE
yield chunk
end
}
end
def getDataStreams(data_set_reference)
@logger.info("Getting file list for dataset '#{data_set_reference}'", print=TRUE)
uri = URI("#{@erp5_url}getDataStreamList?data_set_reference=#{data_set_reference}")
str = handleRequest(uri)
if str == FALSE
@logger.abortExecution()
end
if not str.nil?
str.gsub!(/(\,)(\S)/, "\\1 \\2")
return YAML::load(str)
end
return {'status_code': 0, 'result': []}
end
private
def handleRequest(uri, reference=nil, data_chunk=nil)
req = Net::HTTP::Post.new(uri)
req.basic_auth @user, @password
if data_chunk != nil
@logger.info("Setting request form data...", print=TRUE)
begin
req.set_form_data('data_chunk' => data_chunk)
rescue java.lang.OutOfMemoryError
@logger.logOutOfMemoryError(reference)
@banned_references_list << removeEOF(reference)
return FALSE
end
@logger.info("Sending record:'#{reference}'...", print=TRUE)
end
begin
res = Net::HTTP.start(uri.hostname, uri.port,
:use_ssl => (uri.scheme == 'https'),
:verify_mode => OpenSSL::SSL::VERIFY_NONE,
:ssl_timeout => 32000, :open_timeout => 32000, :read_timeout => 32000,
) do |http|
http.request(req)
end
rescue Exception => e
@logger.error("HTTP ERROR: " + e.to_s, print=TRUE)
@logger.error(e.backtrace)
return FALSE
else
if res.kind_of?(Net::HTTPSuccess) # res.code is 2XX
@logger.info("Done", print=TRUE)
return res.body
else
@logger.error("HTTP FAIL - code: #{res.code}", print=TRUE)
if res.code == '500' or res.code == '502' or res.code == '503'
@logger.error("Internal Server Error: if the error persists, please contact the administrator.", print=TRUE)
elsif res.code == '401'
@logger.error("Unauthorized access. Please check your user credentials and try again.", print=TRUE)
@logger.abortExecution()
else
@logger.error("Sorry, an error ocurred. If the error persists, please contact the administrator.", print=TRUE)
#@logger.error(res.value)
end
return FALSE
end
end
end
end
ebulk ingest-download tool help
usage: ebulk <command> <dataset> [options...]
commands:
pull <dataset> Downloads the content of the target dataset from the site into the output folder
push <dataset> Ingests the content of the input folder into a target dataset on the site
-h, --help Tool help
-r, --readme Opens README file
argument:
dataset Mandatory. Unique reference for the target dataset
It must start with a letter, and only alphanumerics, dots ( . ), underscores ( _ ) and hyphens ( - ) are allowed
* For download, the reference must be one of the available datasets on the site
* For ingestion, an existing reference will append the files to the corresponding dataset
* A new reference will create a new dataset on the site
It could be a path, then the last directory will be interpreted as the reference
e.g. pull my_directory/sample/ --> dataset reference will be "sample"
options:
-d, --directory <path> Besides the dataset reference, sets the dataset directory and it links that location to the reference
-c, --chunk <chunk> Sets the chunk size (in megabytes) to split large files
-s, --storage <storage> Uses the selected input storage from this set: [http, ftp, s3]
-cs, --custom-storage Allows user to set a new input storage.
-a, --advanced Allows to edit the Embulk cofiguration file of the input storage
examples:
ebulk pull <DATASET>
* downloads the content of target dataset
ebulk push <DATASET>
* ingests files into the target dataset
ebulk pull <DATASET> -d <PATH>
* downloads the content of target dataset in target PATH
* future operations on PATH directory will use the DATASET reference implicitly
ebulk push <DATASET> -c 20
* ingests files into the <DATASET> splitting them in chunks of 20MB
ebulk push <DATASET> -s <STORAGE>
* ingests the content of the input storage [http, ftp, s3] into the target dataset
ebulk push <DATASET> -s <STORAGE> --advanced
* allows the user to edit the configuration file of the selected storage
ebulk push <DATASET> --custom-storage
* user can install and configure a new input plugin storage
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment