Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
E
ebulk
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
nexedi
ebulk
Commits
f57f73de
Commit
f57f73de
authored
Jun 20, 2019
by
Roque
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
README on front page
parent
b8f0c2ec
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
120 additions
and
0 deletions
+120
-0
README.md
README.md
+120
-0
No files found.
README.md
0 → 100755
View file @
f57f73de
# ------ EBULK INGESTION-DOWNLOAD TOOL ------
# CONTENT:
-
Bash script for ingestion and download
-
Embulk plugins
-
Configuration files (yml)
# REQUIREMENTS
This tool relies on
**Embulk**
Java application (see
[
docs
](
http://www.embulk.org/
)
).
Please make sure that
[
Java 8
](
http://www.oracle.com/technetwork/java/javase/downloads/index.html
)
is installed.
After installing the package and in the first use, the bash script will try to install Embulk automatically (if it is not installed).
If your OS needs special permission, it maybe will be necessary to install Embulk v 0.9.7 manually:
curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.bintray.com/embulk/maven/embulk-0.9.7.jar"
chmod +x ~/.embulk/bin/embulk
echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# ------ DOWNLOAD ------
# QUICK START
To start the download, run the following command:
```
ebulk pull <DATA_SET>
```
being
`<DATA_SET>`
the dataset reference showed in the site.
(e.g.
**ebulk pull my-dataset**
)
This will automatically install Embulk and it will ask for user credentials.
After authentication, it will start the download and create an output directory named as the dataset reference with the downloaded files.
`<DATA_SET>`
could be also a path, then the last directory will be interpreted as the dataset reference
(e.g.
**ebulk pull my_directory/sample/**
) --> dataset reference will be "sample"
# CUSTOMIZE CHUNK SIZE
If there is need to specify the chunk size for split download (e.g. due to memory errors with big files),
run the command with these parameters:
```
ebulk pull <DATA_SET> -c <CHUNK_SIZE>
```
being
`<CHUNK_SIZE>`
an integer to set the size in Mb.
(e.g.
**ebulk pull my-dataset 10**
)
# CUSTOMIZE OUTPUT DIRECTORY
Allows to use a custom output directory, different from the dataset reference. That location will be linked to the dataset reference.
```
ebulk pull <DATA_SET> -d <PATH>
```
being
`<PATH>`
the output location of the downloaded files.
(e.g.
**ebulk pull my-dataset -d some/different/path**
)
The content of the
<DATA
_SET
>
will be downloaded in
<PATH>
, and that location will be linked to the reference
<DATA
_SET
>
This means that no matter if the directory is moved or renamed, the operations will refer to the dataset reference:
(e.g.
**ebulk pull moved/or/renamed/path**
will try to download the dataset 'my-dataset')
# ------ INGESTION ------
# QUICK START
To start the ingestion, run the following command:
```
ebulk push <DATA_SET>
```
being
`<DATA_SET>`
the dataset reference for your dataset, and the input directory where the files are.
(e.g.
**ebulk pull my-dataset**
)
This will automatically install Embulk and it will ask for user credentials.
After authentication, it will start the ingestion.
# CUSTOMIZE CHUNK SIZE AND OUTPUT DIRECTORY
The chunk size to split the ingestions or the input directory customization works as in the download operation.
(e.g.
**ebulk push my-dataset 10**
)
(e.g.
**ebulk push my-dataset -d some/different/path**
)
# USE A DIFFERENT INPUT STORAGE
Ebulk tool has some preinstalled input storage that user can use to ingest from different locations than file system. These are:
-
File transfer protocol: ftp
-
HTTP request: http
-
Amazon web service S3: s3
To use one of those storages as input, run the following command:
```
ebulk push <DATA_SET> --storage <STORAGE>
```
being
`<STORAGE>`
one of the following available inputs: ftp, http, s3
(e.g.
**ebulk push my-dataset --storage http**
)
Each storage will request the user inputs like credentials, urls, etc. depending on each case.
# ADVANCED STORAGE
The user can edit the Embulk configuration file of the selected storage to run more complex scenarios
*
Please keep in mind that some knowledge about Embulk is required
```
ebulk push <DATA_SET> --storage <STORAGE> --advanced
```
# CUSTOM
The user can request the installation of a new input storage, running the following command:
```
ebulk push <DATA_SET> --custom-storage
```
The tool will request the user to input the desired Embulk input plugin (gem) in order to install it.
The input gem can be pick from here: http://www.embulk.org/plugins/
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment