Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
E
ebulk
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
nexedi
ebulk
Commits
c6e035d3
Commit
c6e035d3
authored
Sep 19, 2018
by
roqueporchetto@gmail.com
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
ebulk: features to handle modifications an deletions
parent
2912f542
Changes
10
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
10 changed files
with
623 additions
and
336 deletions
+623
-336
.gitignore
.gitignore
+3
-0
ebulk
ebulk
+3
-3
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/dataset_utils.rb
.../embulk-wendelin-dataset-tool/lib/embulk/dataset_utils.rb
+267
-0
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/filelogger.rb
...ata/embulk-wendelin-dataset-tool/lib/embulk/filelogger.rb
+6
-4
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/input/fif.rb
...data/embulk-wendelin-dataset-tool/lib/embulk/input/fif.rb
+114
-144
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/input/wendelin.rb
...embulk-wendelin-dataset-tool/lib/embulk/input/wendelin.rb
+180
-157
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/output/fif.rb
...ata/embulk-wendelin-dataset-tool/lib/embulk/output/fif.rb
+15
-5
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/output/wendelin.rb
...mbulk-wendelin-dataset-tool/lib/embulk/output/wendelin.rb
+11
-3
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/parser/binary.rb
.../embulk-wendelin-dataset-tool/lib/embulk/parser/binary.rb
+11
-4
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/wendelin_client.rb
...mbulk-wendelin-dataset-tool/lib/embulk/wendelin_client.rb
+13
-16
No files found.
.gitignore
0 → 100644
View file @
c6e035d3
*~
ebulk-data/config/*config.yml
ebulk
100755 → 100644
View file @
c6e035d3
...
...
@@ -221,10 +221,10 @@ function checkCurl {
function
checkSoftware
{
# CHECK JAVA VERSION
if
type
-p
java
>
/dev/null
;
then
_java
=
java
elif
[[
-n
"
$JAVA_HOME
"
]]
&&
[[
-x
"
$JAVA_HOME
/bin/java"
]]
;
then
if
[[
-n
"
$JAVA_HOME
"
]]
&&
[[
-x
"
$JAVA_HOME
/bin/java"
]]
;
then
_java
=
"
$JAVA_HOME
/bin/java"
elif
type
-p
java
>
/dev/null
;
then
_java
=
java
else
javaNotInstalled
>
&2
;
return
1
fi
...
...
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/dataset_utils.rb
0 → 100644
View file @
c6e035d3
require_relative
'filelogger'
require
'digest/md5'
# class that handles dataset tasks report
class
DatasetUtils
DATASET_REPORT_FILE
=
".dataset-task-report"
DATASET_COMPLETED_FILE
=
".dataset-completed"
RESUME_OPERATION_FILE
=
".resume-operation"
RUN_DONE
=
"done"
RUN_ERROR
=
"error"
RUN_ABORTED
=
"aborted"
DELETE
=
"DELETE"
INGESTION
=
"ingestion"
MEGA
=
1000000
def
initialize
(
data_set_directory
)
@data_set_directory
=
data_set_directory
@logger
=
LogManager
.
instance
()
@task_report_file
=
@data_set_directory
+
DATASET_REPORT_FILE
@completed_file
=
@data_set_directory
+
DATASET_COMPLETED_FILE
@resume_operation_file
=
@data_set_directory
+
RESUME_OPERATION_FILE
end
def
getLocalFiles
(
remove
=
nil
)
local_files
=
{}
begin
File
.
readlines
(
@task_report_file
).
each
do
|
line
|
record
=
line
.
split
(
";"
)
if
record
[
1
].
chomp
==
RUN_DONE
if
(
remove
.
nil?
)
||
(
remove
!=
record
[
0
])
local_files
[
record
[
0
]]
=
{
"size"
=>
record
[
2
].
chomp
,
"hash"
=>
record
[
3
].
chomp
,
"status"
=>
record
[
1
].
chomp
,
"modification_date"
=>
record
[
4
].
chomp
}
end
end
end
rescue
Exception
=>
e
@logger
.
error
(
"An error occurred in DatasetUtils method 'getLocalFiles':"
+
e
.
to_s
)
@logger
.
error
(
e
.
backtrace
)
end
return
local_files
end
def
saveReport
(
local_files
)
begin
File
.
delete
(
@task_report_file
)
if
File
.
exist?
(
@task_report_file
)
if
local_files
.
empty?
File
.
open
(
@task_report_file
,
'w'
)
{}
else
local_files
.
each
do
|
key
,
array
|
File
.
open
(
@task_report_file
,
'ab'
)
{
|
file
|
file
.
puts
(
key
+
";"
+
array
[
"status"
]
+
";"
+
array
[
"size"
].
to_s
+
";"
+
array
[
"hash"
]
+
";"
+
array
[
"modification_date"
])
}
end
end
rescue
Exception
=>
e
@logger
.
error
(
"An error occurred in DatasetUtils method 'saveReport':"
+
e
.
to_s
)
@logger
.
error
(
e
.
backtrace
)
end
end
def
removeCurrentOperation
()
if
File
.
exist?
(
@resume_operation_file
)
File
.
delete
(
@resume_operation_file
)
end
end
def
saveCurrentOperation
(
operation
,
reference
)
if
File
.
exist?
(
@resume_operation_file
)
File
.
delete
(
@resume_operation_file
)
end
File
.
open
(
@resume_operation_file
,
'w'
)
{
|
file
|
file
.
puts
(
operation
+
";"
+
reference
)
}
end
def
reportUpToDate
(
data_stream_dict
)
begin
if
not
reportFileExist
()
and
not
completedFileExist
()
# directory never downloaded -new or used for partial ingestions-
return
TRUE
end
if
reportFileExist
()
and
not
completedFileExist
()
# download not finished
return
FALSE
end
if
data_stream_dict
[
"status_code"
]
==
2
return
FALSE
end
if
data_stream_dict
[
"status_code"
]
!=
0
return
TRUE
end
changes
=
getRemoteChangedDataStreams
(
data_stream_dict
[
"result"
])
if
changes
.
empty?
return
TRUE
elsif
changes
.
length
==
1
# check if the unique detected change corresponds to an interrumped ingestion
if
File
.
exist?
(
@resume_operation_file
)
operation
=
File
.
open
(
@resume_operation_file
).
read
.
chomp
.
split
(
";"
)
if
operation
[
0
]
==
INGESTION
if
operation
[
1
]
==
changes
[
0
][
"reference"
]
File
.
delete
(
@resume_operation_file
)
return
TRUE
end
end
end
end
return
FALSE
rescue
Exception
=>
e
@logger
.
error
(
"An error occurred in DatasetUtils method 'reportUpToDate':"
+
e
.
to_s
)
@logger
.
error
(
e
.
backtrace
)
return
FALSE
end
end
def
deleteCompletedFile
()
File
.
delete
(
@completed_file
)
if
File
.
exist?
(
@completed_file
)
end
def
createCompletedFile
()
File
.
open
(
@completed_file
,
'w'
)
{}
end
def
completedFileExist
()
return
File
.
exist?
(
@completed_file
)
end
def
createReportFile
()
File
.
open
(
@task_report_file
,
'w'
)
{}
end
def
reportFileExist
()
return
File
.
exist?
(
@task_report_file
)
end
def
addToReport
(
reference
,
status
,
size
,
hash
,
data_set
)
local_files
=
{}
begin
data_set
=
data_set
.
end_with?
(
"/"
)
?
data_set
:
data_set
+
"/"
file_path
=
@data_set_directory
+
reference
.
reverse
.
sub
(
"/"
.
reverse
,
"."
.
reverse
).
reverse
.
sub
(
data_set
,
""
)
file_path
=
file_path
[
0
...-
5
]
if
file_path
.
end_with?
(
".none"
)
modification_date
=
File
.
exist?
(
file_path
)
?
File
.
mtime
(
file_path
).
strftime
(
"%Y-%m-%d-%H-%M-%S"
)
:
"not-modification-date"
if
not
reportFileExist
()
File
.
open
(
@task_report_file
,
'w'
)
{}
end
new_file
=
TRUE
File
.
readlines
(
@task_report_file
).
each
do
|
line
|
record
=
line
.
split
(
";"
)
if
reference
.
to_s
==
record
[
0
].
to_s
local_files
[
reference
]
=
{
"size"
=>
size
,
"hash"
=>
hash
,
"status"
=>
status
,
"modification_date"
=>
modification_date
}
new_file
=
FALSE
else
local_files
[
record
[
0
]]
=
{
"size"
=>
record
[
2
].
chomp
,
"hash"
=>
record
[
3
].
chomp
,
"status"
=>
record
[
1
].
chomp
,
"modification_date"
=>
record
[
4
].
chomp
}
end
end
if
new_file
local_files
[
reference
]
=
{
"size"
=>
size
,
"hash"
=>
hash
,
"status"
=>
status
,
"modification_date"
=>
modification_date
}
end
rescue
Exception
=>
e
@logger
.
error
(
"An error occurred in DatasetUtils method 'addToReport':"
+
e
.
to_s
)
@logger
.
error
(
e
.
backtrace
)
end
saveReport
(
local_files
)
end
def
deleteFromReport
(
reference
,
status
)
local_files
=
getLocalFiles
(
remove
=
reference
)
saveReport
(
local_files
)
end
def
getHash
(
file
)
begin
chunk_size
=
4
*
MEGA
md5
=
Digest
::
MD5
.
new
open
(
file
)
do
|
f
|
while
chunk
=
f
.
read
(
chunk_size
)
md5
.
update
(
chunk
)
end
end
return
md5
.
hexdigest
rescue
Exception
=>
e
@logger
.
error
(
"An error occurred while getting hash of file "
+
file
+
":"
+
e
.
to_s
,
print
=
TRUE
)
@logger
.
error
(
e
.
backtrace
)
raise
e
end
end
def
getLocalChanges
(
files
,
data_set
)
new_files
=
[]
begin
if
reportFileExist
()
File
.
readlines
(
@task_report_file
).
each
do
|
line
|
record
=
line
.
split
(
";"
)
if
record
[
1
].
chomp
==
RUN_DONE
data_set
=
data_set
.
end_with?
(
"/"
)
?
data_set
:
data_set
+
"/"
file_path
=
@data_set_directory
+
record
[
0
].
reverse
.
sub
(
"/"
.
reverse
,
"."
.
reverse
).
reverse
.
sub
(
data_set
,
""
)
file_path
=
file_path
[
0
...-
5
]
if
file_path
.
end_with?
(
".none"
)
if
files
.
include?
file_path
modification_date
=
File
.
mtime
(
file_path
).
strftime
(
"%Y-%m-%d-%H-%M-%S"
)
if
modification_date
!=
record
[
4
].
chomp
size
=
File
.
size
(
file_path
).
to_s
hash
=
getHash
(
file_path
).
to_s
if
size
==
record
[
2
].
to_s
if
hash
!=
record
[
3
].
chomp
new_files
.
push
({
"path"
=>
file_path
,
"size"
=>
size
,
"hash"
=>
hash
})
end
else
new_files
.
push
({
"path"
=>
file_path
,
"size"
=>
size
,
"hash"
=>
hash
})
end
end
files
.
delete
(
file_path
)
else
new_files
.
push
({
"path"
=>
file_path
,
"size"
=>
""
,
"hash"
=>
DELETE
})
end
end
end
end
files
.
each
do
|
path
|
new_files
.
push
({
"path"
=>
path
,
"size"
=>
""
,
"hash"
=>
""
})
end
rescue
Exception
=>
e
@logger
.
error
(
"An error occurred in DatasetUtils method 'getLocalChanges':"
+
e
.
to_s
)
@logger
.
error
(
e
.
backtrace
)
end
return
new_files
end
def
getRemoteChangedDataStreams
(
data_streams
)
pending_data_streams
=
[]
begin
if
reportFileExist
()
local_files
=
{}
remote_files
=
[]
File
.
readlines
(
@task_report_file
).
each
do
|
line
|
record
=
line
.
split
(
";"
)
if
record
[
1
].
chomp
==
RUN_DONE
local_files
[
record
[
0
]]
=
{
"size"
=>
record
[
2
].
chomp
,
"hash"
=>
record
[
3
].
chomp
,
}
end
end
data_streams
.
each
do
|
data_stream
|
remote_files
.
push
(
data_stream
[
"reference"
])
pending
=
TRUE
reference
=
data_stream
[
"reference"
]
if
local_files
.
has_key?
reference
size
=
local_files
[
reference
][
"size"
]
if
size
.
to_s
==
data_stream
[
"size"
].
to_s
hash
=
local_files
[
reference
][
"hash"
]
if
hash
==
data_stream
[
"hash"
]
or
data_stream
[
"hash"
]
==
""
pending
=
FALSE
end
end
end
if
pending
local_files
.
delete
(
reference
)
pending_data_streams
.
push
(
data_stream
)
end
end
local_files
.
each
do
|
key
,
array
|
if
not
remote_files
.
include?
key
pending_data_streams
.
push
({
"reference"
=>
key
,
"hash"
=>
DELETE
})
end
end
end
rescue
Exception
=>
e
@logger
.
error
(
"An error occurred in DatasetUtils method 'getRemoteChangedDataStreams':"
+
e
.
to_s
)
@logger
.
error
(
e
.
backtrace
)
end
return
pending_data_streams
end
end
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/filelogger.rb
View file @
c6e035d3
...
...
@@ -38,11 +38,13 @@ class LogManager
log
(
message
,
print
,
type
=
ERROR
)
end
def
abortExecution
()
puts
def
abortExecution
(
error
=
TRUE
)
info
(
"PROCESS ABORTED"
)
unless
@path
.
nil?
puts
"PROCESS ABORTED : For more detailed information, please refer to the log file '
#{
@path
}
'"
if
error
puts
unless
@path
.
nil?
puts
"PROCESS ABORTED : For more detailed information, please refer to the log file '
#{
@path
}
'"
end
end
exec
(
"Process.kill 9, Process.pid >/dev/null 2>&1"
)
end
...
...
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/input/fif.rb
View file @
c6e035d3
This diff is collapsed.
Click to expand it.
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/input/wendelin.rb
View file @
c6e035d3
This diff is collapsed.
Click to expand it.
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/output/fif.rb
View file @
c6e035d3
require
'base64'
require
'fileutils'
require_relative
'../dataset_utils'
require_relative
'../filelogger'
module
Embulk
...
...
@@ -39,14 +40,23 @@ module Embulk
if
ref
.
end_with?
(
".none"
)
ref
=
ref
[
0
...-
5
]
end
dirname
=
File
.
dirname
(
data_set_directory
+
ref
)
unless
File
.
directory?
(
dirname
)
FileUtils
.
mkdir_p
(
dirname
)
file_path
=
data_set_directory
+
ref
write_mode
=
'ab'
if
record
[
3
]
==
DatasetUtils
::
DELETE
File
.
delete
(
file_path
)
if
File
.
exist?
(
file_path
)
else
if
record
[
3
]
==
TRUE
.
to_s
write_mode
=
'w'
end
dirname
=
File
.
dirname
(
data_set_directory
+
ref
)
unless
File
.
directory?
(
dirname
)
FileUtils
.
mkdir_p
(
dirname
)
end
File
.
open
(
file_path
,
write_mode
)
{
|
file
|
file
.
write
(
data_chunk
)
}
end
File
.
open
(
data_set_directory
+
ref
,
'ab'
)
{
|
file
|
file
.
write
(
data_chunk
)
}
end
rescue
Exception
=>
e
@logger
.
error
(
"An error occurred while
writ
ing file."
,
print
=
TRUE
)
@logger
.
error
(
"An error occurred while
proces
ing file."
,
print
=
TRUE
)
@logger
.
error
(
e
.
backtrace
)
raise
e
end
...
...
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/output/wendelin.rb
View file @
c6e035d3
require
'base64'
require_relative
'../wendelin_client'
require_relative
'../dataset_utils'
module
Embulk
module
Output
...
...
@@ -41,10 +42,17 @@ module Embulk
extension
=
record
[
3
]
eof
=
record
[
5
]
data_chunk
=
record
[
4
]
reference
=
[
supplier
,
dataset
,
filename
,
extension
,
eof
].
join
(
"/"
)
size
=
record
[
6
]
hash
=
record
[
7
]
begin
if
not
@wendelin
.
ingest
(
reference
,
data_chunk
)
raise
"could not ingest"
if
eof
==
DatasetUtils
::
DELETE
reference
=
[
dataset
,
filename
,
extension
].
join
(
"/"
)
@wendelin
.
delete
(
reference
)
else
reference
=
[
supplier
,
dataset
,
filename
,
extension
,
eof
,
size
,
hash
].
join
(
"/"
)
if
not
@wendelin
.
ingest
(
reference
,
data_chunk
)
raise
"could not ingest"
end
end
rescue
Exception
=>
e
raise
e
...
...
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/parser/binary.rb
View file @
c6e035d3
...
...
@@ -40,7 +40,9 @@ module Embulk
Column
.
new
(
2
,
"file"
,
:string
),
Column
.
new
(
3
,
"extension"
,
:string
),
Column
.
new
(
4
,
"data_chunk"
,
:string
),
Column
.
new
(
5
,
"eof"
,
:string
)
Column
.
new
(
5
,
"eof"
,
:string
),
Column
.
new
(
6
,
"size"
,
:string
),
Column
.
new
(
7
,
"hash"
,
:string
)
]
yield
(
task
,
columns
)
...
...
@@ -78,22 +80,27 @@ module Embulk
data
=
next_byte
if
not
next_byte
if
first
values
=
[
task
[
'supplier'
],
task
[
'data_set'
],
filename
,
extension
,
""
,
EOF
]
# this means this is an empty file
values
=
[
task
[
'supplier'
],
task
[
'data_set'
],
filename
,
extension
,
""
,
""
,
""
,
""
]
yield
(
values
)
end
break
end
first
=
FALSE
data
+=
file
.
read
(
chunk_size
)
next_byte
=
file
.
read
(
1
)
if
not
next_byte
eof
=
EOF
if
first
# this means that the whole file will be ingested at once (not split)
eof
=
""
end
else
npart
+=
1
eof
=
npart
.
to_s
.
rjust
(
3
,
"0"
)
end
content
=
Base64
.
encode64
(
data
)
values
=
[
task
[
'supplier'
],
task
[
'data_set'
],
filename
,
extension
,
content
,
eof
]
values
=
[
task
[
'supplier'
],
task
[
'data_set'
],
filename
,
extension
,
content
,
eof
,
""
,
""
]
first
=
FALSE
yield
(
values
)
end
end
...
...
ebulk-data/embulk-wendelin-dataset-tool/lib/embulk/wendelin_client.rb
View file @
c6e035d3
...
...
@@ -16,11 +16,6 @@ class WendelinClient
@last_ingestion
=
Time
.
new
-
2
end
def
removeEOF
(
reference
)
root
=
reference
.
dup
return
root
[
0
...
root
.
rindex
(
'/'
)]
end
def
exists
(
reference
)
uri
=
URI
(
"
#{
@erp5_url
}
/ingestionReferenceExists?reference=
#{
reference
}
"
)
begin
...
...
@@ -34,20 +29,25 @@ class WendelinClient
end
end
def
delete
(
reference
)
@logger
.
info
(
"Deletion requested for reference
#{
reference
}
"
,
print
=
TRUE
)
uri
=
URI
(
"
#{
@erp5_url
}
/ERP5Site_invalidateIngestionObjects?reference=
#{
reference
}
"
)
res
=
handleRequest
(
uri
)
if
res
==
FALSE
@logger
.
abortExecution
()
end
end
def
ingest
(
reference
,
data_chunk
)
@logger
.
info
(
"Ingestion reference:
#{
reference
}
"
,
print
=
TRUE
)
if
@banned_references_list
.
include?
removeEOF
(
reference
)
return
FALSE
end
if
Time
.
new
-
@last_ingestion
<
2
# avoid send ingestions to close (specially for split ones)
sleep
3
sleep
2
end
if
exists
(
reference
)
@logger
.
info
(
"There is another ingestion already done for the pair data_set-filename. Reference "
\
+
re
moveEOF
(
reference
)
,
print
=
TRUE
)
+
re
ference
,
print
=
TRUE
)
@logger
.
info
(
"Rename your reference or delete the older ingestion."
,
print
=
TRUE
)
@banned_references_list
<<
removeEOF
(
reference
)
return
FALSE
end
if
reference
.
include?
"#"
or
reference
.
include?
"+"
...
...
@@ -91,7 +91,6 @@ class WendelinClient
end
def
getDataStreams
(
data_set_reference
)
@logger
.
info
(
"Getting file list for dataset '
#{
data_set_reference
}
'"
,
print
=
TRUE
)
uri
=
URI
(
"
#{
@erp5_url
}
getDataStreamList?data_set_reference=
#{
data_set_reference
}
"
)
str
=
handleRequest
(
uri
)
if
str
==
FALSE
...
...
@@ -115,7 +114,6 @@ class WendelinClient
req
.
set_form_data
(
'data_chunk'
=>
data_chunk
)
rescue
java
.
lang
.
OutOfMemoryError
@logger
.
logOutOfMemoryError
(
reference
)
@banned_references_list
<<
removeEOF
(
reference
)
return
FALSE
end
@logger
.
info
(
"Sending record:'
#{
reference
}
'..."
,
print
=
TRUE
)
...
...
@@ -125,7 +123,7 @@ class WendelinClient
res
=
Net
::
HTTP
.
start
(
uri
.
hostname
,
uri
.
port
,
:use_ssl
=>
(
uri
.
scheme
==
'https'
),
:verify_mode
=>
OpenSSL
::
SSL
::
VERIFY_NONE
,
:ssl_timeout
=>
32000
,
:open_timeout
=>
32000
,
:read_timeout
=>
3200
0
,
:ssl_timeout
=>
20
,
:open_timeout
=>
20
,
:read_timeout
=>
2
0
,
)
do
|
http
|
http
.
request
(
req
)
end
...
...
@@ -135,7 +133,7 @@ class WendelinClient
return
FALSE
else
if
res
.
kind_of?
(
Net
::
HTTPSuccess
)
# res.code is 2XX
@logger
.
info
(
"Done"
,
print
=
TRUE
)
@logger
.
info
(
"Done"
)
return
res
.
body
else
@logger
.
error
(
"HTTP FAIL - code:
#{
res
.
code
}
"
,
print
=
TRUE
)
...
...
@@ -146,7 +144,6 @@ class WendelinClient
@logger
.
abortExecution
()
else
@logger
.
error
(
"Sorry, an error ocurred. If the error persists, please contact the administrator."
,
print
=
TRUE
)
#@logger.error(res.value)
end
return
FALSE
end
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment