Commit 3714fc8f authored by Stan Hu's avatar Stan Hu

Fix large S3 uploads failing to finalize

When large files are uploaded to object storage by Workhorse,
CarrierWave is responsible for copying these files from their temporary
location to a final location. However, if the file is above 5 GB, the
upload will fail outright because AWS requires multipart uploads to be
used to copy files above that limit.

Even if multipart uploads were used, files containing several gigabytes
of data would usually fail to complete within the 60-second Web request
timeout. In one test, a 6 GB file took several minutes to copy with
fog-aws, while it only took 36 seconds with the aws CLI.  The main
difference: multithreading.

fog-aws now supports multipart, multithreaded uploads per these pull
requests:

* https://github.com/fog/fog-aws/pull/578
* https://github.com/fog/fog-aws/pull/579

For this to work, we also need to patch CarrierWave to use the
`File#copy` method instead of the Fog connection `copy_object`
method. We use a concurrency of 10 threads because this is what the AWS
SDK uses, and it appears to give good performance for large uploads.

`s3_multithreaded_uploads` feature flag. We enable it by default because
GitLab.com uses Google Compute Storage.

Relates to https://gitlab.com/gitlab-org/gitlab/-/issues/216442
parent ff86a2e8
---
title: Fix large S3 uploads failing to finalize
merge_request: 50922
author:
type: fixed
---
name: s3_multithreaded_uploads
introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/50922
rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/296772
milestone: '13.8'
type: development
group: group::continuous integration
default_enabled: true
......@@ -17,7 +17,22 @@ module CarrierWave
class Fog < Abstract
class File
def copy_to(new_path)
# fog-aws needs multipart uploads to copy files above 5 GB,
# and it is currently the only Fog provider that supports
# multithreaded uploads (https://github.com/fog/fog-aws/pull/579).
# Multithreaded uploads are essential for copying large amounts of data
# within the request timeout.
if ::Feature.enabled?(:s3_multithreaded_uploads, default_enabled: true) && fog_provider == 'AWS'
file.concurrency = 10 # AWS SDK uses 10 threads by default
file.copy(@uploader.fog_directory, new_path, copy_to_options)
else
# Some Fog providers may issue a GET request (https://github.com/fog/fog-google/issues/512)
# instead of a HEAD request after the transfer completes,
# which might cause the file to be downloaded locally.
# We fallback to the original copy_object for non-AWS providers.
connection.copy_object(@uploader.fog_directory, file.key, @uploader.fog_directory, new_path, copy_to_options)
end
CarrierWave::Storage::Fog::File.new(@uploader, @base, new_path)
end
......
......@@ -6,28 +6,76 @@ RSpec.describe 'CarrierWave::Storage::Fog::File' do
let(:uploader_class) { Class.new(CarrierWave::Uploader::Base) }
let(:uploader) { uploader_class.new }
let(:storage) { CarrierWave::Storage::Fog.new(uploader) }
let(:azure_options) do
let(:bucket_name) { 'some-bucket' }
let(:connection) { ::Fog::Storage.new(connection_options) }
let(:bucket) { connection.directories.new(key: bucket_name )}
let(:test_filename) { 'test' }
let(:test_data) { File.read(Rails.root.join('spec/support/gitlab_stubs/gitlab_ci.yml')) }
subject { CarrierWave::Storage::Fog::File.new(uploader, storage, test_filename) }
before do
require 'fog/azurerm'
require 'fog/aws'
stub_object_storage(connection_params: connection_options, remote_directory: bucket_name)
allow(uploader).to receive(:fog_directory).and_return(bucket_name)
allow(uploader).to receive(:fog_credentials).and_return(connection_options)
bucket.files.create(key: test_filename, body: test_data) # rubocop:disable Rails/SaveBang
end
context 'AWS' do
let(:connection_options) do
{
provider: 'AWS',
aws_access_key_id: 'AWS_ACCESS_KEY',
aws_secret_access_key: 'AWS_SECRET_KEY'
}
end
describe '#copy_to' do
let(:dest_filename) { 'copied.txt'}
it 'copies the file' do
result = subject.copy_to(dest_filename)
expect(result.exists?).to be true
expect(result.read).to eq(test_data)
# Sanity check that the file actually is there
copied = bucket.files.get(dest_filename)
expect(copied).to be_present
expect(copied.body).to eq(test_data)
end
end
end
context 'Azure' do
let(:connection_options) do
{
provider: 'AzureRM',
azure_storage_account_name: 'AZURE_ACCOUNT_NAME',
azure_storage_access_key: 'AZURE_ACCESS_KEY',
provider: 'AzureRM'
azure_storage_access_key: 'AZURE_ACCESS_KEY'
}
end
subject { CarrierWave::Storage::Fog::File.new(uploader, storage, 'test') }
describe '#copy_to' do
let(:dest_filename) { 'copied.txt'}
before do
require 'fog/azurerm'
allow(uploader).to receive(:fog_credentials).and_return(azure_options)
Fog.mock!
it 'copies the file' do
result = subject.copy_to(dest_filename)
# Fog Azure provider doesn't mock the actual copied data
expect(result.exists?).to be true
end
end
describe '#authenticated_url' do
context 'with Azure' do
it 'has an authenticated URL' do
expect(subject.authenticated_url).to eq("https://sa.blob.core.windows.net/test_container/test_blob?token")
end
end
context 'with custom expire_at' do
it 'properly sets expires param' do
......@@ -41,4 +89,5 @@ RSpec.describe 'CarrierWave::Storage::Fog::File' do
end
end
end
end
end
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment