-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS::SQS::Errors::SignatureDoesNotMatch errors on Rubinius #455
Comments
To clarify, we've been running the same code base under MRI for a good 2 months now without any of these errors popping up. Last Friday I switched things over to Rubinius and they've since started to occur. It could be a bug in Rubinius but the stack trace doesn't show any clear signs of this sadly. |
And it seems disabling checksum validation doesn't do anything, the error just re-occurred. |
Extra info: the error seems to occur when a thread encounters an error, kicks off the error handling process and is then re-started (using a |
Managed to reproduce it using the following script on Rubinius: require 'aws'
require 'thread'
AWS.config(
:access_key_id => '...SNIP...',
:secret_access_key => '...SNIP...',
:sqs_region => 'eu-west-1',
:sqs_endpoint => 'sqs.eu-west-1.amazonaws.com'
)
AWS.eager_autoload!(AWS::SQS)
AWS.eager_autoload!(AWS::Core)
def do_work
sqs = AWS::SQS.new
queue = sqs.queues.named('test')
queue.poll do |message|
sleep 2 # Fake some work
raise 'Fake error to showcase restarting of threads'
end
end
threads = []
mutex = Mutex.new
10.times do
threads << Thread.new do
begin
do_work
rescue => error
# Don't mess up console output
mutex.synchronize { puts "Error: #{error.message}" }
sleep 2
retry
end
end
end
threads.each(&:join) |
It's worth noting that the behaviour is rather random. In some runs I can't reproduce it, some other runs it happens basically right away. |
Its curious that it is generating a signature error. Thank you for all of the details. I'll see what I can do about reproducing on my end. |
Any news on this? I'll be performing some extra tests today on the latest version of Rbx but I believe the issue still persists. Not yet sure yet if it's an issue with the AWS SDK or Rbx though. |
Extra note: this issue still persists on Rbx 2.2.6. Not sure yet if it's an Rbx or aws-sdk issue. |
Any news on this? I'd love to somehow fix this in whatever way we can in Rbx. However, I'm not familiar with the inner workings of the aws-sdk and have never seen anything like this outside of it. |
Sorry for the slow response. I spent quite a bit of time reproducing and trying to resolve the issue from my end. Here are some of my findings:
The authorization header is contains a signature that is computed roughly like so: OpenSSL::HMAC.hexdigest( OpenSSL::Digest.new('sha256'), derived_key, string_to_sign) I have verified I am working with the correct string to sign. That leaves the possible sources of discrepancy to:
The derived key is is calculated like so: def derive_key(datetime)
k_secret = credentials.secret_access_key
k_date = hmac("AWS4" + k_secret, datetime[0,8])
k_region = hmac(k_date, region)
k_service = hmac(k_region, service_name)
k_credentials = hmac(k_service, 'aws4_request')
end Given the intermittent failure modes, this leeds me to strongly suspect a thread-safety issue. The v4 signer that computes the signature does not share state between calls to sign (a new digest is always created), except the read only credentials. Is it possible that the OpenSSL::HMAC or OpenSSL::Digest modules are not thread safe? I'm open to suggestions. |
I'm not aware of any thread-safety issues with Is there a way this can be tested without the whole aws-sdk stack? Since you mentioned that the various signatures match it could be that something else other than OpenSSL is interfering along the line. |
Hate to barge in with some possibly not very useful information. In the early development of EmAws I saw this error when I hadn't patched the Mutex with the Em.synchrony's fiber safe mutex. I 90% sure there use to be a mutex around fetching request credentials or a least a portion like the session_token but cannot be sure. I didn't create an issue in EmAws at the time because it was before it was ever released. This was so long ago it might not be related but thought I'd mention it. EmAws Mutex patch: https://github.com/JoshMcKin/em_aws/blob/master/lib/em_aws/patches.rb |
@yorickpeterse Yes, we should be able to test and isolate this further. I simply ran out of time yesterday. Some more background on the error message: The two strings the service echos back in the error message are the:
The canonical request is a string compiled from the request uri, headers and a full sha256 checksum of the http request body. After generating the canonical request, you generate a "string to sign" which is comprised of a fixed string prefix, the current time, the credential scope (date/region/service name/fixed suffix) and finally a hexdigest of the canonical request string. In my local test script (I will try to share this later), I am comparing the canonical request and string to sign that I generate with those returned by the service. They match perfectly in every error. Given the canonical request contains a sha256 checksum of the body, and the service checksum matches ours, I am positive the request is not getting corrupted over the wire. This points to a bad signature. The final signature is created by computing a hex digest of the string to sign using a derived key (the derived key is your secret access key signed recursively with each portion of the credential scope). The service validates the signature by computing and comparing signatures. In the case of failure, it echos back the computed fingerprints (canonical request and string to sign). It does not echo back the computed signature, but that is what fails to match. This is why I feel like the error is happening while generating the signature. @JoshMcKin Not unhelpful, but I don't think that is the cause. In my test example, I am creating a single sqs client (with a single credentials object). That object is used to get my queue by name outside the concurrent sections of code. This should guarantee my static credentials are ready to read/reuse. You are correct, the non-static credential providers use a mutex to allow them to update. I won't much have time to look at this today, I'm heading out of town this afternoon. I'll try to pick this back up on Monday though. |
@yorickpeterse I have not been able to reproduce this error locally single switching over to OpenSSL::Digest. Can you confirm? |
@trevorrowe I haven't tried this yet with using |
Sorry for the late reply. I gave this a shot using aws-sdk 1.43.2 and have not yet experienced the error. I'll do some further testing in the coming days to be absolutely sure it's gone. |
The Digest class has some thread safety issues. See aws/aws-sdk-ruby#455 See aws/aws-sdk-ruby#529
References: #49, #64, #65, #66, #67, aws/aws-sdk-ruby#455, aws/aws-sdk-ruby#529
Upon further testing this problem does seem to still occur on Rubinius 2.2.10 using aws-sdk 1.46.0. In this case I'm getting errors such as the following:
|
Contents of the Gemfile: source 'https://rubygems.org'
gem 'activerecord', '~> 4.0'
gem 'mysql2'
gem 'rollbar'
gem 'daemon-kit'
gem 'aws-sdk', '~> 1.0'
gem 'logstash-file'
gem 'oni', '~> 3.0'
gem 'dotenv'
gem 'json', '>= 1.8.1'
group :development, :test do
gem 'pry'
gem 'pry-doc'
gem 'pry-theme'
gem 'rspec', '~> 3.0'
gem 'simplecov'
gem 'rake'
end
group :yard do
gem 'yard'
gem 'kramdown'
end And Gemfile.lock:
|
Ok I did some testing and both require 'thread'
require 'digest'
require 'digest/sha1'
Thread.abort_on_exception = true
expected = '0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33'
threads = []
digest = Digest::SHA1.new
100.times do
threads << Thread.new do
loop do
got = digest.hexdigest('foo')
unless got == expected
raise "#{got} is not #{expected}"
end
end
end
end
threads.each(&:join) Using https://github.com/aws/aws-sdk-ruby/search?q=%22OpenSSL+Digest%22&type=Code there seem to be some places where Digest instances are kept around. For example,
I'll do some digging to see if those particular bits are shared between threads. |
The chunk signed stream class you linked to is currently not used by the SDK. Each client constructs a single signer and reuses that across requests. However, I don't believe any of the signer classes cache digest objects (except the un-used chunk signer). It could be fairly easy to test. The base client class defines a
https://github.com/aws/aws-sdk-ruby/blob/master/lib/aws/core/client.rb#L701 |
@trevorrowe The code in particular is a tad confusing and I'm not sure what method exactly to override. That is, Basically my idea is to override the method so that it doesn't memoize the signer, then test if that works without throwing any signature errors. |
Each service can choose a different signature version / signer. For this reason, there is a macro As a quick monkey patch, the following should work: class AWS::SQS::Client
def sign_request(req)
args = [credential_provider, 'sqs', req.region]
signer = AWS::Core::Signers::Version4.new(*args)
signer.sign_request(req)
req
end
end The code this is replacing looks like so: def signature_version(version, service_signing_name = nil)
define_method(:sign_request) do |req|
@signer ||= begin
signer_class = AWS::Core::Signers.const_get(version)
signer_args = (version == :Version4) ?
[credential_provider, service_signing_name, req.region] :
[credential_provider]
signer_class.new(*signer_args)
end
@signer.sign_request(req)
req
end
end Notice, the default implantation caches the signer to the client instance. The monkey patched version creates a new signer for each request. |
After testing said patch it appears that removing the caching of the signer instance doesn't prevent the errors from occurring. A global lock however seems to do the trick so far: class AWS::SQS::Client
GLOBAL_LOCK = Mutex.new
def sign_request(req)
GLOBAL_LOCK.synchronize do
args = [credential_provider, 'sqs', req.region]
signer = AWS::Core::Signers::Version4.new(*args)
signer.sign_request(req)
req
end
end
end This might suggest that the actual AWS client instance is shared between threads. This would be odd considering the application in question itself shares nothing AWS related between threads. Instead each SQS connection runs in its own, isolated thread with its own client. I'll try to see if the client is indeed accessed from multiple threads. |
I applied the following patch to our codebase: class AWS::SQS::Client
GLOBAL_LOCK = Mutex.new
def sign_request(req)
id = Thread.current.instance_variable_get(:@thread_id)
puts "Thread ##{id} accessing #{self.class} ##{object_id}"
GLOBAL_LOCK.synchronize do
args = [credential_provider, 'sqs', req.region]
signer = AWS::Core::Signers::Version4.new(*args)
signer.sign_request(req)
req
end
end
end Running the application results in the following output:
The last number in each line is the object ID of the client in question. Perhaps I'm missing something but this would suggest that there's a single |
Worth noting: the |
Thats interesting. Given the global mutex around the signing code seems to resolve the problem, that implies one of a few possible causes:
The v4 signer has 3 instance variables, which are the credential, the service name string and the region string. What type of credentials object are you testing with? Can you dump out the credentials class name? |
This may not matter but rubinius 2.2.9 is set to ruby version 2.1.0 but the open ssl threadsafe patch is in MRI Ruby 2.1.1+ |
If it helps, while I haven't seen these particular errors on JRuby I did just get the following error:
While this is in S3 and not SQS I suspect this code is suffering from the same problem. |
i see the label "Version 1" on here, but I am seeing this with v2.0.30. I am using MRI 2.1.3p242. error happens after awhile using threads. |
@quinn Can you reproduce it using the script mentioned in #455 (comment)? You probably have to adjust the script a bit for V2 of the SDK though. |
@yorickpeterse I think i was mistaken, i'm trying w/o threads and still seeing. sorry for the confusion |
@quinn If possible, could you at least share the error output if you still have it? |
sure:
It happens after awhile but always on the same object. I am batch copying from one bucket to another using Object#copy_from. |
I was not url encoding the copy source param, a bit embarrassing. Still, it's a cryptic error message for this type of error. |
@trevorrowe This particular problem also seems to occur when using V2 of the SDK and Rubinius 2.5.2:
There is also a huge amount of autoloading problems occurring when using V2. On V1 we also had these problems but they could be solved by using |
I did not implement As for the signature errors, the only fix I understand that we've tried that resolves the issue is to put a global mutex around the OpenSSL digest methods. This seems like it shouldn't be necessary. Thoughts? |
@trevorrowe At least on JRuby 1.7
In comment #455 (comment) I discussed this and set up a standalone script that couldn't reproduce anything close to the problem discussed in this issue. The code we use for OpenSSL is pulled directly from MRI commit ruby/ruby@5a58165. I still have to look into updating to the current version, I vaguely remember having problems last time I tried. Either way, in V1 the problem was that the client/signature signing instances were shared between threads without any synchronisation in place. Is such a pattern still the case with V2? |
What is not thread-safe about autoload in JRuby 1.7? |
@headius see #455 (comment). |
Those errors do not necessarily indicate that JRuby's autoload is unsafe. Constants can end up missing during any concurrent require logic, regardless of whether autoload is involved, if code attempts to access classes while they're still booting. This can happen if, for example, defined? is used to determine the existence of a class. This is because class definition is not atomic. Of course it could also indicate a thread-safety problem in JRuby's autoload, but it's definitely not the first place I'd look. |
Ok, looking at the code, there's an access of Base there, and Base is defined as an autoload...so there's a better chance this is a bug in autoload in JRuby. I'll see if I can come up with a case that reproduces it. |
At least this script https://gist.github.com/YorickPeterse/2efb97451fd27c34aec7 fails on Rubinius with constant missing errors, haven't managed to get it to fail on JRuby yet. |
In rubinius/rubinius@b57399f I took care of the autoloading problems (as far as I can tell). I finally managed to get my hands on a signature error using the scripts discussed in https://gist.github.com/YorickPeterse/2efb97451fd27c34aec7:
This indicates that the script does eventually reproduce the problem given enough time and luck. |
In a production application the autoloading problems are indeed resolved, sadly the signature errors remain (when using V2). I'll be looking into this today to see if I can solve this problem for good. |
@yorickpeterse In production, are you using static credentials, or are you using refreshing credentials, such as an IAM instance profile, assume role credentials, STS session credentials, etc? |
@trevorrowe I've tested this both on an EC2 instance (which uses IAM credentials) and locally (which uses static credentials). I'm currently looking into rubysl-openssl as I suspect it has some problems that might trigger this particular problem. Running my repro script mentioned above with |
We recently updated the OpenSSL code of Rubinius to match the code of the latest MRI version (rubysl-openssl is basically a 1:1 copy of MRI's OpenSSL code). Having looked at the code of V2 of the AWS SDK I couldn't find a point where mutable state was shared between threads, leading me to believe it might be OpenSSL that's broken. To confirm/deny this I set up a new standalone script to try and trigger the problem: require 'openssl'
Thread.abort_on_exception = true
input = 'the cake is possibly a lie'
digest = OpenSSL::Digest::SHA256.new
digest.update(input)
expected = digest.hexdigest
puts 'Starting threads...'
threads = 10.times.map do
Thread.new do
loop do
digest = OpenSSL::Digest::SHA256.new
digest.update(input)
got = digest.hexdigest
if got != expected
raise "Expected digest #{got.inspect} to equal #{expected.inspect}"
end
end
end
end
threads.each(&:join) On MRI this will run fine for all eternity. On Rubinius on the other hand this pretty much instantly crashes with the following output:
Interesting enough the digest appears to always be "70b9d40465992fa71a861e050481fb1a4bed5c8aa127272e7bf5838a3b0ab240". However, when running Rubinius with a CAPI lock (using To cut a long story short, it seems the OpenSSL extension is beyond broken in multi-threaded environments or Rubinius does some weird things to it. I'll continue digging until I'm sure what's to blame. If it turns out to be unrelated to the AWS SDK itself I'll close this issue. |
Per "rideliner" from the Rubinius Gitter channel, the above snippet is flawed. The local variable Either way, the actual signature problem still persists. I'll continue my investigation, but at least it seems that both rubysl-digest and rubysl-openssl are not the cause for these problems. |
@trevorrowe I literally just got this error on MRI 2.2 as well (though this is using V1), which seems to rule our Rubinius itself being the problem. The error/backtrace is as following (as taken from Rollbar):
Backtrace:
|
For the first time ever I also experienced this issue on JRuby 1.7:
Backtrace:
Ruby info:
|
This issue has been inactive for over a year, and V1 is end of life. Happy to continue to look at this, given all the effort in place so far, if we get more information. For now, closing. Again, feel free to reopen if more information comes in. |
Since switching a daemon over to Rubinius we've been getting quite a few of the following errors:
The corresponding stack trace:
The root frame of the error is our own code which creates an instance of
AWS::SQS
and then tries to retrieve a queue by its name. This happens in a threaded environment (10 threads to be exact) under Rubinius. MRI does not (so far) seem to be affected.It's not exactly clear to me what's causing it. The individual
AWS::SQS
and related instances are not shared between threads, nor are any of our own AWS related operations.As a temporary fix I disabled checksum validation for SQS operations but I'd rather not keep that disabled in the long run.
Setup info:
AWS.config
)The text was updated successfully, but these errors were encountered: