Mongoose findOneAndUpdate hangs under heavy load #14877

andreialecu · 2024-09-10T13:25:47Z

Prerequisites

I have written a descriptive issue title
I have searched existing issues to ensure the bug has not already been reported

Mongoose version

8.5.1

Node.js version

20.17.0

MongoDB server version

7.0.12

Typescript version (if applicable)

No response

Description

We have a batch job that does a bunch of updates that used to work a while ago, but recently started hanging on calls to findOneAndUpdate. The promise stops resolving after a while.

I have been able to create a repro for this, but it seems to only viably reproduce on MongoDB Atlas (or probably other remote servers). On localhost it doesn't seem to reproduce.

Also if I bypass mongoose entirely and use the mongodb driver directly, it seems to work.

Steps to Reproduce

const mongoose = require('mongoose');
const Parallel = require('async-parallel');
const { setTimeout } = require('timers/promises');

const ObjectId = mongoose.Types.ObjectId;
const Schema = mongoose.Schema;

const testSchema = new Schema({
  fieldA: {
    type: Schema.Types.ObjectId,
    required: true,
  },
  fieldB: {
    type: String,
    required: true,
  },
  fieldC: {
    type: Number,
    default: 0,
  },
  fieldD: {
    subFieldA: [String],
    subFieldB: [String],
    subFieldC: [String],
  },
  fieldK: {
    type: Date,
    default: Date.now,
  },
});

const Test = mongoose.model('Test', testSchema);

async function createObject() {
  await setTimeout(100 + Math.random() * 100);
  return {
    fieldA: new ObjectId(),
    // random hex
    fieldB: new ObjectId().toHexString(),
    fieldC: 0,
    fieldD: {
      // make this big:
      subFieldA: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
      subFieldB: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
      subFieldC: Array.from({ length: 2000 }, () =>
        new ObjectId().toHexString(),
      ),
    },
  };
}

async function run() {
  await mongoose.connect(
    // 'mongodb://localhost:27017/testmongoose', // could not reproduce on localhost
    'mongodb+srv://....mongodb.net/testmongoose?ssl=true&authSource=admin', // replace with a mongodb atlas connection string
  );

  let tidx = 0;
  let total = 10000;
  await Parallel.each(
    Array.from({ length: total }),
    async (_, i) => {
      const obj = await createObject();
      const result = await Promise.race([
        Test.findOneAndUpdate(
          { fieldB: obj.fieldB, fieldK: { $lt: new Date() } },
          obj,
          {
            upsert: true,
          },
        ),
        // following is faster and doesn't have the issue
        // Test.db.collection('test123').findOneAndUpdate(
        //     { fieldB: obj.fieldB, fieldK: { $lt: new Date() } },
        //     { $set: obj },
        //     {
        //       upsert: true,
        //     },
        //   ),
        setTimeout(30000, new Error('timeout')),
      ]);
      if (result instanceof Error) {
        console.log(`Timeout after ${tidx}/${total}`);
        process.exit(1);
      }
      tidx++;
      console.log(`${tidx}/${total}`);
    },
    { concurrency: 20 },
  );

  console.log('done');
  mongoose.disconnect();
}

run();

Expected Behavior

This must've started happening after some recent update to either mongoose, mongodb, or a mongo version. Unclear which, we haven't ran this job in a while and just noticed it occuring now.

The text was updated successfully, but these errors were encountered:

andreialecu · 2024-09-10T13:26:46Z

Output of script:

....
1223/10000
1224/10000
Timeout after 1224/10000

If mongoose is bypassed and the other code that is commented out is swapped in, the issue doesn't seem to occur.

andreialecu · 2024-09-10T13:31:25Z

It seems that if you make the schema bigger/more bloated with additional fields, even if they are not populated, the issue occurs earlier after about 400 iterations. But this could be a fluke.

hunkydoryrepair · 2024-09-11T08:38:43Z

I have been fighting this issue have have not been able to go above version 8.2.2
Any higher version and I will eventually get hangs doing around 50+ updates / second. And we do have a rather large, complex schema defined.
I have also had it hang on simple findOne calls. But I have never reproduced outside of production.

What I have found in production is that the database connection is checked out from the pool and remains checked out.
The connection on db side, using db.currentOp, does not appear. My count of "checked out" connections eventually grows to the limit, while the number of connections (in use or idle) on db server remains very low.

I assume what I experience must be mongodb, since everything from where the connection is checked out to where it is checked back in, is down in the mongodb driver.

hunkydoryrepair · 2024-09-11T08:48:29Z

Here is a function I'm using to look at the mongodb internals and see if the connections have been checked out from the connection pool. I was suspecting that it was a connection pool issue, but that proved to be wrong. Could you check if you are leaking connections from connection pool in your tests?

const connectionStats = () => {
  const client = mongoose.connection?.getClient();
  if (!client) return `Mongo Client Not Loaded`;
  try {
    const topology = (client as any).topology?.s;
    let s =  `Mongo: ${topology?.state}`;
    topology?.servers?.forEach((server: any, key: string) => {
      s += `<br>${key}: `;
      s += ' CO: ' + server.pool?.currentCheckedOutCount ?? '?';
      s += ' P: ' + server.pool?.pendingConnectionCount ?? '?';
      s += ' Q: ' + server.pool?.waitQueueSize ?? '?';
      s += ' A: ' + server.pool?.availableConnectionCount ?? '?';
      if (server.pool?.loadBalanced) {
        s += '[LOAD BALANCED]';
      }
    });
    return s;
  } catch (e) {
    console.log(`mongo pool hack didn't work`, e);
    return `Mongo: Error`;
  }

}

andreialecu · 2024-09-11T09:11:46Z

I have been fighting this issue have have not been able to go above version 8.2.2

Thank you for this!

I've tried running my repro on several mongoose versions, and here are my findings:

8.3.0 repros the issue
8.2.3 seems to work fine

The issue starts in 8.3.0 /cc @vkarpov15

hunkydoryrepair · 2024-09-11T09:55:59Z

here's what I found:
Issue still in 8.6.1
8.3.0 fails indeed, which is when mongodb driver was updated to 6.5. 8.6.1 uses version 6.8.

If we are experiencing the same thing setting socketTimeoutMS will cause it to eventually throw MongoNetworkTimeoutError

andreialecu · 2024-09-11T10:51:02Z

Update: I used yarn resolutions to pin mongodb to 6.3.0 and it appears that the problem no longer reproduces with my test script.

Indeed this may be an upstream issue in the mongodb nodejs driver.

andreialecu · 2024-09-11T10:54:03Z

It appears that the problem started with mongodb 6.4.0 - I've been trying various resolutions. 6.3.0 works, 6.4.0 exhibits the problem.

andreialecu · 2024-09-11T11:18:11Z

This may be the same issue as:

https://jira.mongodb.org/projects/NODE/issues/NODE-6370
https://jira.mongodb.org/browse/NODE-6166

andreialecu · 2024-09-11T13:56:22Z

I can confirm that we resolved our issue with the workaround of pinning mongodb to 6.3.0 using yarn resolutions.

This is not a mongoose issue. mongoose 8.5.1 with mongodb 6.3.0 seems to work.

I'll keep this issue open since the mongodb node driver doesn't have an issue tracker, and perhaps someone from the mongodb team sees it here.

A temporary fix for mongoose might be to revert the mongodb driver bump back to 6.3.0

hunkydoryrepair · 2024-09-11T16:22:21Z

i think worth keeping open because Mongoose should at least have some option for using the working 6.3.0

andreialecu · 2024-09-11T16:26:21Z

Seems it might be a good idea to have mongoose define a peer dependency to mongodb instead of a direct dependency. In this case the version of the driver would be user controllable.

Not sure why it's not already that way.

vkarpov15 · 2024-09-13T16:54:52Z

I'm unable to repro this on my end with Mongoose 8.6.2, although I had to set upsert to false to avoid maxing out my Atlas storage limits.

Mongoose has always pinned exact versions of the MongoDB Node driver because in the past Mongoose was more tightly coupled with the MongoDB Node driver and patch releases would cause breaking issues. However, that hasn't been the case in some time, so we've considered relaxing Mongoose's Node driver version requirements.

@andreialecu if we changed Mongoose's dependencies to "mongodb": "^6.3.0" would that help? Mongoose 8.6 tests do pass with mongodb@6.3, 6.4, 6.5, 6.6, and 6.7.

andreialecu · 2024-09-13T16:57:25Z

That would probably work, but I think making mongodb a peer dependency might still be a good idea.

Or a peer dependency with a default. See: yarnpkg/berry#3710 (comment)

andreialecu · 2024-09-13T17:33:32Z

You can probably get it to repro if you use upsert and update the same document repeatedly.

I haven't tried optimizing the repro.

I noticed it happens easier with bigger documents. The linked issues in the mongodb issue tracker also mention this.

It may still happen with smaller documents, just at a lower frequency. We've had some weird issues for a while which I think we can attribute to this problem.

hunkydoryrepair · 2024-09-15T04:02:09Z

I have a test case that seems pretty reliable, at least with atlas. It might hit storage limits, but it is a simple Model.create([]) with a large array of data. I haven't cleaned out some confidential data (but not personal information of any kind, and not too big a deal if it gets out). I could provide to you @vkarpov15

I'm not certain if it requires the schema or not, but it's about 600 fairly large objects, and from what I can tell it would be kicking off 600 insertOne calls simultaneously, which should pretty much max out the connection pool

vkarpov15 · 2024-09-16T14:10:29Z

npm doesn't install peer dependencies by default AFAIK, which makes making mongodb a peer dep likely a non-starter.

@hunkydoryrepair can you please send me the script? You can just email me val at karpov dot io, if that's preferred to posting the script on GitHub.

andreialecu · 2024-09-16T14:15:04Z

npm doesn't install peer dependencies by default AFAIK, which makes making mongodb a peer dep likely a non-starter.

It does since 2020: https://github.blog/news-insights/product-news/presenting-v7-0-0-of-the-npm-cli/

This has been actually quite controversial. The Yarn maintainers were vehemently opposed to this default.

Yarn and I think pnpm will not install peer dependencies automatically, but they will install peer dependencies with defaults (see my previous comment).

hunkydoryrepair · 2024-09-20T02:06:06Z

the case I thought was the same is a different bug, I think. I have a solid repro case but it works in 8.4.0 reliably, and consistently hangs in 8.4.1 or higher. I will open a separate bug

hunkydoryrepair · 2024-09-20T21:52:38Z

@andreialecu
I just discovered a reproducible case that is a little more direct.
I believe it has to do with the IDLE CONNECTION TIMEOUT from the server. On Atlas, it is 10 minutes, but running locally it can be much, much longer. I added a 10 minute pause in the middle of my test cases and reproduced it much easier

andreialecu · 2024-09-20T22:00:06Z

The test case I shared in my initial post was 100% reproducible to me.

You weren't able to reproduce it?

It's possible it could be related to latency to Atlas as I was running it locally and my ping to the cluster is 30-60ms.

hunkydoryrepair · 2024-09-20T22:06:19Z

I wasn't able to consistently reproduce with it, even when connecting to Atlas. It timed out a few times for me, but it was pretty slow other times, so it wasn't super convincing. And i go through a VPN to connect to Atlas so it is pretty slow, but I'm not sure it was reaching the idle timeouts. And the complicated schema ones seem to be a different bug from what I can tell, as it does some N! operations where N=document depth, so even getting 10 levels deep would seem to hang.

You also said you could not produce locally, and I think this is why.

andreialecu · 2024-09-20T22:13:20Z

I was only able to reproduce it with certain mongodb driver versions, and it was consistent.

In our production app, pinning to the driver version I mentioned resolved the issue.

hunkydoryrepair · 2024-09-21T19:50:07Z

@andreialecu Thank you. You inspired me to dig into this again using your use case. The issue I ran into is that your repro stops producing the problem if it is run multiple times, but I added a Test.deleteMany({}) at the beginning, and it was suddenly possible.

I spent the day tracking it down in the debugger and found a solution, which I added to this issue. Mongodb team didn't put the listeners on the 'data' event of the socket in time to catch all of the 'data' events, so it misses the returned value. This happens only with larger commands which exceed the kernel socket buffer, which causes the socket.write command to return a value indicating it needs to be drained, and the mongodb driver was awaiting that event before listening for the 'data' event, and thus, before the task awaiting the 'drain' is processed, the 'data' event comes back and is emitted. Anyway....here's hoping they accept my solution!

https://jira.mongodb.org/projects/NODE/issues/NODE-6370

hunkydoryrepair · 2024-09-21T19:51:17Z

Can we keep this open until a fix from mongodb is available and incorporated?

hunkydoryrepair · 2024-10-01T20:02:46Z

mongodb driver fix has been pulled into the 6.10 branch. Next release should have a fix

Fix #14877

andreialecu mentioned this issue Sep 11, 2024

perf(NODE-5771): improve new connection mongodb/node-mongodb-native#3948

Merged

5 tasks

vkarpov15 added the has repro script There is a repro script, the Mongoose devs need to confirm that it reproduces the issue label Sep 13, 2024

vkarpov15 mentioned this issue Sep 27, 2024

feat: add semver range for mongodb node driver and bson versions #14916

Closed

vkarpov15 added a commit that referenced this issue Oct 28, 2024

feat: upgrade mongodb -> ~6.10

de3da48

Fix #14877

vkarpov15 mentioned this issue Oct 28, 2024

feat: upgrade mongodb -> ~6.10 #14991

Merged

vkarpov15 added this to the 8.8 milestone Oct 29, 2024

vkarpov15 added underlying library issue This issue is a bug with an underlying library, like the MongoDB driver or mongodb-core and removed has repro script There is a repro script, the Mongoose devs need to confirm that it reproduces the issue labels Oct 29, 2024

vkarpov15 closed this as completed Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mongoose findOneAndUpdate hangs under heavy load #14877

Mongoose findOneAndUpdate hangs under heavy load #14877

andreialecu commented Sep 10, 2024

andreialecu commented Sep 10, 2024

andreialecu commented Sep 10, 2024

hunkydoryrepair commented Sep 11, 2024 •

edited

Loading

hunkydoryrepair commented Sep 11, 2024

andreialecu commented Sep 11, 2024

hunkydoryrepair commented Sep 11, 2024

andreialecu commented Sep 11, 2024 •

edited

Loading

andreialecu commented Sep 11, 2024

andreialecu commented Sep 11, 2024

andreialecu commented Sep 11, 2024

hunkydoryrepair commented Sep 11, 2024

andreialecu commented Sep 11, 2024

vkarpov15 commented Sep 13, 2024

andreialecu commented Sep 13, 2024

andreialecu commented Sep 13, 2024

hunkydoryrepair commented Sep 15, 2024 •

edited

Loading

vkarpov15 commented Sep 16, 2024

andreialecu commented Sep 16, 2024

hunkydoryrepair commented Sep 20, 2024

hunkydoryrepair commented Sep 20, 2024

andreialecu commented Sep 20, 2024 •

edited

Loading

hunkydoryrepair commented Sep 20, 2024

andreialecu commented Sep 20, 2024

hunkydoryrepair commented Sep 21, 2024

hunkydoryrepair commented Sep 21, 2024

hunkydoryrepair commented Oct 1, 2024

Mongoose findOneAndUpdate hangs under heavy load #14877

Mongoose findOneAndUpdate hangs under heavy load #14877

Comments

andreialecu commented Sep 10, 2024

Prerequisites

Mongoose version

Node.js version

MongoDB server version

Typescript version (if applicable)

Description

Steps to Reproduce

Expected Behavior

andreialecu commented Sep 10, 2024

andreialecu commented Sep 10, 2024

hunkydoryrepair commented Sep 11, 2024 • edited Loading

hunkydoryrepair commented Sep 11, 2024

andreialecu commented Sep 11, 2024

hunkydoryrepair commented Sep 11, 2024

andreialecu commented Sep 11, 2024 • edited Loading

andreialecu commented Sep 11, 2024

andreialecu commented Sep 11, 2024

andreialecu commented Sep 11, 2024

hunkydoryrepair commented Sep 11, 2024

andreialecu commented Sep 11, 2024

vkarpov15 commented Sep 13, 2024

andreialecu commented Sep 13, 2024

andreialecu commented Sep 13, 2024

hunkydoryrepair commented Sep 15, 2024 • edited Loading

vkarpov15 commented Sep 16, 2024

andreialecu commented Sep 16, 2024

hunkydoryrepair commented Sep 20, 2024

hunkydoryrepair commented Sep 20, 2024

andreialecu commented Sep 20, 2024 • edited Loading

hunkydoryrepair commented Sep 20, 2024

andreialecu commented Sep 20, 2024

hunkydoryrepair commented Sep 21, 2024

hunkydoryrepair commented Sep 21, 2024

hunkydoryrepair commented Oct 1, 2024

hunkydoryrepair commented Sep 11, 2024 •

edited

Loading

andreialecu commented Sep 11, 2024 •

edited

Loading

hunkydoryrepair commented Sep 15, 2024 •

edited

Loading

andreialecu commented Sep 20, 2024 •

edited

Loading