-
Notifications
You must be signed in to change notification settings - Fork 40
FAQ
Q. I gave the Alfresco process a bunch more memory / CPU, but my import didn't speed up. Shouldn't it have gotten a lot faster?
A. No. Bulk imports are usually I/O bound, so adding more CPU or memory capacity when neither is the bottleneck isn't going to help much, if at all. Instead I'd focus on the classical performance tuning process:
- Identify the performance objective (so you know when to stop)
- Measure the system
- Identify the bottleneck
- Fix the bottleneck
- Measure the system again. If the performance objective isn't met, go to step 3.
- ???
- PROFIT!!!1
Step #1 is critically important, otherwise this process becomes an infinite loop!
A. Yes, though it may not accomplish much if your bottleneck is in a shared component (database, contentstore, network, source filesystem - see previous question).
Related Q. I tried to run the Bulk Import Tool on multiple cluster nodes and got a JobLockService exception.
A. You're using the embedded fork, which is a cluster-singleton process. One of numerous reasons to avoid the embedded fork.
A. To avoid double counting (e.g. during a transactional retry), the tool only "counts" the target data when a transaction is committed. This makes the various target counters appear to be a lot more bursty than they actually are. The best solution is to focus on the moving average, since it's a better indicator of overall throughput.
Q. After a little while I'm seeing long periods of zero instantaneous activity, followed by a solitary large burst. What's going on?
A. This is partly related to the previous question, and is something I've observed in my test environment too. While I'm not 100% sure I know the answer, what I think is happening is that transaction commits across the various worker threads end up falling into alignment. Initially I figured it was just because I was starting all of the worker threads at the same time, but after adding in staggered startup logic what I saw was that the "coherence pattern" would eventually re-emerge anyway. It's possible this is specific to the database I'm testing on (MySQL 5.6.25) but regardless, I'd be very keen to hear from a database expert who might be able to explain the observed behaviour in more detail.
Q. At the start of an import, I see a high "nodes imported per second" reading, but "bytes imported per second" is stuck on zero. What's happening?
A. The tool imports the entire directory structure first, before importing any files. Directories count as nodes in the repository, but are (obviously) empty - they contain no data.
Q. At the start of an import, I see "Threads: 0 active of 0 total", but the import seems to be progressing. Why is this?
A. The tool imports the directory structure and the first couple of batches of content on a single thread, since:
-
directories are "dependent" on each (they can be nested), so it's not possible to reliably import the structure in a multi-threaded way - multi-threaded imports don't guarantee ordering, meaning it would be possible for a child folder to be imported before its parent has been imported (which will fail for obvious reasons)
-
for small imports the cost of spinning up the multi-threaded import machinery outweighs the benefits, so the first couple of batches (couple of hundred files) are imported serially, and only once a certain threshold is reached (currently 3 batches, but this is an internal implementation detail that may change) does multi-threading kick in
During this single-threaded phase the worker threads haven't been created yet, and so the tool reports that zero threads are active (it's reporting on the size of the worker thread pool). Arguably it should report that 1 thread is active, even though that thread is not part of the worker thread pool - feel free to raise an issue if you think this is problematic.
A. Nothing. "Weight" is a unitless value that's simply used for comparing the approximate size of each imported node while constructing batches. It's intended to be proportional to the amount of work the database will have to do while importing that node, but the value itself is meaningless (it's not "number of nodes" or "number of database rows" or anything like that - it's simply a unitless value).
Q. Why is the count of "folders scanned" so much higher than the actual number of folders on disk in the source content set?
A. The "Default" source (which imports from a filesystem) scans the source content set twice (see the FAQ item above regarding confusing thread counts, for an explanation of why this is done):
- to enumerate all of the folders in the source content set and submit them for import
- to enumerate all of the files in the source content set and submit them for import
While this may seem inefficient, it is preferable than scanning once and holding the entire set of filenames in memory - the memory usage with that approach would be O(N) on the number of folders + files in the source content set, and could easily exceed the total heap available to Alfresco in the presence of large (multi-million node) source content sets. In addition, on modern platforms, performing a recursive folder listing is reasonably fast - no file data is being read, just the index entries (inodes in Unixland) in the filesystem, and most operating systems have caches for these data structures anyway.
Back to wiki home.
Copyright © Peter Monks. Licensed under the Apache 2.0 License.