Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance overhead on HiveConnector::createDataSource if split preloading kicked in #10173

Open
zhztheplayer opened this issue Jun 13, 2024 · 9 comments
Labels
bug Something isn't working triage Newly created issue that needs attention.

Comments

@zhztheplayer
Copy link
Contributor

zhztheplayer commented Jun 13, 2024

Bug description

It's observed in Gluten's use case, HiveConnector::createDataSource slows down data scan when split preload is turned on.

In the case a hotspot appeared in filter expression compilation (namely SimpleExpressionEvaluator::compile):

image

In the case the filter expression contains bloom-filters so it took much longer time than usual since bloom-filter's compilation can be slower than other types of expressions.

In the case when split preloading is turned off, the scan time can be shorten by ~6x (~30s vs ~5s). The estimated total split number is ~200K.

Related code:

void TableScan::preload(
const std::shared_ptr<connector::ConnectorSplit>& split) {
// The AsyncSource returns a unique_ptr to the shared_ptr of the
// DataSource. The callback may outlive the Task, hence it captures
// a shared_ptr to it. This is required to keep memory pools live
// for the duration. The callback checks for task cancellation to
// avoid needless work.
split->dataSource = std::make_unique<AsyncSource<connector::DataSource>>(
[type = outputType_,
table = tableHandle_,
columns = columnHandles_,
connector = connector_,
ctx = operatorCtx_->createConnectorQueryCtx(
split->connectorId, planNodeId(), connectorPool_),
task = operatorCtx_->task(),
dynamicFilters = dynamicFilters_,
split]() -> std::unique_ptr<connector::DataSource> {
if (task->isCancelled()) {
return nullptr;
}
auto debugString =
fmt::format("Split {} Task {}", split->toString(), task->taskId());
ExceptionContextSetter exceptionContext(
{[](VeloxException::Type /*exceptionType*/, auto* debugString) {
return *static_cast<std::string*>(debugString);
},
&debugString});
auto dataSource =
connector->createDataSource(type, table, columns, ctx.get());
if (task->isCancelled()) {
return nullptr;
}
for (const auto& entry : dynamicFilters) {
dataSource->addDynamicFilter(entry.first, entry.second);
}
dataSource->addSplit(split);
return dataSource;
});
}

To solve the issue, perhaps split-preloading procedure could adopt some kind of reuse logics to avoid compiling the expressions every time a split is preloaded.

System information

Velox System Info v0.0.2
Commit: b98b20414faf414775f506b6d446f314cb27fe38
CMake Version: 3.29.2
System: Linux-5.4.0-156-generic
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 9.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 12.0.0
CMake Prefix Path: /usr/local;/usr;/;/opt/cmake;/usr/local;/usr/X11R6;/usr/pkg;/opt
@zhztheplayer zhztheplayer added bug Something isn't working triage Newly created issue that needs attention. labels Jun 13, 2024
@zhztheplayer
Copy link
Contributor Author

cc @oerling @Yuhta @mbasmanova

@Yuhta
Copy link
Contributor

Yuhta commented Jun 13, 2024

Can you put the bloom filter in dynamic filters instead? It does not sound right to put a bloom filter in remaining filter. Where is the bloom filter generated?

@zhztheplayer
Copy link
Contributor Author

It does not sound right to put a bloom filter in remaining filter.

What's the ideal use case for remaining filter? I am not pretty much familiar with this part of design in Velox.

Where is the bloom filter generated?

It's generated by Spark query planner before creating Velox task.

@Yuhta
Copy link
Contributor

Yuhta commented Jun 13, 2024

Remaining filter is part of the where clause that cannot be converted to tuple domains (key value filters). In your case it is probably easier to wrap the bloom filter in a Filter object and pass it using TableScan::addDynamicFilter before you kick off the task.

@Yuhta
Copy link
Contributor

Yuhta commented Jun 13, 2024

A second way is to inherit compiled expression in HiveDataSource::setFromDataSource

@zhli1142015
Copy link
Contributor

Or make Bloom Filter lazy initialized, looks like something like below? Thanks.

template <typename T>
struct BloomFilterMightContainFunction {
  VELOX_DEFINE_FUNCTION_TYPES(T);

  using Allocator = std::allocator<uint64_t>;

  void initialize(
      const std::vector<TypePtr>& /*inputTypes*/,
      const core::QueryConfig&,
      const arg_type<Varbinary>* serialized,
      const arg_type<int64_t>*) {
    if (serialized != nullptr) {
      serialized_ = serialized->str();
    }
  }

  FOLLY_ALWAYS_INLINE void
  call(bool& result, const arg_type<Varbinary>&, const int64_t& input) {
    if (serialized_.has_value()) {
      bloomFilter_.merge(serialized_.value().c_str());
      serialized_ = std::nullopt;
    }
    result = bloomFilter_.isSet()
        ? bloomFilter_.mayContain(folly::hasher<int64_t>()(input))
        : false;
  }

 private:
  BloomFilter<Allocator> bloomFilter_;
  std::optional<std::string> serialized_;
};

@Yuhta
Copy link
Contributor

Yuhta commented Jun 14, 2024

@zhli1142015 If it is the same in all data sources, recompiling them is a waste of CPU

@zhli1142015
Copy link
Contributor

I see, make sense. Then how about store the DataSource created in preload threads in some collection and reuse them?
Thanks.

@Yuhta
Copy link
Contributor

Yuhta commented Jun 14, 2024

Reusing them will be tricky. The most straightforward way would be wrap the bloom filter in filter object and push them into dynamic filters instead of modeling them as expression (it seems more logically right that way as well). Otherwise keeping them as expression will be a fairly large change on connector to factor out the remaining filter compilation (the connector interface might be hard to change to accommodate this). We are not seeing compilation cost anywhere else so it's a question whether it is worthing doing it just for spark bloom filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Newly created issue that needs attention.
Projects
None yet
Development

No branches or pull requests

3 participants