-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Native Interface #91
Comments
From related issue above:
What's wrong with clickhouse-driver? Except official support from ClickHouse Inc. It's well documented and has excellent (96%) code coverage. Many people use it in production for years. It is also in top 1% downloads on pypi: https://pypistats.org/packages/clickhouse-driver. It’s good to see you chasing it in features and performance. But what’s the profit for the community in another one package? |
clickhouse-driver is a mature, solid project and we continue to recommend it the ClickHouse users in appropriate circumstances. However, there are many ClickHouse users who can't take advantage of clickhouse-driver because of its lack of HTTP support; in particular for deployments that depend on ChProxy or other HTTP caches/load balancers. At this point ClickHouse Connect performance is nearly equal to that of clickhouse-driver, and we continue to make performance improvements on a regular basis. Native support is just one of those planned improvements, and as I noted in the issue, at this point it's not a lot of work to replace the HTTP transport layer with a native protocol implementation. (It's also not our highest priority.) I guess I would reverse the question -- what's the harm to the community in having two options for drivers over the native protocol? |
It good to see • proxing: load balancing/caching Please correct me if I wrong.
Well, the difficulty of choice at the first. Let's think I'm newbie to ClickHouse and I need to get data from it in Python in most effective way. I need to learn about different internals: FORMATs (Native/HTTP/etc.), transport layers (TCP, HTTP), cache/load balancers. At the beginning I have one problem: how to read data from ClickHouse. At the end I have at least two problems: how to read data from ClickHouse and how to choose right driver for it. Of course I can spend some time to learn about internals. But should I do in the very beginning? There was another point: what if I picked "wrong" package and already write code in it? Then I need to pick another one library and rewrite code because another library have another interface. It’s natural for different libraries of course, but it's sad for me as developer that just wants to work with ClickHouse at the first time. There are a many such examples in Russian telegram chat. How you would continue to recommend There are TCP load balancers such as nginx/haproxy: https://kb.altinity.com/altinity-kb-setup-and-maintenance/load-balancers. Of course they are not so flexible as HTTP. But you can even balance queries on client-side with simple round-robin. There is an option for it: https://clickhouse-driver.readthedocs.io/en/latest/features.html#multiple-hosts. Both I don't have anything against suite package but I don't understand why you have to embed "core" packages in the suite. If you want to build "suite" package just specify "core" packages as requirements. This will make suite easier to maintain: minor changes in core packages will not require new version of the suite. As I can see the suite still relies on the SQLAlchemy. For some good reason this package is not embedded. Please don't get me wrong, I don't want to teach you how to make packages or write software. I have no doubt in it. All my points are about making better packages for the community and focusing on better integration with the Clickhouse server by reusing existing software not rewriting it without good reason for it. And focusing on the new integrations like Superset, etc. |
I've recommended clickhouse-driver to customers with high volume queries or inserts because of the performance advantage. For example, #56 (comment). As clickhouse-connect has improved in performance that difference has narrowed significantly and in many cases clickhouse-connect now has the advantage even though it uses HTTP. And while we certainly do not force anyone to use it, for the best up to date support and compatibility with other tools, we do think most users in the long run will be better off with an officially supported ClickHouse Inc. driver. (More than 10 months after it was released as an experimental feature, clickhouse-driver does not yet support JSON column inserts). My first interest in a Python HTTP driver started when I was at Comcast building an analytics platform based on ClickHouse. We wanted to use Superset but had only HTTP proxies and load balancers at the time to access a cluster of nearly 100 ClickHouse nodes. To be clear this has not been an actual year of development. ClickHouse Connect started as a side project using RowBinary while I was still a ClickHouse support engineer. I came across this issue early on:
We continued to have demand for better Superset compatibility, so I kept working on the project. We also took over dbt-clickhouse and HTTP support was one of the most requested features, so that was another incentive to build a solid driver. I wasn't satisfied with my original RowBinary performance so I started investigating the Native format. The design and performance of the project have grown organically as I have had time to work on it. (As for the suite aspect, that was originally for simplicity to provide Superset support with a dynamic EngineSpec. I don't know that we'll ever expand SQLAlchemy support beyond what's necessary for Superset (I think SQLAlchemy is an outdated technology, with a very cumbersome design), so making that into a stand alone project doesn't make a lot of sense, and we actually do plan on migrating the small Superset pieces to Superset itself. I don't claim the suite itself is a particularly good thing.) I assume you've known about clickhouse-connect for quite some time (it's been public since May or June I think) so it's unfortunate that you waited until this issue targeting native protocol support was opened to express your concerns. It's also unclear what you'd like us to do at this point. I think the time has passed to add HTTP support to clickhouse-driver. Finally, as I've said, clickhouse-driver is a mature and stable product and there's no doubt that it's been by far the best option for Python/ClickHouse connectivity for years. It's therefore not surprising that it's widely used in ClickHouse projects and therefore popular on pypi. But I obviously believe that ClickHouse Connect is a strong alternative, with a solid design, very few bugs (that are quickly addressed), good performance, and official ClickHouse Inc. support. |
Can we merge these two drivers together and make the transport protocol choice available under a common interface? |
I suppose we can. But how should we do it? |
Ha-ha :) That reminds me "Hitchhiker's guide to the galaxy": “But Mr Dent, the plans have been available in the local planning office for the last nine months.”
In my opinion the only advantage of HTTP is simplicity for driver creators. So if you want to write driver in 20 minutes - go with HTTP. The rest are DISadvantages (see below),
Instead of creating new driver on the top of HTTP it would be better to add support for native protocol to chproxy. The HTTP-protocol in clickhouse is bad for 'universal' drivers (and always was). In general, HTTP is not designed for RDBMS usage: the main problem is lack of multiplexing there, and several other problems.
Simple test which all drivers build on the top of HTTP fail: echo "SELECT throwIf(number=8, 'Ups, it is the error inside the data stream') as x FROM numbers(10)" | curl -D- 'http://localhost:8123/?max_threads=1&max_block_size=1&default_format=Native' --data-binary @- | xxd
## Those settings (max_threads & max_block_size are tuned just to simplify the test, the same can happen with normal settings too Yes, you can use
echo "SELECT max(toString(number)) as let_me_burn_your_cpu FROM numbers_mt(100000000000)" | timeout -v 2s curl -D- 'http://localhost:8123/?max_threads=32' --data-binary @-
# so the client was killed, but what is happening on the server? Psst, check SHOW PROCESSLIST! You can (try) to use
There was some old proposal how to make HTTP itself better: ClickHouse/ClickHouse#6294
Since it's impossible to close that Pandora box again - it would be best to merge them. |
I think the Hitchhiker's Guide comparison is a bit unfair :) -- it's been on our website for quite some time and mentioned in Alexey's webinars. Again no one seemed bothered by it until I opened this issue targeting native protocol support. I generally agree with the limitations of the HTTP protocol and the ClickHouse implementation in particular. A fair amount of the work in That said, the most popular driver for ClickHouse is I'm open to ideas about how to merge the projects, but like @xzkostyan I'm not sure what that would look like. |
@xzkostyan I will send you an invitation to a meeting to discuss our options |
I think I've repeatedly answered the question of "why creating new". Ideal or not, HTTP support is in high demand in the community (see dbt, Grafana -- using the Altinity Grafana plugin until last year, various jdbc based products like DbBeaver and Airflow, CHProxy, my own experience at Comcast with a 650TB/day cluster). At this point it's all well and good to say "the [still undocumented] native protocol is 'better'" but that doesn't help a community that (for whatever reason) prefers HTTP and the existing HTTP based ecosystem. Finally, the author of clickhouse-driver explicitly rejected HTTP support mymarilyn/clickhouse-driver#275, mymarilyn/clickhouse-driver#176 and said "build it from scratch" (or work with one of the existing HTTP drivers, which are really basic and slow). So after building something "from scratch", you might understand why I'm somewhat frustrated with being attacked for looking to improve it by implementing the "better" native protocol as an option. ClickHouse Connect already performs on par with clickhouse-driver even though it uses HTTP (and HTTP query performance will improve further when ClickHouse/ClickHouse#45593 is released, presumably this month) and significantly better in some circumstances (like Pandas dataframes). And yes, I'm obviously biased, but I think the rapid improvements in and responsive support for this package are valuable to the ClickHouse community even if the project doesn't have the long pedigree of clickhouse-driver. That said, in retrospect I should have coordinated better with @xzkostyan when it became clear my side project was gaining traction, and we should have made more of an effort to bring that project under the ClickHouse Inc. umbrella. I'm probably guilty of having just a bit too much fun making something that I think is a pretty solid driver. |
Hi @genzgd, @xzkostyan, @alexey-milovidov, and @filimonov, it seems to me there are a couple of levels of issues here. First practical. I'm skeptical that merging drivers is better for users, unless it's done by extending the Python clickhouse-driver. It's been around for years and there are many thousands of applications based on it. Their owners are not going to port to a new driver. Making backwards-incompatible changes through some sort of "split-the-difference" approach would also hurt them. Think Python2 -> Python3 and you know what I mean. It would likely result in 3 drivers, not two. At the same time it's not bad to have an open market for drivers, and in fact any software related to ClickHouse. That's the quickest way to make improvements. So if we don't merge and nobody wants to give up their work, what do we do? Have two drivers and acknowledge each other's work. Support them both. Let users decide. Next the strategic issues. Having unnecessary fights over basic infrastructure components is not good for users and not good for ClickHouse as a whole, including anyone who makes their living from it. We're wasting resources and fragmenting effort. It leads to a poor user experience. It also takes away focus from getting features like async insert or TTLs to work easily and reliably. Databases win on core functionality, not on drivers. I suggest that we back things up a bit and take this as an opportunity learn how to collaborate better as a community. In this particular case it would perhaps have been better for a new driver to start as a fork of clickhouse-driver. That's a common way to improve things and would have given users a natural path to a better driver if that's how it turns out. Or everybody could have agreed up front and attacked it as a community using code from a single project. That's water under the bridge. It's not a big deal if it happens occasionally. What would be bad is to repeat this pattern on every piece of software related to ClickHouse. That's a point that deserves some collective thought. |
A quick codicil: databases have traditionally prospered with both ODBC and native clients, which have a lot of overlap. It's not winner take all by any means. |
@hodgesrm Not surprisingly I share your skepticism about merging drivers, as they are fairly opinionated about how to handle datatypes and have different approaches for low level streaming/buffer/data conversion. Also for the record when I started this experiment I could not have done a fork of I also agree that basic competition in software is good. ClickHouse would not exist if someone had just said "oh, well fork MySQL to do analytic workloads, it's got a lot of stars on Github" (did Github exist in 2008?) I also think the current highly competitive environment, where ClickHouse has become a popular target, has pushed us to make it a better product. However, I don't think it's quite so simple as "let the users decide", since that doesn't tell us where to direct resources or what to emphasize with our customers. I think a significant reason for this "fight" is that ClickHouse Inc is officially supporting I further believe that Python support is a pretty core ClickHouse requirement. Snowflake and BigQuery offer "first class citizen" status for their Python integrations (not to mention DuckDB), and the data science/ML world is one of our biggest target markets. I've been quite surprised about the number of questions and issues I see about Pandas which so far have exceeded anything related to native Python data types. For that reason and others I don't consider working on an improved driver "a waste of resources", and I don't think the effort expended is taking away anything from other undertakings like async inserts (I doubt Alexey is going to draft me into the core C++ work any time soon). Anyway, the upshot is I don't have an answer, and the strategic choice is above my pay grade. Again we should have touched based with the |
Hi @genzgd, thanks for the thoughtful response. It seems like a missed opportunity not to work on drivers together. It still leaves open the question of what you think would be the right response to users who have committed to clickhouse-driver. I would expect them to continue to use @xzkostyan's work. We collectively have an obligation to support them. |
@genzgd, should we add a note with these recommendations to the new python driver documentation? Because currently the documentation is confusing and it suggests the new driver by default. And it's really hard to find the old one using a search in the documentation. A note will explain that an HTTP-based |
FYI, as one of chproxy maintainers, we're working to make chproxy compatible with the native protocol. |
As someone with no hard requirements on HTTP, the current state of the JSON issue leaves me with a very different impression @tavplubix and I think it'd be a disservice to new users to recommend it as a defacto option. I'm new to Clickhouse and unfortunately spent a non-trivial amount of time getting clickhouse-driver to work on AWS Lambda, only to stumble upon the JSON issue linked above. I don't want to downplay what's a really impressive effort from an outside contributor, but it definitely seems less risky to go with something driven by Clickhouse, and it's not surprising to see the official docs reflect that. |
@genzgd What about gRPC support? Looks pretty "HTTP" to me. :) |
gRPC would actually be fairly easy, but you're the first person I've seen mention it. :) Also my limited review of the ClickHouse gRPC interface suggests that the advantages are fairly limited, since it doesn't really provide low level object types (just a Result.output object that contains a binary blob of the actual response in the selected ClickHouse output format). Accordingly it feels like in some ways it would be harder to implement streaming, for example. In any case, feel free to open a feature request and see if you can generate additional interest, and we'll take that in account in our priorities. |
streamable Result is available :) |
It's good to have more than one successful driver. When one fails, you can check on the other. Next, you start an issue) |
The native interface still retains certain advantages over HTTP, including better compression and progress packets. It would be nice to support both options. At this point this would mean:
nativeclient
subclass and implementing the required methodsHowever this looks like a lot less work now than it did last March
The text was updated successfully, but these errors were encountered: