Universal SQL engine for Elixir.
This library implements the SQL engine to perform queries on user provided databases using a simple interface based on Foreign Data Wrappers from PostgreSQL.
This allows to use SQL on your own data and virtual tables.
For example it includes a CSV reader and an HTTP client, so that you can do queries as:
SELECT url, status_code
FROM urls
INNER JOIN request
ON urls.url = request.url
There is a simple repl to be able to test ExoSQL:
iex> ExoSQL.repl()
exosql> SELECT m, SUM(price) FROM generate_series(10) as m LEFT JOIN (SELECT width_bucket(price, 0, 200, 10) AS n, price FROM products) ON n = m GROUP BY m
tmp.m.m | tmp.tmp.col_2
--------------------------
1 | 31
2 | 30
3 | 0
4 | 0
5 | 0
6 | 0
7 | 0
8 | 0
9 | 0
10 | 0
The origin of the library is as a SQL layer to all the services connected to you Serverboards.
Each service can export tables to be accessed via SQL and then can show the data in the Dashboards, the notebook, or used in the rules.
The package can be installed by adding exosql
to your list of dependencies in
mix.exs
:
def deps do
[
{:exosql, "~> 0.2"}
]
end
- SELECT over external databases (CSV, HTTP endpoints... Programmable)
SELECT
over several tablesWHERE
INNER JOIN
LEFT JOIN
RIGHT JOIN
GROUP BY
ORDER BY
OFFSET
andLIMIT
DISTINCT
andDISTINCT ON
LIKE
andILIKE
CASE
WHEN
THEN
ELSE
END
/IF
THEN
ELIF
ELSE
END
.UNION
andUNION ALL
.RANGE
WITH
common table expressionsCROSS JOIN LATERAL
CROSSTAB
/CROSSTAB ON
- table and column alias with
AS
- nested
SELECT
: AtFROM
,SELECT
,WHERE
... generate_series
function tables- Aggregation functions:
COUNT
,SUM
,AVG
... - Builtin functions and operators: * / + - ||
or
and
in
not
;round
concat
... See all. - Builtin
format
,strftime
,regex
and more string and time formatting functions. - Basic Reflection over
self.tables
- JSON support via json pointer.
- Array support:
[1, 2, 3, 4]
- Variables
- Comments (-- SQL style)
Check the tests for current available features.
Variables can be passed as a dictionary at __vars__
inside the context, and
referenced as $name
at the SQL expression. This may change in the future
to streamline it more with standard SQL (no need for $
).
There is a special variable debug
that, when true, enables extra debug to the
logger, for example each step of the execution of the query.
Because some columns may need to be autogenerated depending on the query, if you want to access those columns you may need to use INNER JOINS. This way the planner asks for those specific column values.
For example:
SELECT * FROM request
does not know to which URL you want to access, but:
SELECT * FROM request WHERE url = 'http://serverboards.io'
knows the URL and can get the data.
Then same way, on INNER JOINS this can be used to access to auto generated data:
SELECT url, status_code
FROM urls
INNER JOIN request
ON urls.url = request.url
ExoSQL can do limited lateral joins on table expressions. This allow for example to use a JSON array to be unnested as several columns:
SELECT id, email, name FROM json CROSS JOIN LATERAL unnest(json, 'email', 'name')
Currently only support for expressions is ready, nested queries is not funcional yet.
Crosstab is a custom extension, not existing on the SQL standard, and other implementations may use other syntax:
Serverboards tries another option:
SELECT CROSSTAB ON (sugus, lollipop)
user, product, sum(amount)
FROM
productsales
It will result in a table with this form:
user | sugus | lollipop |
---|---|---|
David | 10 | 20 |
Anna | NULL | 15 |
... | ... | ... |
It follows the syntax of DISTINCT
.
If ON
is provided only extracts those columns, which in some cases may be
completely empty. If it is not present it returns all possible columns, in
alphabetical order.
Crosstabs have the following caveats:
- Column names may not be known at plan time, so if you need any specific column for subqueries, you need to use the "ON" version. The first column name is always known.
- Crosstab is performed just after the select. If you need to order
the rows it has to be by column number. See
ORDER BY
for the full rationale. As a workaround you can use a nested select.
Sort can be used, per SQL standard, by column name or result column number.
Due to the way ExoSQL works the ORDER operation is done in two steps:
- Before select, orders by expressions.
- After select, orders by the resulting column number.
This is like this as after the select we do not have access to all the column names, only the resulting ones. And before there is no access to the column number results.
This have two important implications:
- There is a bug when you mix
ORDER BY
expression and column number. The second order by number will be always more important than by expression. - At
CROSSTAB
theORDER BY
name just don't work. When data gets into the crosstab algorithm the order is not specified for rows, and is alphabetical for columns. It is possible to order by number, as it happens after the crosstab.
Prints to the Elixir Logger.debug the value of that string. Returns the same value.
Can be useful on some debugging.
Formats a String using C sprintf-like parameters. Known placeholders are:
%s
-- String%10s
-- String. String at the right, add spaces until 10 chars. (padleft)%-10s
-- String. String at the left, add spaces until 10 chars. (padright)%d
-- Integer number+%d
-- Integer number, always add sign%02d
-- Number padded with 0 to fill 2 chars%f
-- Float%.2f
-- Float with precision+%f
-- Float, always add sign. 2 chars of precision.%k
-- Metric System suffix: k, M, G, T. Try to show most relevant information.%.2k
-- Metric System suffix with precision%,2k
-- Metric System, using.
to separate thousands and,
for decimals. Follow Spanish numbering system.
Lower case a full string.
Joins all elements from a list into a string, using the given separator.
join([1,2,3,4], "/")
"1/2/3/4"
Splits a string into a list using the given separator.
split("1, 2,3 4")
["1", "2", "3", "4"]
Extracts a substring from the first argument.
Can use negative indexes to start to count from the end.
substr('#test#', 1, -1)
"test"
Converts the given argument into a string.
to_string(1)
"1"
Upper cases a full string
Returns how many unit
has passed since start to end.
Returns current datetime.
Convert a datetime to a string. Can be used also to extract some parts of a date, as the day, year and so on.
Normally strftime
can be used directly with a string or an integer as it does
the conversion to datetime implicitly.
It is based on Timex formatting.
Most common markers:
%Y
-- Year four digits%y
-- Year two digits%m
-- Month number%d
-- Day of month%H
-- Hour%M
-- Minute%S
-- Second%V
-- ISO Week (01-53)%s
-- Unix time%F
-- ISO year: yyyy-mm-dd%H
-- Time: HH:MM:SS
Converts the given string or integer to a date.
The string must be in ISO8859 sub string format:
YYYY-mm-dd
YYYY-mm-ddTHH:MM
YYYY-mm-dd HH:MM
YYYY-mm-ddTHH:MM:SS
YYYY-mm-dd HH:MM:SS
- or an Unix epoch integer.
This is called implicitly on strftime
calls, and normally is not needed.
Last argument can be a modifier to add or subtract time, or a timezone shifter.
mod
MUST start with +
or -
If mod
is given it is a duration modifier as defined by
ISO8601, with the following
changes:
- Must start with
+
or-
- A subsequent
P
is optional
For example:
- Subtract one month
to_datetime(NOW(), "-1M")
- Add 30 minutes:
to_datetime(NOW(), "+T30M")
- One year and a half and 6 minutes ago:
to_datetime(NOW(), "-1Y1MT6M")
If a timezone is provided the datetime will be changed from the current timezone to the provided one, making the appropiate changes to the datetime.
For example, to_datetime(0, 'Europe/Madrid')
gives the Madrid time for unix
epoch 0: 1970-01-01 01:00:00+01:00 CET Europe/Madrid
.
Check https://en.wikipedia.org/wiki/List_of_tz_database_time_zones for the timezone names.
Converts to boolean. Equivalent to NOT NOT arg
Calculates the average of the calculated expression on the group rows.
Equivalent to sum(expr) / count(expr)
.
If no rows, returns NULL
.
Counts the number of rows of the aggregates expression.
Returns the maximum value of the given expression for the group.
Returns the minimum value of the given expression for the group.
For each of the grouped rows, calculates the expression and returns the sum. If there are no rows, returns 0.
Returns the absolute value a number
Rounds up a number
Rounds down a number
Returns the natural logarithm of the given number.
Returns the base 10 logarithm of the given number.
Returns the power of the first number to the second.
For example power(2,2)
return 4
.
Return a random float between 0 and 1.
Returns a random integer between min
and max
.
Returns the number rounded to the given precision. May be convert to integer if precision is 0.
Returns -1
, 0
or 1
depending if the number is <0
, =0
or >0
Returns the square root of the given number. See the power
function for more
advanced roots.
Returns the number rounded to the given precision. May be convert to integer if precision is 0. It is an alias to round.
Returns the first not NULL value.
This function generates a virtual table with one column and on each row a value of the series.
Can be reverse with a larger start than end and negative step.
It can be used to for example fill all holes in a temporal serie:
SELECT month, SUM(value)
FROM generate_series(12) AS month
LEFT JOIN purchases
ON strftime(purchases.datetime, "%m") == month
GROUP BY month
This will return 0 for empty months on the purchases table.
Get the greatest value of all the given ones.
It's similar to max
in other languages, but in SQL can not use max
as it is
an aggregation function with different semantics.
Parses a string into a json object. NULL is nil.
Does JSON Pointer selection:
- Use / to navigate through the object keys or array indexes.
- If no data found, return
NULL
Get the lowest value of all the given ones.
It's similar to min
in other languages, but in SQL can not use min
as it is
an aggregation function with different semantics.
Get the lower bound of a range.
Returns NULL if A and B are equal. Else returns A.
This is used for example to set a default value:
SELECT coalesce(nullif(name, ''), 'John Doe') FROM users
Returns a halt open interval [start, end)
that later can be used to get
intersection *
or membership.
The range includes the start
but not the end
(all X | start >= X < end
).
This is important for later datediff and similar.
For example, the following query will check if NOW()
is in the intersection of
some range given by parameters $start
and $end
and the range set by the columns
start
and end
.
SELECT NOW() IN (range(start, end) * range($start, $end))
FROM dates
It works for both dates, texts and numbers.
Ranges can be decomposed with lower(range)
and upper(range)
.
Performs a regex search on the string and returns the first match.
It uses elixir regex, so use it as reference.
Can use groups and named groups for matching and it will return a list of a map
with the result. It can optionally use directly JSON pointer queries. See
jp
function.
If matches the result will be "trueish" (or "falsy" if doesn't) so can be used as a boolean.
Similar to regex(str, regex, query \\ nil)
but returns all matches in a list.
Expands a json or an array to be used on LATERAL
joins.
It converts the array or json representation of an array to a list of lists,
as required by the LATERAL
joins. Optionally, if column names are given the
items are expanded as such columns.
For example:
SELECT id, email, name FROM json CROSS JOIN LATERAL unnest(json, 'email', 'name')
For each column expands the json
array, getting only the email
and name
of
each item, so the final result has all the emails and names for all the arrays
at the json table.
A.json.id | tmp.unnest.email | tmp.unnest.name | A.json.json |
---|---|---|---|
1 | one@example.com | uno | [{"email": "one@example.com", "name": "uno"}, {"email": "two@example.com", "name": "dos"}] |
1 | two@example.com | dos | [{"email": "one@example.com", "name": "uno"}, {"email": "two@example.com", "name": "dos"}] |
2 | three@example.com | tres | [{"email": "three@example.com", "name": "tres"}, {"email": "four@example.com", "name": "cuatro"}] |
2 | four@example.com | cuatro | [{"email": "three@example.com", "name": "tres"}, {"email": "four@example.com", "name": "cuatro"}] |
The last column has the original json fo the each json line (2 lines), but it is expanded to four lines.
Get the upper bound of a range.
Parses an URL and returns a JSON.
If selector is given it does the equivalent of callong jp
with that selector.
Given a n
value it is assigned a bucket between 0 and buckets
, that correspond to the full width between start
and end
.
If a value is out of bounds it is set either to 0 or to buckets - 1
.
This helper eases the generation of histograms.
For example an histogram of prices:
SELECT n, SUM(price)
FROM (SELECT width_bucket(price, 0, 200, 10) AS n, price
FROM products)
GROUP BY n
or more complete, with filling zeroes:
SELECT m, SUM(price)
FROM generate_series(10) AS m
LEFT JOIN (
SELECT width_bucket(price, 0, 200, 10) AS n, price
FROM products
)
ON n = m
GROUP BY m
ExoSQL has been developed with the idea of connecting to Serverboards services, and as such it does not provide more than some test extractors:
- CSV files
- HTTP requests
Creating new ones is a very straightforward process. The HTTP example can be followed.
This is not intended a full database system, but to be embedded into other Elixir programs and accessible from them by end users. As such it does contain only some basic extractors that are needed for proper testing.
There is no formal documentation yet, but you can check the esql_test.exs
file
to get an idea of how to use ExoSQL.
Example:
context = %{
"A" => {ExoSQL.Csv, path: "test/data/csv/"},
"B" => {ExoSQL.HTTP, []}.
"__vars__" => %{ "start" => "2018-01-01" }
}
{:ok, result} = ExoSQL.query("
SELECT urls.url, request.status_code
FROM urls
INNER JOIN request
ON urls.url = request.url
", context)
%ExoSQL.Result{
columns: [{"A", "urls", "url"}, {"B", "request", "status_code"}],
rows: [
["https://serverboards.io/e404", 404],
["http://www.facebook.com", 302],
["https://serverboards.io", 200],
["http://www.serverboards.io", 301],
["http://www.google.com", 302]
]}
A Simple extractor can be:
defmodule MyExtractor do
def schema(_config), do: {:ok, ["week"]}
def schema(_config, "week"), do: {:ok, %{ columns: ["id", "nr", "name", "weekend"] }}
def execute(_config, "week", _quals, _columns) do
{:ok, %{
columns: ["id", "nr", "name", "weekend"],
rows: [
[1, 0, "Sunday", true],
[2, 1, "Monday", false],
[3, 2, "Tuesday", false],
[4, 3, "Wednesday", false],
[5, 4, "Thursday", false],
[6, 5, "Friday", false],
[7, 6, "Saturday", true],
]
}}
end
end
And then a simple query:
{:ok, res} = ExoSQL.query("SELECT * FROM week WHERE weekend", %{ "A" => {MyExtractor, []}})
ExoSQL.format_result(res)
A.week.id | A.week.nr | A.week.name | A.week.weekend |
---|---|---|---|
1 | 0 | Sunday | true |
7 | 6 | Saturday | true |
There are other implementations of this very same idea:
- Postgres Foreign Data Wrappers (FDW). Integrates any external source with a postgres database. Can be programmed in C and Python. Postgres FDW gave me the initial inspiration for ExoSQL.
- Apache Foundation's Drill. Integrates NoSQL database and SQL databases.
- Apache Foundation's Calcite. Java based library, very similar to ExoSQL, with many many adapters. Many projects use parts of calcite, for example Drill uses the SQL parser.
If you know any other, please ping me and I will add it here.
I develop ExoSQL as I needed an elixir solution for an existing project, and to learn how to create an SQL engine. ExoSQL is currently used in Serverboards KPI.
-
When doing
ORDER BY [column id], [column name]
, it reverses the order. To avoid use one or the other, dont mix order by column name and result column position.This is because the planner does the ordering on column name first, then the select which limits the columns and reorder them and then the ordering by column position.
-
Can not use variables inside aggregation functions.
-
BEAM compile: expressions and functions first, then full plan. Now uses an AST executor, which may be slower
-
Bytecode compile. Check also if compiling to a bytecode may improve performance
-
LIMIT
/OFFSET
beforeSELECT
. No need to do the full select over all the data, specially if there are format, expressions, aggregations and other expensive operations, if we only want some results. As it may be used withSORT BY
extra care is needed to calculate the required columns, and then calculate the rest. -
Streaming extractors. Now all is stored in memory, which can be a lot.
-
Multiprocessing. If possible calculate in other processes part of the operations. This must be taking care of dependencies as
WITH
may depend on previous queries, or not. -
Allow user functions