-
Notifications
You must be signed in to change notification settings - Fork 3
Home
It is a plugin module that enables SQLite as a standalone utility to query structured text files using its own SQL engine. I considered structured text as format that you can definitively specify its format via a set of parsing rule. It is the intent of this project to provide the flexibility to specify complex parsing rules.
Import data set can be a single file or a bunch of files with the same structure in a sub-directory. You can use wildcard characters like *
and ?
to pattern match file that are of interest to you. You can specify complex parsing instruction beyond just tabs and line feed. Some use cases are over here.
SQL is a widely use language to query data in databases. Why not extend it to query data in text files too but without the heavy infra-structure requirement of a traditional RDMS or Hadoop?
Aggregation would be the most common use case. You don't have any infrastructure ready/in-place or it is not economical to load up a database staging table to query the data. It can be used as an alternate tool to validate the output generated using a different tool but consuming the same input data set. It can be a replacement for the default CSV import utility to import complex delimited file. INSERT INTO ... SELECT ... construct can be used to load a local SQLite table.
Today machine generated data is abundant as in the log files that are spewing out 24 by 7. If the data files are not stored in an enabling platform like Hadoop, you may have to write custom programs or scripts to analyze your data on your local file system. Why not express your query in SQL?
This is just another tool for your tool chest. Use it where it fits but it is by no mean a single tool to solve all problems.
SQLite provide the SQL engine to execute your query. You can download it from here. This project provide the low level parser for your text data via an extension module that SQLite calls to get rows for its query engine.
There is nothing to install. Either you download a pre-compiled binary or build from the source code and place it in directory you have access to. Set your environment variable LD_LIBRARY_PATH
to include this directory . You also need access to the SQLite folder for its header files or the its executable interactive shell.
Click of the MD5 link below to download the compiled extension module for your platform. This module is developed under MinGW and GCC version 4.8.1. Other combination of environment and compiler version have yet to be tested.
SQLite | Windows 7 32-bit | Windows 7 64-bit |
---|---|---|
3.8.4.3 | 11f524bb14869eb1360230729f154dd7 | |
3.8.6.0 | cdf2952c0bb01603c3b6edaf1e4eddd1 | |
3.8.7.4 | 6a819ddecf270865711df8357d98fbf7 |
If you want to build it from scratch, download the source code and make sure the correct tool-chain for your platform is available. The SQLite download page is here. The following was capture from my terminal:
$ ls -g
-rw-r--r--@ 1 staff 120647 Apr 4 00:21 shell.c
-rw-r--r--@ 1 staff 5175934 Apr 4 00:21 sqlite3.c
-rw-r--r--@ 1 staff 355882 Apr 4 00:21 sqlite3.h
-rw-r--r--@ 1 staff 26110 Apr 4 00:21 sqlite3ex
-rw-r--r--@ 1 staff 106500 May 31 10:43 tfr.c
$
$ gcc -O2 -DSQLITE_THREADSAFE=0 shell.c sqlite3.c -ldl
$ gcc -O2 -shared -o tfr.dll tfr.c
The following are the general steps:
- Invoke SQLite from your command prompt. See documentation here.
- Load the extension module once. See documentation here.
- Create a virtual table using the loaded module. The TFR module grammar is describe with Railroad Diagram.
- Write a SQL query against the table created in step #3. See documentation here. Since your text files have no indices, do not expect it to behave like an indexed table.
The following was capture from my terminal:
$ sqlite3
SQLite version 3.8.4.3 2014-04-03 16:53:12
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite>
sqlite> .load tfr.dll
sqlite> CREATE VIRTUAL TABLE myVirtualTable USING tfr (
In addition to the ROWID pseudo column provided by SQLite, there are 7 hidden pseudo columns provided by this module, that you can use in your SQL query, defined as follows:
Column | Type | Comment |
---|---|---|
RecNum_ | INT | Unique Record number within the file. |
RecSize_ | INT | Raw record size in buffer. |
FileSize_ | INT | File size. |
FileDate_ | TEXT | File creation timestamp. |
FileName_ | TEXT | File name. |
FileDir_ | TEXT | File directory. |
HostName_ | TEXT | Host server. |
The following is an example:
SELECT FileName_ ,COUNT(1) RecCount
FROM MyVirtualTFRTable
GROUP BY FileName_
;
They are over here.
These tables you will be creating are virtual. If you start up SQLite as an in-memory database there is no where to store your table definition. However, you can use an init file (myTFR.init
) to save your definition just like a catalog. The following is the "init" file using SQLite special shell command:
.explain OFF
.header ON
.mode column
.nullvalue "(null)"
.separator " | "
.timer ON
.load tfr.dll
.print
.print 'Defining myVirtualTable table ...'
CREATE VIRTUAL TABLE myVirtualTable USING tfr.dll (
'path/to/your/file_*.txt'
,field1
,field2
,field3
);
You would start your session from the command line prompt by using the "-init" option as follows:
$ sqlite3 -init myTFR.init
The following are some ideas. They may or may not be a continuation of this project but a separate project by them self.
- More enhancements.
- Support variable position key/value fields. Think URLs query string.
- Directly read a compressed file. The following are the formats being consider:
- lz4, gz, bz2, lzo
- Better handle text other than ASCII.
- UTF8, UTF16, UTF32
- Big and Little endian.
- Byte Order Marker (BOM)
- EBCDIC
- If a header is provided in the first line of the data file, it can be used to default the column names
- Learn and convert to use the lemon parser. Need to replace my home grown parser.
- Statistical libraries.
- Port MADlib or Magma or equivalent over.
- Distinct estimation
- Implement other readers.
- ODBC reader.
- Parquet reader.
- AVRO reader.
- JSON reader.
- Distributed processing.
I want to thank the SQLite team lead by Dr. Richard Hipp for developing and maintaining a wonderful tool for our industry. I want to thank Jay A. Kreibich author of the book "Using SQLite" that provided me the guidance to write this external module having not touch the C language in more than 25 years. Also, Michael Owen's example found on the web. I want to thank my past and present professional colleagues that inspired me to take on this endeavor. Not to forget my family that absorbed all my domestic duties to free up time for me to write and debug code. Love those memory leaks. 😅