Skip to content
Edward Lau edited this page Jul 29, 2023 · 88 revisions

Welcome to the Text File Reader for SQLite

What does it do?

It is a plugin module that enables SQLite as a standalone utility to query structured text files using its own SQL engine. I considered structured text as format that you can definitively specify its format via a set of parsing rule. It is the intent of this project to provide the flexibility to specify complex parsing rules.

Import data set can be a single file or a bunch of files with the same structure in a sub-directory. You can use wildcard characters like * and ? to pattern match file that are of interest to you. You can specify complex parsing instruction beyond just tabs and line feed. Some use cases are over here.

SQL is a widely use language to query data in databases. Why not extend it to query data in text files too but without the heavy infra-structure requirement of a traditional RDMS or Hadoop?

What is it use for?

Aggregation would be the most common use case. You don't have any infrastructure ready/in-place or it is not economical to load up a database staging table to query the data. It can be used as an alternate tool to validate the output generated using a different tool but consuming the same input data set. It can be a replacement for the default CSV import utility to import complex delimited file. INSERT INTO ... SELECT ... construct can be used to load a local SQLite table.

Today machine generated data is abundant as in the log files that are spewing out 24 by 7. If the data files are not stored in an enabling platform like Hadoop, you may have to write custom programs or scripts to analyze your data on your local file system. Why not express your query in SQL?

This is just another tool for your tool chest. Use it where it fits but it is by no mean a single tool to solve all problems.

How is it done?

SQLite provide the SQL engine to execute your query. You can download it from here. This project provide the low level parser for your text data via an extension module that SQLite calls to get rows for its query engine.

How do you install it?

There is nothing to install. Either you download a pre-compiled binary or build from the source code and place it in directory you have access to. Set your environment variable LD_LIBRARY_PATH to include this directory . You also need access to the SQLite folder for its header files or the its executable interactive shell.

Click of the MD5 link below to download the compiled extension module for your platform. This module is developed under MinGW and GCC version 4.8.1. Other combination of environment and compiler version have yet to be tested.

SQLite Windows 7 32-bit Windows 7 64-bit
3.8.4.3 11f524bb14869eb1360230729f154dd7
3.8.6.0 cdf2952c0bb01603c3b6edaf1e4eddd1
3.8.7.4 6a819ddecf270865711df8357d98fbf7

If you want to build it from scratch, download the source code and make sure the correct tool-chain for your platform is available. The SQLite download page is here. The following was capture from my terminal:

$ ls -g
-rw-r--r--@ 1 staff   120647 Apr  4 00:21 shell.c
-rw-r--r--@ 1 staff  5175934 Apr  4 00:21 sqlite3.c
-rw-r--r--@ 1 staff   355882 Apr  4 00:21 sqlite3.h
-rw-r--r--@ 1 staff    26110 Apr  4 00:21 sqlite3ex
-rw-r--r--@ 1 staff   106500 May 31 10:43 tfr.c
$
$ gcc -O2 -DSQLITE_THREADSAFE=0 shell.c sqlite3.c -ldl
$ gcc -O2 -shared -o tfr.dll  tfr.c

How do you use it?

The following are the general steps:

  1. Invoke SQLite from your command prompt. See documentation here.
  2. Load the extension module once. See documentation here.
  3. Create a virtual table using the loaded module. The TFR module grammar is describe with Railroad Diagram.
  4. Write a SQL query against the table created in step #3. See documentation here. Since your text files have no indices, do not expect it to behave like an indexed table.

The following was capture from my terminal:

$ sqlite3
SQLite version 3.8.4.3 2014-04-03 16:53:12
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> 
sqlite> .load  tfr.dll 
sqlite> CREATE VIRTUAL TABLE  myVirtualTable  USING  tfr (

What else is there?

In addition to the ROWID pseudo column provided by SQLite, there are 7 hidden pseudo columns provided by this module, that you can use in your SQL query, defined as follows:

Column Type Comment
RecNum_ INT Unique Record number within the file.
RecSize_ INT Raw record size in buffer.
FileSize_ INT File size.
FileDate_ TEXT File creation timestamp.
FileName_ TEXT File name.
FileDir_ TEXT File directory.
HostName_ TEXT Host server.

The following is an example:

SELECT    FileName_ ,COUNT(1) RecCount
FROM      MyVirtualTFRTable
GROUP  BY FileName_
;

Where are the examples?

They are over here.

Is there a catalog?

These tables you will be creating are virtual. If you start up SQLite as an in-memory database there is no where to store your table definition. However, you can use an init file (myTFR.init) to save your definition just like a catalog. The following is the "init" file using SQLite special shell command:

.explain    OFF
.header     ON
.mode       column
.nullvalue "(null)"
.separator " | "
.timer      ON

.load  tfr.dll

.print
.print 'Defining myVirtualTable table ...'
CREATE VIRTUAL TABLE myVirtualTable USING tfr.dll (
      'path/to/your/file_*.txt'
      ,field1
      ,field2
      ,field3
);

You would start your session from the command line prompt by using the "-init" option as follows:

$ sqlite3 -init myTFR.init

What is there left to do?

The following are some ideas. They may or may not be a continuation of this project but a separate project by them self.

  • More enhancements.
    • Support variable position key/value fields. Think URLs query string.
    • Directly read a compressed file. The following are the formats being consider:
      • lz4, gz, bz2, lzo
    • Better handle text other than ASCII.
      • UTF8, UTF16, UTF32
      • Big and Little endian.
        • Byte Order Marker (BOM)
      • EBCDIC
    • If a header is provided in the first line of the data file, it can be used to default the column names
  • Learn and convert to use the lemon parser. Need to replace my home grown parser.
  • Statistical libraries.
  • Implement other readers.
    • ODBC reader.
    • Parquet reader.
    • AVRO reader.
    • JSON reader.
  • Distributed processing.
    • Each processing node shall work on its own local data and send the results back to the queries client to amalgamate.

Who do I credit?

I want to thank the SQLite team lead by Dr. Richard Hipp for developing and maintaining a wonderful tool for our industry. I want to thank Jay A. Kreibich author of the book "Using SQLite" that provided me the guidance to write this external module having not touch the C language in more than 25 years. Also, Michael Owen's example found on the web. I want to thank my past and present professional colleagues that inspired me to take on this endeavor. Not to forget my family that absorbed all my domestic duties to free up time for me to write and debug code. Love those memory leaks. 😅