Tasks

[#C] Add a summarize class

Other data analysis toolkits have something like a summary function that computes arguably sensible properties of columns. It shows things like min, max, and mean for numeric columns, or the number of unique values for textual columns.

We may be able to implement a similar utility with a type class whose instance for each column type is a Fold.

For example, suppose we have the following Frame:

user id	age	gender	occupation	zip code
1	24	M	technician	85711
2	53	F	other	94043
3	23	M	writer	32067
4	24	M	techician	43537

The summary for a single column like occupation (a categorical variable) might look like this:

category	count
technician	2
other	1
writer	1

However, the summary for a column like age (an interval variable) could look like this:

parameter	value
min	24
max	53
mean	2.5
stddev	1.29
median	2.5

The summary of a whole frame might just go through summaries for each column:

entity	attribute	value
user_id	class	Num
user_id	class	Key
user_id	min	1
user_id	max	4
user_id	mean	2.5
user_id	stddev	1.29
user_id	median	2.5
user_id	unique	4
age	class	Interval
age	class	Ratio
age	min	24
age	max	53
age	mean	2.5
age	stddev	1.29
age	median	2.5
gender	class	Categorical
gender	numCategories	2
occupation	class	Categorical
occupation	numCategories	3
zip code	class	Categorical
zip code	numCategories	4

[#B] Support overriding specific columns for type inference

We currently support overriding all or none, but something more granular will be helpful. In particular with things like data types for categorical variables where the inference is more likely to be wrong.

[#A] Support types like `Either Categorical Double`

The idea is that a numeric column could have several “failure modes” that we might like to enumerate.

[#A] Demonstrate grouping

Show how to perform a fold that performs a computation based on some grouping criteria.

[#A] Revisit Joins

Get a benchmark sorted out.

Debugging inference in the REPLp

Some handy examples,

foldCoRec parsedTypeRep (bestRep "r" :: CoRec ColInfo CommonColumns)

> =Definitely Text

foldCoRec parsedTypeRep (bestRep "23" :: CoRec ColInfo CommonColumns)

> =Definitely Int

Stage Restrictions

One area where GHC’s stage restrictions on Template Haskell bite us is in stating how to parse a particular file. The problem is that we parse the file once at compile time to generate type declarations, and again at runtime to read the values. As a reminder, GHC’s stage restriction means that values we pass to a splice must be literals or imported from another module. To be clear, we can’t do this,

x :: Foo
x = foo 23

mySplice x "skidoo"

myData :: RuntimeFoo
myData = readStuff x "skidoo"

This means that a value representing parser options must be imported so that it can be used during both phases. At the moment, the only parser options are defining how columns are separated, and whether or not there is a header row (the absence of a header is indicated by explicitly providing column names). We can capture most of the needed functionality by passing a separator character and a list of strings to the TH splice. This is a slight wart as any further parser options would extend the type of every parsing function. Using a record for options would mean that we could add options without having to change every type signature.

A Benefit to Duplication?

Another drawback of passing parsing options as literals is that it exacerbates another problem: repeating the name of the file to be parsed. Specifically, we need to provide the name for the template haskell splice that produces all the relevant declarations, and again for the runtime code that reads the data file. A minor advantage of this duplication is that we can provide a model file for the type declarations, and a lower quality data file that we want to analyze. This offers a way to infer tighter types than the noisy data would allow so that malformed records can more easily be discarded when they fail to parse at the specific type.

Options

To be concrete, if we do not use a record for parser options, we could always pass the unpacked parser options wherever they are needed.

tableTypesOpt '|' ["name", "age", "occupation"] "Users" "data/users.dat"

userData :: Producer Users IO ()
userData = readTableOpt '|' ["name", "age", "occupation"] "data/users.dat"

The duplication of the column names is atrocious. We could declare all Users-related types and values, and the definition of userData at once to avoid repeating ourselves, but this seems like it might become an unwieldy splice.

The best choice is for the splice to declare a value usersParser that readTableOpt could then use. This works out quite nicely.

Prettying TH Splice Dumps

At the GHCi repl

> :set -XQuasiQuotes -XTemplateHaskell
> import Language.Haskell.TH
> putStrLn $(tableTypes "base" "base.csv" >>= stringE . show . ppr_list)

> set -XOverloadedStrings -XQuasiQuotes TempalteHaskell
> import Data.Char
> import Data.List
> import Frames
> import Frames.CSV
> let stripModule = until (\w -> length w == 1 || not ("." `isInfixOf` w)) (tail . dropWhile isAlpha)
> let onWords f xs = takeWhile isSpace xs ++ unwords (map f (words xs))
> putStrLn . unlines . map (onWords stripModule) $ lines $(tableTypes' (rowGen "data/ml-100k/u.user") {rowTypeName = "User", columnNames = ["user id", "age", "gender", "occupation", "zip code"], separator = "|"} >>= stringE . show . ppr_list)

Using ghc-mod and some elisp helpers

Dumping the definitions created by the TH splices results in a pretty unreadable mess. Here’s how to use these functions to clean things up:

Evaluate the three elisp definitions here
Hit C-c C-e to get ghc-mod to evaluate all splices
Copy the contents of the *GHC Info* buffer to somewhere like your *scratch* buffer (because *GHC Info* is read-only)
Run M-x pretty-splices in that buffer

(defun replace-stringf (from to)
  (beginning-of-buffer)
  (while (search-forward from nil t)
    (replace-match to nil t)))

(defun replace-regexpf (from to)
  (beginning-of-buffer)
  (while (re-search-forward from nil t)
    (replace-match to nil nil)))

(defun pretty-splices ()
  (interactive)
  ;; Fix newlines
  (replace-stringf (rx (char ?\0)) "
")
  ;; Unqualify names
  (replace-stringf "GHC.Types.:" "':")
  (replace-stringf "Data.Text.Internal." "")
  (replace-stringf "Data.Text." "T.")
  (replace-stringf "GHC.Types.Int" "Int")
  (replace-stringf "GHC.Base." "")
  (replace-stringf "Frames.Col." "")
  (replace-stringf "Data.Proxy." "")
  (replace-stringf "Data.Vinyl.TypeLevel." "")
  (replace-stringf "Data.Vinyl.Core." "")
  (replace-stringf "Frames.Rec." "")
  (replace-stringf "Data.Vinyl.Lens." "")
  (replace-stringf "Frames.CSV.ParserOptions" "ParserOptions")

  ;; Erase inferrable type
  (replace-regexpf "(Frames.TypeLevel.RIndex .*?)" "")

  ;; Make `:->' infix
  (replace-regexpf (rx (sequence "(:->) \""
                                 (group (0+ (not (in "\""))))
                                 "\" "
                                 (group (0+ (not (in " "))))))
                   "\"\\1\" :-> \\2")

  ;; Make `:' infix
  (replace-regexpf (rx (sequence "((':) (" (group (0+ (not (in ")")))) ") '[])"))
                   "[\\1]")
  (let ((x 10))
    (while (plusp x)
      (replace-regexpf (rx (sequence "((':) (" (group (0+ (not (in ")")))) ") ["
                                     (group (0+ (not (in "]")))) "])"))
                       "[\\1, \\2]")
      (decf x)))

  ;; Newline before top-level type signature
  (replace-regexpf "^    [^ ]+ ::" "
\\&")
  ;; Newline before single-line type synonym definitions
  (replace-regexpf "^    type [^ ]+ = [^ ]+.*$" "
\\&"))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frames-notes.org

Frames-notes.org

Tasks

[#C] Add a summarize class

[#B] Support overriding specific columns for type inference

[#A] Support types like `Either Categorical Double`

[#A] Demonstrate grouping

[#A] Revisit Joins

Debugging inference in the REPLp

Stage Restrictions

A Benefit to Duplication?

Options

Prettying TH Splice Dumps

At the GHCi repl

Using ghc-mod and some elisp helpers

Files

Frames-notes.org

Latest commit

History

Frames-notes.org

File metadata and controls

Tasks

[#C] Add a summarize class

[#B] Support overriding specific columns for type inference

[#A] Support types like Either Categorical Double

[#A] Demonstrate grouping

[#A] Revisit Joins

Debugging inference in the REPLp

Stage Restrictions

A Benefit to Duplication?

Options

Prettying TH Splice Dumps

At the GHCi repl

Using ghc-mod and some elisp helpers

[#A] Support types like `Either Categorical Double`