-
Notifications
You must be signed in to change notification settings - Fork 46
DataGrid
class DataGrid()
DataGrid instances have the following atrributes:
- columns - a list of column names, or a dict of column names mapped to column types
- data - a list of lists where each is a row of data
- name - a name of the tabular data
def __init__(data=None,
columns=None,
name="Untitled",
datetime_format="%Y/%m/%d",
heuristics=False,
converters=None)
Create a DataGrid instance.
Arguments:
-
data
- (optional, list of lists) The rows of data -
columns
- (optional, list of strings) the column titles -
name
- (optional, str) a name of the tabular data -
datetime_format
- (optional, str) the Python date format that dates are read. For example, use "%Y/%m/%d" for dates like "2022/12/01". -
heuristics
- if True, guess that some numbers might be dates -
converters
- (optional, dict) dictionary of functions to convert items into values. Keys are str (to match column name)NOTES:
The varaible
dg
is used below as an example DataGrid instance.If column names are not provided, then names will be generated in the sequence "A", "B", ... "Z", "AA", "BB", ...
The DataGrid instance can be imagined as a two-dimensional list of lists. The first dimension is the row, and the second dimension is the column. For example, dg[5][2] would return the 6th row (zero-based) and the 3rd (zero-based) column's value.
Likewise, you can use
dg.append(ROW)
,dg.extend(ROWS)
, anddg.pop(INDEX)
methods.Rows can be either lists of values, or JSON-like dictionaries of the form
{"COLUMN NAME": VALUE, ...}
.These are common methods to use on a DataGrid:
-
dg.info()
- data about rows, columns, and datatypes -
dg.head()
- show the first few rows of a DataGrid -
dg.tail()
- show the last few rows of a DataGrid -
dg.show()
- open up an IFrame (if in a Jupyter Notebook) or a webbrowser page showing the DataGrid UI
-
Examples:
>>> from kangas import DataGrid, Image
>>> import glob
>>> dg = DataGrid(name="Images", columns=["Image", "Score"])
>>> for filename in glob.glob("*.jpg"):
... score = model.predict()
... dg.append([Image(filename), score])
>>> dg.show()
def show(filter=None,
host=None,
port=4000,
debug=None,
height="750px",
width="100%",
protocol="http",
hide_selector=None,
use_ngrok=False,
cli_kwargs=None,
**kwargs)
Open DataGrid in an IFrame in the jupyter environment or browser.
Arguments:
-
host
- (optional, str) the host name or IP number for the servers to listen to -
filter
- (optional, str) a filter to set on the DataGrid -
port
- (optional, int) the port number for the servers to listen to -
debug
- (optional, str) will display additional information from the server (may not be visible in a notebook) -
height
- (optional, str) the height of iframe in px or percentage -
width
- (optional, str) the width of iframe in px or percentage -
use_ngrok
- (optional, bool) force using ngrok as a proxy -
cli_kwargs
- (dict) a dictionary with keys the names of the kangas server flags, and values the setting value (such as:{"backend-port": 8000}
) -
kwargs
- additional URL parameters to pass to server
Example:
>>> import kangas as kg
>>> dg = kg.DataGrid()
>>> # append data to DataGrid
>>> dg.show()
>>> dg.show("{'Column Name'} == 'category three'")
>>> dg.show("{'Column Name'} == 'category three'",
... group="Another Column Name")
def set_columns(columns)
Set the columns. columns
is either a list of column names, or a
dict where the key is the column name, and the value is a DataGrid
type. Vaild DataGrid types are: "INTEGER", "FLOAT", "BOOLEAN",
"DATETIME", "TEXT", "JSON", "VECTOR", or "IMAGE-ASSET".
Example:
>>> dg = DataGrid()
>>> dg.set_columns(["Column 1", "Column 2"])
def __iter__()
Iterate over data.
def to_csv(filename,
sep=",",
header=True,
quotechar='"',
encoding="utf8",
converters=None)
Save a DataGrid as a Comma Separated Values (CSV) file.
Arguments:
-
filename
- (str) the file to save the CSV data to -
sep
- (str) separator to use in CSV; default is "," -
header
- (bool) if True, write out the header; default is True -
quotechar
- (str) the character to use to surround text; default is '"' -
encoding
- (str) the encoding to use in the saved file; default is "utf8" -
converters
- (optional, dict) dictionary of functions to convert items into values. Keys are str (to match column name)
Example:
>>> dg.to_csv()
def to_dataframe()
Convert a DataGrid into a pandas dataframe.
Example:
>>> df = dg.to_dataframe()
def to_dicts(column_names=None, format_map=None)
Iterate over data, returning dicts.
Arguments:
-
column_names
- (optional, list of str) only return the given column names -
format_map
- (optional, dict) dictionary of column type to function that takes a value, and returns a new value.
>>> dg = DataGrid(columns=["column 1", "column 2"])
>>> dg.append([1, "one"])
>>> dg.append([2, "two"])
>>> dg.to_dicts()
[
{"column 1": value1_1, "column 2": value1_2, ...},
{"column 1": value2_1, "column 2": value2_2, ...},
]
>>> dg.to_dicts("column 2")
[
{"column two": value1_2, ...},
{"column two": value2_2, ...},
]
def __getitem__(item)
Get either a row or a column from the DataGrid.
Arguments:
-
item
- (str or int) - if int, return the zero-based row; if str then item is the column name to return
>>> dg = DataGrid(columns=["column 1", "column 2"])
>>> dg.append([1, "one"])
>>> dg.append([2, "two"])
>>> dg[0]
[1, "one"]
>>> dg["column 1"]
[1, 2]
@property
def nrows()
The number of rows in the DataGrid.
Example:
>>> dg.nrows
42
@property
def ncols()
The number of columns in the DataGrid.
Example:
>>> dg.ncols
10
@property
def shape()
The (rows, columns) in the DataGrid.
Example:
>>> dg.shape
(10, 42)
@classmethod
def download(cls, url, ext=None)
Download a file from a URL.
Example:
>>> DataGrid.download("https://example.com/file.zip")
@classmethod
def read_sklearn(cls, dataset_name)
Load a sklearn dataset by name.
Arguments:
-
dataset_name
- (str) one of: 'boston', 'breast_cancer', 'diabetes', 'digits', 'iris', 'wine'
Example:
>>> dg = DataGrid.read_sklearn("iris")
@classmethod
def read_parquet(cls, filename, **kwargs)
Takes a parquet filename or URL and returns a DataGrid.
Note: requires pyarrow to be installed.
Example:
>>> dg = DataGrid.read_parquet("userdata1.parquet")
@classmethod
def read_dataframe(cls, dataframe, **kwargs)
Takes a columnar pandas dataframe and returns a DataGrid.
Example:
>>> dg = DataGrid.read_dataframe(df)
@classmethod
def read_json(cls, filename, **kwargs)
Read JSON data, or JSON or JSON Line files [1]. JSON should be a list of objects, or a series of objects, one per line.
Arguments:
-
filename
- the name of the file or URL to read the JSON from, or the data -
datetime_format
- (str) the Python date format that dates are read. For example, use "%Y/%m/%d" for dates like "2022/12/01". -
heuristics
- (bool) whether to guess that some float values are datetime representations -
name
- (str) the name to use for the DataGrid -
converters
- (dict) dictionary of functions where the key is the columns name, and the value is a function that takes a value and converts it to the proper type and form. -
Note
- the file or URL may end with ".zip", ".tgz", ".gz", or ".tar" extension. If so, it will be downloaded and unarchived. The JSON file is assumed to be in the archive with the same name as the file/URL. If it is not, then please use the kangas.download() function to download, and then read from the downloaded file.[1] - https://jsonlines.org/
Example:
>>> from kangas import DataGrid
>>> dg = DataGrid.read_json("json_line_file.json")
>>> dg = DataGrid.read_json("https://instances.social/instances.json")
>>> dg = DataGrid.read_json("https://company.com/data.json.zip")
>>> dg = DataGrid.read_json("https://company.com/data.json.gz")
>>> dg.save()
@classmethod
def read_datagrid(cls, filename, **kwargs)
Read (load) a datagrid file.
Arguments:
-
kwargs
- any keyword to pass to the DataGrid constructor
Example:
>>> dg = DataGrid.read_datagrid("mnist.datagrid")
@classmethod
def read_csv(cls,
filename,
header=0,
sep=",",
quotechar='"',
datetime_format=None,
heuristics=False,
converters=None)
Takes a CSV filename and returns a DataGrid.
Arguments:
-
filename
- the CSV file to import -
header
- (optional, int) row number (zero-based) of column headings -
sep
- (optional, str) used in the CSV parsing -
quotechar
- (optional, str) used in the CSV parsing -
datetime_format
- (optional, str) the datetime format -
heuristics
- (optional, bool) whether to guess that some float values are datetime representations -
converters
- (optional, dict) A dictionary of functions for converting values in certain columns. Keys are column labels.
Example:
>>> dg = DataGrid.read_csv("results.csv")
def info()
Display information about the DataGrid.
Example:
>>> dg.info()
DataGrid (on disk)
Name : coco-500-with-bbox
Rows : 500
Columns: 7
# Column Non-Null Count DataGrid Type
--- -------------------- --------------- --------------------
1 ID 500 INTEGER
2 Image 500 IMAGE-ASSET
3 Score 500 FLOAT
4 Confidence 500 FLOAT
5 Filename 500 TEXT
6 Category 5 500 TEXT
7 Category 10 500 TEXT
def head(n=5)
Display the last n rows of the DataGrid.
Arguments:
-
n
- (optional, int) number of rows to show
Example:
>>> dg.head()
row-id ID Score Confidence Filename
1 391895 0.4974163872616 0.5726406230662 COCO_val2014_00
2 522418 0.3612518386682 0.8539611863547 COCO_val2014_00
3 184613 0.1060265192042 0.1809083103203 COCO_val2014_00
4 318219 0.8879546879811 0.2918134509273 COCO_val2014_00
5 554625 0.5889039105388 0.8253719528139 COCO_val2014_00
[500 rows x 4 columns]
def tail(n=5)
Display the last n rows of the DataGrid.
Arguments:
-
n
- (optional, int) number of rows to show
Example:
>>> dg.tail()
row-id ID Score Confidence Filename
496 391895 0.4974163872616 0.5726406230662 COCO_val2014_00
497 522418 0.3612518386682 0.8539611863547 COCO_val2014_00
498 184613 0.1060265192042 0.1809083103203 COCO_val2014_00
499 318219 0.8879546879811 0.2918134509273 COCO_val2014_00
500 554625 0.5889039105388 0.8253719528139 COCO_val2014_00
[500 rows x 4 columns]
def get_columns()
Get the public-facing, non-hidden columns. Returns a list of strings.
Example:
>>> dg.get_columns()
['ID', 'Image', 'Score', 'Confidence', 'Filename']
def append_iou_columns(image_column_name, layer1, layer2)
Add Intersection Over Union columns between two layers on an image column.
def append_iou_column(image_column_name,
layer1,
layer2,
label,
new_column=None)
Add Intersection Over Union columns between two layers on an image column.
def remove_unused_assets()
Remove any assets that don't have a reference to them from the datagrid table.
def remove_select(where,
computed_columns=None,
limit=None,
offset=0,
debug=False)
Remove items by filter
Arguments:
-
where
- (optional, str) a Python expression where column names are written as {"Column Name"}. -
limit
- (optional, int) select at most this value -
offset
- (optional, int) start selection at this offset -
computed_columns
- (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.
def remove_rows(*row_ids)
Remove specific rows, and any associated assets.
def remove_columns(*column_names)
Delete columns from the saved DataGrid.
Arguments:
-
column_names
- list of column names to delete
Example:
>>> dg = kg.DataGrid(columns=["a", "b"])
>>> dg.save()
>>> dg.remove_columns("a")
def append_column(column_name, rows, verify=True)
Append a column to the DataGrid.
Arguments:
-
column_name
- column name to append -
rows
- list of values -
verify
- (optional, bool) if True, verify the data -
NOTE
-rows
is a list of values, one for each row.
Example:
>>> dg.append_column("New Column Name", ["row1", "row2", "row3", "row4"])
def append_columns(columns, rows=None, verify=True)
Append columns to the DataGrid.
Arguments:
-
columns
- list of column names to append if rows is given or dictionary of column names as keys, and column rows as values. -
rows
- (optional, list) list of list of values per row -
verify
- (optional, bool) if True, verify the data
Example:
>>> dg = kg.DataGrid(columns=["a", "b"])
>>> dg.append([11, 12])
>>> dg.append([21, 22])
>>> dg.append_columns(
... ["New Column 1", "New Column 2"],
... [
... ["row1 col1",
... "row2 col1"],
... ["row1 col2",
... "row2 col2"],
... ])
>>> dg.append_columns(
... {"New Column 3": ["row1 col3",
... "row2 col3"],
... "New Column 4": ["row1 col4",
... "row2 col4"],
... })
>>> dg.info()
row-id a b New Column 1 New Column 2 New Column 3 New Column 4
1 11 12 row1 col1 row1 col2 row1 col3 row1 col4
2 21 22 row2 col1 row2 col2 row2 col3 row2 col4
[2 rows x 6 columns]
def pop(index)
Pop a row by index from an in-memory DataGrid.
Arguments:
-
index
- (int) position (zero-based) of row to remove
Example:
>>> row = dg.pop(0)
def append(row)
Append this row onto the datagrid data.
Example:
>>> dg.append(["column 1 value", "column 2 value", ...])
def get_asset_ids()
Get all of the asset IDs from the DataGrid.
Returns a list of asset IDs.
def extend(rows, verify=True)
Extend the datagrid with the given rows.
Example:
>>> dg.extend([
... ["row 1, column 1 value", "row 1, column 2 value", ...],
... ["row 2, column 1 value", "row 2, column 2 value", ...],
... ...,
... ])
def get_schema()
Get the DataGrid schema.
Example:
>>> dg.get_schema()
{'row-id': {'field_name': 'column_0', 'type': 'ROW_ID'},
'ID': {'field_name': 'column_1', 'type': 'INTEGER'},
'Image': {'field_name': 'column_2', 'type': 'IMAGE-ASSET'},
'Score': {'field_name': 'column_3', 'type': 'FLOAT'},
'Confidence': {'field_name': 'column_4', 'type': 'FLOAT'},
'Filename': {'field_name': 'column_5', 'type': 'TEXT'},
'Category 5': {'field_name': 'column_6', 'type': 'TEXT'},
'Category 10': {'field_name': 'column_7', 'type': 'TEXT'},
'Image--metadata': {'field_name': 'column_8', 'type': 'JSON'}}
def select_count(where="1")
Get the count of items given a where expression.
Arguments:
-
where
- a Python expression where column names are written as {"Column Name"}.
Example:
>>> dg.select_count("{'column 1'} > 0.5")
894
def select_dataframe(where="1",
sort_by=None,
sort_desc=False,
computed_columns=None,
limit=None,
offset=0,
select_columns=None)
Perform a selection on the database, including possibly a query, and returning rows in various sort orderings.
Arguments:
-
where
- (optional, str) a Python expression where column names are written as {"Column Name"}. -
select_columns
- (list of str, optional) list of column names to select -
sort_by
- (optional, str) name of column to sort on -
sort_desc
- (optional, bool) sort descending? -
limit
- (optional, int) select at most this value -
offset
- (optional, int) start selection at this offset -
computed_columns
- (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.
Example:
>>> df = dg.select_dataframe("{'column name 1'} == {'column name 2'} and {'score'} < -1")
def select(where="1",
sort_by=None,
sort_desc=False,
to_dicts=False,
count=False,
computed_columns=None,
limit=None,
offset=0,
debug=False,
select_columns=None)
Perform a selection on the database, including possibly a query, and returning rows in various sort orderings.
Arguments:
-
where
- (optional, str) a Python expression where column names are written as {"Column Name"}. -
select_columns
- (optional, list of str) a list of column names to select -
sort_by
- (optional, str) name of column to sort on -
sort_desc
- (optional, bool) sort descending? -
limit
- (optional, int) select at most this value -
offset
- (optional, int) start selection at this offset -
to_dicts
- (optional, cool) if True, return the rows in dicts where the keys are the column names. -
count
- (optional, bool) if True, return the count of matching rows -
computed_columns
- (optional, dict) a dictionary with the keys being the column name, and value is a string describing the expression of the column. Uses same syntax and semantics as the filter query expressions.
Example:
>>> dg.select("{'column name 1'} == {'column name 2'} and {'score'} < -1")
[
["row 1, column 1 value", "row 1, column 2 value", ...],
["row 2, column 1 value", "row 2, column 2 value", ...],
...
]
def save(filename=None, create_thumbnails=None)
Create the SQLite database on disk.
Arguments:
-
filename
- (optional, str) the name of the filename to save to -
create_thumbnails
- (optional, bool) if True, then create thumbnail images for assets
Example:
>>> dg.save()
def set_about(markdown)
Set the about page for this DataGrid.
Arguments:
-
markdown
- (str) the text of the markdown About text
def set_about_from_script(filename)
Set the about page for this DataGrid.
Arguments:
-
filename
- (str) the file that created the DataGrid
def get_about()
Get the about page for this DataGrid.
def display_about()
Display the about page for this DataGrid as markdown.
Note: this requires being in an IPython-like environment.
def upgrade()
Upgrade to latest version of datagrid.
Kangas DataGrid is completely open source; sponsored by Comet ML
-
Home
- User Guides
- Installation - installing kangas
- Reading data - importing data
- Constructing DataGrids - building from scratch
- Exploring data - exploration and analysis
- Examples - scripts and notebooks
- Kangas Command-Line Interface
- Kangas Python API
- Integrations - with Hugging Face and Comet
- User Interface
- FAQ - Frequently Asked Questions
- Under the Hood
- Security - issues related to security
- Development - setting up a development environment
- Roadmap - plans and known issues
- User Guides