Quickly find tokens (words, phrases, etc) within your data.
Data Filter is a lightweight data cleansing framework that can be easily extended to support different data types, structures or processing requirements. It natively supports the following data types:
- CSV files
- Text files
- Text strings
- Python 3.6+
To install, simply use pipenv (or pip):
>>> pipenv install datafilter
from datafilter import CSV
tokens = ["Lorem", "ipsum", "Volutpat est", "mi sit amet"]
data = CSV("test.csv", tokens=tokens)
data.save("filtered.csv")
In this example, we open a CSV file, search all rows for normalized tokens and flag them. The save
method creates a new CSV file with all rows that weren't flagged.
from datafilter import Text
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
data = Text(text, tokens=["Lorem"])
print(next(data.results()))
In this example, we search a text string for normalized tokens. We can then iterator over the results using the .results()
method, which returns a generator that yields formatted results.
from datafilter import TextFile
data = TextFile("test.txt", tokens=["Lorem", "ipsum"], re_split=r"(?<=\.)")
print(next(data.results()))
In this example, we open a text file and split the data based on a regular expression defined by re_split
.
Data Filter was designed to be highly extensible. Common or useful filters can be easily reused and shared. A few example use cases include:
- Filters that can handle different data types such as Microsoft Word, Google Docs, etc.
- Filters that can handle incoming data from external APIs.
Abstract base class that's subclassed by every filter.
Base
includes several methods to ensure data is properly normalized, formatted and returned. The .results()
method is an @abstractmethod
to enforce its use in subclasses.
type <list>
A list of strings that will be searched for within a set of data.
type <list>
A list of strings that will be removed during normalization.
Default
['0123456789', '(){}[]<>!?.:;,`\'"@#$%^&*+-|=~ββ/\\_', '\t\n\r\x0c\x0b']
type <bool>
When True
, token matching will be bidirectional.
Default
True
Note:
A common obfuscation method is to reverse the offending string or phrase. This helps detect those instances.
type <bool>
When True
, tokens and data are converted to lowercase during normalization.
Default
True
Abstract method used to return results within a filter. This is defined by a Base
subclass
Returns a translation table used during normalization.
Returns
type <dict>
Returns normalized data. Normalization includes converting data to lowercase and removing strings.
Accepts parameter data
.
Returns
type <tuple>
Note:
Normalized data is returned as a tuple. The first element is the original data. The second element is the normalized data.
Returns parsed and formatted data.
Accepts parameter data
.
Returns
type <dict>
Example
Assume we're searching for the token "Lorem" in a very short text string.
data = Text("Lorem ipsum dolor sit amet", tokens=["Lorem"])
print(next(data.results()))
The returned result would be formatted as:
{
"data": "Lorem ipsum dolor sit amet",
"flagged": True,
"describe": {
"tokens": {
"detected": ["Lorem"],
"count": 1,
"frequency": {
"Lorem": 1,
},
}
},
}
Note:
.parse()
should never be called directly. Use.results()
instead.
Filters subclass and extend the Base
class to support various data types and structure. This extensibility allows for the creation of powerful custom filters specifically tailored to a given task, data type or structure.
CSV
is a subclass of Base
and inherits all parameters.
type <str>
Path to a CSV file.
CSV
is a subclass of Base
and inherits all methods.
Saves results to a file.
Accepts parameter path
. path
is the absolute path and filename of the new file.
Text
is a subclass of Base
and inherits all parameters.
type <str>
A text string.
type <str>
A regular expression pattern or string that will be applied to text
with re.split
before normalization.
Text
is a subclass of Base
and inherits all methods.
Saves results to a file.
Accepts parameter path
and endofline
. path
is the absolute path and filename of the new file. endofline
is a line delimiter that will be added to the end of every row.
TextFile
is a subclass of Base
and inherits all parameters.
type <str>
Path to a text file.
type <str>
A regular expression pattern or string that will be applied to text
with re.split
before normalization.
TextFile
is a subclass of Base
and inherits all methods.
Saves results to a file.
Accepts parameter path
and endofline
. path
is the absolute path and filename of the new file. endofline
is a line delimiter that will be added to the end of every row.