Skip to content

Commit

Permalink
Implement updating values within string columns (#1524)
Browse files Browse the repository at this point in the history
  • Loading branch information
st-pasha authored Jan 10, 2019
1 parent 9cad681 commit 10f78ca
Show file tree
Hide file tree
Showing 5 changed files with 90 additions and 16 deletions.
12 changes: 8 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/).

- Module `datatable` now exposes C API, to allow other C/C++ libraries interact
with datatable Frames natively (#1469). See "datatable/include/datatable.h"
for the description of the API functions.
for the description of the API functions. Thanks [Qiang Kou][] for testing
this functionality.

- The column selector `j` in `DT[i, j]` can now be a list/iterator of booleans.
This list should have length `DT.ncols`, and the entries in this list will
Expand Down Expand Up @@ -92,6 +93,9 @@ and this project adheres to [Semantic Versioning](http://semver.org/).
- FTRL algorithm now works correctly with view frames (#1502). Thanks to
[Olivier][] for reporting this issue.

- Partial column update (i.e. expression of the form `DT[i, j] = R`) now works
for string columns as well (#1523).


### Changed

Expand Down Expand Up @@ -196,13 +200,13 @@ and this project adheres to [Semantic Versioning](http://semver.org/).
- function `abs()` to find the absolute value of elements in the frame.

- improved handling of Excel files by fread:
* sheet name can now be used as a path component in the file name,
- sheet name can now be used as a path component in the file name,
causing only that particular sheet to be parsed;

* further, a cell range can be specified as a path component after the
- further, a cell range can be specified as a path component after the
sheet name, forcing fread to consider only the provided cell range;

* fread can now handle the situation when a spreadsheet has multiple
- fread can now handle the situation when a spreadsheet has multiple
separate tables in the same sheet. They will now be detected automatically
and returned to the user as separate Frame objects (the name of each
frame will contain the sheet name and cell range from where the data was
Expand Down
57 changes: 45 additions & 12 deletions c/column_string.cc
Original file line number Diff line number Diff line change
Expand Up @@ -326,25 +326,58 @@ void StringColumn<T>::replace_values(
RowIndex replace_at, const Column* replace_with)
{
reify();
if (!replace_with) {
Column* rescol = nullptr;

if (replace_with && replace_with->stype() != stype()){
replace_with = replace_with->cast(stype());
}
// This could be nullptr too
auto repl_col = static_cast<const StringColumn<T>*>(replace_with);

if (!replace_with || replace_with->nrows == 1) {
CString repl_value; // Default constructor creates an NA string
if (replace_with) {
T off0 = repl_col->offsets()[0];
if (!ISNA<T>(off0)) {
repl_value = CString(repl_col->strdata(), static_cast<int64_t>(off0));
}
}
MemoryRange mask = replace_at.as_boolean_mask(nrows);
auto mask_indices = static_cast<const int8_t*>(mask.rptr());
Column* t = dt::map_str2str(this,
rescol = dt::map_str2str(this,
[=](size_t i, CString& value, dt::fhbuf& sb) {
if (mask_indices[i]) {
sb.write_na();
} else {
sb.write(mask_indices[i]? repl_value : value);
});
}
else {
const char* repl_strdata = repl_col->strdata();
const T* repl_offsets = repl_col->offsets();

MemoryRange mask = replace_at.as_integer_mask(nrows);
auto mask_indices = static_cast<const int32_t*>(mask.rptr());
rescol = dt::map_str2str(this,
[=](size_t i, CString& value, dt::fhbuf& sb) {
int ir = mask_indices[i];
if (ir == -1) {
sb.write(value);
} else {
T offstart = repl_offsets[ir - 1] & ~GETNA<T>();
T offend = repl_offsets[ir];
if (ISNA<T>(offend)) {
sb.write_na();
} else {
sb.write(repl_strdata + offstart, offend - offstart);
}
}
});
StringColumn<T>* scol = static_cast<StringColumn<T>*>(t);
std::swap(mbuf, scol->mbuf);
std::swap(strbuf, scol->strbuf);
delete scol;
if (stats) stats->reset();
return;
}
throw NotImplError() << "StringColumn::replace_values() not implemented";

xassert(rescol);
StringColumn<T>* scol = static_cast<StringColumn<T>*>(rescol);
std::swap(mbuf, scol->mbuf);
std::swap(strbuf, scol->strbuf);
delete rescol;
if (stats) stats->reset();
}


Expand Down
13 changes: 13 additions & 0 deletions c/rowindex.cc
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,19 @@ MemoryRange RowIndex::as_boolean_mask(size_t nrows) const {
}


MemoryRange RowIndex::as_integer_mask(size_t nrows) const {
MemoryRange res = MemoryRange::mem(nrows * 4);
int32_t* data = static_cast<int32_t*>(res.xptr());
// NA index is -1 in byte, and also -1 in int32
std::memset(data, -1, nrows * 4);
iterate(0, size(), 1,
[&](size_t i, size_t j) {
data[j] = static_cast<int32_t>(i);
});
return res;
}


RowIndex RowIndex::negate(size_t nrows) const {
if (isabsent()) {
// No RowIndex is equivalent to having RowIndex over all rows. The inverse
Expand Down
7 changes: 7 additions & 0 deletions c/rowindex.h
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,13 @@ class RowIndex {
*/
MemoryRange as_boolean_mask(size_t nrows) const;

/**
* Convert the RowIndex into an array `int32_t[nrows]`, where entries not
* selected by this RowIndex are -1, and the selected entries are
* consecutive integers 0, 1, ..., size()-1.
*/
MemoryRange as_integer_mask(size_t nrows) const;

/**
* Return a RowIndex which is the negation of the current, when applied
* to an array of `nrows` elements. That is, the returned RowIndex
Expand Down
17 changes: 17 additions & 0 deletions tests/munging/test_assign.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,20 @@ def test_assign_frame():
f0[:, "A"] = f1[:10, :]
assert f0.names == ("A",)
assert f0.ltypes == (dt.ltype.real,)


def test_assign_string_columns():
f0 = dt.Frame(A=["One", "two", "three", None, "five"])
f0[dt.isna(f.A), f.A] = dt.Frame(["FOUR"])
assert f0.names == ("A", )
assert f0.stypes == (dt.stype.str32,)
assert f0.to_list() == [["One", "two", "three", "FOUR", "five"]]


def test_assign_string_columns2():
f0 = dt.Frame(A=["One", "two", "three", None, "five"])
f0[[2, 0, 4], "A"] = dt.Frame([None, "Oh my!", "infinity"])
f0[1, "A"] = dt.Frame([None], stype=dt.str32)
assert f0.names == ("A", )
assert f0.stypes == (dt.stype.str32,)
assert f0.to_list() == [["Oh my!", None, None, None, "infinity"]]

0 comments on commit 10f78ca

Please sign in to comment.