Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] update type inference for string columns #343

Merged
merged 3 commits into from
Apr 10, 2021

Conversation

Moh-Yakoub
Copy link
Contributor

@Moh-Yakoub Moh-Yakoub commented Apr 5, 2021

Overview

Closes #249

I've noticed that we have a type deduction error in case the field had NaNs as described in the attached issue.
I furthur noticed that we don't check if the string columns can be casted to a float/int type. I've added extra check to see id a string column can be casted to an int/float column and deduce the proper data_type accordingly.

Update (04/10/2021)

  • Adding warning message when NaN columns displayed as histograms
  • Resolve NaN series causing error in histogram execute_binning
  • Rewrote is_numeric_nan_column to run 2x faster
    image

Changes

I've added a logic to

  1. Check if the column can be cast to int/double
  2. apply the respective data_type inference.

Example Output

Screen Shot 2021-04-05 at 11 51 40 PM

The result shows that the two columns mentioned in the issue: # Instances and # Attributes have a correct data_type now

@codecov
Copy link

codecov bot commented Apr 5, 2021

Codecov Report

Merging #343 (eeee236) into master (1a72332) will increase coverage by 0.27%.
The diff coverage is 92.68%.

❗ Current head eeee236 differs from pull request most recent head 02a6a99. Consider uploading reports for the commit 02a6a99 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #343      +/-   ##
==========================================
+ Coverage   79.98%   80.25%   +0.27%     
==========================================
  Files          50       50              
  Lines        3612     3632      +20     
==========================================
+ Hits         2889     2915      +26     
+ Misses        723      717       -6     
Impacted Files Coverage Δ
lux/vislib/altair/Choropleth.py 94.20% <0.00%> (ø)
lux/vislib/matplotlib/ScatterChart.py 76.47% <50.00%> (+1.17%) ⬆️
lux/action/univariate.py 90.90% <100.00%> (+0.52%) ⬆️
lux/executor/PandasExecutor.py 96.07% <100.00%> (+0.06%) ⬆️
lux/utils/utils.py 90.12% <100.00%> (+1.23%) ⬆️
lux/vislib/matplotlib/MatplotlibRenderer.py 87.30% <100.00%> (+0.63%) ⬆️
lux/interestingness/interestingness.py 87.56% <0.00%> (+1.08%) ⬆️
lux/vislib/matplotlib/Heatmap.py 98.33% <0.00%> (+1.66%) ⬆️
lux/vislib/altair/ScatterChart.py 96.96% <0.00%> (+3.03%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a72332...02a6a99. Read the comment docs.

@dorisjlee
Copy link
Member

dorisjlee commented Apr 10, 2021

Thanks @Moh-Yakoub! I made some changes to resolve the related issue in #249 and rewrote the helper function in a more optimized way. Congrats on your first contribution to Lux!

@dorisjlee dorisjlee merged commit bab48ff into lux-org:master Apr 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misdetected data type when numerical column contains null
2 participants