Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read image and process lables natebook #162

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sfc-gh-dan
Copy link

Add notebook to show unstrcutured data processing on container runtime

Comment on lines +19 to +20
"from snowflake.snowpark.context import get_active_session\n",
"session = get_active_session()\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work outside of Snowflake Notebooks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Need to create the session from config first)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ray data won't work from notebook as well, this notebook is meant to be used inside a snowbook.

Comment on lines +50 to +51
" database = \"ST_DB\",\n",
" schema = \"ST_SCHEMA\",\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This db/schema don't exist for users

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what would be the best practice on this? I guess we cannot assume any database and scheme won't exist on customer account.

},
"source": [
"### Process both dataset to include addition columns\n",
"**Image Dataset**: add a join key, encode the images, standardize image\\n\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove \\n

"### Process both dataset to include addition columns\n",
"**Image Dataset**: add a join key, encode the images, standardize image\\n\n",
"\n",
"**Label Dataset**: add a join key, interrpet the labels"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sp

Comment on lines +45 to +72
"source": [
"from snowflake.ml.ray.datasource import SFStageImageDataSource, SFStageTextDataSource\n",
"\n",
"image_source = SFStageImageDataSource(\n",
" stage_location = \"@DATA_STAGE_RAY/images/\",\n",
" database = \"ST_DB\",\n",
" schema = \"ST_SCHEMA\",\n",
" image_size=(256, 256),\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2324e409-b4c5-4405-ad1c-267831be1773",
"metadata": {
"language": "python",
"name": "cell15"
},
"outputs": [],
"source": [
"label_source = SFStageTextDataSource(\n",
" stage_location = \"@DATA_STAGE_RAY/labels/\",\n",
" database = \"ST_DB\",\n",
" schema = \"ST_SCHEMA\",\n",
")"
]
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should external users get the images and labels?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add a step before this notebook to prepare for the data, to answer your question: this is using a public third party dataset

},
"source": [
"### Merge image source and label source into a single dataset\n",
"We have two ways of achieving this: 1) if customer is more famaliar with `pandas.Dataframe` and if the data fit into memory, then we can convert all data into pandas (or write into snowflake) and do the rest of the ops. 2) If the data does not fit into memory, we can directly leverage ray dataset to do the processing. \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sp famaliar

"### Merge image source and label source into a single dataset\n",
"We have two ways of achieving this: 1) if customer is more famaliar with `pandas.Dataframe` and if the data fit into memory, then we can convert all data into pandas (or write into snowflake) and do the rest of the ops. 2) If the data does not fit into memory, we can directly leverage ray dataset to do the processing. \n",
"\n",
"**Note**: Ray dataset is not naturally architeched to support join ops, so it's better for to use other method (in memory / snowflake) to perform joins"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sp architeched

"resultHeight": 46
},
"source": [
"## Save the Transformed Dataset to a snowflake table\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: capitalize Snowflake

Comment on lines +405 to +406
" database = \"ST_DB\",\n",
" schema = \"ST_SCHEMA\",\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just a reminder that db/schema don't exist for users)

Comment on lines +435 to +439
"source": [
"# sql cell\n",
"\n",
"# SELECT * FROM RAY_DEMO_JAN21_IMAGE_DS;"
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert to Snowpark Python call?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants