From 0b1088751d4dd5ff9aeb27a1e589c22c452a9745 Mon Sep 17 00:00:00 2001
From: Chris Wood <c.c.wood@gmail.com>
Date: Tue, 27 Feb 2024 19:21:46 +0000
Subject: [PATCH] Adding solutions for episode 3

---
 _episodes/03-starting-with-data.md | 82 +++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/_episodes/03-starting-with-data.md b/_episodes/03-starting-with-data.md
index 13df5aa42..29ecbdcc4 100644
--- a/_episodes/03-starting-with-data.md
+++ b/_episodes/03-starting-with-data.md
@@ -310,6 +310,54 @@ Let's look at the data using these.
 >
 > 3. `waves_df.head()` Also, what does `waves_df.head(15)` do?
 > 4. `waves_df.tail()`
+>
+> > ## Solution
+> > 1. 
+> > ~~~
+> > Index(['record_id', 'buoy_id', 'Name', 'Date', 'Tz', 'Peak Direction', 'Tpeak',
+> >    'Wave Height', 'Temperature', 'Spread', 'Operations', 'Seastate',
+> >    'Quadrant'],
+> >    dtype='object')
+> > ~~~
+> > {: .output}
+> >
+> > 2. 
+> > ~~~
+> > (2073, 13)
+> > ~~~
+> > {: .output}
+> >
+> > It is a _tuple_
+> >
+> > 3. 
+> > ~~~
+> >   record_id  buoy_id  ... Seastate Quadrant
+> > 0          1       14  ...    swell     west
+> > 1          2        7  ...    swell    south
+> > 2          3        5  ...  windsea     east
+> > 3          4        3  ...    swell    south
+> > 4          5       10  ...    swell     west
+> >
+> > [5 rows x 13 columns]
+> > ~~~
+> > {: .output}
+> >
+> > So, `waves_df.head()` returns the first 5 rows of the `waves_df` dataframe. (Your Jupyter Notebook might show all columns). `waves_df.head(15)` returns the first 15 rows; i.e. the _default_ value (recall the functions lesson) is 5, but we can change this via an argument to the function
+> > 4.
+> > ~~~
+> >       record_id  buoy_id              Name  ... Operations  Seastate  Quadrant
+> > 2068       2069       16  west of Hebrides  ...       crew     swell     north
+> > 2069       2070       16  west of Hebrides  ...       crew     swell     north
+> > 2070       2071       16  west of Hebrides  ...       crew     swell     north
+> > 2071       2072       16  west of Hebrides  ...       crew     swell     north
+> > 2072       2073       16  west of Hebrides  ...       crew     swell     north
+> >
+> > [5 rows x 13 columns]
+> > ~~~
+> > {: .output}
+> > 
+> > So, `waves_df.tail()` returns the final 5 rows of the dataframe. We can also control the output by adding an argument, like with `head()`
+> {: .solution}
 {: .challenge}
 
 
@@ -360,11 +408,38 @@ array(['SW Isles of Scilly WaveNet Site', 'Hayling Island Waverider',
 > ## Challenge - Statistics
 >
 > 1. Create a list of unique site IDs ("buoy_id") found in the waves data. Call it
->   `buoy_ids`. How many unique sites are there in the data? How many unique
+>   `buoy_ids`. How many unique
 >   buoys are in the data?
 >
 > 2. What is the difference between using `len(buoy_id)` and `waves_df['buoy_id'].nunique()`?
 >    in this case, the result is the same but when might be the difference be important?
+> 
+> > ## Solution
+> > 1. 
+> > ~~~
+> > buoy_ids = pd.unique(waves_df["buoy_id"])
+> > print(buoy_ids)
+> > ~~~
+> > {: .language-python}
+> >
+> > ~~~
+> > [14  7  5  3 10  9  2 11  6 16]
+> > ~~~
+> > {: .output}
+> > 
+> > We could count the number of elements of the list, or we might think about using either the `len()` or `nunique()` functions, and we get 10.
+> >
+> > We can see the difference between `len()` and `nunique()` if we create a DataFrame with a `None` value:
+> >
+> > ~~~
+> > length_test = pd.DataFrame([1,2,3,None])
+> > print(len(length_test))
+> > print(length_test.nunique())
+> > ~~~
+> > {: .language-python}
+> > 
+> > We can see that `len()` returns 4, while `nunique()` returns 3 - this is because `nunique()` ignore any `Null` value
+> {: .solution}
 {: .challenge}
 
 ## Groups in Pandas
@@ -464,7 +539,10 @@ is much larger than the wave heights classified as 'windsea'.
 >   - `grouped_data2.mean()`
 > 3. Summarize Temperature values for swell and windsea states in your data. 
 >
->> ## Solution to 3
+>> ## Solution
+>> 1. The most complete answer is `waves_df.groupby("Quadrant").count()["record_id"][["north", "west"]]`
+>> 2. It groups by 2nd column _within_ the results of the 1st column, and then calculates the mean (n.b. depending on your version of python, you might need `grouped_data2.mean(numeric_only=True)`)
+>> 3. 
 >> ~~~
 >> waves_df.groupby(['Seastate'])["Temperature"].describe()
 >> ~~~