library(tidyverse)
+library(tidyfinance)
+library(scales)
Working with Stock Returns
+You are reading Tidy Finance with R. You can find the equivalent chapter for the sibling Tidy Finance with Python here.
+The main aim of this chapter is to familiarize yourself with the tidyverse
for working with stock market data. We focus on downloading and visualizing stock data from Yahoo Finance.
At the start of each session, we load the required R packages. Throughout the entire book, we always use the tidyverse
(Wickham et al. 2019). In this chapter, we also load the tidyfinance
package to download stock price data. This package provides a convenient wrapper for various quantitative functions compatible with the tidyverse
and our book. Finally, the package scales
(Wickham and Seidel 2022) provides useful scale functions for visualizations.
You typically have to install a package once before you can load it. In case you have not done this yet, call, for instance, install.packages("tidyfinance")
.
We first download daily prices for one stock symbol, e.g., the Apple stock, AAPL, directly from the data provider Yahoo Finance. To download the data, you can use the function download_data
. If you do not know how to use it, make sure you read the help file by calling ?download_data
. We especially recommend taking a look at the examples section of the documentation. We request daily data for a period of more than 20 years.
<- download_data(
+ prices type = "stock_prices",
+ symbols = "AAPL",
+ start_date = "2000-01-01",
+ end_date = "2023-12-31"
+
+ ) prices
# A tibble: 6,037 × 8
+ symbol date volume open low high close adjusted_close
+ <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+1 AAPL 2000-01-03 535796800 0.936 0.908 1.00 0.999 0.843
+2 AAPL 2000-01-04 512377600 0.967 0.903 0.988 0.915 0.772
+3 AAPL 2000-01-05 778321600 0.926 0.920 0.987 0.929 0.783
+4 AAPL 2000-01-06 767972800 0.948 0.848 0.955 0.848 0.716
+5 AAPL 2000-01-07 460734400 0.862 0.853 0.902 0.888 0.749
+# ℹ 6,032 more rows
+ download_data(type = "stock_prices")
downloads stock market data from Yahoo Finance. The function returns a tibble with eight quite self-explanatory columns: symbol
, date
, the daily volume
(in the number of traded shares), the market prices at the open
, high
, low
, close
, and the adjusted
price in USD. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices, but they have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.
Next, we use the ggplot2
package (Wickham 2016) to visualize the time series of adjusted prices in Figure 1 . This package takes care of visualization tasks based on the principles of the grammar of graphics (Wilkinson 2012).
|>
+ prices ggplot(aes(x = date, y = adjusted_close)) +
+ geom_line() +
+ labs(
+ x = NULL,
+ y = NULL,
+ title = "Apple stock prices between beginning of 2000 and end of 2023"
+ )
Instead of analyzing prices, we compute daily net returns defined as \(r_t = p_t / p_{t-1} - 1\), where \(p_t\) is the adjusted day \(t\) price. In that context, the function lag()
is helpful, which returns the previous value in a vector.
<- prices |>
+ returns arrange(date) |>
+ mutate(ret = adjusted_close / lag(adjusted_close) - 1) |>
+ select(symbol, date, ret)
+ returns
# A tibble: 6,037 × 3
+ symbol date ret
+ <chr> <date> <dbl>
+1 AAPL 2000-01-03 NA
+2 AAPL 2000-01-04 -0.0843
+3 AAPL 2000-01-05 0.0146
+4 AAPL 2000-01-06 -0.0865
+5 AAPL 2000-01-07 0.0474
+# ℹ 6,032 more rows
+The resulting tibble contains three columns, where the last contains the daily returns (ret
). Note that the first entry naturally contains a missing value (NA
) because there is no previous price. Obviously, the use of lag()
would be meaningless if the time series is not ordered by ascending dates. The command arrange()
provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use arrange(desc(date))
.
For the upcoming examples, we remove missing values as these would require separate treatment when computing, e.g., sample averages. In general, however, make sure you understand why NA
values occur and carefully examine if you can simply get rid of these observations.
<- returns |>
+ returns drop_na(ret)
Next, we visualize the distribution of daily returns in a histogram in Figure 2. Additionally, we add a dashed line that indicates the 5 percent quantile of the daily returns to the histogram, which is a (crude) proxy for the worst return of the stock with a probability of at most 5 percent. The 5 percent quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. We refer to Tsay (2010) for a more thorough introduction to stylized facts of returns.
+<- quantile(returns |> pull(ret), probs = 0.05)
+ quantile_05 |>
+ returns ggplot(aes(x = ret)) +
+ geom_histogram(bins = 100) +
+ geom_vline(aes(xintercept = quantile_05),
+ linetype = "dashed"
+ +
+ ) labs(
+ x = NULL,
+ y = NULL,
+ title = "Distribution of daily Apple stock returns"
+ +
+ ) scale_x_continuous(labels = percent)
Here, bins = 100
determines the number of bins used in the illustration and hence implicitly the width of the bins. Before proceeding, make sure you understand how to use the geom geom_vline()
to add a dashed line that indicates the 5 percent quantile of the daily returns. A typical task before proceeding with any data is to compute summary statistics for the main variables of interest.
|>
+ returns summarize(across(
+
+ ret,list(
+ daily_mean = mean,
+ daily_sd = sd,
+ daily_min = min,
+ daily_max = max
+
+ ) ))
# A tibble: 1 × 4
+ ret_daily_mean ret_daily_sd ret_daily_min ret_daily_max
+ <dbl> <dbl> <dbl> <dbl>
+1 0.00122 0.0247 -0.519 0.139
+We see that the maximum daily return was 13.905 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.
+You can also compute these summary statistics for each year individually by imposing group_by(year = year(date))
, where the call year(date)
returns the year. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the return distribution.
|>
+ returns group_by(year = year(date)) |>
+ summarize(across(
+
+ ret,list(
+ daily_mean = mean,
+ daily_sd = sd,
+ daily_min = min,
+ daily_max = max
+
+ ),.names = "{.fn}"
+ |>
+ )) print(n = Inf)
# A tibble: 24 × 5
+ year daily_mean daily_sd daily_min daily_max
+ <dbl> <dbl> <dbl> <dbl> <dbl>
+ 1 2000 -0.00346 0.0549 -0.519 0.137
+ 2 2001 0.00233 0.0393 -0.172 0.129
+ 3 2002 -0.00121 0.0305 -0.150 0.0846
+ 4 2003 0.00186 0.0234 -0.0814 0.113
+ 5 2004 0.00470 0.0255 -0.0558 0.132
+ 6 2005 0.00349 0.0245 -0.0921 0.0912
+ 7 2006 0.000950 0.0243 -0.0633 0.118
+ 8 2007 0.00366 0.0238 -0.0702 0.105
+ 9 2008 -0.00265 0.0367 -0.179 0.139
+10 2009 0.00382 0.0214 -0.0502 0.0676
+11 2010 0.00183 0.0169 -0.0496 0.0769
+12 2011 0.00104 0.0165 -0.0559 0.0589
+13 2012 0.00130 0.0186 -0.0644 0.0887
+14 2013 0.000472 0.0180 -0.124 0.0514
+15 2014 0.00145 0.0136 -0.0799 0.0820
+16 2015 0.0000199 0.0168 -0.0612 0.0574
+17 2016 0.000575 0.0147 -0.0657 0.0650
+18 2017 0.00164 0.0111 -0.0388 0.0610
+19 2018 -0.0000573 0.0181 -0.0663 0.0704
+20 2019 0.00266 0.0165 -0.0996 0.0683
+21 2020 0.00281 0.0294 -0.129 0.120
+22 2021 0.00131 0.0158 -0.0417 0.0539
+23 2022 -0.000970 0.0225 -0.0587 0.0890
+24 2023 0.00168 0.0128 -0.0480 0.0469
+In case you wonder: the additional argument .names = "{.fn}"
in across()
determines how to name the output columns. The specification is rather flexible and allows almost arbitrary column names, which can be useful for reporting. The print()
function simply controls the output options for the R console.
Scaling Up the Analysis
+As a next step, we generalize the code from before such that all the computations can handle an arbitrary vector of symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.
+This is where the tidyverse
magic starts: tidy data makes it extremely easy to generalize the computations from before to as many assets as you like. The following code takes any vector of symbols, e.g., symbol <- c("AAPL", "MMM", "BA")
, and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for an arbitrary number of assets. We perform the analysis with data from all current constituents of the Dow Jones Industrial Average index.
<- download_data(type = "constituents", index = "Dow Jones Industrial Average")
+ symbols symbols
# A tibble: 30 × 5
+ symbol name location exchange currency
+ <chr> <chr> <chr> <chr> <chr>
+1 GS GOLDMAN SACHS GROUP INC Vereinigte Staaten New Yor… USD
+2 UNH UNITEDHEALTH GROUP INC Vereinigte Staaten New Yor… USD
+3 MSFT MICROSOFT CORP Vereinigte Staaten NASDAQ USD
+4 HD HOME DEPOT INC Vereinigte Staaten New Yor… USD
+5 CAT CATERPILLAR INC Vereinigte Staaten New Yor… USD
+# ℹ 25 more rows
+Conveniently, tidyfinance
provides the functionality to get all stock prices from an index with a single call.
<- download_data(
+ prices_daily type = "stock_prices",
+ symbols = symbols$symbol,
+ start_date = "2000-01-01",
+ end_date = "2023-12-31"
+ )
The resulting tibble contains 177925 daily observations for GS, UNH, MSFT, HD, CAT, SHW, CRM, V, AXP, MCD, AMGN, AAPL, TRV, JPM, HON, AMZN, IBM, BA, PG, CVX, JNJ, NVDA, MMM, DIS, MRK, WMT, NKE, KO, CSCO, VZ different stocks. Figure 3 illustrates the time series of downloaded adjusted prices for each of the constituents of the Dow index. Make sure you understand every single line of code! What are the arguments of aes()
? Which alternative geoms
could you use to visualize the time series? Hint: if you do not know the answers try to change the code to see what difference your intervention causes.
<- prices_daily |>
+ fig_prices ggplot(aes(
+ x = date,
+ y = adjusted_close,
+ color = symbol
+ +
+ )) geom_line() +
+ labs(
+ x = NULL,
+ y = NULL,
+ color = NULL,
+ title = "Stock prices of Dow index constituents"
+ +
+ ) theme(legend.position = "none")
+ fig_prices
Do you notice the small differences relative to the code we used before? All we need to do to illustrate all stock symbols simultaneously is to include color = symbol
in the ggplot
aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates the point well.
The same holds for stock returns. Before computing the returns, we use group_by(symbol)
such that the mutate()
command is performed for each symbol individually. The same logic also applies to the computation of summary statistics: group_by(symbol)
is the key to aggregating the time series into symbol-specific variables of interest.
<- prices_daily |>
+ returns_daily group_by(symbol) |>
+ mutate(ret = adjusted_close / lag(adjusted_close) - 1) |>
+ select(symbol, date, ret) |>
+ drop_na(ret)
+
+|>
+ returns_daily group_by(symbol) |>
+ summarize(across(
+
+ ret,list(
+ daily_mean = mean,
+ daily_sd = sd,
+ daily_min = min,
+ daily_max = max
+
+ ),.names = "{.fn}"
+ |>
+ )) print(n = Inf)
# A tibble: 30 × 5
+ symbol daily_mean daily_sd daily_min daily_max
+ <chr> <dbl> <dbl> <dbl> <dbl>
+ 1 AAPL 0.00122 0.0247 -0.519 0.139
+ 2 AMGN 0.000493 0.0194 -0.134 0.151
+ 3 AMZN 0.00107 0.0315 -0.248 0.345
+ 4 AXP 0.000544 0.0227 -0.176 0.219
+ 5 BA 0.000628 0.0222 -0.238 0.243
+ 6 CAT 0.000724 0.0203 -0.145 0.147
+ 7 CRM 0.00119 0.0266 -0.271 0.260
+ 8 CSCO 0.000322 0.0234 -0.162 0.244
+ 9 CVX 0.000511 0.0175 -0.221 0.227
+10 DIS 0.000414 0.0194 -0.184 0.160
+11 GS 0.000557 0.0229 -0.190 0.265
+12 HD 0.000544 0.0192 -0.287 0.141
+13 HON 0.000497 0.0191 -0.174 0.282
+14 IBM 0.000297 0.0163 -0.155 0.120
+15 JNJ 0.000379 0.0121 -0.158 0.122
+16 JPM 0.000606 0.0238 -0.207 0.251
+17 KO 0.000318 0.0131 -0.101 0.139
+18 MCD 0.000536 0.0145 -0.159 0.181
+19 MMM 0.000363 0.0151 -0.129 0.126
+20 MRK 0.000371 0.0166 -0.268 0.130
+21 MSFT 0.000573 0.0193 -0.156 0.196
+22 NKE 0.000708 0.0193 -0.198 0.155
+23 NVDA 0.00175 0.0376 -0.352 0.424
+24 PG 0.000362 0.0133 -0.302 0.120
+25 SHW 0.000860 0.0180 -0.208 0.153
+26 TRV 0.000555 0.0181 -0.208 0.256
+27 UNH 0.000948 0.0196 -0.186 0.348
+28 V 0.000933 0.0185 -0.136 0.150
+29 VZ 0.000238 0.0151 -0.118 0.146
+30 WMT 0.000323 0.0148 -0.114 0.117
+Note that you are now also equipped with all tools to download price data for each symbol listed in the S&P 500 index with the same number of lines of code. Just use symbol <- download_data(type = "constituents", index = "S&P 500")
, which provides you with a tibble that contains each symbol that is (currently) part of the S&P 500. However, don’t try this if you are not prepared to wait for a couple of minutes because this is quite some data to download!
Other Forms of Data Aggregation
+Of course, aggregation across variables other than symbol
can also make sense. For instance, suppose you are interested in answering the question: Are days with high aggregate trading volume likely followed by days with high aggregate trading volume? To provide some initial analysis on this question, we take the downloaded data and compute aggregate daily trading volume for all Dow index constituents in USD. Recall that the column volume
is denoted in the number of traded shares. Thus, we multiply the trading volume with the daily closing price to get a proxy for the aggregate trading volume in USD. Scaling by 1e9
(R can handle scientific notation) denotes daily trading volume in billion USD.
<- prices_daily |>
+ trading_volume group_by(date) |>
+ summarize(trading_volume = sum(volume * adjusted_close))
+
+<- trading_volume |>
+ fig_trading_volume ggplot(aes(x = date, y = trading_volume)) +
+ geom_line() +
+ labs(
+ x = NULL, y = NULL,
+ title = "Aggregate daily trading volume of Dow index constitutens"
+ +
+ ) scale_y_continuous(labels = unit_format(unit = "B", scale = 1e-9))
+ fig_trading_volume
Figure 4 indicates a clear upward trend in aggregated daily trading volume. In particular, since the outbreak of the COVID-19 pandemic, markets have processed substantial trading volumes, as analyzed, for instance, by Goldstein, Koijen, and Mueller (2021). One way to illustrate the persistence of trading volume would be to plot volume on day \(t\) against volume on day \(t-1\) as in the example below. In Figure 5, we add a dotted 45°-line to indicate a hypothetical one-to-one relation by geom_abline()
, addressing potential differences in the axes’ scales.
<- trading_volume |>
+ fig_persistence ggplot(aes(x = lag(trading_volume), y = trading_volume)) +
+ geom_point() +
+ geom_abline(aes(intercept = 0, slope = 1),
+ linetype = "dashed"
+ +
+ ) labs(
+ x = "Previous day aggregate trading volume",
+ y = "Aggregate trading volume",
+ title = "Persistence in daily trading volume of Dow index constituents"
+ +
+ ) scale_x_continuous(labels = unit_format(unit = "B", scale = 1e-9)) +
+ scale_y_continuous(labels = unit_format(unit = "B", scale = 1e-9))
+ fig_persistence
Warning: Removed 1 rows containing missing values (`geom_point()`).
+Do you understand where the warning ## Warning: Removed 1 rows containing missing values (geom_point).
comes from and what it means? Purely eye-balling reveals that days with high trading volume are often followed by similarly high trading volume days.
Key Takeaways
+In this chapter, you learned how to effectively use R to download, analyze, and visualize stock market data using tidy principles. From downloading adjusted stock prices to computing returns, summarizing statistics, and visualizing trends, we have laid a solid foundation for working with financial data. Key takeaways include the importance of using adjusted prices for return calculations, leveraging tidyverse
-tools for efficient data manipulation, and employing visualizations like histograms and line charts to uncover insights. Scaling up analyses to handle multiple stocks or broader indices demonstrates the flexibility of tidy data workflows. Equipped with these foundational techniques, you are now ready to apply them to different contexts in financial economics coming in subsequent chapters.
Exercises
+-
+
- Download daily prices for another stock market symbol of your choice from Yahoo Finance with
download_data()
from thetidyfinance
package. Plot two time series of the symbol’s un-adjusted and adjusted closing prices. Explain the differences.
+ - Compute daily net returns for an asset of your choice and visualize the distribution of daily returns in a histogram using 100 bins. Also, use
geom_vline()
to add a dashed red vertical line that indicates the 5 percent quantile of the daily returns. Compute summary statistics (mean, standard deviation, minimum and maximum) for the daily returns.
+ - Take your code from before and generalize it such that you can perform all the computations for an arbitrary vector of symbols (e.g.,
symbol <- c("AAPL", "MMM", "BA")
). Automate the download, the plot of the price time series, and create a table of return summary statistics for this arbitrary number of assets.
+ - Are days with high aggregate trading volume often also days with large absolute returns? Find an appropriate visualization to analyze the question using the symbol
AAPL
. 1.Compute monthly returns from the downloaded stock market prices. Compute the vector of historical average returns and the sample variance-covariance matrix. Compute the minimum variance portfolio weights and the portfolio volatility and average returns. Visualize the mean-variance efficient frontier. Choose one of your assets and identify the portfolio which yields the same historical volatility but achieves the highest possible average return.
+