diff --git a/(All Papers) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition.md b/(All Papers) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition.md
new file mode 100644
index 00000000..10a18ae3
--- /dev/null
+++ b/(All Papers) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition.md	
@@ -0,0 +1,419 @@
+# (All Papers) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition
+
+### Preface: The Machine Learning Tsunami
+
+[Geoffrey E. Hinton et al., “A Fast Learning Algorithm for Deep Belief Nets”, Neural Computation 18 (2006): 1527–1554.](https://www.cs.toronto.edu/~hinton/absps/ncfast.pdf)
+
+
+
+## Part I. The Fundamentals of Machine Learning
+
+### Chapter 1. The Machine Learning Landscape
+
+
+[Richard Socher et al., "Zero-Shot Learning Through Cross-Modal Transfer"](https://arxiv.org/abs/1301.3666)
+
+[Peter Norvig et al., “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems 24, no. 2 (2009): 8–12.](https://ieeexplore.ieee.org/document/4804817)
+
+[Michele Banko and Eric Brill, “Scaling to Very Very Large Corpora for Natural Language Disambiguation”, Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (2001): 26–33.](https://aclanthology.org/P01-1005/)
+
+[David Wolpert, “The Lack of A Priori Distinctions Between Learning Algorithms”, Neural Computation 8, no. 7 (1996): 1341–1390.](https://www.researchgate.net/publication/2755783_The_Lack_of_A_Priori_Distinctions_Between_Learning_Algorithms)
+
+### Chapter 2. End-to-End Machine Learning Project
+
+[R. Kelley Pace and Ronald Barry, “Sparse Spatial Autoregressions”, Statistics & Probability Letters 33, no. 3 (1997): 291–297.](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/ch02.html#idm45720235861264)
+
+[Lars Buitinck et al., “API Design for Machine Learning Software: Experiences from the Scikit-Learn Project”, arXiv preprint arXiv:1309.0238 (2013).](https://arxiv.org/abs/1309.0238)
+
+### Chapter 3. Classification
+
+### Chapter 4. Training Models
+
+[Mark Schmidt et al., “Minimizing Finite Sums with the Stochastic Average Gradient Algorithm”](https://www.cs.ubc.ca/~schmidtm/Documents/2014_Google_SAG.pdf)
+
+### Chapter 5. Support Vector Machines
+[ Chih-Jen Lin et al., “A Dual Coordinate Descent Method for Large-Scale Linear SVM”, Proceedings of the 25th International Conference on Machine Learning (2008): 408–415.](https://icml.cc/Conferences/2008/papers/166.pdf)
+
+[John Platt, “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines” (Microsoft Research technical report, April 21, 1998).](https://www.researchgate.net/publication/2624239_Sequential_Minimal_Optimization_A_Fast_Algorithm_for_Training_Support_Vector_Machines)
+
+[Stephen Boyd and Lieven Vandenberghe’s "Convex Optimization" (Cambridge University Press)](https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf)
+
+[Gert Cauwenberghs and Tomaso Poggio, “Incremental and Decremental Support Vector Machine Learning”, Proceedings of the 13th International Conference on Neural Information Processing Systems (2000): 388–394.](https://dl.acm.org/doi/10.5555/3008751.3008808)
+
+[Antoine Bordes et al., “Fast Kernel Classifiers with Online and Active Learning”, Journal of Machine Learning Research 6 (2005): 1579–1619.](https://www.researchgate.net/publication/220320193_Fast_Kernel_Classifiers_with_Online_and_Active_Learning)
+
+### Chapter 6. Decision Trees
+
+[Sebastian Raschka, "Machine Learning FAQ"](https://sebastianraschka.com/faq/docs/decision-tree-binary.html)
+
+### Chapter 7. Ensemble Learning and Random Forests
+
+[Leo Breiman, “Bagging Predictors”, Machine Learning 24, no. 2 (1996): 123–140.](https://link.springer.com/article/10.1007/BF00058655)
+
+[Leo Breiman, “Pasting Small Votes for Classification in Large Databases and On-Line”, Machine Learning 36, no. 1–2 (1999): 85–103.](https://sci2s.ugr.es/keel/pdf/algorithm/articulo/1999-ML-Breiman-Pasting%20Small%20Votes%20for%20Classification%20in%20Large%20Databases%20and%20On-Line.pdf)
+
+[Gilles Louppe and Pierre Geurts, “Ensembles on Random Patches”, Lecture Notes in Computer Science 7523 (2012): 346–361.](https://link.springer.com/chapter/10.1007/978-3-642-33460-3_28)
+
+[Tin Kam Ho, “The Random Subspace Method for Constructing Decision Forests”, IEEE Transactions on Pattern Analysis and Machine Intelligence 20, no. 8 (1998): 832–844.](https://ieeexplore.ieee.org/document/709601)
+
+[Tin Kam Ho, “Random Decision Forests”, Proceedings of the Third International Conference on Document Analysis and Recognition 1 (1995): 278.](https://ieeexplore.ieee.org/document/598994)
+
+[Pierre Geurts et al., “Extremely Randomized Trees”, Machine Learning 63, no. 1 (2006): 3–42.](https://link.springer.com/article/10.1007/s10994-006-6226-1)
+
+[Yoav Freund and Robert E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”, Journal of Computer and System Sciences 55, no. 1 (1997): 119–139.](https://www.sciencedirect.com/science/article/pii/S002200009791504X)
+
+[Ji Zhu et al., “Multi-Class AdaBoost”, Statistics and Its Interface 2, no. 3 (2009): 349–360.](https://www.researchgate.net/publication/228947999_Multi-class_AdaBoost)
+
+[Leo Breiman, “Arcing the Edge”](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=db2792b8b003f6caefca0c254fa0a52c15197162)
+
+[Jerome H. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine”](https://www.jstor.org/stable/2699986)
+
+[David H. Wolpert, “Stacked Generalization”, Neural Networks 5, no. 2 (1992): 241–259.](https://www.sciencedirect.com/science/article/abs/pii/S0893608005800231)
+
+### Chapter 8. Dimensionality Reduction
+
+[Karl Pearson, “On Lines and Planes of Closest Fit to Systems of Points in Space”, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, no. 11 (1901): 559–572.](https://www.tandfonline.com/doi/abs/10.1080/14786440109462720)
+
+[David A. Ross et al., “Incremental Learning for Robust Visual Tracking”, International Journal of Computer Vision 77, no. 1–3 (2008): 125–141.](https://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf)
+
+[Sanjoy Dasgupta et al., “A neural algorithm for a fundamental computing problem”, Science 358, no. 6364 (2017): 793–796.](https://www.its.caltech.edu/~jkenny/nb250c/papers/Dasgupta-2017.pdf)
+
+[Sam T. Roweis and Lawrence K. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding”, Science 290, no. 5500 (2000): 2323–2326.](https://www.science.org/doi/10.1126/science.290.5500.2323)
+
+### Chapter 9. Unsupervised Learning Techniques
+
+[Stuart P. Lloyd, “Least Squares Quantization in PCM”, IEEE Transactions on Information Theory 28, no. 2 (1982): 129–137.](https://ieeexplore.ieee.org/document/1056489)
+
+[David Arthur and Sergei Vassilvitskii, “k-Means++: The Advantages of Careful Seeding”, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (2007): 1027–1035.](https://theory.stanford.edu/~sergei/papers/kMeansPP-soda)
+
+[Charles Elkan, “Using the Triangle Inequality to Accelerate k-Means”, Proceedings of the 20th International Conference on Machine Learning (2003): 147–153.](https://www.researchgate.net/publication/2480121_Using_the_Triangle_Inequality_to_Accelerate_K-Means)
+
+[David Sculley, “Web-Scale K-Means Clustering”, Proceedings of the 19th International Conference on World Wide Web (2010): 1177–1178.](https://citeseerx.ist.psu.edu/document?repid=rep1type=pdf&doi=b452a856a3e3d4d37b1de837996aa6813bedfdcf)
+
+## Part II. Neural Networks and Deep Learning
+
+### Chapter 10. Introduction to Artificial Neural Networks with Keras
+
+[Warren S. McCulloch and Walter Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity”, The Bulletin of Mathematical Biology 5, no. 4 (1943): 115–113.](https://link.springer.com/article/10.1007/BF02478259)
+
+[David Rumelhart et al., “Learning Internal Representations by Error Propagation” (Defense Technical Information Center technical report, September 1985).](https://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap8_PDP86.pdf)
+
+[Heng-Tze Cheng et al., “Wide & Deep Learning for Recommender Systems”, Proceedings of the First Workshop on Deep Learning for Recommender Systems (2016): 7–10.](https://arxiv.org/abs/1606.07792)
+
+[Lisha Li et al., “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization”, Journal of Machine Learning Research 18 (April 2018): 1–52.](https://arxiv.org/abs/1603.06560)
+
+[Max Jaderberg et al., “Population Based Training of Neural Networks”, arXiv preprint arXiv:1711.09846 (2017).](https://arxiv.org/abs/1711.09846)
+
+[Dominic Masters and Carlo Luschi, “Revisiting Small Batch Training for Deep Neural Networks”, arXiv preprint arXiv:1804.07612 (2018).](https://www.arxiv.org/abs/1804.07612)
+
+[Elad Hoffer et al., “Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks”, Proceedings of the 31st International Conference on Neural Information Processing Systems (2017): 1729–1739.](https://arxiv.org/abs/1705.08741)
+
+[Priya Goyal et al., “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv preprint arXiv:1706.02677 (2017).](https://arxiv.org/abs/1706.02677)
+
+[Leslie N. Smith, “A Disciplined Approach to Neural Network Hyper-Parameters: Part 1—Learning Rate, Batch Size, Momentum, and Weight Decay”, arXiv preprint arXiv:1803.09820 (2018).](https://arxiv.org/abs/1803.09820)
+
+### Chapter 11. Training Deep Neural Networks
+
+[Xavier Glorot and Yoshua Bengio, “Understanding the Difficulty of Training Deep Feedforward Neural Networks”, Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (2010): 249–256.](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
+
+[E.g., Kaiming He et al., “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” Proceedings of the 2015 IEEE International Conference on Computer Vision (2015): 1026–1034.](https://arxiv.org/abs/1502.01852)
+
+[Bing Xu et al., “Empirical Evaluation of Rectified Activations in Convolutional Network,” arXiv preprint arXiv:1505.00853 (2015).](https://arxiv.org/abs/1505.00853)
+
+[Djork-Arné Clevert et al., “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),” Proceedings of the International Conference on Learning Representations, arXiv preprint (2015).](https://arxiv.org/abs/1511.07289)
+
+[Günter Klambauer et al., “Self-Normalizing Neural Networks”, Proceedings of the 31st International Conference on Neural Information Processing Systems (2017): 972–981.](https://arxiv.org/pdf/1706.02515)
+
+[Dan Hendrycks and Kevin Gimpel, “Gaussian Error Linear Units (GELUs)”, arXiv preprint arXiv:1606.08415 (2016).](https://arxiv.org/abs/1606.08415)
+
+[Prajit Ramachandran et al., “Searching for Activation Functions”, arXiv preprint arXiv:1710.05941 (2017).](https://arxiv.org/abs/1710.05941)
+
+[Diganta Misra, “Mish: A Self Regularized Non-Monotonic Activation Function”, arXiv preprint arXiv:1908.08681 (2019).](https://arxiv.org/abs/1908.08681)
+
+[Sergey Ioffe and Christian Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, Proceedings of the 32nd International Conference on Machine Learning (2015): 448–456.](https://arxiv.org/abs/1502.03167)
+
+[Razvan Pascanu et al., “On the Difficulty of Training Recurrent Neural Networks”, Proceedings of the 30th International Conference on Machine Learning (2013): 1310–1318.](https://www.arxiv.org/abs/1211.5063)
+
+[Boris T. Polyak, “Some Methods of Speeding Up the Convergence of Iteration Methods”, USSR Computational Mathematics and Mathematical Physics 4, no. 5 (1964): 1–17.](https://www.researchgate.net/publication/243648538_Some_methods_of_speeding_up_the_convergence_of_iteration_methods)
+
+[Yurii Nesterov, “A Method for Unconstrained Convex Minimization Problem with the Rate of Convergence O(1/k2),” Doklady AN USSR 269 (1983): 543–547.](https://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=dan&paperid=46009&option_lang=eng)
+
+[John Duchi et al., “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization”, Journal of Machine Learning Research 12 (2011): 2121–2159.](https://jmlr.org/papers/v12/duchi11a.html)
+
+[Geoffrey Hinton, Nitsh Srivastava, Kevin Swersky, "Neural Networks for Machine Learning"](https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)
+
+[Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization”, arXiv preprint arXiv:1412.6980 (2014).](https://arxiv.org/abs/1412.6980)
+
+[Timothy Dozat, “Incorporating Nesterov Momentum into Adam” (2016).](https://openreview.net/pdf/OM0jvwB8jIp57ZJjtNEZ.pdf)
+
+[Ilya Loshchilov, and Frank Hutter, “Decoupled Weight Decay Regularization”, arXiv preprint arXiv:1711.05101 (2017).](https://arxiv.org/abs/1711.05101)
+
+[Ashia C. Wilson et al., “The Marginal Value of Adaptive Gradient Methods in Machine Learning”, Advances in Neural Information Processing Systems 30 (2017): 4148–4158.](https://www.arxiv.org/abs/1705.08292)
+
+[Leslie N. Smith, “A Disciplined Approach to Neural Network Hyper-Parameters: Part 1—Learning Rate, Batch Size, Momentum, and Weight Decay”, arXiv preprint arXiv:1803.09820 (2018).](https://arxiv.org/abs/1803.09820)
+
+[Andrew Senior et al., “An Empirical Study of Learning Rates in Deep Neural Networks for Speech Recognition”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (2013): 6724–6728.](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40808.pdf)
+
+[Geoffrey E. Hinton et al., “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors”, arXiv preprint arXiv:1207.0580 (2012).](https://arxiv.org/abs/1207.0580)
+
+[Nitish Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research 15 (2014): 1929–1958.](https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)
+
+[Yarin Gal and Zoubin Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, Proceedings of the 33rd International Conference on Machine Learning (2016): 1050–1059.](https://proceedings.mlr.press/v48/gal16)
+
+### Chapter 12. Custom Models and Training with TensorFlow
+
+[Kaz, Sato, "What makes TPUs fine-tuned for deep learning?"](https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning)
+
+### Chapter 13. Loading and Preprocessing Data with TensorFlow
+
+[Tomáš Mikolov et al., “Distributed Representations of Words and Phrases and Their Compositionality”, Proceedings of the 26th International Conference on Neural Information Processing Systems 2 (2013): 3111–3119.](https://arxiv.org/abs/1310.4546)
+
+[Malvina Nissim et al., “Fair Is Better Than Sensational: Man Is to Doctor as Woman Is to Doctor”, arXiv preprint arXiv:1905.09866 (2019).](https://www.researchgate.net/publication/340110035_Fair_Is_Better_than_Sensational_Man_Is_to_Doctor_as_Woman_Is_to_Doctor)
+
+### Chapter 14. Deep Computer Vision Using Convolutional Neural Networks
+
+[David H. Hubel, “Single Unit Activity in Striate Cortex of Unrestrained Cats”, The Journal of Physiology 147 (1959): 226–238.](https://journals.scholarsportal.info/details/00223751/v147i0002/226_suaiscouc.xml)
+
+[David H. Hubel and Torsten N. Wiesel, “Receptive Fields of Single Neurons in the Cat’s Striate Cortex”, The Journal of Physiology 148 (1959): 574–591.](https://www.bibsonomy.org/bibtex/202c5cf1ee910eadba5efa77b3cd043f6/idsia)
+
+[David H. Hubel and Torsten N. Wiesel, “Receptive Fields and Functional Architecture of Monkey Striate Cortex”, The Journal of Physiology 195 (1968): 215–243.](https://pubmed.ncbi.nlm.nih.gov/4966457/)
+
+[Kunihiko Fukushima, “Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position”, Biological Cybernetics 36 (1980): 193–202.](https://link.springer.com/article/10.1007/BF00344251)
+
+[Yann LeCun et al., “Gradient-Based Learning Applied to Document Recognition”, Proceedings of the IEEE 86, no. 11 (1998): 2278–2324.](https://ieeexplore.ieee.org/document/726791)
+
+[Yann LeCun et al., “Gradient-Based Learning Applied to Document Recognition”, Proceedings of the IEEE 86, no. 11 (1998): 2278–2324.](https://ieeexplore.ieee.org/document/726791)
+
+[Alex Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, Proceedings of the 25th International Conference on Neural Information Processing Systems 1 (2012): 1097–1105.](https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf)
+
+[Matthew D. Zeiler and Rob Fergus, “Visualizing and Understanding Convolutional Networks”, Proceedings of the European Conference on Computer Vision (2014): 818–833.](https://arxiv.org/abs/1311.2901)
+
+[Christian Szegedy et al., “Going Deeper with Convolutions”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015): 1–9.](https://arxiv.org/abs/1409.4842)
+
+[Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, arXiv preprint arXiv:1409.1556 (2014).](https://arxiv.org/abs/1409.1556)
+
+[Kaiming He et al., “Deep Residual Learning for Image Recognition”, arXiv preprint arXiv:1512:03385 (2015).](https://arxiv.org/abs/1512.03385)
+
+[Christian Szegedy et al., “Inception–v4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv preprint arXiv:1602.07261 (2016).](https://arxiv.org/abs/1602.07261)
+
+[François Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions”, arXiv preprint arXiv:1610.02357 (2016).](https://arxiv.org/abs/1610.02357)
+
+[Jie Hu et al., “Squeeze-and-Excitation Networks”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018): 7132–7141.](https://arxiv.org/abs/1709.01507)
+
+[Saining Xie et al., “Aggregated Residual Transformations for Deep Neural Networks”, arXiv preprint arXiv:1611.05431 (2016).](https://arxiv.org/abs/1611.05431)
+
+[Gao Huang et al., “Densely Connected Convolutional Networks”, arXiv preprint arXiv:1608.06993 (2016).](https://arxiv.org/abs/1608.06993)
+
+[Andrew G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv preprint arxiv:1704.04861 (2017).](https://arxiv.org/abs/1704.04861)
+
+[Chien-Yao Wang et al., “CSPNet: A New Backbone That Can Enhance Learning Capability of CNN”, arXiv preprint arXiv:1911.11929 (2019).](https://www.arxiv.org/abs/1911.11929)
+
+[Mingxing Tan and Quoc V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, arXiv preprint arXiv:1905.11946 (2019).](https://arxiv.org/abs/1905.11946)
+
+[Adriana Kovashka et al., “Crowdsourcing in Computer Vision”, Foundations and Trends in Computer Graphics and Vision 10, no. 3 (2014): 177–243.](https://www.researchgate.net/publication/311249150_Crowdsourcing_in_Computer_Vision)
+
+[Jonathan Long et al., “Fully Convolutional Networks for Semantic Segmentation”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015): 3431–3440.](https://ieeexplore.ieee.org/document/7298965)
+
+[Joseph Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016): 779–788.](https://ieeexplore.ieee.org/document/7780460)
+
+[Wei Liu et al., “SSD: Single Shot Multibox Detector”, Proceedings of the 14th European Conference on Computer Vision 1 (2016): 21–37.](https://arxiv.org/abs/1512.02325)
+
+[Shaoqing Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, Proceedings of the 28th International Conference on Neural Information Processing Systems 1 (2015): 91–99.](https://arxiv.org/abs/1506.01497)
+
+[Mingxing Tan et al., “EfficientDet: Scalable and Efficient Object Detection”, arXiv preprint arXiv:1911.09070 (2019).](https://arxiv.org/abs/1911.09070)
+
+[Nicolai Wojke et al., “Simple Online and Realtime Tracking with a Deep Association Metric”, arXiv preprint arXiv:1703.07402 (2017).](https://arxiv.org/abs/1703.07402)
+
+[Kaiming He et al., “Mask R-CNN”, arXiv preprint arXiv:1703.06870 (2017).](https://arxiv.org/abs/1703.06870)
+
+### Chapter 15. Processing Sequences Using RNNs and CNNs
+
+[Vu Pham et al., "Dropout improves Recurrent Neural Networks for Handwriting Recognition"](https://arxiv.org/abs/1312.4569)
+
+[Quoc V. Le et al., "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units"](https://arxiv.org/abs/1504.00941)
+
+[Nal Kalchbrenner and Phil Blunsom, “Recurrent Continuous Translation Models”, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013): 1700–1709.](https://www.semanticscholar.org/paper/Recurrent-Continuous-Translation-Models-Kalchbrenner-Blunsom/944a1cfd79dbfb6fef460360a0765ba790f4027a)
+
+[César Laurent et al., “Batch Normalized Recurrent Neural Networks”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (2016): 2657–2661.](https://arxiv.org/abs/1510.01378)
+
+[Jimmy Lei Ba et al., “Layer Normalization”, arXiv preprint arXiv:1607.06450 (2016).](https://arxiv.org/abs/1607.06450)
+
+[Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory”, Neural Computation 9, no. 8 (1997): 1735–1780.](https://www.researchgate.net/publication/13853244_Long_Short-term_Memory)
+
+[Haşim Sak et al., “Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition”, arXiv preprint arXiv:1402.1128 (2014).](https://arxiv.org/abs/1402.1128)
+
+[Wojciech Zaremba et al., “Recurrent Neural Network Regularization”, arXiv preprint arXiv:1409.2329 (2014).](https://arxiv.org/abs/1409.2329)
+
+[Kyunghyun Cho et al., “Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014): 1724–1734.](https://arxiv.org/abs/1406.1078)
+
+[Klaus Greff et al., “LSTM: A Search Space Odyssey”, IEEE Transactions on Neural Networks and Learning Systems 28, no. 10 (2017): 2222–2232.This paper seems to show that all LSTM variants perform roughly the same.](https://arxiv.org/abs/1503.04069)
+
+[Aaron van den Oord et al., “WaveNet: A Generative Model for Raw Audio”, arXiv preprint arXiv:1609.03499 (2016).](https://arxiv.org/abs/1609.03499)
+
+### Chapter 16. Natural Language Processing with RNNs and Attention
+
+[Alan Turing, “Computing Machinery and Intelligence”, Mind 49 (1950): 433–460.](https://academic.oup.com/mind/article/LIX/236/433/986238)
+
+[Alec Radford et al., “Learning to Generate Reviews and Discovering Sentiment”, arXiv preprint arXiv:1704.01444 (2017).](https://arxiv.org/abs/1704.01444)
+
+[Rico Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units”, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics 1 (2016): 1715–1725.](https://arxiv.org/abs/1508.07909)
+
+[Taku Kudo, “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates”, arXiv preprint arXiv:1804.10959 (2018).](https://arxiv.org/abs/1804.10959)
+
+[Taku Kudo and John Richardson, “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing”, arXiv preprint arXiv:1808.06226 (2018).](https://arxiv.org/abs/1808.06226)
+
+[Yonghui Wu et al., “Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation”, arXiv preprint arXiv:1609.08144 (2016).](https://arxiv.org/abs/1609.08144)
+
+[Matthew Peters et al., “Deep Contextualized Word Representations”, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1 (2018): 2227–2237.](https://arxiv.org/abs/1802.05365)
+
+[Jeremy Howard and Sebastian Ruder, “Universal Language Model Fine-Tuning for Text Classification”, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics 1 (2018): 328–339.](https://arxiv.org/abs/1801.06146)
+
+[Daniel Cer et al., “Universal Sentence Encoder”, arXiv preprint arXiv:1803.11175 (2018).](https://arxiv.org/abs/1803.11175)
+
+[Ilya Sutskever et al., “Sequence to Sequence Learning with Neural Networks”, arXiv preprint (2014).](https://arxiv.org/abs/1409.3215)
+
+[Samy Bengio et al., “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks”, arXiv preprint arXiv:1506.03099 (2015).](https://arxiv.org/abs/1506.03099)
+
+[Sébastien Jean et al., “On Using Very Large Target Vocabulary for Neural Machine Translation”, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing 1 (2015): 1–10.](https://arxiv.org/abs/1412.2007)
+
+[Dzmitry Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate”, arXiv preprint arXiv:1409.0473 (2014).](https://arxiv.org/abs/1409.0473)
+
+[Minh-Thang Luong et al., “Effective Approaches to Attention-Based Neural Machine Translation”, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015): 1412–1421.](https://arxiv.org/abs/1508.04025)
+
+[Ashish Vaswani et al., “Attention Is All You Need”, Proceedings of the 31st International Conference on Neural Information Processing Systems (2017): 6000–6010.](https://arxiv.org/abs/1706.03762)
+
+[Alec Radford et al., “Improving Language Understanding by Generative Pre-Training” (2018).](https://paperswithcode.com/paper/improving-language-understanding-by)
+
+[Jacob Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1 (2019).](https://arxiv.org/abs/1810.04805)
+
+[Alec Radford et al., “Language Models Are Unsupervised Multitask Learners” (2019).](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+
+[William Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” (2021).](https://arxiv.org/abs/2101.03961)
+
+[Victor Sanh et al., “DistilBERT, A Distilled Version of Bert: Smaller, Faster, Cheaper and Lighter”, arXiv preprint arXiv:1910.01108 (2019).](https://arxiv.org/abs/1910.01108)
+
+[Colin Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”, arXiv preprint arXiv:1910.10683 (2019).](https://arxiv.org/abs/1910.10683)
+
+[Aakanksha Chowdhery et al., “PaLM: Scaling Language Modeling with Pathways”, arXiv preprint arXiv:2204.02311 (2022).](https://arxiv.org/abs/2204.02311)
+
+[Jason Wei et al., “Chain of Thought Prompting Elicits Reasoning in Large Language Models”, arXiv preprint arXiv:2201.11903 (2022).](https://arxiv.org/abs/2201.11903)
+
+[Kelvin Xu et al., “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, Proceedings of the 32nd International Conference on Machine Learning (2015): 2048–2057.](https://proceedings.mlr.press/v37/xuc15)
+
+[Marco Tulio Ribeiro et al., “‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier”, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016): 1135–1144.](https://arxiv.org/abs/1602.04938)
+
+[Nicolas Carion et al., “End-to-End Object Detection with Transformers”, arXiv preprint arxiv:2005.12872 (2020).](https://arxiv.org/abs/2005.12872)
+
+[Alexey Dosovitskiy et al., “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”, arXiv preprint arxiv:2010.11929 (2020).](https://arxiv.org/abs/2010.11929)
+
+[Hugo Touvron et al., “Training Data-Efficient Image Transformers & Distillation Through Attention”, arXiv preprint arxiv:2012.12877 (2020).](https://arxiv.org/abs/2012.12877)
+
+[Andrew Jaegle et al., “Perceiver: General Perception with Iterative Attention”, arXiv preprint arxiv:2103.03206 (2021).](https://arxiv.org/abs/2103.03206)
+
+[Mathilde Caron et al., “Emerging Properties in Self-Supervised Vision Transformers”, arXiv preprint arxiv:2104.14294 (2021).](https://arxiv.org/abs/2104.14294)
+
+[Xiaohua Zhai et al., “Scaling Vision Transformers”, arXiv preprint arxiv:2106.04560v1 (2021).](https://dokumen.tips/documents/scaling-vision-transformers.html?page=1)
+
+[Mitchell Wortsman et al., “Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy Without Increasing Inference Time”, arXiv preprint arxiv:2203.05482v1 (2022).](https://arxiv.org/abs/2203.05482)
+
+[Alec Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, arXiv preprint arxiv:2103.00020 (2021).](https://arxiv.org/abs/2103.00020)
+
+[Aditya Ramesh et al., “Zero-Shot Text-to-Image Generation”, arXiv preprint arxiv:2102.12092 (2021).](https://arxiv.org/abs/2102.12092)
+
+[Aditya Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv preprint arxiv:2204.06125 (2022).](https://arxiv.org/abs/2204.06125)
+
+[Jean-Baptiste Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning”, arXiv preprint arxiv:2204.14198 (2022).](https://arxiv.org/abs/2204.14198)
+
+[Scott Reed et al., “A Generalist Agent”, arXiv preprint arxiv:2205.06175 (2022).](https://arxiv.org/abs/2205.06175)
+
+### Chapter 17. Autoencoders, GANs, and Diffusion Models
+
+[William G. Chase and Herbert A. Simon, “Perception in Chess”, Cognitive Psychology 4, no. 1 (1973): 55–81.](https://www.sciencedirect.com/science/article/abs/pii/0010028573900042)
+
+[Yoshua Bengio et al., “Greedy Layer-Wise Training of Deep Networks”, Proceedings of the 19th International Conference on Neural Information Processing Systems (2006): 153–160.](https://www.researchgate.net/publication/200744514_Greedy_layer-wise_training_of_deep_networks)
+
+[Jonathan Masci et al., “Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction”, Proceedings of the 21st International Conference on Artificial Neural Networks 1 (2011): 52–59.](https://link.springer.com/chapter/10.1007/978-3-642-21735-7_7)
+
+[Pascal Vincent et al., “Extracting and Composing Robust Features with Denoising Autoencoders”, Proceedings of the 25th International Conference on Machine Learning (2008): 1096–1103.](https://www.researchgate.net/publication/221346269_Extracting_and_composing_robust_features_with_denoising_autoencoders)
+
+[Pascal Vincent et al., “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion”, Journal of Machine Learning Research 11 (2010): 3371–3408.](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf)
+
+[Diederik Kingma and Max Welling, “Auto-Encoding Variational Bayes”, arXiv preprint arXiv:1312.6114 (2013).](https://arxiv.org/abs/1312.6114)
+
+[Ian Goodfellow et al., “Generative Adversarial Nets”, Proceedings of the 27th International Conference on Neural Information Processing Systems 2 (2014): 2672–2680.](https://papers.nips.cc/paper_files/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html)
+
+[Mario Lucic et al., “Are GANs Created Equal? A Large-Scale Study”, Proceedings of the 32nd International Conference on Neural Information Processing Systems (2018): 698–707.](https://papers.nips.cc/paper_files/paper/2018/hash/e46de7e1bcaaced9a54f1e9d0d2f800d-Abstract.html)
+
+[Alec Radford et al., “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, arXiv preprint arXiv:1511.06434 (2015).](https://arxiv.org/abs/1511.06434)
+
+[Mehdi Mirza and Simon Osindero, “Conditional Generative Adversarial Nets”, arXiv preprint arXiv:1411.1784 (2014).](https://arxiv.org/abs/1411.1784)
+
+[Tero Karras et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, Proceedings of the International Conference on Learning Representations (2018).](https://arxiv.org/abs/1710.10196)
+
+[Tero Karras et al., “A Style-Based Generator Architecture for Generative Adversarial Networks”, arXiv preprint arXiv:1812.04948 (2018).](https://arxiv.org/abs/1812.04948)
+
+[Jascha Sohl-Dickstein et al., “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”, arXiv preprint arXiv:1503.03585 (2015).](https://arxiv.org/abs/1503.03585)
+
+[Jonathan Ho et al., “Denoising Diffusion Probabilistic Models” (2020).](https://arxiv.org/abs/2006.11239)
+
+[Alex Nichol and Prafulla Dhariwal, “Improved Denoising Diffusion Probabilistic Models” (2021).](https://arxiv.org/abs/2102.09672)
+
+[Olaf Ronneberger et al., “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv preprint arXiv:1505.04597 (2015).](https://arxiv.org/abs/1505.04597)
+
+[Robin Rombach, Andreas Blattmann, et al., “High-Resolution Image Synthesis with Latent Diffusion Models”, arXiv preprint arXiv:2112.10752 (2021).](https://arxiv.org/abs/2112.10752)
+
+### Chapter 18. Reinforcement Learning
+
+[Volodymyr Mnih et al., “Playing Atari with Deep Reinforcement Learning”, arXiv preprint arXiv:1312.5602 (2013).](https://arxiv.org/abs/1312.5602)
+
+[Volodymyr Mnih et al., “Human-Level Control Through Deep Reinforcement Learning”, Nature 518 (2015): 529–533.](https://www.researchgate.net/publication/272837232_Human-level_control_through_deep_reinforcement_learning)
+
+[Ronald J. Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Leaning”, Machine Learning 8 (1992) : 229–256.](https://link.springer.com/article/10.1007/BF00992696)
+
+[Richard Bellman, “A Markovian Decision Process”, Journal of Mathematics and Mechanics 6, no. 5 (1957): 679–684.](https://www.jstor.org/stable/24900506)
+
+[Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet"](https://www.alexirpan.com/2018/02/14/rl-hard.html)
+
+[Hado van Hasselt et al., “Deep Reinforcement Learning with Double Q-Learning”, Proceedings of the 30th AAAI Conference on Artificial Intelligence (2015): 2094–2100.](https://arxiv.org/abs/1509.06461)
+
+[Tom Schaul et al., “Prioritized Experience Replay”, arXiv preprint arXiv:1511.05952 (2015).](https://arxiv.org/abs/1511.05952)
+
+[Ziyu Wang et al., “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint arXiv:1511.06581 (2015).](https://arxiv.org/abs/1511.06581)
+
+[Matteo Hessel et al., “Rainbow: Combining Improvements in Deep Reinforcement Learning”, arXiv preprint arXiv:1710.02298 (2017): 3215–3222.](https://arxiv.org/abs/1710.02298)
+
+[David Silver et al., “Mastering the Game of Go with Deep Neural Networks and Tree Search”, Nature 529 (2016): 484–489.](https://www.nature.com/articles/nature16961)
+
+[David Silver et al., “Mastering the Game of Go Without Human Knowledge”, Nature 550 (2017): 354–359.](https://www.nature.com/articles/nature24270)
+
+[David Silver et al., “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm”, arXiv preprint arXiv:1712.01815.](https://www.arxiv.org/abs/1712.01815)
+
+[Julian Schrittwieser et al., “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv preprint arXiv:1911.08265 (2019).](https://arxiv.org/abs/1911.08265)
+
+[Volodymyr Mnih et al., “Asynchronous Methods for Deep Reinforcement Learning”, Proceedings of the 33rd International Conference on Machine Learning (2016): 1928–1937.](https://arxiv.org/abs/1602.01783)
+
+[Tuomas Haarnoja et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, Proceedings of the 35th International Conference on Machine Learning (2018): 1856–1865.](https://arxiv.org/abs/1801.01290)
+
+[John Schulman et al., “Proximal Policy Optimization Algorithms”, arXiv preprint arXiv:1707.06347 (2017).](https://arxiv.org/abs/1707.06347)
+
+[John Schulman et al., “Trust Region Policy Optimization”, Proceedings of the 32nd International Conference on Machine Learning (2015): 1889–1897.](https://arxiv.org/abs/1502.05477)
+
+[Deepak Pathak et al., “Curiosity-Driven Exploration by Self-Supervised Prediction”, Proceedings of the 34th International Conference on Machine Learning (2017): 2778–2787.](https://arxiv.org/abs/1705.05363)
+
+[Rui Wang et al., “Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions”, arXiv preprint arXiv:1901.01753 (2019).](https://arxiv.org/abs/1901.01753)
+
+[Rui Wang et al., “Enhanced POET: Open-Ended Reinforcement Learning Through Unbounded Invention of Learning Challenges and Their Solutions”, arXiv preprint arXiv:2003.08536 (2020).](https://arxiv.org/abs/2003.08536)
+
+[Open-Ended Learning Team et al., “Open-Ended Learning Leads to Generally Capable Agents”, arXiv preprint arXiv:2107.12808 (2021).](https://arxiv.org/abs/2107.12808)
+
+### Chapter 19. Training and Deploying TensorFlow Models at Scale
+
+[Jianmin Chen et al., “Revisiting Distributed Synchronous SGD”, arXiv preprint arXiv:1604.00981 (2016).](https://arxiv.org/abs/1604.00981)
+
+[Aaron Harlap et al., “PipeDream: Fast and Efficient Pipeline Parallel DNN Training”, arXiv preprint arXiv:1806.03377 (2018).](https://arxiv.org/abs/1806.03377)
+
+[Paul Barham et al., “Pathways: Asynchronous Distributed Dataflow for ML”, arXiv preprint arXiv:2203.12533 (2022).](https://arxiv.org/pdf/2203.12533)
+
+### Appendix A. Machine Learning Project Checklist
+
+[Jasper Snoek et al., “Practical Bayesian Optimization of Machine Learning Algorithms”, Proceedings of the 25th International Conference on Neural Information Processing Systems 2 (2012): 2951–2959.](https://arxiv.org/abs/1206.2944)
\ No newline at end of file