Skip to content

Latest commit

 

History

History
189 lines (121 loc) · 17.5 KB

prerequisites.md

File metadata and controls

189 lines (121 loc) · 17.5 KB

What do you need to do Machine Learning?

What you will learn

In this guide you will learn which elements do you need to get started with Machine Learning.

The elements listed in this guide are of 3 types:

  • know-how
  • data
  • computational resources

These elements enable you to practice machine learning first-hand (small scale) or start machine learning projects to innovate an organization (big scale).

Prerequisites

Read Demystification of AI, ML and DS before this guide.

Time to complete

Approximately 45 minutes. Less than a Netflix episode!

Index

The three elements

Virtually every area of science and economics has access to a quantity of data unimaginable even a decade ago.

In addition to this, today there is abundant and inexpensive computing capacity available through cloud service providers.

It is also important that international research is excited about machine learning topics, and therefore proceeds expeditiously in the development of increasingly effective techniques.

Parallel to this, the engineering world creates the ML toolbox by adjusting itself accordingly: wonderful frameworks and high-level interfaces to apply ML techniques quickly, going towards more user-friendly "enablers".

Keras, a high-level framework that allows you to train a deep neural network in 15 lines of code, is a a perfect example of this phenomenon. In the same way cloud providers offer you the possibility to host on their machines your ML models, with powerful logic that allows you to build a complete AI/ML start-up from your laptop.

The topics listed above have the following natural consequence:

Today, more than ever (and an enormity compared to 5 years ago) it is easy and affordable for everyone to start practicing ML in everyday problems, and it is an obligation in the shoes of a company that wants to remain competitive in a market that runs at crazy speed.

Let's see which elements do you need to get started.

Data

Data comes from the Latin Datus, which means fact, happened. Building ML programs means exploiting something that has already happened in the past (the data) to discover recurring patterns that suggest how to act in the future.

Building traditional programs is a tiring task. It requires that every single piece is built and then these are composed. It's not too different from building a house. Bricks are the modules that make it up and the house is the program in its entirety.

When it comes to Machine Learning applications, however, the matter is different. An ML program is something that we don't write ourselves, but rather something that we understand by observing and processing data.

A good analogy can be the following:

While building traditional software looks more similar to building a brick house, Machine Learning is similar to food cultivation. Cultivation leverages the earth to do most of the work. The farmer "guides" the land, supervises it and then picks up the fruits. Likewise, thanks to Machine Learning, the programmer tries to make the data do the bulk of the work. Growers combine seeds with land to grow plants while engineers combine data with algorithms to grow programs.

It is therefore trivial to understand that if the data is to do the bulk of the work, the quality (and quantity) of the latter is strictly necessary for a model to make accurate predictions, or in general to extract knowledge effectively from them.

Today there is a large amount of data available to start practicing machine learning and discover its potential. Let's see how we can get the existing ones and how they can be useful in solving our problems.

Free available data

Be happy! A lot of data is free and easy to download! Let's see the main options:

Collect your own data

Data collection is a huge and critical topic, which technical details go outside the scope of this guide. We'll see in a dedicated Purgatorio's guide how to collect huge quantities of data through scraping and other sources. To first taste the awesome free tools that we can use, take a look the ones from Digital Methods! Take also a look to the Snorkel project from Stanford, seems very promising.

Look under the bed

More and more organizations are realizing that they are sitting in a gold mine. Years and years of historical data, from machines, conversation logs, transactions or sensor measurements are only waiting for you! They probably hide a tremendous value to be unlocked, so why not to take a look at that relational database!

The bottom line is: If you have a lot of data coming from a business process, probably they are worth their bytes in gold (uh, this analogy doesn't seem to fit well).

Ask "why?"

If we are aware of the five-whys, it's one of the key techniques behind critical thinking. With so much data and with the end-results being data-dependent, it's so important to get the data part right. So let's know-our-data before performing any actions or tasks or transformations on it. Let us know everything we need to know about the source and nature of the data we will be learning about, processing, transforming, analysing and training. There is a whole section dedicated about this aspect and perspective about data on this AI/ML/DL GitHub resource. Do scavenge through the material, the notebooks, presentations and videos under this section. Unfortunately, we focus much less on critical thinking, which is a highly-potent ingredient needed when working with mathematics and sciences. Another good example would be looking at this online course "Calling Bullshit".

Know-how

Doing data science is not simple. It will never be, if you want to fully understand the mathematical processes behind the elegant models offered by high-level interfaces such as Keras. Learn to hate those who say "learn data science in 6 weeks". In one sense it's totally false, in the other it's disrespectful to those who study these issues for years, learning new things every day. This does not mean that we should be discouraged, on the contrary! Virgilio is here to help you in this fascinating process.

  • If you are a student or a technician in general who wants to experience these innovative themes at first hand, follow the path that has been prepared for you starting from Purgatorio, and with the right time and determination you will get incredible results.

  • If you are just interested in knowing what know-how is needed (for example in order to form a team that can develop large data science projects in your company) here are some tips for you.

Let's see what stages the process of developing a DS project consists of:

Figure 1-1

Let's have a look.

Once the requirements of the application that you want to develop to solve the problem have been collected, the data science cycle usually consists of experimenting with solutions in an iterative way by collecting data and shaping the problem.

Data collection consists of putting together data sources, building pipelines to pre-process them, and having a place to put them on hold for analysis. Modeling the problem consists of feature engineering (i.e. combining existing data in a meaningful way), training models and evaluating their performance.

Once a satisfactory model has been obtained, it must be put into production so that it can solve the user's requirement.

Once in production, the model needs to be monitored, and soon it may have to be modified as requirements change. As you can easily guess, the data science process involves different figures for its success, even if the web services provided by large technology companies allow anyone to start developing ML solutions without worrying about the integration or management of the underlying systems.

These figures have blurred boundaries in terms of skills and knowledge, and in general the ideal would be for each individual to be able to move with agility between the different tasks of the data science process. If this is possible for small scale projects, it is true that the needs of a company often require the coordinated effort of several people to be successful.

Names like "data scientist" or "data engineer" are described differently depending on the source. Here at Virgilio we believe that the classification made by KDNuggets is the most comprehensive and clear.

In this awesome blog post you can find in detail the differences between these figures.

If you're just thinking about learning everything you need by your own, you're in the right place! Starting from Purgatorio our purpose is to educate a "liquid" figure who can move around among the different tasks and parts of the data science process.

Sooner or later you'll have to choose something to specialize in, maybe data visualization or model architecture, but before then the path we propose here at Virgilio is mixed between these "predefined" figures and contains all the interdisciplinary topics that a member of a data science team should at least be aware of.

Now that you know what know-how you need for your ML projects, let's take a look at the requirements in terms of computational resources.

Computational Power

The third and final element that you have is computational power. The calculations that most of the time you find yourself doing when experimenting with Machine Learning are multiplications between matrices. Then, in the subset of ML techniques called Deep Learning, neural networks are particularly hungry for these calculations. In the past decades such calculations were made directly using the CPU of computers, but they proved insufficient to train the most complex networks. Fortunately, we are helped by GPUs (graphical processor units), whose job is exactly to do matrix calculations! In fact, in the world of gaming and video computing in general, these kinds of calculations are what happens daily behind the scenes of video games and 3D graphics in general. In particular, the main factor that indicates the usefulness of GPUs in Machine Learning calculations is the amount of video memory (expressed in GigaBytes) that the card has. Damn it! We have to buy an extreme gaming PC with 4 GPUs! I'm kidding, please don't do that. Thanks to the internet connection and cloud providers, nowadays it is much more convenient to use machines prepared and maintained by companies like Amazon or Google, instead of investing money in physical hardware. In the next chapters I will argue why you should not buy an expensive phisical machine, and if you stil have todo that for some reason, I'll give you some tips in order to show you the tradeoffs in terms of price and amount of computational power.

Cloudy Days

Outsourcing is the main driver of the human society progress. Imagine one of the first farmers in 20,000 B.C. He is used to carrying out the entire production process on his own: he tames and cultivates the wheat, picks it up and crushes it into small parts. He then uses it to create various products, such as flour, ready-made doughs, or a primitive form of bread.

Nowadays, however, an individual or an organization can only deal with one of the previous sub-processes, and specialize highly in that. For example, by choosing to specialize in the production of bread, they could learn how to produce many forms of bread, look for new ones, improve the flavor and optimize all these processes, forming a specific and specialized know-how with respect to the production of bread.

This process, the progressive specialization of human activities, is the basis of the technical progress of our species. It is clearly based on outsourcing! If all the sub-processes preceding the production of bread were to disappear, the baker would be destined to close down. How does it grow wheat? How does it prevent insects from ruining it? How does it pay for, manage and maintain the presses that crush it and make it flour?

In the same way, companies today outsource as much as possible all processes that are not essential to the company's know-how: in doing so, everyone can do "what they do best".

The same trend is being seen today in the use of software. It is more common than ever to use services provided through the network to solve sub-problems that companies do not want to worry about, such as the management of physical machines that host their applications, or they want to have their data from anywhere in the world in a safe and reliable way!

It is therefore not surprising that cloud computing is becoming a very important part of the software market, thanks to the flexibility and reduced costs that this entails.

The race towards the monopoly of the cloud market is fierce, particularly between Amazon (with AWS) and Microsoft (Azure).

How does this come in handy for Machine Learning?

With a few clicks you can deploy files and files of high-performance machines, managed in a transparent way, and especially with on-demand capabilities. This fact drastically reduces the costs and complexity of developing complex software and solutions, all from a dedicated web interface or client. It also allows you to serve solutions quickly, flexibly and without the classic problems of large-scale computing that today's distributed systems require. Being able to scale from one device to millions simply by turning on hundreds of self-managed and paid-for machines, is an outstanding business and research enabler.

Recently, all large tech companies have started to provide (and invest heavily) in providing services related to AI and ML issues. Emerging (and required) figures such as the data engineer or data scientist are able to achieve incredible levels of productivity and business through end-to-end platforms that cover the entire development cycle of AI and ML solutions. Some examples are AWS Sagemaker or Azure Machine Learning Studio.

To start experimenting in a simple way there are even solutions such as Google Colab, a free Google tool that allows you to use Jupyter notebooks on your machines, without having to worry about installing packages or managing memory. It's also totally free!

We have created a fantastic list that compares all cloud solutions in terms of cost and performance.

Don't believe who tells you that you need to buy a high-performance machine to make ML. He probably lives in 2010!

If you still have to buy a machine

Suppose you still have to buy a calculator for other needs, let's see which are some good options:

  • This video from Siraj Raval explains you pretty well which components are important and why.
  • These Quora answers are straightforward (you can notice that they suggest to leverage the cloud first) .
  • This Medium article shows you how to build an 800 dollars rig for Machine Learning.

As we have seen the important component for ML calculations is the GPU, in particular the amount of video memory that owns. WARNING: You definitely want to have a GPU compatible with Cuda, the parallelization platform of Nvidia, the largest GPU manufacturer.

This article](https://blog.slavv.com/picking-a-gpu-for-deep-learning-3d4795c273b9) explains how to choose a GPU that supports Cuda.

Conclusions

We have seen how the three elements (data, know-how and computing power) are strictly necessary to develop ML solutions for everyday problems. If you are a student or just a curious cat you will have an easy time thanks to Google Colab and the free power it offers. If you're part of a team in a company thinking about starting to experiment with ML at scale, consider the ever richer and better ML services offered by the big technology vendors.


Written by clone95