Although this is the most common question, it is hard to answer since the amount of data mainly depends on how complex the learning concept is. In Machine Learning (ML), the learning complexity can be broken down into informational and computational complexities. Further, informational complexity considers two aspects, how many training examples are needed (sample complexity) and how fast a learner/model’s estimate can converge to the true population parameters (rate of convergence). Computational complexity refers the types of algorithms and the computational resources to extract the learner/model’s prediction within a reasonable time. As you can guess now, this blog will cover informational complexity to answer the question.
Let’s try to learn what banana is. In this example, banana is the learning concept (one hypothesis, that is ‘to be’ or ‘not to be banana’), and the various descriptions associated with banana can be features such as colors and shapes. Unlike the way human can process the concept of banana – the human does not require non-banana information to classify a banana, typical machine learning algorithm requires counter-examples. Although there is One Class Classification (OCC) which has been widely used for outlier or anomaly detection, this is harder than the problem of conventional binary/multi-class classification.
Let’s place another concept ‘Apple’ into this example and make this practice as a binary-class classification. By doing this, we just made the learning concept simpler, ‘to be banana = not apple’ and ‘not to be banana = apple’. This is little counter-intuitive since adding an additional learning concept into a model makes the model simpler: however, OCC basically refers one versus all others, and the number of all other cases are pretty much infinite. This is where we are in terms of ML; one of the simplest learning activities for human is the most difficult problem to solve in ML. Before generating some data for banana, we need to define some terms.
Ideally, we want to have all the instances in the training sample set S covering all the possible combinations of features with respect to t as you can see in Table 1. There are three possible values for f1 and two possible values for f2. Also, there are two classes in this example. Therefore, the number of all the possible instances |X| = |f1| x |f2| x |C| = 3 x 2 x 2 = 12. However, f2 is a lucky feature[iii] that is mutually exclusive between banana and apple. Hence, |f2| is considered as 1 in this case. In addition to that, we can subtract one case because there is no red banana. For this example, only 5 instances can exhaust the entire sample space H. In general, the number of features (columns) in a data set is exponentially proportional to the required number of training samples (|S| = n, where n is the unique number of samples in the set). If we assume that all features are binary like a simple value of yes or no, then |X| = 2 x 2 x 2 = 23. Two to the power of the number of columns is the minimum n in the simplest case. This example only works when the values in all the features are discrete values. If we use the gradient color values for Color (RGB 256 color pallet ranges from 0 to 16777215 in decimal), the required number of training samples will increase quite significantly because now you need to multiply 16777216 for f1 if all the possible colors exist in H.
It is worth noting that the number of instances we calculate here does not always guarantee that a learner/model can converge properly. If you have the number of data equal or below this number, the amount of data is simply too small for the most of algorithm except a ML algorithm evaluating one feature at a time such as a decision tree. As a rough rule of thumb, many statisticians say that a sample size of 30 is large enough. This rule can be applied for a regression based ML algorithm that assumes one smooth linear decision boundary. Although an optimal n could be different on a case-by-case basis, it is not a bad idea to start from the total number of samples of N = |X| x 30.
Table 1 Training sample set S
In ML, learning curve refers a plot of the classification accuracy against the training set size. This is not an estimation method; it requires building a classification model multiple times with the different size of training data set. This is a good technique for sanity-checks (underfitting – high bias and overfitting – high variance) for a model. It also can be utilized to optimize/improve the performance.
Back in the day, not a single ML paper was accepted without a learning curve. Without this simple plot, the entire performance claim will be unverifiable.
Internal web page
External web page
Contacts Americas Kihoon Yoon Sr. Principal Systems Dev Eng Kihoon.Yoon@dell.com +1 512 728 4191
Or attributes. A feature is an individual measurable property or characteristic of a phenomenon being observed.
[ii] Instance is also referred as example or sample
[iii] If a feature like f2 exists in a data set, we could make 100% accurate prediction simply by looking at f2.
[iv] Is there yellow apple? Yes, Golden Delicious is yellow.