## Abstract

The *k*-dimensional coding schemes refer to a collection of methods that attempt to represent data using a set of representative *k*-dimensional vectors and include nonnegative matrix factorization, dictionary learning, sparse coding, *k*-means clustering, and vector quantization as special cases. Previous generalization bounds for the reconstruction error of the *k*-dimensional coding schemes are mainly dimensionality-independent. A major advantage of these bounds is that they can be used to analyze the generalization error when data are mapped into an infinite- or high-dimensional feature space. However, many applications use finite-dimensional data features. Can we obtain dimensionality-dependent generalization bounds for *k*-dimensional coding schemes that are tighter than dimensionality-independent bounds when data are in a finite-dimensional feature space? Yes. In this letter, we address this problem and derive a dimensionality-dependent generalization bound for *k*-dimensional coding schemes by bounding the covering number of the loss function class induced by the reconstruction error. The bound is of order , where *m* is the dimension of features, *k* is the number of the columns in the linear implementation of coding schemes, and *n* is the size of sample, when *n* is finite and when *n* is infinite. We show that our bound can be tighter than previous results because it avoids inducing the worst-case upper bound on *k* of the loss function. The proposed generalization bound is also applied to some specific coding schemes to demonstrate that the dimensionality-dependent bound is an indispensable complement to the dimensionality-independent generalization bounds.

## 1 Introduction

*k*-dimensional coding schemes (Maurer & Pontil, 2010) are abstract and general descriptions of a collection of methods, all of which encode a data point as a representative vector by a linear map

*T*, where denotes the Hilbert space. These coding schemes can be formulated as follows: where is called the codebook and the linear map is called the implementation of the codebook. The implementation projects the codebook back to the data source space. The dimension of a data point

*x*can be either finite or infinite. In this letter, we consider the data as having finite dimensions of features, that is, .

*y*in the codebook. The reconstruction error of a data point

*x*is defined as The function , whose variables are

*x*and

*T*, is also called the loss function. Nonnegative matrix factorization (NMF) (Lee & Seung, 1999; Févotte, Bertin, & Durrieu, 2009; Guan, Tao, Luo, & Yuan, 2012), dictionary learning (Chen, Donoho, & Saunders, 1999; Ivana & Pascal, 2011), sparse coding (Olshausen & Field, 1996; Amiri & Haykin, 2014; Gui, Sun, Ji, Tao, & Tan, 2016),

*k*-means clustering (MacQueen, 1967; Anderberg, 1973; Liu, Liu, Wu, Tao, & Fu, 2015), and vector quantization (Gray, 1984; Schneider, Biehl, & Hammer, 2009a) are specific forms of

*k*-dimensional coding schemes because they share the same form of the reconstruction error as equation 1.2. They have achieved great successes in the fields of pattern recognition and machine learning for their superior performances on a broad spectrum of applications (Liu, Shao, Li, & Fu, 2016; Pehlevan, Hu, & Chklovskii, 2015; Gong, Zhang, Schölkopf, Tao, & Geiger, 2015; Liu, Wu, Tao, Zhang, & Fu, 2015; Liu, Tao, Cheng, & Tang, 2014; Mairal, Bach, & Ponce, 2012; Hunt, Ibbotson, & Goodhill, 2012; Wright, Yang, Ganesh, Sastry, & Ma, 2009; Schneider, Biehl, & Hammer, 2009b; Dhillon, Guan, & Kulis, 2007; Quiroga, Nadasdy, & Ben-Shaul, 2004; Kanungo et al., 2002; Abbott & Dayan, 1999).

*T*. A natural choice for

*T*is the one that minimizes the expected reconstruction error. The expected reconstruction error with respect to the implementation

*T*is defined as where is a Borel measure of the data source and is the probability density function. However, in most cases, is unknown, and cannot be directly minimized. An alternative approach is the empirical risk minimization (ERM) method (Vapnik, 2000; Cucker & Smale, 2002). Given a finite number of independent and identically distributed (i.i.d.) observations , the empirical reconstruction error with respect to

*T*is defined as The ERM method searches for a

*T*that minimizes , and in the hope that has a small distance to the expected reconstruction error , where and denotes a particular class of linear operators

_{n}*T*.

*k*-dimensional coding schemes. Although different restrictions are imposed on the choices of and

*Y*for different concrete forms of

*k*-dimensional coding schemes (e.g., NMF requires both and

*Y*to be nonnegative, and sparse coding requires sparsity in

*Y*), they are closely related. For example, Ding, He, and Simon (2005) showed that NMF with orthogonal is identical to

*k*-means clustering of . Analyzing the generalization bounds together in this context has the advantages of exploiting the common properties and mutual cross-fertilization.

### 1.1 Related Work

*k*-dimensional coding schemes. Other work has concentrated only on specific

*k*-dimensional coding schemes. Since some previous work has studied consistency performance, which considers the quantity of the related ERM-based algorithms, we demonstrate the relationship between the generalization error and consistency performance here: Thus, analyzing the generalization error provides an approach for analyzing the consistency performance, and the consistency performance provides directions to generalization error analysis. We review the generalization error and consistency performance of

*k*-dimensional coding schemes together:

*Nonnegative matrix factorization (NMF)*. The only known generalization bounds of NMF have been developed by Maurer and Pontil (2010) and Gribonval et al. (2015).*Dictionary learning*. Maurer and Pontil (2010) have developed dimensionality-independent generalization bounds. Vainsencher, Mannor, and Bruckstein (2011) and Gribonval et al. (2015) have studied the dimensionality-dependent generalization bounds.*Sparse coding*. A generalization bound for sparse coding was first derived by Maurer and Pontil (2010) and subsequently extended by Xu and Lafferty (2012), Mehta and Gray (2013), Maurer, Pontil, and Romera-Paredes (2013), and Gribonval et al. (2015). Maurer et al. (2013) derived a faster convergence rate upper bound of the consistency performance in a transfer learning setting.*K-means clustering and vector quantization*. Consistency performances of*k*-means clustering and vector quantization have mostly been studied for . Asymptotic and nonasymptotic consistency performances have been considered by Pollard (1982), Chou (1994), Linder, Lugosi, and Zeger (1994), Bartlett, Linder, and Lugosi (1998), Linder (2000), Antos, Györfi, and György (2005), Antos (2005), and Levrard (2013). Biau, Devroye, and Lugosi (2008), Maurer and Pontil (2010), and Levrard (2015) developed dimensionality-independent generalization bounds for*k*-means clustering.

We are aware that these specific forms of *k*-dimensional coding schemes have many applications for finite-dimensional data, and only a few dimensionality-dependent methods have been developed to analyze the generalization bounds for all these coding schemes.

In this letter, we develop a dimensionality-dependent method to analyze the generalization bounds for the framework of *k*-dimensional coding schemes. Our method is based on Hoeffding’s inequality (Hoeffding, 1963) and the Bennett-type inequalities (Boucheron, Lugosi, & Massart, 2013) and directly bounds the covering number of the loss function class induced by the reconstruction error, which avoids inducing the worst-case upper bound on *k* of the loss function. Our method allows a generalization bound of order , where is much bigger than when *n* is small, which delicately describes the nonasymptotic behavior of the learning process. However, when *n* goes to infinity, approaches 0.5. The obtained dimensionality-dependent generalization bound can be much tighter than the previous ones when the number *k* of columns of the implementation is larger than the dimensionality *m*, which could often happen for dictionary learning, sparse coding, *k*-means clustering, and vector quantization. We therefore obtain state-of-the-art generalization bounds for NMF, dictionary learning, sparse coding, *k*-means clustering, and vector quantization.

The remainder of the letter is organized as follows. We present our motivation in section 2 and main results in section 3. In section 4, we apply our results to specific coding schemes and empirically compare them with state-of-the-art generalization bounds. We prove our results in section 5 and conclude the letter in section 6.

## 2 Motivation

We first introduce the dimensionality-independent generalization bounds and demonstrate why our dimensionality-dependent bound complements them.

Assume that data points are drawn from a Hilbert space with distribution . For any , let denote the set of probability distributions on supported on the closed ball of radius *r* centered at the origin. In other words, means that Let be bounded in the operator norm; that is, for every , it holds that for all *v* with . Then we also have that the columns of *T* are bounded as , where is the orthonormal basis of .

The following two theorems are equivalent to the main theorems proved by Maurer and Pontil (2010) but are represented in a different way. They are dimensionality-independent generalization bounds obtained in the frame of the *k*-dimensional coding schemes. They exploited the Rademacher complexity technique (Bartlett & Mendelson, 2003), which is suitable for deriving dimensionality-independent bounds (see Biau et al., 2008).

The dimensionality-independent generalization bound in theorem ^{1} is valuable because it shows a convergence rate of order .

The condition that *Y* is a closed subset of the unit ball of can be easily achieved by controlling the upper bound of columns of *T* because there is a trade-off between the bounds of columns of *T* and the entries of .

We note that theorems ^{1} and ^{3} are more complicated than the original results presented in Maurer and Pontil (2010). This is because we have removed the restrictions that and , which are required to simplify their results, to reveal the intrinsic relationships between the order of *k* and the Rademacher complexities (discussed below). The proof methods of theorems ^{1} and ^{3} in this letter are exactly the same as those presented by Maurer and Pontil (2010).

*y*is in the unit ball of , then where

*r*, the upper bound of the data point, can be reduced by normalization. However,

*k*is a fixed integer, whose value is usually large in practice. Thus, is the dominant factor in the upper bound of

*f*. It is evident that

_{T}*f*has the worst-case upper bound on

_{T}*k*of order ; that is, the dependency with regard to

*k*of the upper bound of

*f*has the worst-case order . However, for some special forms of

_{T}*k*-dimensional coding schemes, the upper bound of

*f*has a very small order about

_{T}*k*. Taking NMF as an example, the order about

*k*is zero because It is evident that the term in theorem

^{1}has the same order as that of the worst-case upper bound on

*k*of

*f*. It will therefore be loose for some specific

_{T}*k*-dimensional coding schemes. Maurer and Pontil (2010) introduced the proof method of theorem

^{3}to overcome this problem; however, the term implies that the problem is only partially solved because represents the worst-case upper bound on

*k*of (details can be found in the proof therein). For example, in NMF, the term is of order (discussed in remark

^{9}). The dimensionality-dependent bound in theorem

^{3}faces the same problem because the proof method computes the Rademacher complexity, corresponding to which part the obtained bound is dimensionality independent and involves the worst-case upper bound on

*k*of .

We try to avoid the worst case by employing a covering number method to measure the complexity of the induced loss function class . However, in our setting, the dimensionality *m* of data space must be finite.

## 3 Main Results

Before presenting our main results, we introduce the definition of covering number (Zhang, 2002).

Let . We can upper-bound the covering number of the induced loss function class of any *k*-dimensional coding scheme.

By employing Hoeffding’s inequality (Hoeffding, 1963), we can derive a dimensionality-dependent generalization bound for *k*-dimensional coding schemes.

Our result is dimensionality-dependent. Compared to the bound in theorem ^{3}, our bound could be tighter if .

*k*of the loss function compared to those of theorems

^{1}and

^{3}. Regarding NMF, If we consider only the order of , and

*n*, our bound is of order , while theorem

^{1}has order and theorem

^{3}is of order . Our bound is tighter when .

For dictionary learning, sparse coding, *k*-means clustering, and vector quantization, the number *k* of the columns of the linear implementation may be larger than the dimensionality *m*. If , our bound will be much tighter than the dimensionality-independent generalization bound.

^{7}and theorem

^{8}, our result is based on the estimation of the Lipschitz constant of the loss function with regard to the implementation

*T*. Particularly, we proved the property for all

*T*and in , where

*L*is a constant depending on a specific

*k*-dimensional coding scheme. Similar to our idea, Gribonval et al. (2015) also developed dimensionality-dependent generalization bounds for

*k*-dimensional coding schemes. However, their method is different from ours. Their results are essentially based on the property that for all

*T*and in , where is also a constant and the operator norm of an matrix is defined as . As a result, under some assumptions (see assumptions A1–A4, B1–B3 and C1–C2 therein) and with high probability, they have that where are constants depending on a specific

*k*-dimensional coding scheme. Note that in most applications, and . Their bound could be looser than the derived bound in theorem

^{8}because in the cases, it holds that . Detailed comparisons are presented in section 4.

The result in theorem ^{8} can be improved by exploiting Bennett-type inequalities. We can make the upper bound to have either a smaller constant or a faster convergence rate as follows.

By employing Bernstein’s inequality, we show that a tighter generalization bound of *k*-dimensional coding schemes than that in theorem ^{8} can be derived.

The upper bound in theorem ^{12} can be much tighter than that in theorem ^{8}. The dominant term in the upper bound of theorem ^{12} is . Since the empirical reconstruction error is no bigger and sometimes much smaller than 1, the upper bound in theorem ^{12} can be much tighter than that in theorem ^{8}.

We can represent the result by using the inequality that for all , .

We have claimed that theorem ^{12} and proposition ^{14} can be tighter than theorem ^{8} by saying that can be very small. However, sometimes such a term could be large. If (note that the reconstruction error function ), theorem ^{12} and proposition ^{14} will be looser than theorem ^{8}.

The following theorem implies that by employing Bennett’s type inequality, the generalization bound can be improved no matter what the value of is.

Since in theorem ^{15}, we have that if the condition holds. For simplicity, set . If we have that , the upper bound in theorem ^{15} will be the same as that in theorem ^{8} except for a faster convergence rate. Thus, the upper bound in theorem ^{15} can be much tighter than that in theorem ^{8} in the sense that it converges much faster.

The generalization bound in theorem ^{8} is of order , while the generalization bound in theorem ^{15} is of order , where when *n* is finite. The generalization bound in theorem ^{15}, derived by employing Bennett’s inequality, converges faster when the sample size *n* is small, which is often the case in practice and describes the nonasymptotic behavior of the learning process. More empirical discussions can be found in Zhang (2013). However, when the sample size *n* goes to infinity, the term will approach , which means that the upper bounds in theorems ^{15} and ^{8} describe the same asymptotic behavior of the learning process.

^{15}looks complex, since the exponent in the convergence rate depends itself on the sample size in an implicit way. Here we show the superiority of theorem

^{15}by comparing it with theorem

^{8}. From the proof of theorem

^{15}, we can see that the theorem depends on the following inequality (see equation 5.13): where . Note that for Hoeffding’s inequality, with any , we also have Thus, according to Hoeffding’s inequality and the proof method of theorem

^{15}, for all , with probability at least , it holds that Comparing the above bound with that in theorem

^{15}, we can see that if we interpret theorem

^{8}with a faster convergence rate, the upper bound there is looser than that in theorem

^{15}when .

Our main results in theorems ^{8}, ^{12}, and ^{15} apply to all the *k*-dimensional coding schemes because the covering number in lemma ^{7} measures the complexity of the loss function class that includes all the possible loss functions of *k*-dimensional coding schemes. However, for some specific *k*-dimensional coding schemes, the complexity of the corresponding induced loss function class can be refined. We discuss the details in the next section.^{1}

## 4 Applications

In this section, we apply our proof methods to specific *k*-dimensional coding schemes. We show that our methods provide state-of-the-art dimensionality-dependent generalization bounds.

### 4.1 Nonnegative Matrix Factorization

NMF factorizes a data matrix into two nonnegative matrices and , where . NMF has been widely exploited since Lee and Seung (1999) provided a powerful psychological and physiological interpretation as a parts-based factorization and an efficient multiplicative update rule for obtaining a local solution. Many fast and robust algorithms then followed (e.g., Gillis & Vavasis, 2014; Guan, Tao, Luo, & Yuan, 2011; Liu & Tao, 2015). In all applications, both the data points and the vectors are contained in the positive orthant of a finite-dimensional space. In this case, our method for deriving dimensionality-dependent generalization bounds is likely to be superior to the method for obtaining dimensionality-independent results.

If we restrict and normalize *T*, columns of *Y* will also be upper-bounded by *r*. This can be seen in the following lemma, which generalizes lemma ^{19} in Maurer and Pontil (2010):

For NMF with normalized *T*, if , then every column of *Y* is upper bounded by *r*; that is, for all .

Using the same proof method as that of lemma ^{7}, we have the following lemma:

Then, according to the proof methods of theorems ^{8}, ^{12}, and ^{15}, we have the following dimensionality-dependent generalization bounds for NMF:

^{8}with state-of-the-art bounds. Theorem

^{8}gives the following bound for NMF: Under the setting of theorem

^{21}, theorem

^{3}yields the following bound: Gribonval et al.’s (2015) result gives the following bound:

We then carefully compare the above generalization bounds. For NMF problems, the dimensionality *m* is usually very large compared to the reduced dimensionality *k*. We set . The comparisons are illustrated in Figure 1. The figure shows that in most cases, the derived generalization bound is tighter than state-of-the-art bounds. In Figure 1 d, the bound in equation 1.3 is tighter than the derived bound in a small range because it is dimensionality independent and is set to be much larger than the corresponding reduced dimensionality *k*.

### 4.2 Dictionary Learning

*T*be the atoms of the dictionary. For an observation , the dictionary learning method will represent

*x*by a linear combination of columns of

*T*as Thus, the reconstruction error of dictionary learning is the same as those of

*k*-dimensional coding schemes.

Vainsencher et al. (2011) provided notable dimensionality-dependent generalization bounds for dictionary learning by considering two types of constraints on coefficient selection, respectively. For the -norm regularized coefficient selection, where every signal is approximated by a combination of, at most, *p* dictionary atoms, the generalization bound (theorem ^{35} therein) is of order under an approximate orthogonality assumption on the dictionary. For the -norm regularized coefficient selection, the generalization bound (theorem ^{22} therein) is of order under the requirements that , which is the upper bound of the -norm of the coefficient, is larger than , and that the signal *x* is mapped onto the -sphere. Our result on *k*-dimensional coding scheme can also be applied to dictionary learning and provides a more general bound, which does not require *x* to be on the -sphere or the near-orthogonality requirement and directly applies to all dictionary learning problems.

The proof of theorem ^{22} is the same as that of theorem ^{21}.

If we substitute an upper bound into the bound in Vainsencher et al. (2011), the bound in theorem ^{22} therein will be of order , which has the same order as term . However, our bound in theorem ^{18} also shows a faster convergence rate.

The method Vainsencher et al. (2011) used to upper-bound the covering number of the induced loss function class is very different from ours. To upper bound the covering number of the induced loss function class for dictionary learning, Vainsencher et al. (2011) used the knowledge that a uniform *L* Lipschitz mapping between metric spaces converts covers into covers. Then they focused on analyzing the Lipschitz property of the reconstruction error function that maps a dictionary into a reconstruction error, , as shown in their lemma 7. Also note that to upper-bound the Lipschitz constant of the mapping , they introduced the approximate orthogonality condition (a bound on the Babel function) on the dictionary.

Analyzing the Lipschitz properties of the induced loss functions is essential for upper bounding the generalization error of *k*-dimensional coding schemes. Different from the method used in Vainsencher et al. (2011), Maurer and Pontil (2010) employed Slepian’s lemma to exploit the Lipschitz property, while in this letter, we also proposed a novel method, presented in the proof of theorem ^{8}.

The comparisons of the generalization bounds of dictionary learning are similar to that of NMF because NMF can be regarded as dictionary learning in the positive orthant. We therefore omit the comparison. Many algorithms used in applications require sparsity in *Y*, because sparsity has advantages, such as for computation and storage. We therefore analyze sparsity in section 4.3.

### 4.3 Sparse Coding

The following generalization bound for sparse coding is also from the work of Maurer and Pontil (2010), derived using the proof method of theorem ^{3}.

We now consider the generalization bound of sparse coding using our method. The following lemma is proved in section 5.7.

Then we have the generalization bounds for sparse coding as follows:

The proof of theorem ^{28} is the same as that of theorem ^{21}.

We then compare the above generalization bounds of sparse coding in Figure 2 by setting , and . The comparisons show that the derived generalization bound is tighter than state-of-the-art bounds.

### 4.4 Vector Quantization and *k*-Means Clustering

*k*-means clustering (or vector quantization) method aims to find

*k*cluster centers such that observations can be partitioned into

*k*clusters and represented by the

*k*cluster centers with a small reconstruction error. Taking every column of

*T*as a cluster center and setting

*Y*as the standard bases , we see that solving a

*k*-means clustering problem is equal to finding an implementation

*T*. The corresponding reconstruction error is So the reconstruction error of

*k*-means clustering and vector quantization is also within the framework of the reconstruction error of

*k*-dimensional coding schemes.

The following lemma is essential for proving our dimensionality-dependent generalization bounds:

The proof of theorem ^{30} is the same as that of theorem ^{21}.

The bound in equation 4.12 has order , which is the same as the bound obtained by Biau et al. (2008). The term in theorem ^{30} has order . If , our bound can be tighter than that of Maurer and Pontil (2010) and the result in Biau et al. (2008). The generalization bounds derived by Maurer and Pontil (2010) and Biau et al. (2008) also have an advantage that they converge faster. As discussed in Bartlett et al. (1998), Linder et al. (1994), and Devroye, Györfi, and Lugosi (1996), the factor in theorem ^{30} can be removed by the sophisticated uniform large-deviation inequalities of Alexander (1984) or Talagrand (1994). However, Devroye et al. (1996) proved that (in their theorem 12.10) the fast convergence upper bound has an astronomically large constant. The corresponding convergence bound is therefore loose. Our generalization bound, which is derived by exploiting Bennett’s inequality, will be tighter if the empirical reconstruction error is small.

We compare the above generalization bounds of *k*-means clustering and vector quantization in Figure 3 by setting . For *k*-means clustering and vector quantization problems, the dimensionality *m* can be independent of the reduced dimensionality *k*. Figure 3 shows that when *k* is not very large, the derived bound is tighter than state-of-the-art generalization bounds.

## 5 Proofs

### 5.1 Concentration Inequalities

In this section, we introduce the concentration inequalities that will be used to prove our assertions.

We first present Hoeffding’s inequality (Hoeffding, 1963), which is widely used for deriving generalization bounds.

We will also use Bernstein’s inequality and Bennett’s inequality (Boucheron et al., 2013; Zhang, 2013) to derive generalization bounds.

### 5.2 Proof of Lemma ^{7}

*m*-dimensional regular solids with width , there are such regular solids. If we pick out the centers of these regular solids and use them to make up

*T*, there are choices, denoted by . Then is the upper bound of the -cover of the implementation class .

### 5.3 Proof of Theorem ^{8}

We first prove the following theorem, which is useful to prove theorem ^{8}.

*F*. Then . By a union bound of probability, we have that with probability at least , the following holds: It can be easily verified that Combining inequalities 5.8 and 5.9, and letting , we have that with probability at least , the following holds: This concludes the proof.

_{T}Theorem ^{8} can be proven by combining theorem ^{35} and lemma ^{7}. We can also prove proposition ^{14} using the same method as that of theorem ^{8}.

### 5.4 Proof of Theorem ^{12}

According to Bernstein’s inequality, we have the following theorem, which is useful to prove theorem ^{12}.

^{35}, by a union bound of probability, we then have that with probability at least , the following holds: This concludes the proof.

Theorem ^{12} can be proven by combining theorem ^{36} and lemma ^{7}.

### 5.5 Proof of Theorem ^{12}

The following theorem, derived by exploiting Bennett’s inequality, is essential to prove theorem ^{15}.

Theorem ^{37} can be easily proven by using Bernstein’s inequality. However, to show the faster convergence property, we propose a new method to prove Bernstein’s inequality, which needs the following lemma:

Thus, there are many pairs of such that the first part of lemma ^{38} holds.

Similar to the proof of theorem ^{35}, theorem ^{37} can be proven by using lemma ^{38} and a union bound of probability.

Theorem ^{15} can be proven by combining theorem ^{37} and lemma ^{7}.

### 5.6 Proof of Lemma ^{19}

The proof method is the same as that of Lemma ^{19} in Maurer and Pontil (2010).

*y*is a minimizer of

*h*and . Because

*T*is normalized, . Then Let the real-valued function

*f*be defined as Then So

*f*cannot have a minimum at 1, whence

*y*cannot be a minimizer of

*h*. Thus, the minimizer

*y*must be contained in the ball with radius

*r*in the

*m*-dimensional space.

### 5.7 Proof of Lemma ^{27}

### 5.8 Proof of Lemma ^{29}

The proof method of lemma ^{29} is similar to that of lemma ^{7}.

*k*-means clustering and vector quantization, we can easily prove that . As in the proof of lemmas

^{7}and

^{27}, we can pick out a set , where , having the property that for every

*T*, there exists a such that with . The proof is as follows:

## 6 Conclusion

We have proposed a method to analyze the dimensionality-dependent generalization bounds for *k*-dimensional coding schemes, which are the abstract and general descriptions of a set of methods that encode random vectors in Hilbert space . There are several specific forms of *k*-dimensional coding schemes, including NMF, dictionary learning, sparse coding, *k*-means clustering, and vector quantization, which have achieved great successes in pattern recognition and machine learning.

Our proof approach is based on an upper bound for the covering number of the loss function class induced by the reconstruction error. We explained that the covering number is more suitable for deriving dimensionality-dependent generalization bounds for *k*-dimensional coding schemes, because it avoids the worst-case dependency with regard to the number *k* of the columns of the linear implementation. If *k* is larger than the dimensionality *m*, our bound could be much tighter than the dimensionality-independent generalization bound. Moreover, according to Bennett’s inequality, we derived a dimensionality-dependent generalization bound of order , where when the sample size *n* is finite, for *k*-dimensional coding schemes. Our method therefore provides state-of-the-art dimensionality-dependent generalization bounds for NMF, dictionary learning, sparse coding, *k*-means clustering, and vector quantization.

## Acknowledgments

T.L. and D.T. were supported in part by Australian Research Council Projects DP-140102164 and FT-130101457. D.X. was partially supported by funding from the Faculty of Engineering and Information Technologies, University of Sydney, under the Faculty Research Cluster Program.

## References

*n*vectors decreases to the optimum at

*k*-means clustering algorithm: Analysis and implementation

## Note

^{1}

Even though the faster convergence interpretation in theorem ^{15} is interesting, it looks complicated, and the upper bound is almost as tight as that of theorem ^{12}. Therefore, we do not discuss its applications for specific *k*-dimensional coding schemes.