I Introduction
Independent Component Analysis (ICA) addresses the recovery of unknown statistically independent source signals from their observed mixtures, without full prior knowledge of the mixing model or the statistics of the source signals. The classical Independent Components Analysis framework usually assumes linear combinations of the independent sources over the field of real valued numbers. A special variant of the ICA problem is when the sources, the mixing model and the observed signals are over a finite field, such as Galois Field of order , GF().
Several types of generative mixing models may be assumed when working over GF(). This includes modulo additive operations, OR operations (over a binary field) and others. Existing solutions to ICA mainly differ in their assumptions of the generative model, the prior distribution of the mixing matrix (if such exists) and the noise model. Nevertheless, the common assumption to all of these approaches is that there exists a set of fully statistically independent source signals to be decomposed. However, these assumptions do not usually hold, and a more robust approach is required. In other words, we would like to decompose any given observed mixture into “as independent as possible” components, with no prior assumption on the way it was generated. This problem was first introduced by Barlow [1] as minimal redundancy representation and was later referred to as factorial representation [2] or generalized ICA over finite alphabets[3].
A factorial representation has several advantages. The probability of occurrence of any realization can be simply computed as the product of the probabilities of the individual components that represent it (assuming such decomposition exists). In addition, any method of finding factorial codes can be viewed as implementing the
Occam’s razorprinciple, which prefers simpler models over more complex ones, where simplicity may be defined as the number of parameters necessary to represent the joint distribution of the data. In the context of supervised learning, independent features can also make later learning easier; if the input units to a supervised learning networks are uncorrelated, then the Hessian of its error function is diagonal, allowing accelerated learning abilities
[4]. There exists a large body of work which demonstrates the use of factorial codes in learning problems. This mainly manifests in artificial neural networks
[5, 6]with application to facial recognition
[7, 8, 9]and deep learning
[10, 11]. Recently, factorial codes were also shown to attain favorable compression rates in large alphabet source coding [12, 13, 14, 15].In this work we focus on linear solutions to the factorial representation problem; we seek for a linear transformation (over GF()), that decomposes the observed mixture into “as statistically independent as possible” components. Linear solutions have several advantages. They are easy to implement in linear timeinvariant (LTI) systems, they are robust to noisy measurements and they are storage space efficient. Importantly, they are usually simpler to derive and analyze than in the nonlinear case. For these reasons, linear ICA has received most of the attention over the years [16, 17, 18, 19, 20, 21, 22].
This paper is an extended version of our initial results, presented in [23]. In [23] we introduced a lower bound to the linear binary ICA problem, followed by a simple practical algorithm. Here we significantly extend the scope of this study. This includes the following contributions:

A comprehensive derivation of the computational and statistical properties of the methods presented in [23].

A novel computationally efficient algorithm which reduces the complexity of our previously suggested schemes.

Generalization of our framework to finite fields of any order.

A rigorous study of the flexibility of linear ICA methods, including both analytical and empirical results.

Comparative study of linear and nonlinear ICA methods.

A data compression application, which demonstrates the use of our suggested approach in large alphabet source coding.
Ii Previous Work
The problem considered in this paper has a long history. In his pioneering work from , Barlow [1] presented a minimally redundant representation scheme for binary data. He claimed that a good representation should capture and remove the redundancy of the data. This leads to a factorial representation/ encoding in which the components are as mutually independent of each other as possible. Barlow suggested that such representation may be achieved through minimum entropy encoding: an invertible transformation (i.e., with no information loss) which minimizes the sum of marginal entropies (as later presented in (1)). Unfortunately, Barlow did not propose any direct method for finding factorial codes. Atick and Redlich [24] proposed a cost function for Barlow’s principle for linear systems, which minimize the redundancy of the data subject to a minimal information loss constraint. This is closely related to Plumbey’s [25] objective function, which minimizes the information loss subject to a fixed redundancy constraint. Schmidhuber [2] then proposed several ways of approximating Barlow’s minimum redundancy principle in the nonlinear case. This naturally implies much stronger results of statistical independence. However, Schmidhuber’s scheme is rather complex, and subjects to local minima [5]. Recently, we introduced a novel approach for finding factorial codes over nonlinear transformation [3]. In that work, Barlow’s objective is tightly approximated with a series of linear problems. Despite its favorable computational properties, the approach suggested in [3] is quite analytically involved. Later, we introduced a simpler suboptimal solution, which applies an order permutation according to the probability mass function of the observed mixture [14]. This method is shown to minimize the desired objective function up to a small constant, for increasing alphabets sizes, on the average.
In a second line of work, the factorial coding problem is studied under the assumptions that a fully independent decomposition exists, and that the mixture is linear. In his first contribution to this problem, Yeredor [16] considered a linear mixture of statistically independent sources over GF(2) (namely, binary ICA) and proposed a method for source separation based on entropy minimization. Yeredor assumed that the mixing model is a byinvertible matrix and proved that the XOR model is invertible and that there exists a unique transformation matrix to recover the independent components. Yeredor further suggested several algorithms for the linear binary ICA (BICA) problem, including the AMERICA and the enhanced MEXICO algorithms. Further, Yeredor generalized his work [17] to address the ICA problem over Galois fields of any prime number. His ideas were analyzed and improved by Gutch et al. [26]. In [21]
, Attux et al. extended Yeredor’s formulation for a more robust setup, in which the sources are not necessarily independent and presented a heuristic immuneinspired algorithm
[21], which was later improved and generalized to any GF() [27].In addition to generative XOR model, there exist several alternative mixture assumptions. In [18] and [22] the authors considered a noiseOR model, where the probabilistic dependency between observable vectors and latent vectors is modeled via the noiseOR conditional distribution. Streith et al. [19] studied the binary ICA problem, where the observations are either drawn from a signal following OR mixtures or from a noise component. The key assumption made in this work is that the observations are conditionally independent given the model parameters (as opposed to the latent variables). This greatly reduces the computational complexity and makes the scheme amenable to an objective descentbased optimization solution. In [20], Nguyen and Zheng considered OR mixtures and proposed a deterministic iterative algorithm to determine the distribution of the latent variables and the mixing matrix.
Iii Linear Binary Independent Component Analysis
We begin our presentation by discussing the simpler binary case. Throughout this manuscript we use the following standard notation: underlines denote vector quantities, while their respective components are written without underlines but with an index. The probability function of the vector is denoted as , while is the entropy of . This means where the function denotes a logarithm of base , unless stated otherwise. Further, we denote the binary entropy of the parameter as .
Iiia Problem Formulation
Let be a given random vector of dimension . We are interested in an invertible transformation, , such that the components of are “as statistically independent as possible”. Notice that the common notion of ICA is not limited to invertible transformations (hence and may be of different dimensions). However, in our work we focus on this setup as we would like to be “lossless” in the sense that we do not loss any information. Further motivation to this setup is discussed in [1, 2].
We distinguish between linear and nonlinear invertible transformations. In the linear case, is a by invertible matrix over the XOR field. This means we seek where and . In the nonlinear case, we notice that an invertible transformation of a vector is a onetoone mapping (i.e., permutation) of its words. This means there exist invertible transformations from which only are linear [17].
To quantify the statistical independence among the components of we use the wellknown total correlation criterion, which was first introduced by Watanabe [28] as a multivariate generalization of the mutual information,
(1) 
This measure can also be viewed as the cost of coding the vector componentwise, as if its components were statistically independent, compared to its true entropy. Notice that the total correlation is nonnegative and equals zero iff the components of are mutually independent. Therefore, “as statistically independent as possible” may be quantified by minimizing . The total correlation measure was first considered as an objective for factorial representation by Barlow [1].
Since we define to be an invertible transformation of we have and so our minimization objective, in the binary case, is
(2) 
where is the sum of probabilities of all words whose bit equals . This means that the transformations are not unique. For example, we can invert the bit of all words to achieve the same objective, or even shuffle the bits.
Notice that the probability mass function of is defined over values. Therefore, any approach which exploits the full statistical description of would require going over all possible words at least once. On the other hand, there exist at most possible invertible transformations. The complexity of currently known binary ICA methods (and factorial codes) fall within this range. In the linear case (and under a complete independence assumption), the AMERICA algorithm [17], which assumes a XOR mixture, has a complexity of . The MEXICO algorithm, which is an enhanced version of AMERICA, achieves a complexity of under some restrictive assumptions on the mixing matrix. In the nonlinear case, an approximation of the optimal solution may be achieved in a computational complexity of , where is the accuracy parameter [29], while the simpler order permutation [14] requires .
IiiB Lower Bound on Linear Binary ICA
In this section we introduce a lower bound to the binary linear ICA problem. Specifically, we present a method which obtains an infimum of (2), under and , .
In his linear binary ICA work, Yeredor established a methodology based on a basic property of the binary entropy [17]
. He suggested that the binary entropy of the XOR of two independent binary variables is greater than each variable’s entropy. Unfortunately, there is no such guarantee when the variables are dependent. This means that in general, the entropy of the XOR of binary variables may or may not be greater than the entropy of each of the variables (for example,
, where is the xor operand). When minimizing (2) over linear transformations, , we notice that each is a XOR of several, possibly dependent, variables . This means that naively, we need to go over all possible subsets of and evaluate their XOR. Formally, we define a matrix of all possible realizations of bits. Then, we would like to compute for all . This means that each row in the matrix corresponds to a subset of variables from , on which we apply a XOR. Then, we evaluate the binary entropy of each . A necessary condition for to be invertible is that it has no two identical rows. Therefore, a lower bound on (2) may be achieved by simply choosing rows of the matrix , for which are minimal. Notice that this lower bound is not tight or attainable. It defines a simple lower bound on (2), which may be attained if we are lucky enough to have chosen rows of the matrix which are linearly independent.The derivation of our suggested bound requires the computation of binary entropy values, where each entropy corresponds to a different vectorial XOR operation of length . This leads to a computational complexity of . Then, we sort the entropy values in an ascending order, using a simple quicksort implementation. This again requires operations. Therefore, the total computation complexity of our suggested bound is . In other words, we attain a lower bound to any linear solution of (2), in a computational complexity that is asymptotically competitive to all currently known methods (for both linear and nonlinear methods).
IiiC A Greedy Algorithm for Linear Binary ICA
We now present our suggested algorithmic approach for the linear BICA problem, based on the same methodology presented in the previous section. Again, we begin by evaluating all possible XOR operations for all . Further, we evaluate the binary entropy of each . We then sort the rows of according to the binary entropy values of their corresponding . Denote the sorted list of rows as . Our remaining challenge is to choose rows from such that the rank of these rows is . Clearly, our objective suggests that we choose rows which are located at the top of the matrix , as they result in lower entropy values. Our suggested greedy algorithm begins with an empty matrix . It then goes over the rows in in an ascending order. If the current row in is linearly independent of the current rows in it is added to . Otherwise, it is skipped and the algorithm proceeds to the next row in . The algorithm terminates once is of full rank. The rank of is evaluated by a simple Gauss elimination procedure (implemented on a Matlab platform as , for example) or by more efficient parallel computation techniques [30]. We denote our suggested algorithm as Greedy Linear ICA (GLICA).
Although GLICA looks for linearly independent rows from in a noregret manner, we may still consider its statistical properties and evaluate how much it deviates from the lower bound. Let us analyze the case where the order of the rows is considered random. This means that although the rows of are sorted according to the value of their corresponding entropy, , we still consider the rows as randomly ordered, in the sense that the position of each row in is random and independent of the other rows. This assumption is typically difficult to justify. However, we later demonstrate in a series of experiments that the derived properties may explain the favorable performance of GLICA. This suggests that although there is typically dependence among the row, it is practically low.
Let us examine GLICA at an iteration where it already found linearly independent rows (rank), and it is now seeking an additional independent row. Notice that there exist at most rows which are linearly dependent of the current rows in (all possible linear combinations of these rows). Assume we uniformly draw (with replacement) an additional row from a list of all possible rows (of size ). The probability of a drawn row to be independent of the rows in is simply
. Therefore, the number of draws needed in order to find another linearly interdependent row follows a geometric distribution with a parameter
(as the draws are i.i.d.). This means that the average number of draws needed to find an additional row is, while its variance is
. Denote the number of rows that GLICA examines before termination by . We have that and . It can be shown [31] that and . Further, . We may now apply Chebyshev’s inequality to conclude that for every we have that(3)  
This result implies that if we choose rows from , even with replacement, our suggested algorithm skips up to rows on the average, before terminating with a full rank matrix , under the assumptions mentioned above. Further, the probability that our algorithm skips rows () is bounded from above by . Notice that this bound is independent of . Therefore, the overhead from our suggested the lower bound becomes negligible, as increases. For example, for , the probability that we will examine rows or more () is not greater than .
IiiD Blockwise Approach for Linear Binary ICA
Despite its favorable computational and statistical properties, our suggested greedy algorithm (GLICA) still requires the computation of all possible XOR operations. This becomes quite costly as increases (or as alphabet size grows, as we see in Section IV). Therefore, we suggest a simple modification to circumvent this computational burden.
Let us split the components of into disjoint sets, (blocks). For example, for and we may have that , and . Let us now apply GLICA to each block. Denote the outcome of this operation as , where the matrix is block diagonal. As discussed in Section IIIA, our objective (2) is invariant to shuffling of components. It means we can randomly shuffle the components of (by multiplying it with a permutation matrix), and maintain the same objective value . In our example (, ), the random shuffling may attain, for example, , and . Notice that we now have a new set of blocks, on which we may again apply the GLICA algorithm, to further reduce our objective. We repeat this process for a configurable number of times, or until (local) convergence occurs. Notice that the convergence is guaranteed since we generate a sequence of nonincreasing objective values, which is bounded from below by our suggested lower bound (Section IIIB). The resulting linear transformation of this entire process is simply the multiplication of all matrices that we apply along the process. We denote this iterative blockwise algorithm as Block GLICA, or simply BloGLICA. Our suggested approached is summarized in Algorithm 1
Notice that this blockwise approach is a conceptual framework which reduces the computational complexity of any finite alphabet ICA method. In other words, we may replace the GLICA algorithm in line 4 of Algorithm 1 by any finite field ICA algorithm, and by that reduce its computational burden. Let us set the maximal number of iterations to . Further, set the maximal size of each block as . Then, the computational complexity of BloGLICA is simply . Notice this complexity is typically much smaller than GLICA’s , since we do not examine all possible XOR operations and focus on local random searches that are implied by the block structure that we define.
IiiE Illustrations
Let us now illustrate the performance of our suggested algorithms and bounds in several experiments. In the first experiment we examine our ability to recover independent binary sources that were mixed by an unknown matrix . Let be a ddimensional binary vector. Assume that the components of are i.i.d. with a parameter . This means that the joint entropy of is . We draw i.i.d. samples from and mix them with a binary matrix . Then, we apply our binary ICA approach to recover the original samples and the mixing matrix . Figure 1 demonstrates the averaged results we achieve for different number of components , where is an arbitrary invertible matrix that is randomly drawn, prior to the experiment. We verify that matrix is not an identity, nor a permutation matrix, to make the experiment meaningful. We compare our suggested approach with three alternative methods: AMERICA, MEXICO (described in Section II) and cobICA [27]. As previously described, the cobICA is an immuneinspired algorithm. It starts with a random “population” where each element in the population represents a valid transformation (an invertible matrix). At each step, the algorithm evaluates the objective function for each element in the population, and subsequently “clones” the elements. Then, the clones suffer a mutation process, generating a new set of individuals. This new set is evaluated again in order to select the individuals with the best objective values. The entire process is repeated until a preconfigured number of repetitions is executed. The cobICA algorithm requires a careful tuning of several parameters. Here, and in the following experiments, we follow the guidelines of the authors; we set the general parameters as appear in Table of [27], while the rest of the parameters (concentration and suppression) are then optimized over a predefined set of values, in a preliminary independent experiment.
We first notice that for , both GLICA and AMERICA successfully recover the mixing matrix (up to permutation of the sources), as they achieve an empirical sum of marginal entropies, which equals to the entropy of the samples prior to the mixture (blue curve at the bottom). Second, we notice that for the same values of , our suggested lower bound (Section IIIB) seems tight, as GLICA and AMERICA attain it. This is quite surprising since our bound is not expected to be attainable. This phenomenon may be explained by the simple setting and the relatively small dimension. In fact, we notice that as the dimension increases beyond , our suggested bound drops beneath the joint entropy, , and GLICA fails to perfectly recover . The green curve corresponds to MEXICO, which demonstrates inferior performance, due to its design assumptions (see Section IIIA) . Finally, the red curve with the circles is cobICA, which is less competitive as the dimension increases. It is important to emphasize that while AMERICA and MEXICO are designed under the assumption that a perfect decomposition exists, cobICA and GLICA do not assume a specific generative model. Nevertheless, GLICA demonstrates competitive results, even when the dimension of the problem increases.
In our second experiment we consider a binary source vector over an alphabet size , whose joint distribution follows a Zipf’s law distribution,
where is the alphabet size and
is the “skewness” parameter. The Zipf’s law distribution is a commonly used heavytailed distribution, mostly in modeling of natural (realworld) quantities. It is widely used in physical and social sciences, linguistics, economics and many other fields. In this experiment, we draw
i.i.d. random samples from a Zipf’s law distribution with , where each sample is represented by a dimensional binary vector, and the representation is chosen at random. We evaluate the lower bound of (2) under linear transformations, followed by the GLICA algorithm. In addition, we examine the blockwise approach, BloGLICA (Section IIID) with . We compare our suggested bound and algorithms to cobICA [27]. Notice that in this experiment we omit the AMERICA and MEXICO algorithms, as they assume a generative mixture model, which is heavily violated in our setup. Figure 2 demonstrates our objective (2), for an increasing number of components .We first notice that our suggested algorithms lie between the lower bound and cobICA. Interestingly, bloGLICA preforms reasonably well, even as the number of blocks increases. On the computational side, Figure 3 demonstrates the runtime of each of these methods, over a standard personal computer. While we do not claim that these are the optimal implementations of the algorithms, obvious optimizations of the algorithms were implemented. The fact that they all were implemented in the same language (Matlab) is assumed to give none of the methods a significant advantage over the others. The same runtime comparison approach was taken, for example, in [26].
Here we notice the extensive runtime required by cobICA, compared to our suggested methods. In addition, we see that as the dimension increases, GLICA necessitates a rapidly increasing runtime (as it requires to compute and sort entropy values). However, by introducing BloGLICA we avoid this problem at a relatively small overhead in the objective.
Finally, we repeat the previous experiment, where the samples are now drawn from a source distribution with a greater entropy. Specifically, we consider
i.i.d. draws from a BetaBinomial distribution over an alphabet size
,where and are the parameters of the distribution and is the Beta function. We set
in our experiment. The BetaBinomial distribution is the Binomial distribution in which the probability of success at each trial is not fixed but random and follows the Beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics, to capture overdispersion in Binomial type distributed data. Although the entropy of this distribution does not hold an analytical expression, it can be shown to be greater than
for . As before, each drawn sample is represented by a dimensional binary vector, where the initial representation is chosen at random. Figure 4 demonstrates the results we achieve, applying the same binary ICA schemes as in the previous experiment.Here again, we notice the same qualitative behavior as with the Zipf’s law experiment, where the main difference is that the ICA algorithms attain smaller values of (2). The reason for this phenomenon lies on the observation that sources with greater entropy are typically easier to decompose into independent components. This property is further discussed in Section VI.
Iv Linear ICA over Finite Fields
Let us now extend our scope to general (possibly nonbinary) finite fields of higher order. Finite fields (Galois fields) with elements are commonly denoted as GF(). A finite field only exists if is a prime power i.e. for some prime and a positive integer . When , the field is called a prime field and its elements can be represented by the integers . Addition, subtraction and multiplication are then performed by modulo operations, while division is not completely straightforward, but can be easily implemented using Bézout’s identity [26]. The simplicity of this field structure allows an easy implementation on any mathematical software – all it requires is multiplication, addition and the modulo operation on integers. For these reasons, we focus on linear ICA over prime fields, although our results can be easily extended to any finite field of prime power. As with the binary case, we are interested in minimizing the sum of marginal entropies, where it is now defined as
(4) 
Here, we have that where and the multiplication is modulo–. As before, each is a linear combination (over GF()) of possibly dependent variables . This means that naively, we need to go over all possible linear combinations of to find the linear combinations that minimize the marginal entropies. Formally, we may define a matrix of all possible linear combinations. We would like to compute , for all . Then, we evaluate the entropy of each . As before, a necessary condition for to be invertible is that it has no two identical rows. Therefore, a lower bound on (4) may be achieved by simply choosing rows of the matrix , for which are minimal. As in the binary case, this lower bound is not tight or attainable; it defines a simple bound on (4), which may be attained if we are lucky enough to have chosen rows of the matrix that are linearly independent. Notice that this bound requires the computation of entropy values, followed by sorting them in a ascending order. This leads to a computational complexity of , which may become quite costly as and increase.
As discussed in Section IIIC, we may derive a simple algorithm from our suggested bound, that seeks for independent rows from in a greedy manner. As in the binary case, we may derive the second order statistics of our suggested scheme, under the same assumptions mentioned in Section IIIC. Here, we have that the expected number of rows that our suggested algorithm examines before termination, , satisfies
(5)  
Further, the variance is bounded from above by
(6)  
where the last inequality is due to (5). A similar derivation is given in [31]. Notice that both the expected overhead and the variance of converge to zero as the order of the field grows. We now apply Chebyshev’s inequality to conclude that for every we have that
(7)  
Again, this result implies that under the independent draws assumption, choosing rows from the sorted , even with replacement, skips up to rows on the average, before terminating with a full rank matrix . Further, the probability that our algorithm will skip more than rows is governed by the left term of (7). Unfortunately, despite these desired statistical properties, the computational complexity of this algorithm, , becomes quite intractable as both and increase. Therefore, we again suggest a blockoriented algorithm, in the same manner described in Section IIID. This reduces the computation complexity to , where is the maximal number of iterations and is the maximal number of components in each of the blocks.
Figure 5 illustrates the performance of our suggested bound and algorithms when the alphabet size increases. Let be a sixdimensional random vector that follows a Zipf’s law distribution with . Here, we draw i.i.d. random samples of for a dimension and different prime numbers . We evaluate our suggested lower bound, as previously discussed. We further apply the GLICA algorithm and examine BloGLICA with and . Figure 5 demonstrates our objective for and .
We first notice that GLICA achieves a sum of marginal entropies that is remarkably close the the lower bound. In addition, we notice that BloGLICA results in only a slight deterioration in performance (for ). As we examine the runtime of each of these methods (Figure 6), we notice a dramatic increase in GLICA’s computational complexity when grows. As mentioned above, this computational drawback may be circumvented by applying the BloGLICA algorithm, at quite a mild overhead cost in the objective.
We further compare our suggested approach to AMERICA, MEXICO and cobICA, under a generative linear mixture model assumption, as described in Section IIIE. Here again, we observe that both GLICA and AMERICA successfully decompose the mixture model for smaller values of and , while MEXICO and cobICA are slightly inferior. The detailed results are located in the first author’s webpage^{1}^{1}1https://sites.google.com/site/amichaipainsky/supplemental, due to space limitation.
V Applications
Although ICA over finite fields does not originate from a specific application, it applies to a variety of problems. In [26], Gutch et al. suggested a specific application to this framework, in the context of eavesdropping on a multiuser MIMO digital communications system. Here, we show that ICA over finite fields also applies for large alphabet source coding. For the simplicity of the presentation we focus on the binary case.
Consider independent binary sources with corresponding Bernoulli parameters . Assume that the sources are mixed by an unknown matrix over GF(). We draw i.i.d. samples from this mixture, which we would like to efficiently transmit to a receiver. Large alphabet source coding considers the case where the number of draws is significantly smaller than the alphabet size . This means that the empirical entropy of drawn samples is significantly smaller the entropy of the source. Therefore, even if we assume that both the transmitter and the receiver know , coding the samples according to the true distribution is quite wasteful. Large alphabet source coding has been extensively studies over the past decade. Several key contributions include theoretical performance bounds and practical methods in different setups (see [14] for an overview). Here, we show that under a mixed independent sources assumption, our suggested binary ICA framework demonstrates favorable properties, both in terms of coderate and runtime. Specifically, we show that by applying our suggested GLICA scheme, we (strive) to recover the original independent sources. Then, we encode each recovered source independently, so that the sources are no longer considered over “large alphabet” (as the alphabet is now binary). Notice that the redundancy of this scheme consists of three terms: the decomposition cost (1), the (negligible) redundancy of encoding the recovered binary sources, and the cost of describing the matrix to the receiver ( bits).
We illustrate our suggested coding scheme in the following experiment. We draw i.i.d. samples from a random mixture of sources, with a corresponding set of parameters . Notice that the joint entropy of this source is bits. We compare three different compression scheme. First, we examine the “textbook” approach for this problem. Here, we construct a Huffman code for the samples, based on their empirical distribution. Since the receiver is oblivious to this code, we need to transmit it as well. This results in a total compression size of at least times the empirical entropy, plus a dictionary size of , where is the number of unique symbols that appear in the samples. Second, we apply our suggested GLICA algorithm, followed by arithmetic coding to each of the recovered sources. Finally, we apply BloGLICA with , to reduce the runtime. Notice that the cost of describing the transformation is also reduced to . Figure 7 demonstrated the compression rate we achieve for different values of . The red curve with the asterisks corresponds to a Huffman compression (following [32]) with its corresponding dictionary. We first observe the notable effect of the dictionary’s redundancy when , leading to a compression rate which is even greater than bits. However, as increases the relative portion of the dictionary decreases, since grows much slower. This leads to a quick decay in the compression rate. The blue curve at the bottom is GLICA, while the black curve is the empirical sum of marginal entropies of the original sources. Here we observe that GLICA successfully decomposes the sources, where the difference between the two curves is due to the cost of describing (which becomes negligible as increases). Further, we compare GLICA with marginal encoding to each of the mixed components (that is, without trying to decompose it first). This results in an compression rate of approximately bits (magenta curve with rhombuses), as the mixture increased the empirical marginal entropies to almost a maximum. Finally, we apply the BloGLICA (blue dashed curve). We notice an increased compression rate compared with GLICA, due to a greater sum of marginal entropies. However, BloGLICA is a much more practical approach, as it takes only seconds to apply (for ), compared with seconds by GLICA. Notice that we omit other alternative methods that are inadequate or impractical to apply to this high dimensional problem (cobICA, AMERICA, MEXICO).
To conclude, we show that by applying our suggested scheme we are able to efficiently compress data in a large alphabet regime. Although our results strongly depend on the independent sources assumption, they are not limited to this model. In other words, we may apply GLICA to any set of samples and compare the resulting empirical sum of marginal entropies with the empirical joint entropy of the data; if the difference between the two is relatively small, then there is no significant loss in encoding the data componentwise, and the binary ICA compression scheme is indeed favorable.
Vi On the flexibility of linear transformations
In the previous sections we introduced fundamental bounds and efficient algorithms for the linear ICA problem over finite fields. However, it is not quite clear how “powerful” this tool is, for a given arbitrary mixture. In other words, compared to its alternatives, how well can a linear transformation minimize the sum of marginal entropies, in the general case? To answer this question, we first need to specify an alternative approach.
Via Nonlinear binary ICA
Assume we are interested in minimizing (2) over nonlinear invertible transformations. As mentioned in Section II, this problem was originally introduced by Barlow [1] as minimally redundant representations, and is considered hard. In [3], the authors introduce a piecewise linear relaxation algorithm which tightly approximates the solution to (2) with a series of linear problems. Here we focus on a simplified greedy algorithm which strives to minimize (2) in a more computationally efficient manner with favorable theoretical properties. This approach, along with its computational and statistical characteristics, was previously introduced in [14]. Here, we briefly review these results.
ViA1 The Order Permutation
As mentioned above, an invertible transformation is a onetoone mapping (i.e. permutation) of its alphabet symbols. In other words, an invertible transformation permutes the mapping of the symbols to the values of its probability mass function . Since our objective (1) is quite involved, we modify it by imposing an additional constraint. Specifically, we would like to sequentially minimize each term of the summation, , for , in a noregret manner. As shown in [14], the optimal solution to this problem is the order permutation, which suggests to map the smallest probability to the word (in its binary representation). This means, for example, that the all zeros word will be assigned to the smallest probability value in while the all ones word is assigned to the maximal probability value.
ViA2 worstcase and averagecase performance
At this point, it is quite unclear how well the order permutation performs as a minimizer to (1). The following theorems present two theoretical properties which demonstrate its capabilities. These theorems (and their proofs) were first presented in [14].
We begin with the worstcase performance of the order permutation. Here, we denote our objective (1) as , as it solely depends on the probability vector and the transformation. Further, let us denote the order permutation as and the optimal permutation (which minimizes (1)) as . In the worstcase analysis, we would like to quantify the maximum of over all probability mass functions , of a given alphabet size .
Theorem 1
For any random vector , over an alphabet size we have that
Theorem 1 shows that even the optimal nonlinear transformation achieves a sum of marginal entropies which is bits greater than the joint entropy, in the worst case. This means that there exists at least one source with a probability mass function which is impossible to encode as if its components are independent without losing at least bits. This result is quite unfortunate since we have that
Notice that this worstcase derivation obviously also applies for linear transformations. In other words, we can always find a worstcase source which no ICA algorithm (linear or nonlinear) can efficiently decompose.
We now turn to an averagecase analysis. Here, we show that the expected value of is bounded by a small constant, when averaging uniformly over all possible over an alphabet size .
Theorem 2
Let be a random vector of an alphabet size and joint probability mass function . Let be the order permutation. For , the expected value of , over a prior uniform simplex of joint probability mass functions , satisfies
(8) 
This means that when the alphabet size is large enough, the order permutation achieves, on the average, a sum of marginal entropies which is only bits greater than the joint entropy, when all possible probability mass functions are equally likely to appear. Notice that the uniform prior assumption may not be adequate for every setup, but it does provide a universal result for the performance of this minimizer. Proofs of these theorems are located in [14] under different notations, and in the first author’s webpage^{2}^{2}2https://sites.google.com/site/amichaipainsky/supplemental in the same notation that is used above.
ViB Averagecase performance of Linear ICA transformations
As with the nonlinear case, we would like to evaluate the averagecase performance of linear transformations.
Theorem 3
Let be a random vector of an alphabet size and joint probability mass function . Let be the optimal linear transformation (notice that the optimal transformation depends on ). Then, the expected value of , over a prior uniform simplex of joint probability mass functions , satisfies
(9) 
A proof for this theorem is provided in the Appendix. This theorem suggests that when the number of components is large enough, even the optimal linear transformation preforms very poorly on the average, as it attains the maximal possible marginal entropy value (). Moreover, it is shown in the Appendix that this averagecase performance is equivalent to applying no transformation at all. In other words, when the number of components is too large, linear transformations are useless in the worstcase, and on the average. The intuition behind these results is that our objective depends on the entire probability mass function of (which is of order ), while the number of free parameters, when applying linear transformations, is only . This means that when increases, linear transformations are simply not flexible enough to minimize the objective.
ViC Illustrations
Let us now compare our suggested linear algorithms and bound with the nonlinear order permutation discussed above. In the first experiment we draw independent samples from a Zipf law distribution with and a varying number of components . We evaluate the lower bound of (2) under linear transformations, as discussed in Section IIIB and further apply the GLICA algorithm (Section IIIC). We compare our linear results with the nonlinear order permutation on one hand, and with applying no transformation at all on the other hand. Figure 8 demonstrates the results we achieve. As we can see, the order permutation outperforms any linear solution quite remarkably, for the reasons mentioned above. Moreover, we notice that the gap increases as the number of components grows.
We now turn to illustrate the averagecase performance, as discussed in previous sections. Figure 9 shows the mean of our objective (empirically evaluated over independent draws from a uniform simplex), for the four methods mentioned above. We first see that the nonlinear order permutation converges to a small overhead constant, as indicated in Theorem 2. On the other hand, the lower bound of the linear solution behaves asymptotically like applying no transformation at all. This can be easily derived from Theorem 3 and the fact that , where is a digamma function, as shown in [14].
Vii Conclusion
In this work we consider a framework for ICA over finite fields, in which we strive to decompose any given vector into “as statistically independent as possible” components, using only linear transformations. Over the years, several solutions have been proposed to this problem. Most of them strongly depend on the assumption that a perfect decomposition exists, while the rest suggest a variety of heuristics to the general case. In this work, we present a novel lower bound which provides a fundamental limit for the performance of any linear solution. Based on this bound, we suggest a simple greedy algorithm which shows favorable statistically properties, compared with our suggested bound. In addition, we introduced a simple modification to our suggested algorithm which significantly reduces its computational complexity at a relatively small cost in the objective.
In addition, we discuss the basic limitations of working with linear transformations. We show that this class of transformations becomes quite ineffective in the general case, when the dimension of the problem increases. Specifically, we show that when averaging over a uniform prior, applying the optimal linear transformation is equivalent to applying no transformation at all. This result should not come as a surprise, since the objective depends on the full statistical description of the problem (which is exponential in the number of components ), while linear transformations only provides free parameters. That being said, we do not intend to discourage the use of linear transformations for finite field ICA. Our analysis focuses on a universal setup in which no prior assumption is made. This is not always the case in realworld applications. For example, linear transformations may be quite effective in cases where we assume a generative linear model but the sources are not completely independent.
Although the focus of this paper is mostly theoretical, linear ICA is shown to have many applications in a variety of fields. We believe that the theoretical properties that we introduce, together with the practical algorithms which utilize a variety of setups, make our contribution applicable to many disciplines.
This Appendix provide the proof of Theorem 3. We begin the proof by introducing the following Lemma:
Lemma 1
Let be a binary random vector of components, . Then,
where the expectation is over a prior of uniform simplex of probability mass functions and is the digamma function.

The probability vector consists of elements and follows a uniform simplex prior. This means that it follows a flat Dirichlet distribution with a parameter . The marginal probability of each component of is a summation of half of the elements of . Therefore, following the Dirichlet distribution properties, we have that again follows a Dirichlet distribution, with . This means that
. Notice that for a symmetric Beta distributed random variable
, we have that where is the digamma function. Therefore,(10)
Proof of Theorem 3: Let be a binary random vector of components, . For the simplicity of the presentation, denote our objective as . Its expectation over a prior of uniform simplex of probability mass functions is defined as where is the unit simplex and is a normalization constant such that . Let us use the symmetry of the simplex to reformulate this definition of expectation. Denote
where is the set of all ascending ordered probability vectors of size and is the set of all possible permutations of the vector . Notice we have that iff such that and . In other words, we can express every probability mass function as a permutation of an ascending ordered probability mass function. Therefore, .
Let us now focus on the set , for a fixed . Notice that this set represents the set of all invertible transformations of . In other words, assume , then is an invertible transformation iff . Linear invertible transformations define a (disjoint) partitioning of . This means that for every , we can define an invertible linear transformation such that . In other words, for every we can define a set of probability mass functions that are a result of all invertible linear transformations on it. These sets are disjoint since every set of linear transformations is closed; if and , then there exists a linear transformation from any element in to any element in , which contradicts the definition of above. Since these transformations are invertible (by definition), they are also elements in . Denote the size of the set as . Notice that while , as shown in [26]. Notice that these figures only depend of the dimension of the problem (and not on or ).
In every set there exists (at least one) minimizer of our objective. We denote it by . Further, define the set of all linear minimizers as . We have that
(11)  
This means that
(12) 
since . Let us now multiply both sides by and average over all . We get that
(13)  
where the equality follows from . Finally, we have that
(14)  
due to Lemma 1 above. Notice that the expression on the left is exactly the expected value of the optimal linear solution, as desired.
References
 [1] H. Barlow, T. Kaushal, and G. Mitchison, “Finding minimum entropy codes,” Neural Computation, vol. 1, no. 3, pp. 412–423, 1989.
 [2] J. Schmidhuber, “Learning factorial codes by predictability minimization,” Neural Computation, vol. 4, no. 6, pp. 863–879, 1992.
 [3] A. Painsky, S. Rosset, and M. Feder, “Generalized independent component analysis over finite alphabets,” IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 1038–1053, 2016.
 [4] S. Becker and Y. Le Cun, “Improving the convergence of backpropagation learning with second order methods,” in Proceedings of the 1988 connectionist models summer school. San Matteo, CA: Morgan Kaufmann, 1988, pp. 29–37.

[5]
S. Becker and M. Plumbley, “Unsupervised neural network learning procedures for feature extraction and classification,”
Applied Intelligence, vol. 6, no. 3, pp. 185–203, 1996.  [6] D. Obradovic, An informationtheoretic approach to neural computing. Springer Science & Business Media, 1996.

[7]
S. Choi and O. Lee, “Factorial code representation of faces for recognition,”
in
Biologically Motivated Computer Vision
. Springer, 2000, pp. 42–51.  [8] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski, “Face recognition by independent component analysis,” IEEE Transactions on Neural Networks, vol. 13, no. 6, pp. 1450–1464, 2002.
 [9] M. S. Bartlett, “Information maximization in face processing,” Neurocomputing, vol. 70, no. 13, pp. 2204–2217, 2007.
 [10] J. Schmidhuber, D. Cireşan, U. Meier, J. Masci, and A. Graves, “On fast deep nets for agi vision,” in Artificial General Intelligence. Springer, 2011, pp. 243–246.
 [11] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
 [12] A. Painsky, S. Rosset, and M. Feder, “Universal compression of memoryless sources over large alphabets via independent component analysis,” in Data Compression Conference (DCC). IEEE, 2015, pp. 213–222.
 [13] ——, “A simple and efficient approach for adaptive entropy coding over large alphabets,” in Data Compression Conference (DCC). IEEE, 2016, pp. 369–378.
 [14] ——, “Large alphabet source coding using independent component analysis,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6514–6529, 2017.
 [15] A. Painsky, “Phd dissertation: Generalized Independent Components Analysis over finite alphabets,” arXiv preprint arXiv:1809.05043, 2018.
 [16] A. Yeredor, “ICA in boolean XOR mixtures,” in Independent Component Analysis and Signal Separation. Springer, 2007, pp. 827–835.
 [17] ——, “Independent analysis over Galois fields of prime order,” IEEE Transactions on Information Theory, vol. 57, no. 8, pp. 5342–5359, 2011.
 [18] T. Šingliar and M. Hauskrecht, “Noisyor component analysis and its application to link analysis,” The Journal of Machine Learning Research, vol. 7, pp. 2189–2213, 2006.
 [19] A. P. Streich, M. Frank, D. Basin, and J. M. Buhmann, “Multiassignment clustering for Boolean data,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 969–976.
 [20] H. Nguyen and R. Zheng, “Binary Independent Component Analysis with OR mixtures,” IEEE Transactions on Signal Processing, vol. 59, no. 7, pp. 3168–3181, 2011.
 [21] D. G. Silva, R. Attux, E. Z. Nadalin, L. T. Duarte, and R. Suyama, “An immuneinspired informationtheoretic approach to the problem of ICA over a galois field,” in Information Theory Workshop (ITW). IEEE, 2011, pp. 618–622.
 [22] F. Wood, T. Griffiths, and Z. Ghahramani, “A nonparametric Bayesian method for inferring hidden causes,” arXiv preprint arXiv:1206.6865, 2012.
 [23] A. Painsky, S. Rosset, and M. Feder, “Binary independent component analysis: Theory, bounds and algorithms,” in The International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2016, pp. 1–6.
 [24] J. J. Atick and A. N. Redlich, “Towards a theory of early visual processing,” Neural Computation, vol. 2, no. 3, pp. 308–320, 1990.
 [25] M. D. Plumbley, “Efficient information transfer and antihebbian neural networks,” Neural Networks, vol. 6, no. 6, pp. 823–833, 1993.
 [26] H. W. Gutch, P. Gruber, A. Yeredor, and F. J. Theis, “ICA over finite fields  separability and algorithms,” Signal Processing, vol. 92, no. 8, pp. 1796–1808, 2012.
 [27] D. G. Silva, J. Montalvao, and R. Attux, “cobICA: A concentrationbased, immuneinspired algorithm for ica over galois fields,” in IEEE Symposium on Computational Intelligence for Multimedia, Signal and Vision Processing (CIMSIVP), 2014, pp. 1–8.
 [28] S. Watanabe, “Information theoretical analysis of multivariate correlation,” IBM Journal of research and development, vol. 4, no. 1, pp. 66–82, 1960.
 [29] A. Painsky, S. Rosset, and M. Feder, “Generalized binary independent component analysis,” in International Symposium on Information Theory (ISIT). IEEE, 2014, pp. 1326–1330.

[30]
K. Mulmuley, “A fast parallel algorithm to compute the rank of a matrix over
an arbitrary field,” in
Proceedings of the eighteenth annual ACM symposium on Theory of computing
. ACM, 1986, pp. 338–339.  [31] N. Shulman, “Communication over an unknown channel via common broadcasting,” Ph.D. dissertation, 2003.
 [32] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.
Comments
There are no comments yet.