Bases in compressed sensing (signal reconstruction)

Question

I have been posting this kind of question in Cross Validated, but since this one deals almost entirely with mathematics, I will post it here.

In signal reconstruction using compressed sensing, we want to sample a signal $f$ in order to obtain a smaller (compressed) signal $b$:

$$b = \phi f$$

However, $f$ can also be expressed by a linear combination of basis functions $\Psi$ and its coefficients $c$:

$$f = \Psi c$$

So the first equation and second equation together produce:

$$b = \phi \Psi c$$

Furthermore, we also want to promote the sparsity of $c$. This is done by solving the following optimization problem:

$$\text{min } ||c||_{l_{1}} \text{ subject to } b = \phi \Psi c$$

where $||c||_{l_{1}}$ is the $l_{1}$-norm.

I think I understand this part but I'd like to know more about the bases $\Psi$. How are they chosen? How are they implemented in an algorithm? Right now I'm trying to understand a couple of examples that use the following $\Psi$ and $\Phi$:

1.

D=dct(eye(n,n)); % \Psi
A=D(perm,:); % \Phi \Psi

where dct is the discrete cosine transform, n is the dimension of the original signal and perm are the first numbers of a list of randomly generated numbers:

r1 = permutation(arange(1, n))
perm = r1[0:m]

2.

Atest = zeros((nx, ny)).reshape(1, nx*ny)
Adelta = zeros((k, nx*ny))
for i, j in enumerate(r1k):
    Atest[0,j] = 1
    Adelta[i, :] = dct(Atest)
    Atest[0, j] = 0

here nx and ny are the dimensions of the original signal, r1k is also permutation (similar to perm). Adelta is produced by choosing a point in a matrix Atest, transform it using dct and adding the result as a row in matrix Atest.

In these pieces of code, I know A and Adelta represent $\Phi \Psi$ but I don't really understand why. I'd appreciate a few comments about this.

By the way, if you want to take a look at the full example, here is one using MATLAB and this one in Python. This example is also related to my other question in Cross Validated, so if you want to contribute to that also, you are more than welcome, although it deals with an implementation issue which may not be interesting enough.

UPDATE:

An additional question: How should we deal with the actual reconstruction. According to the second equations $f = \Psi c$, once we know $c$, an approximation of $f$ is found simply by calculating $\Psi c$. However, what I see in practice is actually the direct application of a DCT to the coefficients $c$. Why?

Thanks

For part 1 above, $D$ is $\Psi$ and $A$ is $\Phi \Psi$. The value of $\Phi$ is implied by the selection of the random rows of $D$, so it's really just a row-selection matrix. For instance, if perm(1)=13, then the first row of $\Phi$ is all zeros except with a 1 in column 13. Likewise, if perm(2)=47, then the second row of $\Phi$ is all zeros except a 1 in column 47. Does this answer your question? — AnonSubmitter85, Jun 23 '13 at 00:51
Actually, it's the rows that you're selecting based on perm. I think you're also missing the dct function that is being applied on the rows before such selection takes place. — r_31415, Jun 23 '13 at 02:35
I understand that perm is selecting rows, that's why it implies that $\Phi$ is a row-selection matrix. As for the DCT function, taking the DCT of an identity matrix is just an east way to create the DCT matrix. The following are equivalent: dct(X) == dct(eye(size(X))) * X. — AnonSubmitter85, Jun 23 '13 at 02:40
Are you confused about what A represents? If so, in the first case, it simply represents the values of the DCT that you have available, that is, your incomplete information from which you are going to reconstruct the entire signal. The DCT is a linear operator and as such has a matrix representation, thus, the line D = dct(eye(n,n)) is simply creating the matrix representation of the DCT. Removing rows from D is the same thing as removing values from the DCT of the signal. Thus, A * x is equal to values of dct(x) at the indices in perm. — AnonSubmitter85, Jun 23 '13 at 02:50
I don't think I'm confused about $A$. It is mostly "How are $\Psi$ and $\Phi$ chosen?" and "How are they implemented in an algorithm?" The second one might not be a good fit for this site, though. — r_31415, Jun 23 '13 at 03:25
By the way, thank you for all your interest. I really appreciate it. — r_31415, Jun 23 '13 at 03:56
I think you misunderstand the problem; you dont want to 'promote sparsity'. The general problem of $\ell_1$ reconstruction is that if the original signal $\mathbf{x}$ is "sparse enough", then solving $min \Vert \mathbf{x} \Vert_1 ,, \mathrm{s.t.} ,, \Vert \mathbf{Ax} - \mathbf{b} \Vert_2 < \epsilon$, will recover the $N$ samples in $\mathbf{x}$ when only the $M$ measurements in $\mathbf{b}$ are available, where $M<N$. If this is possible, and how sparse $\mathbf{x}$ must be for it to work, depends on $\mathbf{A}$. This relationship is very complicated though (see the papers linked below). — AnonSubmitter85, Jun 23 '13 at 03:58
No, I understand the problem although, of course, you're right in your statement that is used as an assurance that this technique will work. However, the part about 'promote sparsity' refers to the fact that if you use $\text{min }||x||{2}$ then you're going to get a lot of non-zero values in the coefficients of the expansion of your original signal. Therefore, you 'promote' sparsity by choosing $\ell{1}$-norm. — r_31415, Jun 23 '13 at 04:04
If you use $\ell_2$, then as far as I know, the problem does not have a unique solution, which is why it doesn't work. (I think people usually gloss over this fact by calling it "noisey"). I'd suggest you read at least the first three sections of Uncertainty Principles And Ideal Atomic Decomposition, where they show that the problem is actually well defined first for the $\ell_0$ norm and then the $\ell_1$ norm. I think it will really help you understand the role the linear constraints (that is, $A=\Phi \Psi$) play. — AnonSubmitter85, Jun 23 '13 at 04:11
I will, thanks. As far as I remember, the argument was something along the lines of "Why don't you pick $\ell_0$?" "Too slow" "Why don't you pick $\ell_2$?" "Fast but wrong" "Why don't you pick $\ell_1$?" "Alright... not so slow but somewhat unstable" — r_31415, Jun 23 '13 at 04:15
As for "how they are implemented," they are simply linear constraints, so there is no magic to it. At the most basic level, it's just a matrix-vector product. Usually, however, since matrix-vector products are slow, you'll see function handles used instead so that you can take advantage of fast-algorithms for things like wavelets, DCTs, or Fourier transforms. Remember though, that all of these are linear operators and every linear operator is equivalent to multiplication by some matrix. That's why in the literature you just see the general case of $\mathbf{Ax}=\Phi \Psi \mathbf{x} \mathbf{b}$. — AnonSubmitter85, Jun 23 '13 at 04:16

score 2 · Answer 1 · answered Jun 22 '13 at 23:32

2

I will not enter in the details of your code (which is not as readable as it should be), i will rather give you an abstract description: $\phi$ is the matrix that represents the way you sample your signal $f$. For example, if you are doing subsampling, i.e. you just sample a subset of the entries of $f$, then $\phi$ is the identity matrix with rows removed. On the other hand $\Psi$ represents the basis that you choose to expand your signal $f$. For example, if you know that $f$ is a time series containing only a couple of frequencies, then you can take $\Psi$ to be the Discrete Fourier (or discrete cosine) matrix and then you try to solve for $c$. Note that $c$ will be sparse, with nonzero entries associated to the distinct frequencies.

answered Jun 22 '13 at 23:32

Manos

25,833

Thanks. Let me know what I can do to make the code readable (I didn't want to post the whole code because it would detract from the main question.) In any case, yes, that was my understanding of $\Psi$ and $\Phi$. Maybe the issue here is that I don't understand what leads to choose an operation such as DCT(I(n, n)) unlike the example you described in which it is very clear the purpose. – r_31415 Jun 22 '13 at 23:41
I would assume that this is because the signal that you the designer of this example wants to sample is sparse in frequency. But which one is this signal $f$ in the code you show? I recommend this, and its quite easy: why don't you design an example in MATLAB exactly along these lines? I.e. construct a time series with e.g. two frequencies, take some samples (b vector) and then using the DCT matrix solve the l1 problem and see if you can recover your time series. – Manos Jun 22 '13 at 23:50
I already did that (in Python) from an example in MATLAB (http://www.mathworks.com/company/newsletters/articles/clevescorner-compressed-sensing.html) It works but I still find rather mysterious the basis part. Actually, most of what I have done so far can be found in this other question (http://stats.stackexchange.com/questions/62280/example-of-signal-reconstruction) – r_31415 Jun 22 '13 at 23:56
By the way, how is your linear algebra? – Manos Jun 23 '13 at 01:11
It could be better but I don't think it will be an impediment right now. By the way, I didn't mean to say that in general I find mysterious the way bases work. It's mostly the choice of bases in these examples. – r_31415 Jun 23 '13 at 02:24
1

It all depends on what your signal is. That's the key to what basis one has to use. For example, in image analysis, its very common to use $\Psi$ as the wavelet transform. Why? Because usually images are sparse in the wavelet domain. So you should really think what is your signal $f$ in your example or your application. – Manos Jun 23 '13 at 02:31
Oh, great. I think I was reading too much into the operation of these choices of $\Phi \Psi$. I suppose the second example is performing a selection $\Phi$ and a mapping $\Psi$ in one step. Is that right? – r_31415 Jun 23 '13 at 02:39

AnonSubmitter85 · Answer 2 · 2013-06-23T01:07:58.650

0

To take case 1, you know that $A=\Phi\Psi$. The DCT matrix, $D$, is $\Psi$. The value of $\Phi$ is implied by the random selection of rows in the line A = D(perm,:);. This is equivalent to multiplying D by a row-selection matrix: A = P * D. To give a specific example, say that D is a 5x5 matrix and you want to make the 2x5 matrix A that is made up of the rows 4 and 2 of D. Then

$$ P= \left[ \begin{array}{ccccc} 0 & 0 & 0 & 1 & 0\\ 0 & 1 & 0 & 0 & 0\\ \end{array} \right] $$

The following are good background material on the theoretical aspects of the problem:

Uncertainty Principles and Signal Recovery

Optimally Sparse Representation in General Nonorthogonal Dictionaries

Sparsity and Incoherence in Compressive Sampling

Decoding by Linear Programming

Uncertainty Principles and Ideal Atomic Decomposition

edited Jun 23 '13 at 01:07

answered Jun 22 '13 at 23:41

AnonSubmitter85

3,455

@RobertSmith To the contrary, they cover it in exacting detail. Also, I would not recommend reading a single part of these papers in isolation. – AnonSubmitter85 Jun 23 '13 at 00:04
Then would you mind to point out the page in which it's covered in exacting detail? – r_31415 Jun 23 '13 at 00:05
Pages 2-17 of "Optimally Sparse Representation in General Nonorthogonal Dictionaries". – AnonSubmitter85 Jun 23 '13 at 00:08
@RobertSmith You do realize that dictionary == bases, right? – AnonSubmitter85 Jun 23 '13 at 00:10
Right. However, there are only a couple of paragraphs that are directly applicable to my question. – r_31415 Jun 23 '13 at 00:13
I don't see how you can say that unless you are just flipping through them and giving up after not finding the exact wording you are hoping for. I'd suggest, if you really want to "know more about the bases," then the above papers are where you should start. It will take a non-trivial amount of effort on your part though, as I don't think you are going to find a simple one- or two-line comment (or a couple of isolated paragraphs in a paper) that explains things. – AnonSubmitter85 Jun 23 '13 at 00:26
Actually, I have a couple of days going through lots of material. When I said I want to know more about bases, I meant more about bases that is relevant to this particular question. However, if you think I missed something from those papers, please let me know by quoting the specific part that is relevant. – r_31415 Jun 23 '13 at 00:40

score 0 · Answer 3 · edited Jun 24 '13 at 02:49

You say "but I'd like to know more about the bases $\Psi$. How are they chosen? How are they implemented in an algorithm? "

$\Psi$ is chosen so that $c$ is sparse. If $\Psi$ is equal to the identity, then the signal is sparse in the canonical basis.

Let us imagine the signal is sparse in the canonical basis (aka $\Psi =\mathbf{I}$) then the difference between a simple sampling problem and compressive sensing problem is the difference between $\Phi$ being the identity and an acceptable compressive sensing measurement matrix.

How is it implemented in algorithm? $\Psi$ is either a matrix made up of columns of the discretized version of each function of the basis or just a subroutine.

Bases in compressed sensing (signal reconstruction)

3 Answers3