Compress/decompress a file using Huffman codes

I wrote a small python script which can be used to compress/decompress a text file. On large files (more than 5MB), one can reduce the size of the file by a factor of ~ 0.35 which is not far from the best compression ratio. Since it is in pure Python, it is slow. This script requires module huffman and bitarray .

pip install huffman bitarray --user

The script accepts two options -c or compression and -d for decompression. Following is help message produced by the script.

./compress -h
usage: compress [-h] [--compress] [--decompress] file

Compress/Decompress a file using Huffman codes.

positional arguments:
 file File to compress/decompress.

optional arguments:
 -h, --help show this help message and exit
 --compress, -c Compress file

This script is intended for Information Theory students who are looking for a python script to play with.

We read the input file and compute the frequency of each character (Counter from standard python library does the job). We pass this frequency information to huffman module and ask it to give a code-book (Huffman code). Next, we replace each character with code and save it to a binary file. We also add the code-book to the file as prefix. Adding code-book is necessary if we want to decompress it. The Huffman code gives the shorted binary code to the symbol which occurs most frequently, and longest to the one which occur most rarely. I ran the script (with time command) on a large file (76MB).

time ./compress ~/Downloads/camk12_pp54_voxel2_diff--PP1--1e-13-_suOFF.dat -c
Reading /home1/dilawars/Downloads/camk12_pp54_voxel2_diff--PP1--1e-13-_suOFF.dat (76 MB).. done.
Generating codebook.. done.
Compressing files .. done.
Average codeword length : 2.924465
| Optimal average code length: 2.869759
Compressed files is written to /home1/dilawars/Downloads/camk12_pp54_voxel2_diff--PP1--1e-13-_suOFF.dat.dx
| Original file size : 80169086
| Compressed file size : 29307774
| Compression ratio : 2.735421

real 0m40.917s
user 0m40.822s
sys 0m0.096s

gzip takes less than a second to compress it (0.75 sec) and this script took more than 40 secs. With pypy, it took more time because bitarray is already a c-module.

The point of this script is demonstration, and not the efficiency. gzip reduced the file to size 2878272 bytes. Ours (29307774 bytes) is not bad. Note that we are storing our codebook in plain text and not in binary format.

We also print the Optimal average code length which is information theoretic optical average codeword length; no one can do better than this. And Huffman codes are quite close to this number.

On small file, you won’t see much reduction.

To decompress, we replace the codeword with the symbol and write to the output file.

File can be found on github.

Advertisements

Compressed Sensing: A Python demo

I am having trouble with HTML format of blog post. Here is draft PDF PDF.

I recommend http://www.pyrunner.com/weblog/2016/05/26/compressed-sensing-python/ . This is much better written.

Can you solve $latex Ax=b$ for $latex x$ when number of rows are less than number of columns in (A) i.e there are more variables than equations? Such a system is called underdetermined system. For examples, give me 3 numbers whose average is 3. Or solve (x/3+y/3+z/3=3). You are right! There are many solutions to this problem!

If one adds more containts then one can hope to get a unique solution. Give me a solution which has minimum length (minimize (L_2)-norm)? Answer is 3,3,3 (hopefully). Give me a solution which is most sparse (least number of non-zero entries)? Answer is 9,0,0 (or 0,0,9).

Many signals are sparse (in some domain). Images are sparse in Fourier/Wavelets domain; neural acitivity is sparse in time-domain, activity of programming is sparse in spatial-domain. Can we solve under-determined (Ax=b) for sparse (x). Formally,

\min_x L_0(x), \quad \text{st} \quad Ax = b where (L_0) is the 0-norm i.e. number of non-zero entries.

This problem is hard to solve. (L_0) is not a convex function. In this case, one uses convex relaxation trick and replace the non-convex function with convex one.

\min_x L_1(x), \quad \text{st} \quad Ax = b

It is now well-understood that under certain assumptions on (A), we can recover (x) by solving this tractable convex-optimization problem rather than the original NP hard problem. Greedy strategy based solvers also exists which works even faster (though, according to Terry Tao blog, they do not guarantee the same performance as interior-point method based).

Somewhat formal statement

If (x_N) (size (N)) is (S-sparse) then an under-determined system (A_{kN}x_{N} =b_k) — where (b) is vector of (k) observations — can be solved exactly given that (A) (mesurement matrix) has following properties:

  • \left\Vert y\right\Vert^2 \sim = \left\Vert x \right\Vert ^2 where \(y = A x\) for any sparse signal \(x\) i.e application of measurements matrix does not change the length of sparse signal \(x\) ‘very much’. See Restricted Isometric Property in sec. 3.1 .

Lets there is a sparse signal (x) of length (N). Its sparse. We take a dot-product of x with a random vector (A_i) of size N i.e. (b_i = A_i * x). We make (k) such measurements. We can put all k (A_i) into a matrix (A) and represent the process by a linear system:

\begin{bmatrix} A_1 \\ A_2 \\ \vdots \\ A_k \end{bmatrix} * x = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_k \end{bmatrix} \qquad(1)

Note that each (A_i) has the dimension (1 \times N). We can rewrite eq. 1 as following:

A x = b\qquad(2) where dimension of (A) are (k \times N) and (k << N). This is an under-determined system with the condition that (x) is sparse. Can we recover (x) (of size N) from b (of size k (<<N))?

Figure fig. 1 shows a real system. Compressed sensing says that (x) can be recovered by solving the following linear program.

\min_x \left\Vert x \right\Vert_1 \; \text{given} \; Ax = b\qquad(3)

A demonstration using Python

Data for fig. 2 is generated by script ./compressed_sensing.py and for fig. 3, data is generated by script ./magic_reconstruction.py. Both of these scripts also generate similar figures.

Dependencies In addition to scipy, numpy, and matplotlib, we also need pyCSalgo. It is available via pip: pip install pyCSalgo --user.

Code is available at github

Algorithm

  1. input : sparse signal x of size N
  2. Generate a random matrix (measurement matrix) \(A\) of size \(k\times N\).
  3. Make \(k\) measurements of \(x\) using \(A\). That is \(b = A x\). Note that each measurement of \(x\) (i.e. entry of \(b\)) is some linear combination of values of \(x\) given by \(A_i x\) where \(A_i\) is ith row.
  4. Now find \(x\) by solving \(\min_x \left\Vert x \right\Vert_1 \text{given}\; Ax=b\).

For step 4, we are using function l1eq_pd from Python library pyCSalgo. This is reimplementation of l1magic routine written in Matlab by Candes. When (k >> 2S) where (S) is the sparsity of (x), we can recover (x) quite well. In practice, (k \ge 4S) almost always gives exact result.

There are other algorithms available which works faster; they are based on greedy strategy. We are not discussing them here.

Sparse in time domain

cs_time_sparse

Figure 1: x is the sparse signal of length 256. Using measurements matrix A, we made 124 random measurements of \(x\) namely b. The sparse solution of eq. 2 is the solution of $\min_x \left\Vert x \right\Vert_1 \text{given}\; Ax =b$.

Sparse in Fourier domain

magic_reconstruction

Figure 3: CS solution when signal is sparse in Forier domain. Note that we took 200 samples for a singal of size 2000. The recovery is pretty good.

With error in measurements

This example shows that some error is added to measurements. The reconstructed signal is no longer ‘exact’ enough. But still, given that x was binary signal, we can recover x by thresholding function.

compressed_sensing

Mathematical definitions

Restricted isometric property (RIP)

Restricted isometric property (RIP) of a matrix (A) is defined as the following.

Given (\delta_s \in (0,1)), for any sub-matrix (A_s) of (A), and for any sparse vector (y), if following holds

(1-\delta_s) \left\Vert y \right\Vert^2 \leq \left\Vert A_s y \right\Vert ^2 \leq (1+\delta_s) \left\Vert y \right\Vert ^2

then matrix (A) is said to satisfy (s-)restricted isometric property with restricted isometric constant (\delta_s). Note that a matrix A with such a property does not change the length of signal (x) ‘very much’. This enables us to sense two different sparse signal (x_1) and (x_2) such that (A x_1) and (A x_2) are almost likely to be different.

References

To learn about compressed sensing, I recommend following articles

There are many many other great articles and papers on this subject. There are some dedicated webpages also.

Performance of random number generator (mersenne twister)

I compared 4 implementations of mersenne twister (mt19937, 32 bit) algorithm: C++-11 standard template library <random>, boost, GNU Scientific Library, and our own implementation for MOOSE simulator. The code is here (You can customize the code to run it on your platform.  You have to disable MOOSE related code.). Flag -O3 flag passed to gcc-4.8 compiler, on OpenSUSE-Leap 42.1 OS.

I generated random numbers in a tight loop of size N. Numbers were stored in a per-initialized vector. For the bench-marking, I subtracted the time taken to put N values into std::vector. I ran the benchmark 5 times for each. Here is a sample run:

Wrote 1000000 sampled to sample_numbers.csv file
GSL random number generator: mt19937
MOOSE=3135507266, STL=2357136044,BOOST=2357136044,GSL=4293858116
MOOSE=1811477324, STL=2546248239,BOOST=2546248239,GSL=699692587
MOOSE=2095834071, STL=3071714933,BOOST=3071714933,GSL=1213834231
MOOSE=258599318, STL=3626093760,BOOST=3626093760,GSL=4068197670
MOOSE=1470212236, STL=2588848963,BOOST=2588848963,GSL=994957275
MOOSE=3009017253, STL=3684848379,BOOST=3684848379,GSL=2082945813
MOOSE=2280525416, STL=2340255427,BOOST=2340255427,GSL=4112332215
MOOSE=2165689929, STL=3638918503,BOOST=3638918503,GSL=3196767107
MOOSE=551388967, STL=1819583497,BOOST=1819583497,GSL=2319469851
MOOSE=368560217, STL=2678185683,BOOST=2678185683,GSL=3178073856
N = 100
 Baseline (vector storage time) ends. Time 2.46e-07
 MOOSE start.. ends. Time 1.309e-06
 STL starts .. ends. Time 1.6e-07
 BOOST starts .. ends. Time 1.54e-07
 GSL starts .. ends. Time 5.07e-07
N = 1000
 Baseline (vector storage time) ends. Time 2.26e-07
 MOOSE start.. ends. Time 6.63e-06
 STL starts .. ends. Time 3.883e-06
 BOOST starts .. ends. Time 2.3e-06
 GSL starts .. ends. Time 5.844e-06
N = 10000
 Baseline (vector storage time) ends. Time 1.213e-06
 MOOSE start.. ends. Time 6.1885e-05
 STL starts .. ends. Time 4.4721e-05
 BOOST starts .. ends. Time 2.4108e-05
 GSL starts .. ends. Time 6.3407e-05
N = 100000
 Baseline (vector storage time) ends. Time 1.3876e-05
 MOOSE start.. ends. Time 0.000616378
 STL starts .. ends. Time 0.000445375
 BOOST starts .. ends. Time 0.00023736
 GSL starts .. ends. Time 0.000630819
N = 1000000
 Baseline (vector storage time) ends. Time 0.000281644
 MOOSE start.. ends. Time 0.00591533
 STL starts .. ends. Time 0.00426978
 BOOST starts .. ends. Time 0.00222824
 GSL starts .. ends. Time 0.00612132
N = 10000000
 Baseline (vector storage time) ends. Time 0.00418973
 MOOSE start.. ends. Time 0.0574826
 STL starts .. ends. Time 0.0411453
 BOOST starts .. ends. Time 0.0204587
 GSL starts .. ends. Time 0.0590613
N = 100000000
 Baseline (vector storage time) ends. Time 0.0465758
 MOOSE start.. ends. Time 0.569934
 STL starts .. ends. Time 0.399875
 BOOST starts .. ends. Time 0.195795
 GSL starts .. ends. Time 0.598316
N = 1000000000
 Baseline (vector storage time) ends. Time 0.458415
 MOOSE start.. ends. Time 5.69154
 STL starts .. ends. Time 4.09148
 BOOST starts .. ends. Time 2.02454
 GSL starts .. ends. Time 5.85985
[100%] Built target run_randnum_benchmark

Here is the summary: boost random number generator is approximately 2x faster than C++-STL <random> implementation. BOOST is approximately 4x faster than GSL (which is as fast as MOOSE random number generator).

moose_stl_boost_gsl
MOOSE, C++11 <random> (STL), BOOST, and GSL are compared for mt19937 random number generator. For N > 100 to 10^9 values, run time are plotted in log-scale.
moose_stl_boost_gslratio
Comparison of speed. GSL and MOOSE are equally fast and slowest. BOOST is the fastest (somewhere around 3x than MOOSE or GSL and 2x than STL).

Simulating Random Walks using Langevin Equation

Random walks (Brownian motions), in addition to their theoretical potency (describes macro-scale behavior of gas starting with micro-scale description), also describes behavior of many processes in nature. A few of them; genetic network, protein expression caused by mRNA,  have been described/predicted well using stochasticity. Moreover they are ideal noise sources cause by thermal fluctuation and often found in natural processes. To get the solid foundation in this subject, see the classic “Stochatic Processes in Physics and Chemistry” by Van Kampen, and also “handbook of stochastic methods” by Gandiner.

Random walks (in fact any stochastic process) can be described by Fokker Planck equation: it describes how probability density function evolves over time. An equivalent is Master Equation which are much easier to visualize and solve (using Gillespie algorithm, a variant of Markov method). Master equation can describe “almost” all of the chemistry. In fact, Einstein built his theory of Brownian motion by writing down a variant of Fokker Planck equation. After Einstein published his work, a Frenchmen Paul Langevian discovered the same theory using a totally different approach; which is “infinitely simpler” than Einstein approach. I’d highly recommend to read the paper which does not require more than intermediate mathematics.  http://scitation.aip.org/content/aapt/journal/ajp/65/11/10.1119/1.18725

In this note, I’d present a Python recipe to solve Langevian equation to simulate random walk. The Langevian approach is computationally extremely cheap and works extremely well in practice. Of course there is a cost involve. Fokker-Planck gives you “exact” mean and variance (if you can solve it), you need to produce many Langevian trajectories to see mean and variance converging to fixed values.

But for simulating biological and physical processes, you don’t worry too much about overall mean and variance. One trajectory is good enough for introducing noise sources.

Langevian equation looks something like the following.

dx = −f(x) dt+\alpha g(x) \sqrt{dt}

where α is normally distributed with mean 0 and variance 1. Forget f(x) and g(x), the out-of-place thing about these equation is square root of dt on the right-hand side. This led to fractional calculus, and stochastic differential equations.

For the sake of “web and coding”, problem statement and python recipe which simulates this equation can be found here. 5 model trajectories of Random walk in 1D generated by this equation are attached with this note. Many others can be generated using the script solve.py. Mean is as usual; and standard deviation relates with diffusion coefficient.

PS: File solve.py implements the assignment. To produce these trajectories run the following command:

     $ python solve.py 1

To get more about “Randomness in Biology” visit http://courses.ncbs.res.in/mod/resource/view.php?id=374 (login as guest)

NOTES:

  1. Google for: Sriram Ramaswamy’s Resonance article on Einstein’s derivation. Nice readable article. Do check the Langevin original paper linked above.
  2. It is instructive for those who are interested in Control theory to read about biological processes: “How cell control its size” and other macro-molecules inside it in such. Biology is unexplored gold-mine for Control theory. Also its worth thinking how noise is minimized in cellular processes.

Calculating max-flow using push-relabel method

Calculating maximum flow in a flow-network is a fundamental problem. It has been observed that smart implementation of ‘push-relabel’ methods performs better than algorithms based on finding ‘augmenting paths’. We have implemented one such variation of push-relabel method. It is known as ‘relabel to front’ algorithm. It is discussed in details in ‘Introduction to algorithms’ by ‘Cormen, Leiserson, Rivest, and Stein’.

We do not guarantee that it is the exact implementation of what is given in this book. In fact, it is slightly different for we are selecting nodes in some fixed order. We uses boost::graph library. Details are in ‘Readme.md’ file. It is available on my github repository. You can use it, but you are not allowed to modify it without notifying the author. I wish to keep track of bugs.