Shiming

Jan 2, 2016

Sublime package control "There are no packages available for installation" issue, and a solution

To install package control and use it on Sublime:

After installing sublime 2/3, use this https://packagecontrol.io/installation

Then use Administrator to launch sublime. From preference -> package control -> Settings - User, add a block

"channels":
    [
        "https://packagecontrol.io/channel_v3.json",
        "https://web.archive.org/web/20150905194312/https://packagecontrol.io/channel_v3.json"
    ],

Then the package control will work as expected, if it did not work.
I find that it is a good idea to use SublimeREPL. So I spent quite a few hours to search a solution for "There are no packages available for installation" issue. This solution was mentioned at here. I guess there could be different reasons, and different solutions. Above one worked this time.

Jun 19, 2015

Bland Altman plots with density plots (with repeated measurement) (R code)

The Bland-Altman plot is a tool to analyze the agreement between two different measurements. I used it in several of my papers. Sometimes, I feel the original B-A plot is not visually enough to show the distribution of dots (mean vs diff). Adding the density plots for the mean and diff may provide more information of how the two quantities located/shaped. So in most of my papers, I use B-A plot together with density plots of mean and diff (of the two measurements).

Also, in some studies, we have repeated measurements for each subject. Then the B-A plot should adjust to the repeated measurement. The MedCalc has a menu to do "B-A plot with multiple measurements per subject" Also see Bland JM, Altman DG(2007) Agreement between methods of measurement with multiple observations per individual. J. of Biopharmaceutical Statistics, 17:571-582.

MedCalc is a good tool. But I still hope to use R to draw a configurable (good-looking) graph. So, here are two R scripts for B-A plots (traditional, and for repeated measurements). Have to say, I just assembled the code (I used example code from here). The R code is available here.

To use it. assume in r workspace there are variables 'x' and 'y' as two measurements (same length vector), and a variable 'id' to indicate which subject each value belongs to.
>source('BA_repeated.R')
>library(ggplot2)
>library(gridExtra)
>Bland.Altman.re(x, y, rep.meas=TRUE, subject=id, xname='XMeas', yname='YMeas', addDensity=TRUE)

Mar 14, 2015

A Note: They get my affiliation info wrong in a recent publication

Just want to make a note here.

In a recent publication:

http://link.springer.com/article/10.1007/s10877-015-9671-1

Titled as "Accuracy of continuous noninvasive hemoglobin monitoring for the prediction of blood transfusions in trauma patients.", my affiliation was mistaken as "Department of Biomedical Engineering, UMBC", which I have never been affiliated with. The correct one should be "Department of Anesthesiology, University of Maryland School of Medicine". I don't understand why such mistake had been made! It is very frustrating since I feel I contributed much in this paper while the small detail was not taken care of.

I just make a note here. I hope this issue could be solved later, after I contacting with them.

Mar 7, 2015

How to create high resolution figures for publication

When publishing articles in some journals, we may see some specific requirement for figures, such as the size, color, and resolution. For resolution, often the authors are asked to submit separate image files with high resolution (e.g., 300 dpi for color images, and 1200 dpi for gray images), instead of embedded ones (often seen in medical journals). If you are not using LaTeX, Pgf or tikz, you may need to create those figures from drawings generated by MS Word, or PPT, or other drawing software.

Here are the steps to create high resolution figures.
1. Install 'GIMP' from www.gimp.org
2. Prepare your figures from any plotting software; keep it as clear as possible. If they are plotted from Word or Powerpoint, use the system snipping tool, or any screen print program to 'cut' the image. For example, below is one figure generated by Powerpoint.

3. Save the screenshot as png. Right click to find out its dimension from "properties-->details". E.g. this image is 583x303 pixels, with 32 bit depth.
4. Launch 'GIMP', create a new file, and set Image size W=583, H=303; In advanced options, set X resolution=300, Y-resolution=300 (or other values required). If the journal requires some specific figure size, then you need to set the Image size correspondingly.

5. Click OK, and drag the saved PNG image file into the newly created canvas. From 'File' --> "Save As". Now you can save it as the required file type, (often be 'TIF', or 'eps'). You can verify the new figure's resolution by right click and choose "Properties" --> "details".

Of course, I personally prefer LaTeX+TIKZ. But not all the time you can use them. So the GIMP comes for help.

Jan 21, 2014

Fast Matlab datenum

First of all, this implementation is not done by me. The original page is here, by Jan Simon. Putting some log here is only for future reference.

In recent data processing, a large amount of date strings are parsed into Matlab cell array. It is painful to use 'datenum()' to convert a long cell array (>hundred thousands elements). Again, no one wants to bother with parallel code. So I found Jan Simon's function.

First, unfortunately that our date string format is different from those cases implemented in Jan's function. Our date format is
'mm/dd/yyyy hh:mm:ss AP', where the month, day, hour can be one digit or two digits with the last two chars representing AM or PM.

Second, fortunately, Jan's code is well structured. I can easily add a case 2 to handle our date string, since I am sure that all strings are of the same format. This is done with my very very limited knowledge of Matlab-C coding. I love such easy-to-read code. So the code is adapted to deal with the new format.

The result is very encouraging and exciting. For a cell array with ~260000 elements, the adapted DateStr2Num takes 0.021095 seconds elapsed time, while the build-in datenum takes 87.961205 seconds, which is about 4170 speedup! Amazing. Results are identical. Another cell array with ~176000 elements, the DateStr2Num takes 0.014715 seconds; the datenum takes 59.845605 seconds, which is about 4067 speedup! Consistent performance!

The calculation is done with Matlab R2013b, Mex compiler is VC++2008 express; win7-64bit + 8Gb memory with Intel Core i5-3570@3.40GHz.

Don't want to belabor too much about the principle of code optimization. It is clear that the build-in Matlab function has to handle many situations. But when you are sure about the homogeneous of the data and the time efficiency is critical, then it is better to remove those unnecessary various situation handling. This is also applicable to R functions.

Jan 20, 2014

Install Octave on Ubuntu

It took me about 10 times to install the Octave from source on a fresh Ubuntu system. Here are steps.

1. sudo apt-get install gfortran debhelper automake dh-autoreconf texinfo texlive-latex-base texlive-generic-recommended epstool transfig pstoedit libreadline-dev libncurses5-dev gperf libhdf5-serial-dev libblas-dev liblapack-dev libfftw3-dev texi2html less libpcre3-dev flex libglpk-dev libsuitesparse-dev gawk ghostscript libcurl4-gnutls-dev libqhull-dev desktop-file-utils libfltk1.3-dev libgl2ps-dev libgraphicsmagick++1-dev libftgl-dev libfontconfig1-dev libqrupdate-dev libarpack2-dev dh-exec libqt4-dev libqscintilla2-dev default-jdk dpkg-dev gnuplot-x11 libbison-dev libxft-dev llvm-dev
2. Install openblas regularly
3. Use apt-cache search libopenblas* to find where the openblas lib is
4. I replaced the softlink of liblapack to libopenblas:
sudo rm -r /usr/lib/liblapack***.so.3gf
sudo ln -s /lib/libopenblas***.so /usr/lib/liblapack***.so.3gf
5. ./configure --with-blas=libopenblas --with-lapack=libopenblas (didn't make it when using --disable-docs, so it is better not to use it)
6. make
7. sudo make install
8. It is done. then run octave
9. in octave prompt, run 'pkg list' to see installed packages.
10. To install packages, run 'pkg install '
11. To load package, run 'pkg load '

Dec 26, 2013

A quick-and-dirty way to run Matlab (or any code) in "parallel"

Matlab has its parallel toolbox, but I never get into it. Partly because I don't have a nicely configured cluster. A recent project actually forced me to find a way to fully utilize a single PC's resource. Without bothering parallel toolbox, nor MapReduce, nor MPI, I just used a quick-and-dirty way to do that on a single PC. It might be case-specific, but this case may be widely seen in many problems.

The study case: there are about 3TB+ binary data stored on a local network storage disk (NSD), which is mounted to a Windows machine with 32 Gb memory, Quad core Intel i7-3770@3.40GHz (8 threads), and less than 1Tb hard disk. Those binary files are generated by a company's machine, and we don't know the format definition of those binary file. So we have to use a pre-compiled executable program (let's call it 'the-dumb.exe') to convert each binary file to xml file. Then we can use other code to parse the xml file to extract data we need. If we call the-dumb.exe to convert a binary file directly on the NSD, and store the new xml also on the NSD, and write Matlab code to extract data from xml and store them as .mat file on the NSD (above steps are done one after another, or say, only one process is active; one task each time), the running time for all the 40K files may last for a week+ on multiple machines and never end (I didn't hear that someone really finish such processing. A rough estimation is: converting a binary file needs about 2 minutes through NSD, and using some Matlab xml parsing toolbox to extract data needs extra 2~5 minutes). So, simply splitting the task and sending them to multiple machines cannot help. By observation, for all the time, the CPU is often only used up to 1% to 7% percent. Memory space is totally wasted.

To not complicate the problem, MPI or Matlab parallel toolbox wasn't considered. Here I just list the solution steps for simplicity.
0. prepare a very efficient Matlab function to parse xml and extract the desired data. It accepts filename as input. Compile it to executable file (mcc -m Mfunc_name). It is better to make this one highly efficient. Don't make it another bottleneck in the chain. I made this to be done in 30 seconds for a 200~300Mb xml file.
1. create a file list for binary files that to be processed. (simple, done in a few seconds by listing)
2. use matlab to dynamically generate a dos-batch file, in which we do:
(1). copy a binary file from NSD to local hard disk
(2). call 'the-dumb.exe' to convert it to the same name xml file, and store the xml file local. (can be done in less than 20~30 seconds for a 200~300Mb binary file).
(3). remove the binary file. (cannot keep it. no enough space on local machine; plus, it is useless after converting)
(4). call the compile Matlab executable function to parse the xml to mat file
(5). remove the xml file (the same reason as (3))
(6). exit the batch file
3. check the number of active 'the-dumb.exe' and the Matlab parsing function. If the total number if less than a pre-set NUM_WORKER, then submit the batch to system by call in Matlab (system(cmd &). With such non-blocking system call, Matlab is free to move on to the next line to check and submit more tasks to the system. To do this, I used
[status,result] = system('tasklist /FI "imagename eq the-dumb.exe" /fo table /nh');
num_proc = numel(strfind(result,'the-dumb'));

In this way, with one Matlab running instance, it can simultaneously launch NUM_WORKER processes to handle multiple binary files. CPU can be fully utilized up to 100%, memory can be used to 30% to 40%, depending on the files' sizes. My rough observation is that 1K+ files can be processed in 1 hour, when I set a non-greedy NUM_WORKER=10. So the 40K files can be done in less than 40 hours (compared with 1week+ on 3~4 machines). Of course, if we spit the task on multiple machines, then the entire task can be linearly speedup.

In short, in Matlab, we can compile M-code (independent task function) as executable file, and do non-blocking submission to system to best use a single PC resource. Some simple batch file may assist certain advanced need. Also I/O is very important in creating efficient code.

A list of problem-solving code list

(need a place to keep a note, so here... why Evernote couldn't be a good choice...)

There are some excellent code packages to solve specific problems. Since other people already spent time, efforts and applied their expertise, no need to reinvent a secondary wheel. so...

1. R packages,

1.1 Bayesian inference

1.2 Optimization and mathematical programming

1.3 Differential equations

1.4 HighPerformanceComputing

1.5 MachineLearning

1.6 NaturalLanguageProcessing

.... they're there

2. Python packages (for windows) and there are corresponding Linux-alike packages

3. convex optimization solver (CVX) Matlab based

4. Filters ReBel (kalman filter, ETK, particle filter,..., I have extended to LETKF)

5. CentPack (c++) for central scheme PDEs solution (for 1-D, and 2-D hyperbolic conservation laws, seems extra work for source term)

need 5.1 a package for discontinuous Galerkin Finite element methods (may take a look at DG-FEM)

5.2 a package for moving mesh finite volume solver for hyperbolic PDEs

If to do PDE solving in R, may take a look at this, beside those matlab/C++ routines.

6. Gibbs Sampling Bayes: OpenBugs, Stan, JAGS

7. lightspeed: for some optimized Matlab functions (only for windows sys)

8. still find the PMTK3 not easy to use (for probabilistic graphical models)

9. Parse PDF PDFMiner (python)
10. Other languages Erlang for concurrence, functional, maybe also for AI (instead of prolog?)
11. Compressive sensing

R is in fact a very good glue language to interface many mainstream high efficiency languages (read J. M. Chambers' Software for Data Analysis: Programming with R, chap 9 to chap 12).