Matlab has its parallel toolbox, but I never get into it. Partly because I don't have a nicely configured cluster. A recent project actually forced me to find a way to fully utilize a single PC's resource. Without bothering parallel toolbox, nor MapReduce, nor MPI, I just used a quick-and-dirty way to do that on a single PC. It might be case-specific, but this case may be widely seen in many problems.
The study case: there are about 3TB+ binary data stored on a local network storage disk (NSD), which is mounted to a Windows machine with 32 Gb memory, Quad core Intel i7-3770@3.40GHz (8 threads), and less than 1Tb hard disk. Those binary files are generated by a company's machine, and we don't know the format definition of those binary file. So we have to use a pre-compiled executable program (let's call it 'the-dumb.exe') to convert each binary file to xml file. Then we can use other code to parse the xml file to extract data we need. If we call the-dumb.exe to convert a binary file directly on the NSD, and store the new xml also on the NSD, and write Matlab code to extract data from xml and store them as .mat file on the NSD (above steps are done one after another, or say, only one process is active; one task each time), the running time for all the 40K files may last for a week+ on multiple machines and never end (I didn't hear that someone really finish such processing. A rough estimation is: converting a binary file needs about 2 minutes through NSD, and using some Matlab xml parsing toolbox to extract data needs extra 2~5 minutes). So, simply splitting the task and sending them to multiple machines cannot help. By observation, for all the time, the CPU is often only used up to 1% to 7% percent. Memory space is totally wasted.
To not complicate the problem, MPI or Matlab parallel toolbox wasn't considered. Here I just list the solution steps for simplicity.
0. prepare a very efficient Matlab function to parse xml and extract the desired data. It accepts filename as input. Compile it to executable file (mcc -m Mfunc_name). It is better to make this one highly efficient. Don't make it another bottleneck in the chain. I made this to be done in 30 seconds for a 200~300Mb xml file.
1. create a file list for binary files that to be processed. (simple, done in a few seconds by listing)
2. use matlab to dynamically generate a dos-batch file, in which we do:
(1). copy a binary file from NSD to local hard disk
(2). call 'the-dumb.exe' to convert it to the same name xml file, and store the xml file local. (can be done in less than 20~30 seconds for a 200~300Mb binary file).
(3). remove the binary file. (cannot keep it. no enough space on local machine; plus, it is useless after converting)
(4). call the compile Matlab executable function to parse the xml to mat file
(5). remove the xml file (the same reason as (3))
(6). exit the batch file
3. check the number of active 'the-dumb.exe' and the Matlab parsing function. If the total number if less than a pre-set NUM_WORKER, then submit the batch to system by call in Matlab (system(cmd &). With such non-blocking system call, Matlab is free to move on to the next line to check and submit more tasks to the system. To do this, I used
[status,result] = system('tasklist /FI "imagename eq the-dumb.exe" /fo table /nh');
num_proc = numel(strfind(result,'the-dumb'));
In this way, with one Matlab running instance, it can simultaneously launch NUM_WORKER processes to handle multiple binary files. CPU can be fully utilized up to 100%, memory can be used to 30% to 40%, depending on the files' sizes. My rough observation is that 1K+ files can be processed in 1 hour, when I set a non-greedy NUM_WORKER=10. So the 40K files can be done in less than 40 hours (compared with 1week+ on 3~4 machines). Of course, if we spit the task on multiple machines, then the entire task can be linearly speedup.
In short, in Matlab, we can compile M-code (independent task function) as executable file, and do non-blocking submission to system to best use a single PC resource. Some simple batch file may assist certain advanced need. Also I/O is very important in creating efficient code.