The instructor offers some ideas about Honor's options for the class. These options are published well ahead of class schedule since, according to the requirements linked to above, a student must submit a proposal by the 10th day of classes. All such projects will provide 10% extra credit in the class. Discussing alternative assignments is always possible, as long as they have some relevance to the class content. Please speak to the instructor privately (and soon!) to assess said relevance.
This particular assignment has some general specifications, but the implementation is completely open, even in terms of programming language.
We are interested in compressing text through the classic LZW algorithm. However, most of today's text is contained in a multitude of different formats on the Web, cloud drives or local drives. It is impractical to assume that we can compress the entirety of English text available onto a single file and run LZW on it.
Instead, we want a program that will launch one slave thread per file. The thread will perform LZW on the file and return a reference to the LZW tree. It is then joined with the master thread. When all slave threads join with master, all of the different trees will have to be merged into a common one by master. This brings up questions of algorithmic complexity of this operation that the student will have to answer, such as:
- What's the brute force algorithm for merging two LZW trees? What is its algorithmic complexity () as a function of the size of the two trees (number of nodes or height)? .
- Can we improve upon this brute force algorithm in terms of complexity?
These theoretical questions might already have been answered by existing literature: it will be the student's job to determine this and inform the instructor of the relevant sources.
This approach is just a suggestion; you are free to work on the problem in any way you please.
- Choose a programming language.
- Implement standard LZW on a text file (forget about PDFs, HTML scraping, etc. just use text files). Unit-test it thoroughly.
- Build the multi-threaded architecture. The threadling library you use is your own choice. If you use C/C++, maybe pthreads or thread is your best choice. For Java / Scala, you might want to look into Thread, Runnable or even Akka actors. For Python, the thread module might be of interest.
- Put it all together. Make sure the master thread waits for all threads appropriately and the application doesn't suffer from deadlocks, busy-waiting or starvation issues.
- Unit test the result by comparing it to the output of your single-threaded LZW on a concatenation of the text files on your drive. You can also measure and report the different timings in that test!
Those extensions do not affect your grading at all; they are just ideas for you to make a more robust application that you could then showcase in your portfolio.
- If the base system works, expand the slave thread code such that it can parse PDFs, scrape a web url for the text inside an HTML <body> tag, etc.
- Enhance the input of your application such that it can parse command line arguments or read from stdin, with or without input redirection, and/or a network socket.
- What if we have a base LZW tree and then we want to enhance it with information from a new input stream? The naive approach (which might be the optimal one!) would fork two threads, both of which, of course, would hold a reference to the root of our tree. However, one of the threads will be performing the actual compression while the other one will be busy-waiting. Then, the trees will be merged as per the base system. Can we do better?
- This is multi-threaded document compression in its most embarrassingly parallel form: the algorithm itself is not parallelized at all. Could we perhaps come up with a spin-off project which parallelizes the data structure itself? For the purposes of this class, this would be an entirely new project.
- Parallelizing the merging itself will probably yield benefits in time. For documents, there are pairs to merge. Clearly, this can be solved in linear time if we sequentially merge all of the trees, beginning from the first two in some arbitrary ordering, then taking that larger tree and merging it with the third, etc, etc. This might lead to costly mergings down the road if our merging algorithm is affected by badly balanced inputs (that is, one of the two trees being much larger than the other one). We should then consider spawning threads that will merge individual document pairs in parallel, report the results to master before joining, and then the entire process is repeated until we have a common tree. This would merge our full tree set in time, but the thread overhead might not be worth it for small trees or small . The latter is unlikely with modern threading libraries like Akka.
- If an efficient algorithm for merging LZW trees is developed, its complexity analyzed, and no relevant works reporting better results can be found in the community, the instructor would be happy to work with the student on an academic paper to be submitted in a Computer Science conference like FOCS or STACS, where the student would obviously be first author.
This, being an honor's project, is resistant to specific grading criteria. Roughly speaking:
- Single-threaded implementation of LZW that only works on a text file or from stdin: 30%.
- Singe-threaded implementation of LZW that works on a plethora of inputs: 50%.
- Parallelization with suggested master-slave hierarchy that merges the trees correctly (i.e tested):
- 80% for brute-force approach
- 100% for either optimized merging approach or explanation of why we cannot improve upon the brute-force approach.
The project can begin at any point during the semester and will need to be delivered on the last day thereof. The instructor will organize a number of meets through the class schedule to gauge your progress, receive and provide input.