GSoC 2022 - Final Work Report

GSoC 2022 - Final Work Report

In this gist, I present my work during the Google Summer of Code (GSoC) 2022 program. In the project, I contributed code to a published book Probabilistic Machine Learning: An Introduction (book1) and an upcoming book Probabilistic Machine Learning: Advanced Topics (book2) by Kevin P. Murphy.

I mainly worked on two repositories: i) pyprobml (public): a repositiry containing most of the source code for the book; and ii) bookv2 (private): a repository containing source code of the book.

Highlights

  • I worked on this project for 16 weeks and on an average 42 hours a week.
  • I refactored the pyprobml repository to shift from .py files to .ipynb files containing code to reproduce figures.
  • I added automatic code testing via GitHub workflows to pyprobml.
  • I generated code for several figures in “Gaussian processes” chapter.
  • I developed a Bayesian optimization library in JAX called BIJAX.
  • I co-authored a section (34.7 - Active learning) with Dr. Kevin P. Murphy.
  • I learned advanced tools of JAX library.

.py to .ipynb conversion

Background

In earlier version of repo, code for each figure was stored in .py files and there were Jupyter notebooks, each per chapter, containing cells to execute each of those .py files. Link to those cells were stored in a private firestore database and a short link was given with each figure in book1.

There were several challanges with this setup:

  • It was difficult to maintain and keep this setup up to date due to manual intervention needed at multiple places.
  • Due to full .py file execution, interactivity with code was minimal.

Solution

After brainstorming, we decided to move from .py file per figure to a self-contained .ipynb file per figure which alleviated the need of notebook per chapter. This also enabled readers to play with the notebook more interactively or view the cached figures in the notebook. In the final version, we have notebooks stored in folders renamed as chapter numbers and integrated with book without the need of firestore database.

Automatic code testing

Background

Code for any repository needs be up to date with packages it depends on. pyprobml depends on lots of packages and thus automatic GitHub workflow testing can be beneficial to test the code on every push.

Solution

I implemented majority part of workflow which: i) runs and chekcks all notebooks on every push; ii) statically checks if imported modules are installed/going to be installed locally or not; iii) checks code quality with black and iv) pushes some useful statistic to other branches e.g. auto_generated_figure. Following PRs are major in making these changes:

URLTitle
PR 703Add workflow and jupyter-book setup
PR 764Split the workflow. Enable static module checker.
PR 795Combine and publish results from parallel workflows

Figures for Gaussian processes chapter

Gaussian processes are related to my thesis topic and thus I wrote code and/or latexified several figures in the Gaussian processes chapter. Here is the list of figures generated by my code:

  • Figure 18.2 [PR 727] explains ARD in Gaussian processes: a) same lengthscale for both dimensions; ii) lengthscale in one dimension dominates the other.

image

  • Figure 18.5 and 18.6 [PR 721, PR 722] illustrates combination of GP kernels via multiplication and summation

image

image

  • Figure 18.7 [PR 728, PR 750] illustrates noisy and noiseless Gaussian process regression: a) samples from prior; b) noiseless regression; c) noisy regression.

image

  • Figure 18.18 [PR 919] illustrates GP regression with (a) Maximum likelihood and (b) Fully Bayesian inference. I added blackjax version of code for this.

image

  • Figure 18.23 [PR 874] illustrates spectral mixture kernel fit with GPs. I latexified this figure.

image

  • Figure 18.26 [PR 871] illustrates Deep kernel learning with Gaussian processes: a) fit with Matern kernel; b) fit with MLP + Matern kernel. I latexifed this figure.

image

BIJAX: Bayesian Inference in JAX

We initially thought of writing a generic piece of code to generate multiple Bayesian inference related figures in book2. However, realizing the power of Bijactors (transformation functions) in recent methods like ADVI and a posibility of applications for a wider community, we are developing BIJAX as a full library. BIJAX supports MAP, ADVI and Laplace approximation. We are developing SteinVI and other methods in BIJAX at present.

Following are the figures developed with BIJAX:

  • Gaussian Mixture Model with ADVI [PR 1084]: a) MAP; b-d) posterior samples.

image

  • Estimating mean of 2d Gaussian distribution with ADVI [PR 1081]

image

Co-authored section 34.7 - Active learning

We work on active learning in our lab and earlier I have published an interactive active learning article at VizxAI. Thus, I helped Dr. Murphy write the section on active learning in book2. I mentored two of non-GSoC students (Nitish and Ankita) from our lab at IIT Gandhinagar for generating all figures in this section.

image

JAX: an ML library in sync with pure Python

During GSoC, I got introduced to JAX for the first time. After some exploration, I found JAX as a quick and computationally fast prototyping library. Tools like vmap, jit, tree_util and flatten_util were elementry in the code I wrote during GSoC. I summarized the JAX tricks we learned during the GSoC period here.

Summary of PRs

Following table containts the list of PRs created during the GSoC period sorted by time.

URLTitle
PR 690Update beta_binom_post_plot.py
PR 703Add workflow and jupyter-book setup
PR 704Added discrete_prob_dist_plot.ipynb
PR 707Do not auto generate scripts
PR 709Rename environment variables
PR 719Remove jupyter book
PR 721add combining_kernels_by_multiplication notebook
PR 722add combining_kernels_by_summation notebook
PR 727Fig18.2 (Book2) add gprDemoArd.ipynb
PR 728Fig18.7 (Book2) add gprDemoNoiseFreeAndNoisy.ipynb
PR 741converted py files to notebooks.
PR 750Update gprDemoNoiseFreeAndNoisy
PR 763Mention Python version in README
PR 764Split the workflow. Enable static module checker.
PR 779pin Python 3.7.13 in workflow
PR 780Copied d2l notebooks to pyprobml
PR 787Move non-figure notebooks to pyprobml
PR 795Combine and publish results from parallel workflows
PR 798Updated README.md and CONTRIBUTING.md
PR 800Copy probml-notebooks/notebooks to notebooks/misc
PR 805Push updated notebooks to misc
PR 824Add book2 notebooks
PR 834GitHub to colab
PR 846Add autogenerated contributors’ list
PR 871Refactor deep kernel learning figure
PR 874Fix workflow and latexify spectral mixture gp figures
PR 919Latexified and blackjaxified figure 17.18 gp_kernel_opt.ipynb
PR 1046AL notebooks added
PR 1081added gaussian 2d ADVI
PR 1084Added GMM ADVI notebook
PR 1099Latexify vib_demo

Summary

This GSoC I worked on a big project useful to the academic community around the world and with a wonderful mentor Dr. Kevin Murphy. Simultaniously, I learned my new default ML library JAX and increased my confidence to be able to work on such large projects. I will always be grateful to Dr. Kevin Murphy for this opportunity that helped me add a unique and valuable experience in my PhD journy.