Skip to content

Google Summer of Code 2017 Ideas

Alexej Gossmann edited this page Apr 1, 2017 · 22 revisions

Ideas for Google Summer of Code 2017.

Contact

Feel free to reach us by joining #sciruby on chat.freenode.net or via our mailing list.

IMPORTANT NOTICE: SciRuby encourages diversity. Scientific progress in general benefits from diversity and software development for science is no exception. We are really happy that the number of people from Asia, Africa and South America applying for GSoC projects is increasing. Our org admin this year is from India, our previous org admin was from Brazil. We have had students from Japan, India, Sri Lanka, Russia, etc. We have women software developers in our programme. We are happy to hear from you all!

Instructions for students

We strongly recommend that you pick one of the ideas listed below. We value contributions in advance of GSoC, even if they're just little ones. Go pick out something in one of our trackers and work on it, talk to folks on the listserv, and get an idea for what features are needed.

You don't need to know a lot about Ruby to work on a project: depending on how much you already know, it'll be pretty easy to learn enough to be able to contribute. However, you may need some familiarity with scientific computation. If you don't have any, take a look at "Numerical Recipes in C", which you'll probably find in your university's library.

In any case, if you feel your skills aren't enough for some project, please ask us on our IRC channel (see contact section above) or our Google Group (see sciruby.com to sign up) and we can help you.

See also:

Read this before you commit your first patches

Most of the main SciRuby’s landing page on Github holds the stable version of SciRuby gems but developers and contributors should work on the very latest (bleeding edge) repositories in order to make sure that changes can be committed without conflict arising.

Try reading Finding The SciRuby Development Repositories on Github if you would like a brief introduction on finding the latest development gems to work on from Github. Also go through the coding guidelines before sending your first patch.

How to submit a patch ("pull request")

Here's a great tutorial: http://www.thinkful.com/learn/github-pull-request-tutorial/

Have a look and feel free to ask if you have any questions.

Instructions for mentors

Guidelines for mentors to submit projects:

  • Specify the name of your project as a heading.
  • Write a paragraph or two with further details.
  • Write a small 'Skills' section detailing the skills that the student must possess to complete the project.
  • Write down your own GitHub handle and contact details in a 'Mentor Details' section over which the student can contact you.
  • If anyone else wants to co-mentor a project, please specify your details along with the mentor's details.

Project Ideas

NMatrix projects

NMatrix is SciRuby's numerical matrix core, implementing dense matrices as well as two types of sparse (linked-list-based and Yale/CSR). NMatrix is a fairly well-established project which has received Summer-of-Code-like grants from both Brighter Planet and the Ruby Association (in other words, from Matz, who created Ruby). Those who contribute to NMatrix will likely eventually become authors of a jointly-published peer-reviewed science article on the library. Additionally, NMatrix is a good place to gain practical C and C++ experience, while also working to improve Ruby.

NMatrix currently relies on ATLAS/CBLAS/CLAPACK and standard LAPACK for several of its linear algebra operations. In some cases, native versions of the functions are implemented, so that the libraries are not required. There are quite a number of areas for growth in terms of the capabilities of NMatrix here.

Speed up element-wise operations in NMatrix

  • Mentors: John Woods (@mohawkjohn)
  • Per this discussion, constraints of the Ruby language currently slow down element-wise addition and subtraction for NMatrix objects. There are possibly some work-arounds, described in that email thread. A successful proposal would involve some preliminary research and design work on how to speed up element-wise operations.
  • Recommended skills: Some C/C++ would be beneficial, as you'll need to be working under the hood on NMatrix.

Creating the fastest math libraries for Ruby by using the GPU through OpenCL and ArrayFire.

  • Mentors: Pjotr Prins (@pjotrp), Alexej (@agisga - confirm), Charles Nutter (@headius - confirm)
  • Almost all laptops have a GPU that can be used to improve mathematical computations. ArrayFire is an abstraction of GPU computing that can work on all platforms. We have a prototype that covers some matrix computations. This GSoC we would like to mature that into a library for SciRuby that can automatically and transparantly ofload computations to the GPU for a number of SciRuby libraries. Ideally support is included both for MRI and JRuby.
  • ArrayFire-rb would be a wrapper around ArrayFire. The architecture is highly inspired from NMatrix and NArray.
  • For JRuby implementation, it would require ArrayFire-Java that uses JNI(Java Native Interface). ArrayFire-Java doesn't have complete support for BLAS level-3 and LAPACK routines yet.
  • Benchmarks: This repo can be used to benchmark ArrayFire functionalities in Ruby. ArrayFire-rb is around 1e3 to 1e7 times faster than NMatrix.
  • Recommended skills: this is an advanced and highly visible project. The student should be comfortable with both Ruby and GPU computing.

Tensor methods for NMatrix

  • Mentors: Alexej Gossmann (@agisga)
  • This project involves either adding tensor methods to NMatrix or building a separate gem (NTensor or something). There seems to be a great lack of tensor libraries available to the scientific community, apart from a couple Matlab toolboxes (but not everyone has access to Matlab, or has any desire to use it). E.g., see this article: https://www.oreilly.com/ideas/lets-build-open-source-tensor-libraries-for-data-science.
  • A tensor is essentially a multidimensional array, which NMatrix has already some support for. This project would involve:
    1. Implementation of basic tensor operations, such as outer product, n-mode product, slicing, matricization, etc.
    2. Implementation of tensor decompositions, such as CANDECOMP/PARAFAC, Tucker, higher-order SVD, etc.
    3. Implementation of further tensor methods used in machine learning / data science, such as tensor regression.
  • Skills: the student should be able to understand very well Kolda&Bader "Tensor Decompositions and Applications".

Visualization Projects

Ruby Matplotlib

There are several data visualization packages for Ruby: Nyaplot, Ruby-gnuplot, Rubyvis, etc. However, none of these is as straightforward to use as Matplotlib and Pyplot in Python, nor as simple as the plotting routines in Matlab/Octave.

Ruby needs a native plotting capability. In the past, we've briefly explored extending Matplotlib to work directly in Ruby (see discussion, including link therein), but found that it's too tightly integrated with Python. We propose the creation of an abstracted Matplotlib, designed to be available for any language, but with the facility of the Matlab/Matplotlib APIs. The first available API would be in Ruby.

To be clear, this is a larger project than can be accomplished in one Google Summer of Code. SciRuby is prepared to allocate funds for ongoing support — either for the student who is accepted for this project idea, or for another. We are also prepared to provide support for applications for Ruby Association grants subsequent to GSOC.

  • Mentors: John Woods (@mohawkjohn)
  • Languages: Comfort with C/C++ and Ruby highly recommended

Data and Statistics Projects

A colossal amount of data is being generated every minute and having good tools to analyse this data is something that has become an essential feature of any modern language. These projects deal with making Ruby a viable language for data analysis and statistics task.

If you choose to contribute to any of these, you will be exposed to the inner workings of some very useful tools. In addition to this, most of the Ruby community as of now is still waking up to the endless possibilities that Ruby might hold for data analysis, so you will feel an immediate impact of any work that you do in this field. Ruby conferences around the world are opening up to host talks about data tools in Ruby, which will give you a great platform to showcase your work and derive long term career benefits from it.

Following are the ideas under this domain:

Paratext Ruby wrapper

Paratext is a super fast library for reading CSV files. Currently it has wrappers for Python, and having a Ruby wrapper would be a great idea for use in NMatrix and daru. The wrapper must directly interface with the NMatrix C API and should not waste much time in converting data from C to Ruby and back. You will need to understand both the CRuby C API and the NMatrix C API and figure out a way to make them work together. The python wrappers can be used as a reference. This thread provides further information.

This is not a full summer's worth of work and should ideally be coupled with some other project.

Skills: Understanding of Ruby and C APIs | Understanding of the CRuby C API.

Difficulty: Intermediate

Mentor: @v0dro

Rewrite slow parts of daru in Rubex

Daru is a DataFrame library for Ruby. While it has many methods for data wrangling, it is slow for a lot of use cases (check out these benchmarks). This task will involve figuring out the slow areas of daru and porting them to Rubex, which is a language for writing C extensions for Ruby.

Rubex is still far from complete, and you may need to dabble in compilers and the Ruby Garbage Collector for adding features to Rubex that are necessary for completing this task. You will also need to benchmark various daru methods and prove that porting them to Rubex will significantly impact performance.

Skills: Experience in data analysis | Experience in Ruby and C | General understanding of how compilers work | Understanding of good benchmarking practices

Difficulty: Advanced

Mentor: @v0dro

Make Daru more ready for integration with modern Web framework

Part 1: Import

Goal: allow easy creation of dataframes from "real life" data.

Deliverables:

  1. micro-framework/base class for easy definition of new importers
  2. importers at least for the following sources (a matter to discuss):
  • ActiveRecord
  • Sequel
  • JSON
  • Redis
  1. each importer should be independent class (probably subclass of some BaseImporter), it's inclusion into project should allow syntax like Daru::{DataFrame,Vector}.from_{data_source}(data_source, **options)

Part 2: Export

Goal: Symmetrically, we need to be able to export Daru data to "real" environments.

Deliverables:

  1. micro-framework/base class for easy definition of new exporters
  2. exporters at least for:
  • ActiveRecord (bulk insert postprocessed data from CSV to database table is a great showcase)
  • JSON (for API rendering, including pagination support, e.g. data_frame.to_json(page: 100, per_page: 10))
  • xlsx, probably
  1. Exporters interface should by symmetrical to importers

Part 3: Presentation

Goal: using dataframes/vectors, integrate their representation in the web views, agnostic from views library.

Deliverables:

  • daru(dataframe, **options) helper, which can be called from any view templating/layouting system and returns dataframe formatted into HTML;
  • it should be more sophisticated than current DataFrame#to_html
  • it should use some modern JS table component, allowing filtering, sorting and pagination;
  • it should be easy to restyle dataframe (e.g. proper HTML classes should be used while rendering), and make it dynamic (set up endpoints/sources for "next page", for example)
  • daru_chart(dataframe, **options) -- helper to output various simple charts in a view template; requirements are the same (stylable, customizable, dynamic)

Difficulty: Intermediate

Skills: Pragmatic experience with current state of web development, including Rails & front-end

Mentor: @zverok (with @v0dro and @lokesh being co-mentors)

Software integration

Create a reproducible deployment system for Ruby/JRuby

GNU Guix is a viable alternative to RVM, rbenv, brew and bundler. Unlike the other tools, Guix has full reproducibility built in, all the way down to glibc, and does away with dependency hell. At this point GNU Guix supports four versions of MRI (1.8.7, 2.1.10, 2.2.6 and 2.3.3) and hundreds of Ruby gems, and builds exist of Ruby on Rails, see https://www.gnu.org/software/guix/packages/r.html. GNU Guix also has facilities for binary installs of Guix packages without administrator privileges, making it the ideal environment for education and software development on shared systems by including IRuby notebook support.

The project idea for GSoC is to build up on this tooling to add support for the SciRuby gems with their dependencies (such as gsl and openblas) and make it easy to deploy them and introduce default environments for Travis-CI using Guix binary packages.

This is a highly visible project in the Ruby wider community.

  • Recommended skills: interest in web development, software integration and deployment, compiler technology, and Ruby and LISP

Machine Learning Library for Ruby

  • Its high time for Ruby to have its own Machine Learning Library with off-the-shelf algorithms like scikit-learn.

Now Ruby is very popular for Web system and Infrastructure as Code, but not for data science tasks. We cannot conduct a generic workflow of a data science task by using tools provided in Ruby ecosystem. It is the reason why people, even Rubyists select Python and R.

Providing bridge libraries to major data science systems is one way to make Ruby usable for data science tasks. It can be realized faster than developing the original tools such as nmatrix, daru, and original ML libraries because numpy, pandas, and scikit-learn are already mature. Moreover many companies already have data science systems built on Python or R even if their main products are built by Ruby. Bridge libraries allow us to keep using the existing systems.

I (@mrkn) believe bridge libraries are the best way to provide practical tools for data science to Rubyists at the present time. So I'm working for the development of bridge libraries for the base language systems of Python, R, and Julia.

The followings are the ideas should be built on the such base bridge libraries:

Bridges for PyData libraries

PyData is defacto standard system for data science tasks. There are a lot of libraries in Python ecosystem, but the common set is not so large. The following libraries may be heavily used by most cases:

  • pandas
  • xray
  • matplotlib
  • seaborn
  • scikit-learn
  • gensim
  • nltk
  • Pillow
  • scikit-image
  • etc.

Skills: Understanding of Ruby and its extension library's APIs, Python and its extension library's APIs, FFI, experience of library you select

Mentors: @mrkn

Bridges for R libraries

R is also defacto standard programming language for data science tasks.

  • dplyr
  • tidyr
  • vars
  • forecast
  • tseries
  • ggplot2
  • etc.

Skills: Understanding of Ruby and its extension library's APIs, R and its extension library's APIs, FFI, experience of library you select

Mentors: @mrkn

Bridges for Julia libraries

Julia is high-level dynamic programming language designed for scientific computing usecases. Julia program is easy to read as it can be executed faster than C and Fortran. I believe Julia is next defacto standard programming language for data science tasks. So supporting Julia now brings a big advantage for the future use.

Skills: Understanding of Ruby and its extension library's APIs, Julia and its FFI APIs, experience of library you select

Mentors: @mrkn

IRuby improvements

Introducing magic command like IPython

TBD.

Space Projects

Overhaul SpiceRub and continue porting NASA CSPICE Routines

SpiceRub was a GSoC 2016 project which aimed to create a Ruby wrapper around NASA's important CSPICE Library. Since SPICE has a large number of routines and the C extensions had to be written from scratch, a subset of the available routines considered most important at the time was ported along with the basic C backbone leveraging NMatrix's C API. While SpiceRub can currently perform many ephemerides calculations via a clean Ruby interface, many of the functions are ported directly as per the SPICE API, which is not a very convenient API to have to use in Ruby.

  • Mentors: John Woods (@mohawkjohn)

  • Currently the project won't build in recent Ruby versions due to an incompatibility with the NMatrix C API. As we may expect projects in the future to leverage NMatrix's C API for performance gains, this perpetual bug would be better seen resolved. Refer to this and this for more information.

  • Expanding on the Geometry Finder Subsystem routines towards a Ruby-like API. Reference.

  • Model SPICE Datatypes (Such as Time Windows) into a Ruby object to allow seamless interconversion between the two.

  • Increasing the routine coverage of SPICE by writing more C extensions.

  • Setting up a convenient distribution/installation system (CSPICE is only available as a statically linkable library as of now)

  • Recommended skills: C and Ruby, with an emphasis on writing Ruby-C extensions. There is extensive documentation on the SPICE Website, as well as tutorials for concepts involved.

Clone this wiki locally