1_Introduction.tex

\section{Introduction}
\label{sec:intro}
%\subsection{What Is the Problem}
With rapid developments in neural engineering, researchers are approaching the aims of understanding brain functions and building brain-like machines using this knowledge~\citep{furber2007neural}.
As a fast growing field, neuromorphic engineering has provided biologically-inspired sensors such as DVS~(Dynamic Vision Sensor) silicon retinas~\citep{serrano2013128, delbruck2008frame, yang2015dynamic, posch2014retinomorphic}, which are good examples of low-cost visual processing thanks to their event-driven and redundancy-reducing style of computation.
Moreover, SNN simulation tools~\citep{davison2008pynn, gewaltig2007nest, goodman2008brian} and neuromorphic hardware platforms~\citep{furber2014spinnaker,  schemmel2010wafer, merolla2014million} have been developed to allow exploration of the brain by mimicking its functions and developing large-scale practical applications~\citep{eliasmith2012large}.
Particularly for visual processing, the central visual system consists of several cortical areas which are placed in a hierarchical pattern according to anatomical experiments~\citep{felleman1991distributed}.
Fast object recognition takes place in  the feed-forward hierarchy of the ventral pathway, one of the two central visual pathways, which mainly handles the ``What'' tasks.
Experiments have revealed that the information is unfolded along the ventral stream to the  IT (Inferior Temporal) cortex~\citep{dicarlo2012does}.
Inspired by the  explicit  biological study of the central visual pathway, SNNs models have successfully been adapted to computer vision tasks. 
%become an active area of computer vision thanks to the
%There are two basic streams locating in the visual area: a dorsal and a ventral pathway.
%They differ in behavioural patterns according to the observation from brain lesions~\citep{prado2005two}, and also in functions where the ventral (`perception') stream perceives the world by means of object recognition and memory, while the dorsal (`action') stream provides real-time visual guidance for motor actions such as eye movements and grasping objects~\citep{goodale1992separate}. 

\cite{riesenhuber1999hierarchical} proposed a quantitative modelling framework of object recognition with position-, scale- and view-invariance based on the units of MAX-like operations.
The cortical-like model has been analysed on several datasets~\citep{serre2007robust}.
And recently~\cite{fu2012spiking} reported that their SNN implementation of the framework was capable of facial expression recognition with a classification accuracy (CA) of 97.35\% on the JAFFE dataset~\citep{lyons1998coding} which contains 213 images of 7 facial expressions posed by 10 individuals.
% 97.35\% on JAFFE dataset.
They employed simple integrate-and-fire neurons with rank order coding (ROC) where  the earliest pre-synaptic spikes have the strongest impact on the post synaptic potentials.
According to~\cite{vanrullen2002surfing}, the first wave of spikes  carry explicit information through the ventral stream and in each stage meaningful information is extracted and spikes are regenerated. 
Using one spike per neuron,~\cite{delorme2001face} reported 100\% and 97.5\% accuracies on the face identification task over changing  contrast and luminance training (40 individuals $\times$ 8 images) and testing data (40 individuals $\times$ 2 images) respectively.
%These developments yielded a large number of papers on SNNs based recognition, with a majority reporting outstanding recognition resulton limited-size databases.

The Convolutional Neural Network (CNN), also known as the \textit{ConvNet} developed by~\cite{lecun1998gradient}, is a well applied model of such a cortex-like framework.
%Reported results:
%Hand Gestures, Qian Liu
An early Convolutional Spiking Neural Network (CSNN) model identified faces of 35 persons with a CA of 98.3\% exploiting simple integrate and fire neurons~\citep{matsugu2002convolutional}.
Another CSNN model~\citep{zhao2014feedforward} was trained and tested both with DVS raw data and Leaky Integrate-and-Fire (LIF) neurons.
%The MAX operation, training and the switch are not only neuron involved.
It was capable of recognising three moving postures with a CA of about 99.48\% and 88.14\% on the MNIST-DVS dataset (see Chapter~\ref{sec:data}).
As one step forward,~\cite{camunas2012event} implemented a convolution processor module in hardware which could be combined with a DVS for high-speed recognition tasks.
The inputs of the ConvNet were continuous spike events instead of static images or frame-based videos. 
The chip detected four suits of a 52 card deck while the cards were fast browsed in only 410 ms.
Similarly, a real-time gesture recognition model~\citep{liu2014real} was implemented on a neuromorphic system with a DVS as a front-end and a SpiNNaker~\citep{furber2014spinnaker} machine as the back-end where LIF neurons built up the ConvNet configured with biological parameters.
In this study's largest configuration, a network of 74,210 neurons and 15,216,512 synapses used 290 SpiNNaker cores in parallel and reached 93.0\% accuracy. 

Deep Neural Networks (DNNs) together with deep learning are the most exciting research fields in vision recognition.
The spiking deep network has great potential to combine remarkable performance with the energy efficient training and running.
In the initial stage of the research, the study was focused on converting off-line trained deep network to SNNs~\citep{o2013real}.
The same network initially implemented on a FPGA achieved a CA of 92.0\%~\citep{neil2014minitaur}, while a later implementation on SpiNNaker scored 95.0\%~\citep{Stromatias2015scalable}.
%The performance was increased from 92.0\% to 95.0\%~\citep{Stromatias2015scalable} by implementing the model on SpiNNaker instead of the earlier FPGA version.
Recent attempts have contributed to better translation by utilising modified units in a ConvNet~\citep{cao2015spiking} and tuning the weights and thresholds~\citep{Diehl2015fast}).
The later paper claims a state-of-the-art performance (99.1\% on the MNIST dataset) comparing to original ConvNet.
The current trend of training Spiking DNNs on-line using biologically-plausible learning methods is also promising.
An event driven Contrastive Divergence (CD) training algorithm for RBMs (Restricted Boltzmann Machines) was proposed for Deep Belief Networks (DBN) using LIF neurons with STDP (Spike-Timing-Dependent Plasticity) synapses and verified on MNIST (91.9\%)~\citep{neftci2013event}.

STDP as a biological learning process is applied to vision tasks.
\cite{bichler2012extraction} demonstrated an unsupervised STDP learning model to classify car trajectories captured with a DVS retina. 
A similar model was tested on a Poissonian spike presentation of the MNIST dataset achieving a performance of 95.0\%~\citep{diehl2015unsupervised}.
Theoretical analysis~\citep{nessler2013bayesian} showed that unsupervised STDP was able to approximate a stochastic version of Expectation Maximization, a powerful learning algorithm in machine learning.
The computer simulation achieved~93.3\% CA on MNIST and could be implemented in a memrisitve device~\citep{bill2014compound}. 

Despite the promising research on SNN-based vision recognition, there is no commonly used database in the format of spikes.
In the studies listed above, all the vision data used are in one of the following formats:
(1) the grey-scale raw values of images;
(2) rate-based spike trains according to pixel intensities created by various Poissonian generators;
(3) unpublished DVS recorded spike-based videos.
As a consequence, a new series of spike-based vision datasets is now needed to quantitatively measure progress within this rapidly advancing field and to provide fair competition resources for researchers.
Apart from using spikes instead of the frame-based data of conventional computer vision, there are new concerns of evaluating neuromorphic vision in tasks other than recognition accuracy.
Therefore a common metric of performance evaluation on spike-based vision is also required to specify the measurements of algorithms and models. 
Different assessments should be taken into consideration when implementing models on neuromorphic hardware, especially the trade-offs between simulation time, precision and power consumption.
Thus benchmarking neuromorphic hardware with various network models will reveal the advantages and disadvantages of different platforms.
In this paper we propose a large dataset of spike-based visual stimuli, NE, and its complementary evaluation methodology.
The dataset expands and evolves as research develops and new problems are introduced.

In Section~\ref{sec:Related}, some example datasets of conventional non-spiking computer vision are introduced.
Section~\ref{sec:guide} defines the purpose and protocols of the proposed dataset.
The sub-datasets and their generation methods are described in detail in Section~\ref{sec:data}.
In accordance with the dataset, its evaluation methodology is demonstrated in Section~\ref{sec:eval}.
Moreover, two SNN models are provided as examples of benchmarking hardware platforms in Section~\ref{sec:test}.
Section~\ref{sec:summ} summarises the paper and discusses future work.