% $Id$

\documentclass[a4paper,twoside,11pt]{article}

\usepackage[round]{natbib}              % BiBTeX style
\usepackage{newlfont}                   % Special fonts
\usepackage{psfrag}                     % PSFrag (LaTeX style labels)
\usepackage{dcolumn}                    % Line up at decimal point in a table
\usepackage{graphicx}                   % Graphics stuff
\usepackage{color}                      % Colours
\usepackage{makeidx}                    % Enable making an index
\usepackage[small,all]{caption2}        % Captions: all=support subfigure,
%%                                        %           longtable and float
\usepackage[it]{subfigure}              % Subfigures
\usepackage{enumerate}                  % Enumerated lists
\usepackage{mathrsfs}                   % Serif fonts
%%\usepackage{longtable}                  % Stretch tables
%%\usepackage{lscape}                     % Landscape mode
\usepackage{afterpage}                  % Clear a page
%\usepackage[figuresright]{rotating}     % Rotate a figure
\usepackage{setspace}                   % Spacing tools

%%\graphicspath{{./fig/}
\usepackage{ulem}
\usepackage{bm}
\usepackage{url}
\usepackage{hyperref}
\topmargin -18mm
\headheight 14.5pt

\onehalfspacing        % One-and-a-half line spacing.  Use this for final
                        % submission.
%\doublespacing         % Double line spacing.

\newcommand{\fnurl}[1]{\footnote{\url{#1}}}

\begin{document}

\title{Array-valued Functions vs. Subroutines} % Title
\author{The Pencil Coders}          % Author
\date{\today,~ $ $Revision: 1.3 $ $}            % Date

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\maketitle
\label{firstpage}

%\begin{abstract}
%\end{abstract}
%%\pagenumbering{roman}
%%\tableofcontents

\pagenumbering{arabic}
\section{Results for wormhole.ncl.ac.uk}
wormhole is an Intel Core Duo (2Ghz) machine with 2GB of RAM.  The Intel
Core processors are in fact Pentium III based chips but with many of the
technology improvements developed for the Pentium 4 series.  In particular
and most relevant here are the
SSE\fnurl{http://en.wikipedia.org/wiki/SSE}
SSE2\fnurl{http://en.wikipedia.org/wiki/SSE2} and
SSE3\fnurl{http://en.wikipedia.org/wiki/SSE3} instructions.  SSE is Streaming
SIMD Extensions). SIMD\fnurl{http://en.wikipedia.org/wiki/SIMD} -- Single Instruction, Multiple Data is the proper
name for vector instructions .
\begin{verbatim}
Intel(R) Fortran Compiler for 32-bit applications, Version 9.0    Build
20051020Z Package ID: l_fc_c_9.0.028
Copyright (C) 1985-2005 Intel Corporation.  All rights reserved.
FOR NON-COMMERCIAL USE ONLY
\end{verbatim}

Standard test with nx=128, niter=1000000

\begin{tabular}{cccc}
   Compile opts     &       variant       &     absolute & relative [large is bad] \\
\hline
   -O3     &  Subroutine            &    1.3    &   1.0 \\
   -O3     &  Array-valued function &    2.2    &   1.7 \\
\hline
   -O2     &  Subroutine            &    1.3    &   1.0 \\
   -O2     &  Array-valued function &    2.2    &   1.7 \\
\hline
   -O0     &  Subroutine            &    6.1    &   1.0 \\
   -O0     &  Array-valued function &    17.    &   2.8 \\
\hline
\end{tabular}

Typical not average results are shown, since little or no variation 
is observered in multiple ($> 10$) tests.

In this specific example the Array-valed function always loses out but
comparing the -O0 and -O3 results, the optimizer seems to have considerably 
more freedom to improve the array-valued function form.

\chapter{Possible Issues with the present test}
\section{niter is a compile-time constant}
niter is a constant and as such the optimizer may be being able to optimise
the data handling (expecially in the subtoutine case) much more efficiently
than is fair.  In our real code the looping is much more complex.  There are
nested loops and the vector operations may be several steps down the calling
hierachy from the loops  (eg. the pencil loop and the calc\_pencils\_XX and
dXX\_dt routines)
It seems far from trivial that that would be the case.

\section{The function usage is unfair}
Using subroutine calls makes the operation in the test necessarily a 2 step
operation.  However, it may be expressed as a one liner in functional form.
I believe sub3() has precisely the same impact as sub1() and
sub2().  sub3() however expresses the problem in a way that is fair
to the functional form and impossible in the "simple" subroutine
form\footnote{one could introduce a new subroutine which handles a
scalar*scalar*vector though (scalar*scalar) is intrinsically a
scalar and as such a hybrid subroutine/functional form could be created}

Running on wormhole.ncl.ac.uk (as above) with the "fair" functional form:

\begin{tabular}{cccc}
   Compile opts     &       variant       &     absolute & relative [large is bad] \\
\hline
   -O3     &  Subroutine            &    1.3    &   1.0 \\
   -O3     &  Array-valued fn       &    2.1    &   1.6 \\
   -O3     &  Fair Array-valued fn  &    1.3    &   0.98 \\
\hline
\end{tabular}

Clearly there are cases where one form or the other may be faster or
better optimised.  The present test does not seem to use the various
language constructs fairly and the results are highly dependent upon 
how the chosen operation is performed.  It is not therefore clear that
the results are transferable to a more complex situation.

Adding a hybrid suggests that the best is to use a combination!

\begin{tabular}{cccc}
   Compile opts     &       variant       &     absolute & relative [large is bad] \\
\hline
   -O3     &  Subroutine            &    1.3    &   1.0 \\
   -O3     &  Array-valued fn       &    2.1    &   1.6 \\
   -O3     &  Fair Array-valued fn  &    1.3    &   0.98 \\
   -O3     &  Hybrid (sub with int. vector op) &    0.83    &   0.63 \\
\hline
\end{tabular}

\label{lastpage}

\end{document}
