Annual runoff prediction using a nearest-neighbour method based on cosine angle distance for similarity estimation

The Nearest Neighbour Method (NNM) is a data-driven and non-parametric scheme established on the similarity characteristics of hydrological phenomena. One of the important parts of NNM is to choose a proper distance measure. The Euclidean distance (EUD) is a commonly used distance measure, which represents the absolute distance of a spatial point and is directly related to the coordinate of the point, but is not sensitive to the direction of the feature vector. This paper used the cosine angle distance (CAD) for the similarity measure, which reflects more differences in the direction, and compared it to EUD. This technique is applied to annual runoff at YiChang station on the Yangtze River. The results show the NNM with CAD has a better performance than that of EUD.


INTRODUCTION
Annual runoff is influenced by climate, land cover and human activities, and its prediction is quite complex because it has a longer lead time than daily or monthly prediction.The commonly used prediction methods for annual runoff can be divided into traditional and modern methods.The traditional method is carried out according to the variation characteristics of runoff, such as hydrological analysis and hydrological statistics (Chen et al., 1985;Chen, 1997;Kenea and Thian, 2009;Xu et al., 2010).The modern methods, which developed with computing technology, include artificial neural networks (Seckin et al., 2013), the fuzzy method (Zhu et al., 2009), chaos method (Sivakumar, 2000), grey method (Liu, 2009), etc., have achieved some good results.However, most of them are developed based on the prediction pattern "assumption-calibrationverification", which needs parameter calibration before prediction.
The Nearest-Neighbour Method (NNM) is data driven and non-parametric, with potential priority, and needs no assumption about the form of the dependence and probability distribution, or estimation of many parameters.Using NNM to model hydrologic process and dynamics in rivers and streams has been well documented (e.g.Lall and Sharma, 1996;Yuan et al., 2000;Wang et al., 2001;Mehrotra and Sharma, 2006;Lee et al., 2011;Liu et al., 2012), since Karlsson and Yakowtz (1987) used NNM for rainfall-runoff forecasting.One of the important parts of NNM is to choose a proper distance measure, as different distance measures may behave quite differently (Qian et al., 2004).Euclidean distance (EUD) is a commonly used distance measure, which represents the absolute distance of a spatial point and is directly related to the coordinate of the point.The cosine angle distance (CAD) is another popular distance measure, which is sensitive to the direction of the feature vector, but has not been used in hydrological time series.
This paper used CAD for the similarity measure in annual runoff prediction.The annual runoff prediction of YiChang station is used as a case study.Section 2 presents the nearest neighbour method for hydrological time series; Section 3 is a theoretical analysis of CAD for runoff; Section 4 is the prediction of YiChang runoff at Yangtze River to assess the NNM with CAD; and the conclusions are summarized in Section 5.

NNM FOR HYDROLOGICAL TIME SERIES
Generally, correlation exists between hydrology phenomena through time.Thus, the extent, X t depends on the historical runoff Qt-1, Qt-2, …, Qt-P.Given Dt ＝ (Qt-1, Qt-2, …, Qt-P), this is called The feature vector of the runoff series.Then, Xt ＝ (Qt, Qt+1, …, Qt+m-1 (t = P + 1, P + 2, …, n-m+1) and can be defined as the succeeding value of Dt。 Among Dt(t = P + 1, P + 2, …, n) which are constituted by {Qt}n, there must be some feature vectors that are nearest neighbours to the current feature vector Di.Suppose the number of nearest neighbour feature vectors is K, and it is represented by D1(i), D2(i), …, DK(i), then X1(i), X2(i), …, XK(i) must be the succeeding values of each corresponding feature vector.The nearest neighbour is judged by the difference between Di and Dt, which is usually calculated by Euclidean distance: where rt(i) represents the difference between Di and Dt, dij and dtj are number j variable of Di and Dt respectively, and P is the dimension of the feature vector.Then, rj(i)(j = 1,2, …, K) is denoted as the difference between Dj(i) and Di, and it should be mentioned that r1(i) < r2(i) <…<rK(i) (the number j is ordered according to the value of rj(i)).The smaller rj(i) is, the nearer Di and Dj(i) will be and Xi is more similar to Xj(i).Let Gj(i) be the nearest neighbour bootstrapping weight of Xj(i), which shows similarity between Xi and Xj(i).Obviously, Gj(i) is related to rj(i)。 As discussed above, the relative value of number l variables of number j nearest neighbour succeeding vector Xj(i), is known.The succeeding vector Xi can be obtained through multiplying predicted runoff j(i) G .Thus, the ultimate formula of the NNM model can be given as: The NNM model is confirmed when the number of nearest neighbours, K, the dimension of feature vector P, and the nearest neighbour bootstrapping weight Gj(i) are estimated.Generally, K = int√ −  is given.If P ≥ 2, the dimension of feature vector P can be estimated by a runoff auto-correlation graph or the trial and error method.
There are a number of methods to estimate bootstrapping weight Gj(i).When estimating, first of all, its restraint condition must be satisfied, and then the bootstrapping weight Gj(I ) should be related to rj(i), and the bootstrapping weight function should be equal to one (equation ( 3)).As the number j is ordered according to the value of rj(i), in this paper, the following formula is adopted: When K is confirmed, we can only calculate Gj(i ) once.

COSINE ANGLE DISTANCE FOR NNM
The value of angle (Dt, Di.) is defined as follows: where Dt is the feature vector and Di is the current vector.The smaller cos (Dt, Di) is, the nearer Dt and Di.
We illustrate our approach to comparing EUD and CAD using a 2-dimensional space.Figure 1 shows a 2-dimensional space where A is a query point.Suppose that NN(A) is the nearest neighbour of A by EUD, and the EUD between query point A and NN(A) is r.B and C are two points that are on the ssp(A,r).B and C have the same distance to A for EUD, as dist(A,B) = dist(A,C), so it is possible to judge which is better.But for CAD, angle (A,C) < angle (A,B), so C is nearer to A.
In the annual runoff prediction, the angle of the feature vector is very important.So we propose to predict annual runoff using NNM based on CAD for similarity estimation.NNM based on CAD can also be used in daily or monthly runoff prediction, as the hydrological variation trend can be reflected by the angle of the feature vector.

Data
The data used in this study are annual runoff (1890-2010) from Yichang station on the Yangtze River (Fig. 2).Yichang station (10 005 501 km 2 ) is the controlling station for the Three Gorges Dam.The data from 1890 to 1989 are used to develop the model, and data from 1990 to 2010 are used for prediction and model assessment.
First, the annual runoff time series of Yichang station was examined to determine any trends during the past 121 years.Figure 3 shows that it has a decreasing trend.

Prediction results and discussion
Through primary selection of the model parameters, by trial and error, it determines that p = 3, nearest neighbour number = 6.Then the annual runoff of the year 1980-1989 is used to constitute the feature vector Dt (t = 1, 2, 3... 88), a total of 97, which is used to predict annual runoff of the years 1990-2010.The mean relative error (MRE) is 5.10%, the qualified rates (QR, error less than 10%, 20% and 30%) are all 100.0%and r is 0.897 (Table 1, Figs 4 and 5).Compared with EUD, CAD is obviously better, with a lower MRE and higher QR.
CAD and EUD consider one aspect of the similarity measure; developing a better distance measure which considers both the direction and absolute distance would further improve the runoff prediction using NNM.Also, NNM, the same as using other models, has a low accuracy for annual runoff prediction and would make no sense when the future motion trail of the series is out of the law obtained from the historical data.

CONCLUSION
This paper predicted annual runoff prediction using NNM, with two distance measures, CAD and EUD.The results of annual runoff prediction at YiChang station showed that the results of CAD are significantly better than EUD as the former reflects more differences in the direction.Developing a better distance measure which considers both the direction and absolute distance would further improve the runoff prediction using NNM.

Fig. 1
Fig. 1 Difference between Euclidean distance and cosine angle distance.

Fig. 2
Fig.2The location of Yichang station on the Yangtze River, China.

Fig. 3
Fig. 3 Annual runoff time series of Yichang station.
Comparisons between measured and simulated annual runoff during the period of 1990-2010: (a) model performances with EUD, and (b) model performances with CAD.
Comparisons between measured and simulated annual runoff during the period of 1990-2010: (a) model performances with EUD, and (b) model performances with CAD.

Table 1
Prediction performance based on CAD and EUD.