Thursday, January 16, 2014

Data down sampling using wavelet transform and principal component analysis


Сайн байна уу. Та бүхэндээ дараачийн сайхан бүтээлийг хүргэх гэж байгаадаа таатай байна. Хуурмаг мэдрэлийн сүлжээ зарим үед зохиомол мэдрэлийн сүлжээ гэж нэрлэгдэх энэ арга нь усны салбарын судалгааны олон салбарт хэрэглэгддэг. Ялангуяа каскад хэлбэрээр суурилсан олон усан цахилгаан станц, усан сангуудын ажиллагааг оновчилж загварчилхад энэ арга нь түгээмэл хэрэглэгддэг. Нэг гол, эсвэл нэг сав газарт байрлаж байгаа олон усан сангийн ажиллагаа нь хэрэглэгчидээс хамаарч олон параметрээр дүрслэгддэг. Судлаач та бүхэн энэ аргыг нарийн судлах боломжтой ба энэ талаар монголд мэргэшсэн Б.Бүнчингив гуай та уулзаж зөвлөгөө авч болох юм.


Data down sampling using wavelet transform and principal component analysis

Базарцэрэнгийн Бүнчингив (Ph.D)
Геөэкологийн хүрээлэн, Усны нөөц, ус ашиглалтын салбар

Xураангуй

Физикийн хувьд бүрэн ойлгогдоогүй процессуудыг математик тэгшитгэлээр дүрслэхэд хүндрэлтэй бөгөөд ийм тохиолдолд голдуу параметрийн харьцааг хэрэглэдэг. Ийм процессийн нэг жишээ бол усны ёроолын хагшаасны хөдөлгөөн, ялангуяа давалгааны үйлчлэл доорхи хөдөлгөөн болон нүүлт, өөрөөр хэлбэр ёроолын хэлбэр дүрсийн өөрчлөлт юм. Өгөгдөл дээр үндэслэсэн аргууд буюу тухайлбал хиймэл оюун ухаан, тууний нэг төрөл болох мэдрэлийн системийн зохиомол сүлжээ (Artificial Neural Networks - ANN) нь ийм процессыг судлах бас нэгэн хувилбар болдог. Эдгээр аргууд нь загварчлахад харьцангуй хялбархан бөгөөд хурдан, ашиглахын тулд явагдаж буй процессуудыг нэг бүрчлэн математик тэгшитгэлээр дүрслэх шаардлагагүй зэрэг давуу талтай бөгөөд бэлэн байгаа өгөгдөл ба үр дунгийн үндсэн дээр тухайн функцын уялдаа холбоог гаргаж авдаг. 


Эдгээр аргуудыг ашиглахад цаг хугацааны болоод орон зайн хувьд өргөн цар хүрээтэй өгөгдөл дээр ажиллах шаардлага гардаг. Өгөгдлийг оновчтой боловсруулж, орон зайн болон цаг хугацааны онцлог шинж чанарыг хадгалсан параметрүүд болгон цомхотгох нь загварчлалын үр дүнг сайжруулахаас гадна уг процессийг ойлгоход илүү дөхөмтэй болгох боломжтой юм. Энэхүү судалгаагаар усны эрэг орчмын ёроолын морфологийн хэмжилтийг Principal Component Analysis (PCA) болон Wavelet transform аргаар цомхотгон, гарсан үр дүнгээр нь усны ёроолын морфологийн богино хугацааны өөрчлөлтийг дээр дурдсан сүлжээг (ANN)-г ашиглан урьдчилан тодорхойлсон. Cудалгааг Балтын тэнгисийн эрэг болох Кийлийн булангийн нэг хэсгийн ёроолын морфологийн өгөгдөл дээр үндэслэсэн болно. Хэдийгээр цаг хугацааны хувьд өгөгдөл нь богино хугацаатай, зөвхөн 2 жилийг хамарсан боловч харьцангуй өндөр нарийвчлал бухий үр дүн гарсан. Зуны улирлын өөрчлөлт нь өвлийнхөөс арай илүү нарийвчлалтайгаар урьдчилан хэлэх боломжтой байсан болно.
Keywords: data reduction, feature extraction, principal components, wavelet transform

1.   Introduction

There are processes, the physics of which are not fully understood and therefore cannot be described well mathematically. An example is a sediment transport and resulting morphology development. Cases of this nature are conventionally encapsulated by the parametric relations. The data-driven soft computing techniques, such as the Artificial Neural Networks (ANN), offer a cost-effective alternative, without requiring physical insights of the undergoing processes. These techniques generalize the logical relations on the basis of the existing input/output data sets. Therefore, for the data-driven modeling techniques, one must normally work with data sets that are spatially and/or temporally extensive and therefore require efficient data handling. In addition to that, suitable data analysis techniques are needed that are useful for understanding the driving factors of the evolving process.
In this paper, a possibility to analyze and downsample the bathymetry data by wavelet transform and Principal Component Analysis (PCA) has been investigated comparatively. In other words, spatially extensive data have been reduced down to feature variables. Relatively accurate downsampling has been obtained by both methods, without a significant loss of information. Furthermore, it was attempted to estimate the state of near-shore morphology by the ANN, on the basis of the downsampled data. ANN prediction of a profile development has produced encouraging results for a short lead time.

2.   Data

The investigation data are near-shore bathymetry measurements, forming a part of the Kiel Bay on the German side of the Baltic Sea. The bathymetry around the bay has been measured with irregular intervals on 10 sectors of different sizes over the period of years from 1974 to 2000. The most frequent measurements made during the years 1992-1993 (once in about 6 weeks for the Sectors 6) are chosen for this study. Bathymetry measurement campaigns for the above mentioned sector have been carried out along a number of cross-shore profiles of 450 m lengths that reach up to approximately 5 m water depths into the sea (see Fig.1). Altogether 13 profiles are placed 70 m distant from each other. There are 42 points at every 10 m along the profiles. The maximum error of the original bathymetry measurements on the point elevation is ±12 cm [1].

Figure 1. Arbitrary bathymetry measurement for the Sector 6.

3.   Methodology

3.1      Hypothesis

The hypothesis is that the geometric feature at a certain point is a function of forcing variables and spatially neighboring geometric features at the previous measurement instances (time-lagging). In other words, the changes in morphology and the tendency can be captured by studying the local geometry with its direct neighbors. Moreover, the temporal variation of the extracted geometry features can be predicted by the neural networks.

3.2      Modelling technique: Artificial Neural Networks

The multi-layer error backpropagation networks or mostly called Multi-Layer Perceptrons (MLP) embody a supervised learning and generalizes any logical relations on the sequences of a few potential variables. The MLP consist of several layers of computational units. An example of typical two-layer neural network with n, m and k nodes in the input, hidden and output layers respectively, can be seen in Figure 2. Between the layer nodes, there are connections with corresponding weights and trainable biases. Training the neural network is a numerical optimization of a usually nonlinear objective function by adjusting weights and bias values.

Figure 2. Typical two-layer MLP network architecture.
Output of each node is its transfer function value for the weighted sum of inputs from the preceding layer nodes connected to it. Preferably, differentiable transfer functions are chosen to suit the task for which the network is being trained. Outputs of the preceding layer are the inputs to the next layer in a form of a weighted sum. The outputs of the network are compared to the target values to calculate the error, which is usually the Mean Square Error (MSE). The error is ‘back-propagated’ by a chain rule, in which the derivatives of the cost function on the weights of each layer are calculated, from the last layer to the first, by taking the derivatives in the previous layer(s) into account. In a gradient descent algorithm, the weight changes are calculated as below, if α is a learning rate (0< α <1) and m between the nodes i and j.
 – connection weight in the layer

3.3      Data downsampling techniques

3.3.1  Wavelet transform
Wavelet transform is a signal processing technique descendent of the Fourier transform. It provides a localization of frequency components, which is the main difference from the Fourier transform. Although it is a relatively new technique, its application covers a wide range of scientific and engineering disciplines, which include feature extraction, solving differential equations and most often downsampling the image sizes [3].
By the wavelet transform, the signal is decomposed into shifted and scaled versions of the original wavelet, which is a waveform function of a limited duration with an average value of zero. Here, the definitions of signal approximation (A), which are low frequencies of signal and detail (D), the high frequency corrections, are essential. Thus, the signal is decomposed into s=A1+D1. At the next level of decomposition, approximations are separated into A1=A2+D2 etc. As a result, the coefficients of approximations (cA) and details (cD) are derived from the scaled and shifted forms of the applied wavelet function. In one dimensional discrete wavelet transform, the coefficients as a function of scale a and position b are calculated as following:
                                                                         
where, b=k2j - dyadic position with a=2j - dyadic dilation and, j and k are integers that represent the decomposition level and discrete time respectively. At each level of decomposition, the coefficients are reduced twice by choosing every second coefficient and by this way the data reduction is realized.
By an inverse wavelet transform, the details and approximations are calculated backwards. When the details and approximations at each level of decomposition are summed up, the original signal is restored. For the decomposition level j, the reconstruction would be s=Aj+Dj+Dj-1+…+ D1. In every decomposition level, the amount of approximation coefficients is reduced approximately twice. For analysis and processing of bathymetry data, each cross-shore profile is treated as one dimensional signal. Thus, instead of a temporal variable, the spatial one will be adopted. Although the results do not differ significantly, the order and the type of the wavelet function decide the number and value of resulting decomposition coefficients. After a preliminary analysis, the Symlet function of 5th order has been chosen throughout this study.
3.3.2  Principal Component Analysis
Principal Component Analysis (PCA) is a multivariate statistical technique, applied mostly to reduce the dimensionality of data set and to identify the underlying new meaningful variables. It is quite an old technique, nearly 100 years old, for which a number of references can be found (see Jackson [6] for details). Dean and Dalrymple [4] applied this technique to analyze four profiles observed relatively apart in time and gave other examples of similar application of the PCA as well.
The Principal Components (PCs) are the eigenvectors of the covariance matrix of the data set. The largest eigenvalue corresponds to the first principal component, the second largest to the second and so on. The corresponding eigenvalues give an indication on the amount of information that the respective principal components explain. When the components are ranked by their eigenvalues, it is possible to extract the most important information in the data set. Thus, by discarding those components which explain minor parts of the data variance, normally a very high rate of data compression with a low error of reconstruction can be achieved. The PCA should enable the projection vectors:
·         maximize the variance retained in the projected data
·         give uncorrelated projected distributions, and
·         minimize the least square reconstruction error
The basic idea of the PCA is that, for the elevation h on i points along the beach profiles measured k times, there should be a coefficient matrix C for which fulfils the following condition:

where, e - new transformed variables, which are the principal components or eigenvectors. The eigenvectors are orthogonal, therefore are not correlated to each other. The coefficient matrix is found by minimizing the sum of squares of the local error with respect to C. To find the eigenvectors, their contribution to the variance should be maximized. By differentiating and defining the covariance matrix S, the eigenvalue matrix equation can be found.
In our case, the cross-shore profiles are treated as a function of a distance from the coastline (rows) that vary in time (columns). This way, the time series of each cross-shore profile is analyzed separately. Few main eigenvectors, resulting from the orthogonal transformation, define the basic shape of the profile and the perturbations that diverge from the main shape. By choosing the first few eigenvectors, the dimension of data will be significantly reduced. The eigenvectors or principal components remain invariant over time, whereas, the eigenvector coefficients or the PC scores would define time dependent information of an individual profile.

4.   Results

4.1      PCA

After performing the PCA on every profile, eigenvectors of the same size (42x18) and eigenvector coefficients (18x18) are generated, as a result. As an average of 13 profiles, the first PCs explain 99.53%, second PCs explain 0.16%, third PCs explain 0.08% and fourth PCs explain 0.05% of all variances. The first PC (Fig. 3.a) in the sector 6 represents the main profile patterns over time and the following ones represent perturbations and deviations from this main shape of each profile. The 2nd PCs appear to match the locations of the main sandbar (Fig. 3.c). The eigenvector coefficients which correspond to the main PCs of profiles at the sector 6 demonstrate also a seasonal behavior and the coefficients get higher in winter and lower in summer. Over time, the perturbations move diagonally towards right-hand side of the area in off-shore direction. This feature can be observed in all three eigenvector coefficients plotted.

Figure 3. New projected data (eigenvectors) and their coefficients in sector 6 a) first PCs b) coefficients of first PCs c) second PCs d) coefficients of second PCs e) third PCs f) coefficients of third PCs.

4.2      Wavelet transform

One-dimensional discrete decomposition is performed on individual cross-shore profiles at every measurement instances. After decomposing the profile for a few levels, the approximation of the last level would preserve the general shape of the profile and the rest of the coefficients, which are the details of every decomposition level, would contain the deviations from that general shape. As opposed to the PC coefficients, the wavelet decomposition coefficients should indicate the location and the degree of perturbations along the profiles for a given time instance.
After the decomposition levels of 42 bathymetry points along the profiles, a total of 77 coefficients were generated; 11 detail and 11 approximation coefficients of the fourth level, 13 detail coefficients of the third level, 17 at the second and 25 of the first level. Similar to the coefficients of the main PCs, the wavelet decomposition coefficients also have shown a movement or a shift of perturbations over time, from left-hand side to the right, at this sector. In addition to that, the small-scale structures preserved in the coefficients are parallel to the main sandbar (Fig. 4). When the level details are generated separately, it is possible to see, which details are preserved in individual decomposition levels.

Figure 4. Wavelet decomposition for the profiles on Nov, 93 a) original data b) fourth level approximation c) fourth level detail d) third level detail e) second level detail and f) first level detail coefficients.

4.3      Reconstruction of profile data

It is essential to ensure that the profiles are restored back on the basis of the reduced data, without losing details that would exceed the measurement error. In case of the PCA, by multiplying the dominating PCs and their coefficients, the bathymetry data are restored. The magnitudes of the reconstruction error by using 5 main PCs would result in average absolute error of about 4.5 cm. Restoring the bathymetry data by an inverse wavelet transform, certain high-frequency components, in this case the first level details will be discarded, for data reduction purpose. All other detail and approximation coefficients will be considered further. Although the total numbers of coefficients are not less than the original number of points, each decomposition level gives a set of coefficients which illustrate approximate location and the extent of certain frequency components along the profile. The decomposition levels can be analyzed and considered separately.
After restoring the original data by both techniques, there is a small rate of smoothening resulted in, although all the necessary details are preserved. In case of PCA, the residuals on the grid points averaged for 18 measurements are distributed over the whole area and are not necessarily concentrated in the vicinity of the main sandbar. However, for the profiles restored by an inverse wavelet transform, the largest residuals are concentrated around the sandbar. The average residuals for the distance ranges along the cross-shore profiles indicate that the accuracies increase, as the ranges get farther seawards (Table 1).
Table 1. AAE of restoring bathymetry data for distance ranges from the shoreline.
ranges, m
0-20
20-50
50-100
100-200
200-300
300-420
PCA
0.0444
0.0418
0.0406
0.0375
0.0328
0.0294
Wavelet
0.0820
0.0410
0.0432
0.0340
0.0212
0.0199

4.4      Prediction of morphology developments

The morphology has been predicted for a short-term ahead in time on the basis of the downsampled features of the cross/shore profile. Time-lagged neighbouring profile features are also provided as inputs to the neural networks. The ANN resulted outputs are then used to restore the bathymetry at the study area. The forcing varaibles are not taken into account, thus the prediction can be conisidered as a tendency estimation. The lead time is about 6 weeks. The preliminary study has shown no significant difference of performance by choosing various neural network structures. Therefore, only the MLP networks are used throughout. The bathymetry for June and November, 1993 are used for verification and the rest for training the network. The performance indices are the correlation coefficients (r) and Root Mean Square Error (RMSE).

4.4.1  Use of principal component coefficients

Altogether 13 profiles are used for ANN model set-up. The correlation coefficients between the verification results and target values are very high, the highest of them is found for the coefficients of second PCs (Table 2). The coefficients, which correspond to the first PCs have very narrow changes and therefore the results are rather accurate as well. The accuracies on the actual profiles are evaluated in comparison to the reconstructed profiles with target coefficients, as well as to the actual target profiles (Fig. 5).
Table 2. Correlation coefficients for individual PCs at the sector 6.

PC1
PC2
PC3
PC4
PC5
Jun-93
0.9885
0.9941
0.9943
0.9921
0.9927
Nov-93
0.9971
0.9982
0.9928
0.9916
0.9949


Figure 5. Results of prediction November, 1993 a) coefficients of Profile 2 b) coefficients of Profile 7 c) Profile 2 reconstructed d) Profile 7 reconstructed.

4.4.2  Use of wavelet decomposition coefficients

The prediction accuracy on the coefficients expressed in the correlation coefficients is relatively good, except for the second level detail (Table 3). The approximation coefficients are predicted well, however, the extreme values of detail coefficients are again underestimated, which causes the shifted estimation of the main sandbar crest (Fig. 6). The reconstructed profiles using obtained results at the profile No.7 on November, 1993 are plotted in Fig. 7. The neural networks have predicted the seaward placement of the sandbar in November, 1993. It has been found that, as opposed to the winter profiles, those for June, 1993 are qualitatively better predicted, with respect to the location and height of the bar crest.
Table 3. Accuracy of wavelet coefficient prediction in sector 6, correlation coefficients.

A4
D4
D3
D2
Jun-93
0.9984
0.9823
0.9745
0.8540
Nov-93
0.9987
0.9681
0.9601
0.8274


Figure 6. Wavelet coefficients predicted by neural networks profile No.7 in sector 6 for November, 1993 a) A4 b) D4 c) D3 and d) D2.

Figure 7. Reconstructed profiles in sector 6, November, 1993. a) Profile No.2 b) Profile No.7.

4.4.3  Discussion

The performance indices show that the accuracy of the models which use the PC scores clearly outperform those which use the wavelet decomposition coefficients. After restoring the profiles from the ANN results, it has been found that the largest residuals are concentrated around the main sandbar, in case of using wavelet decomposition coefficients. The summer profiles are qualitatively better predicted than those in winter.



Table 4. Average accuracies of profile reconstruction using ANN results.

Performance indices
Restored from NN output
Target profiles
Jun-1993
Nov-1993
Jun-1993
Nov-1993
PCA
Correlation, r
0.9999
0.9999
0.9994
0.9995
RMSE
0.0078
0.0079
0.0418
0.0403
Wavelet
Correlation, r
0.9976
0.9977
0.9973
0.9973
RMSE
0.0892
0.0845
0.0955
0.0918

5.   Conclusions

In order to investigate morphological evolutions by the ANN, the near-shore bathymetry data needed to be downsampled, the purpose for which wavelets transform and PCA have been applied comparatively. For the considered case study, the predictions of the extracted features by the neural network have produced promising results, although the temporal coverage of the bathymetry data is relatively short. The wavelet decomposition coefficients are better suited for data analysis purpose, since the coefficients give the degree of existence of certain frequency components in the profile shape and therefore are related to the geometry. Neural networks for which the PC scores are used outperform those for which the wavelet decomposition coefficients are used.

6.   References

[1]  Amt fuer Land und Wasserwirtschaft Kiel und das Landesamt fuer Natur und Umwelt Schleswig-Holstein., 1997, “Vorstranddynamik einer Tidefreien Kueste”, Abschlussbericht (in german)
[2]  Bazartseren, B., 2005, “Applicability of artificial neural networks for investigating short-term developments of near-shore morphology”, Dissertation, Publications of the Institut Bauinformatik, Brandenburg University of Technology of Cottbus, ISBN 3-934934-09-9 (in publication)
[3] Chui, C.K., “An introduction to wavelets”. Academic Press, Inc. San Diego, CA, 1992.
[4]  Dean, R.G. and Dalrymple, R.A., 2001, “Coastal processes with engineering applications”, Cambridge University Press, ISBN 0-521-49535-0
[5] Haykin, S., 1999, “Neural networks: A comprehensive foundation”, 2nd edition, Prentice Hall, New Jersey
[6] Jackson, J.E., 1991, “A user’s guide to principal components”, John Wiley & Sons, Inc., New York
[7] Oosterlaan, L. M., “Prediction of Near-shore Morphology along the Dutch Coast”, Joint workshop on Artificial Intelligence in Civil Engineering Applications, Schleider, O.H. and Zijderveld, A, editors, BTU Cottbus, pages 101-111, 2000.

      

No comments:

Post a Comment