Open Access Repository
Statistical methodology for tabcharts : data reduction techniques in laser ablation analyses
Downloads
Downloads per month over past year

PDF
(Whole thesis)
whole_YungChiHo...pdf  Download (20MB) Available under University of Tasmania Standard License.  Preview 
Abstract
The researchers in the CODES and the School of Earth Sciences operate a laboratory to study the composition of rock samples, which are collected from the field site. The rock samples are put into a machine. This machine will create series plots of all elements (called tabchart), which indicate the distribution of elements in the samples. In the tabchart, a significant signal change implies the change of composition in the sample and a flat part implies a mineral layer (phase) existing in the sample. Currently, the researchers identify these properties by their knowledge and experience. In some situations, they are difficult to make their judgement on these properties since they are not obvious and clear. Thus, an automatic and systematic method is requested to help them to solve this problem.
Total 1848 (= 66 samples x 28 elements in each sample) tabcharts of primary and secondary samples are provided by the School of Earth Sciences. These primary and secondary samples are not real and created in the laboratory. These tabcharts have the shape of background noise (a flat part) at the first stage, jump (a significant signal change) at the second stage, plateau (another flat part) at the third stage and drop (another significant signal change) at the last stage. Although this project focus on the standard samples only, the analysis and results can be extended to the real samples. The first four chapters of this project explain and describe the equipments of the laboratory, the mechanism and process of geological analysis on the sample and tabchart description. These chapters provide the knowledge for reference only and not the main interest in this project. This project is to focus and concentrate on the mathematical analysis on the tabchart.
The problem mentioned in the first paragraph is actually a change point analysis (detection) or time series segmentation issue in mathematics and statistics. Many methods are presented and invented to solve this problem in the papers. They include cumulative sums of difference (CUSUM), perceptually important points (PIP), fuzzy set theory and genetic algorithm and so on. To my best knowledge, most of these methods focus on point change detection only. On the top of this detection, a method is expected that it can also provide the researchers about the statistical summaries of the flat part in the tabchart (i.e. the mean, standard deviation and trend of element amount in the layer). Therefore, time series model could be considered and a good choice to achieve the above two targets. In addition, time series model has the advantage that it is easily implemented in the worksheet. However, some algorithms and modifications are needed to make such that the time series model can identify any point change in the tabchart.
Among various time series models, the linear Holt exponential smoothing model is selected in this project. This selection is made after considering and comparing some common and popular time series models. For simple exponential smoothing model, it has one estimate or equation (i.e. smooth) and one parameter (i.e. a) only. This model is rejected because signal changes (i.e. jump and drop) have large slope but flat parts (i.e. background and plateau) have gentle slope. One estimate is not enough to reflect this slope property of the tabchart. For dampedtrend linear exponential smoothing model, it has two estimates or equations (i.e. smooth and trend) and three parameters (i.e. a and p and y). Although two estimates are enough to reflect the slope property, three parameters may complicate the problem analysis and there is another better choice, linear Holt exponential smoothing model. This model also has two estimates (two equations) but has two parameters only (i.e. α and β). It is not guarantee that this model is the best choice, but any results and findings from this model can help to explore other time series model in the further studies.
The linear exponential smoothing model is modified before trying to fit it to the tabchart. The modified model has the variable (dynamic) parameters, a and p, and a threshold value, T. In the fitting process, if the trend estimate of the model exceeds the threshold value, the variable parameters will take values α1 and β1. Otherwise, they will take another values α2 and β2. The reason of using this policy is based on the difference of slope between significant signal changes (i.e. background and plateau) and flat parts (i.e. jump and drop). The parameters and the threshold value are adjusted manually until the model is fitted well to the tabchart. After fitting the model to all tabcharts, it discovered that the threshold value T is more influential and important in finding the wellfitted model than the two parameters α and β. Besides, the fitted curve of the model is spiky when the threshold value is small but becomes smooth when the value gets larger. There is a remark that the above parameters policy is only an initial trial and not perfect, the experiment result will reflect and reveal what is the drawback and disadvantage.
For convenience, henceforth HOLT model is named for the above modified linear smoothing model. The first algorithm of detecting the point change (or time series segmentation) by HOLT model is called classification method. The main idea of the algorithm is explained at the following. The change of the variable parameters (i.e. α and β) indicates the stage change in the tabchart (i.e. from significant signal change to flat part or vice versa). For example, the values of parameters change from (α1, β1) to (α2, β2) as the tabchart moves from background (stage) to jump (stage) in standard sample. This classification method is not practical for the researchers and not automatic because it needs the human adjustment of parameters beforehand. However, this method has two purposes. Firstly, this method is a way to develop another automatic classification method. Secondly, this method is used as a tool to analysis the data reduction and fitted error of HOLT model.
The second algorithm of detecting the point change by HOLT model called classification rules. This algorithm is an automatic method because it uses a set of rules to divide the tabchart into different stages. The classification rules are developed from classification method in the following way. After the tabchart is well fitted by the HOLT model, the graphs of trend estimate versus smooth estimate are plotted for all tabcharts. The rules are drawn by comparing the trendsmooth graphs with the tabcharts. The background stage of the tabchart has small trend and smooth values in the trendsmooth graphs. The jump stage of the tabchart has larger trend and smooth values. The plateau stage of the tab has larger smooth values but small trend values. The drop stage of the tabchart has negative and large trend values. The performance of the classification rules are verified and tested by applying to the tabcharts of standard samples. Some guidelines of evaluating the performance are made to minimize the personal and biased judgement. Different person possibly has different judgement and view on some borderline cases. The classification rules have the successful rate ranging from 45% to 80% in various elements. However, after excluding the tabcharts of having close background and plateau, the successful rate of the classification rates will be at least 65%. This project provides the good method and platform of identifying the point change. One promising way of improving the performance is to refine and modify the rules. Although the rules seem to play more important role in the performance than the parameters (i.e. α and β). an experiment should be carried out to investigate any effect of the parameters on the performance.
For both classification method and rules, there is a problem of misclassification. This problem is that the classification and rules have the terrible performance in some tabcharts. In other words, there are a lot of observations in these tabcharts being wrongly classified. However, all these tabcharts have the property that the level of background and plateau are very close. Thus, this is possibly the cause of misclassification but the gentle signal change (i.e. jump or drop having gentle slope) is another possible cause. Anyway, this problem gives us a hint to improve the performance of both method and rules. For example, another set of rules is needed to tackle these tabcharts. Besides, classification method is better than classification rules because rules are hardly to replace the human visual judgement. However, classification rules are automatic and more practical than classification method.
Data reduction is another characteristic of the HOLT model. The model is capable of removing the noise or variation from tabchart. The researchers can gain the useful information by observing the fitted curve (values) of the model. Apart from the graphs, quantitative analysis is also included in this thesis to help us to understand the data reduction in another angle. The squareroot of the sample size formula is used to calculate the standard error over the background and plateau. This formula has the assumption of independent data. Although many tabcharts have the autocorrelation because the ARIMA model can be fitted to them, this violation does not cause the problem of using the formula. The formula is not used to estimate the statistical summaries of the underlying process. It is used to measure the degree of variation and fluctuation in the background and plateau. Over the plateau, the standard error of HOLT model (i.e. mean of fitted curve or values) is smaller than that of tabchart (i.e. mean of the actual observation). This supports that the HOLT can reduce the variation over the plateau. However, the situation is reversed over the background. The standard error of HOLT model over background is greater than that of tabchart over background. It implies that the HOLT model has difficult in background and plateau of small signal (i.e. trace amount of element). A constant trend (i.e. more smooth fitted curve) should be used on the background by choosing the appropriate parameters. In other words, the background (or plateau of small signal) should use different α and β parameters.
After fitting the time series model to the raw data, the analysis of the fitted error should also be provided. From the analysis result, the fitted error of HOLT model increases as the level of plateau increases. The fitted error of the model decreases as the mass of the element increases. It is because the amount of heavy elements is smaller than the amount of light elements in the sample. Therefore, the only factor of affecting the fitted error is the level of plateau. This result implies that the standard deviation of fitting error of element's concentration can be minimized if the background noise can be controlled or minimized. Thus, the control of background noise can help to estimate the trace of the element in the sample more accurate but it is not capable of delivering only significant improvement to the major element in the sample.
For a comprehensive analysis, ARIMA model should also be fitted to the tabchart. Since there are a lot of tabcharts to be fitted, a policy is devised to speed up the ARIMA model fitting. The procedures of fitting the ARIMA have the following three steps. The first step is to check the tabchart is stationary. If it is not stationary, differencing will be carried out on the background of the tabchart. The second step is fit AR(1), AR(2), MA(1), MA(2) and ARMA(1,1) to the background, the best model is chosen by the lowest MSE and significant parameters. If the five models cannot be fitted to the background, it may be random walk or has another ARIMA model. The third step is to use the above two steps to fit the ARIMA to the plateau of the tabchart. After the model fitting, the result is obtained at the following. Over the background, no differencing is needed and most of the tabcharts are ARMA(1,1), random walk or inconclusive. Over the plateau, most of the plateaus having upward or downward trend are ARIMA(0,1,1), whereas most of the plateaus having horizontal trend are AR(1) or ARMA(1,1). The other plateaus are random walk or inconclusive. The above result also indicates that most of the tabcharts have the autocorrelation problem.
In the method (model) selection, there are several reasons of not choosing ARIMA model to tackle the researchers' problem. Firstly, the equation and structure of ARIMA is too complicated to be modified to identify the point change. Secondly, although simple exponential smoothing model is a special case of ARIMA model (equivalent to ARIMA(0,1,1) model without constant term), this model is not accepted in the selection. The reason is already mentioned in the previous paragraph. Thirdly, the researchers are looking for an automatic and fast method to extract the information from the tabchart, ARIMA model seems to be not a practical method for automated use. However the ARIMA can be used to detect the autocorrelation existing in the tabchart before applying the HOLT model. The preprocess on the tabchart could be done if the autocorrelation is serious and the precise estimate is required to make the judgement on the tabchart or sample.
There are some inadequate places in this project and more work is needed on these places in further studies. Firstly, the classification rules are relatively approximated and should be refined. More advanced mathematical techniques could be employed to devise better classification rules. Secondly, the other models should also be explored and investigated such that they have better performance in change point detection and the researchers can gain more information from the tabchart via these models, for example, locally weighted regression. Thirdly, this project does not have enough work on studying the parameters of the HOLT model. One way is to investigate the impact of parameters on the performance of classification rules because only one set of parameters is used in this analysis. Another way is to investigate the relationship between parameters and data reduction because the data reduction does not work on the background or plateau of small signal. Lastly, some studies should be done to tackle the tabchart having the autocorrelation and influence of autocorrelation on the statistical estimate on the tabchart by HOLT model.
Moreover, there are several directions worthwhile to be considered in the longterm goal. Firstly, the classification rules can be extended and applied to the tabchart of multiple significant signals (i.e. jumps and drops) and multiple flat parts (i.e. backgrounds and plateaus). Secondly, when studying the standard samples in this project, the tabchart of each element in a sample is investigated independently and separately. However, the mineral in the real sample is a chemical compound of elements. Therefore, the change of composition in real samples will involve the investigation of more than one tabchart. Thirdly, the knowledge (i.e. model and method) of this thesis is not limited only on the composition investigation of minerals or rock samples. It can be generalized and applied to other areas (i.e. charts in other problem). Lastly, the HOLT model and other methods of change point analysis should be compared, especially their performances. Since the data reduction of HOLT model is to trace the trend of the tabchart and remove the variation or noise, the HOLT model may be incorporate into other methods to get better performance.
Item Type:  Thesis  Research Master 

Authors/Creators:  Yung, CH 
Copyright Information:  Copyright 2008 the author  The University is continuing to endeavour to trace the copyright owner(s) and in the meantime this item has been reproduced here in good faith. We would be pleased to hear from the copyright owner(s). 
Additional Information:  Thesis (MSc)University of Tasmania, 2008. Includes bibliographical references 
Item Statistics:  View statistics for this item 
Actions (login required)
Item Control Page 