1. Introduction
1 The origin of this work is in a collaboration with a French administration, the Midi-Pyrenées Dreal (Direction Régionale Environnement Aménagement Logement) about the merge of several administrative data bases with different spatial support. It was necessary for example to disaggregate the number of housing units, originally available at the commune level, on a fine regular square grid. Similarly, many administrative agencies nowadays are facing the problem of merging information from different administrative origins collected on several incompatible partitions of the zone of interest into spatial units. An easy way to combine data on incompatible supports is to align them on a common grid. For this reason, the Eu directive Inspire (2007), INfrastructure for SPatial InfoRmation, states principles to “give infrastructure for spatial information to support community environment policies”. One of its objectives is to ensure that “it is possible to combine spatial data and services from different sources across the community in a consistent way and share them between several users and applications” and one requirement is that reference data should “enable merging of data from various sources”.
2 The reasons for the existence of incompatible partitions is a historical lack of coordination between collecting agencies, each using its favorite spatial division. Another origin can be the changes of administrative boundaries through time so that the combination of data from different historical periods results in incompatible spatial supports. The support of spatial data refers to the spatial domain informed by each characteristic. It is often that one needs to combine national census statistics with other sources of data, for example in geomarketing or natural sciences. Other examples of such situations arise when some planification task is undertaken such as where a new school or store should be located and the planners need to transfer census data to their particular catchment areas. Even when it is possible to reaggregate the data from the individual level, this solution is time consuming and expensive and may raise confidentiality problems. An easy way to combine data on several different supports is to align them on a common grid and to reallocate all sources to this single target partition. This option (called carroyage in French) is currently being exploited in France at Insee.
3 This problem is also referred to as the areal interpolation problem. More generally, the change of support problem may involve point-to-point, area-to-point or point-to-area interpolation. For example, the point interpolation problem is the case of a target variable available for a set of point locations and needed at another location where the data is not available. Gotway and Young (2002) describe these different types and give an overview of the methods. We will focus here on the area-to-area case with a particular emphasis on disaggregation. A discussion of some methods relative to this framework can also be found in Goodchild et al. (1993) but we go one step further in the degree of formalization and unification.
4 After introducing the vocabulary and definitions in section 2, we will see that there are three main types of such techniques in section 3. The first type is the family of proportional weighting schemes, also called dasymetric methods, which are illustrated in Yuan et al. (1997), Voss et al. (1979), Reibel and Bufalino (2005), Mennis and Hultgren (2006) and Gregory (2002). The second type is made of regression based interpolation and can be found in Flowerdew et al. (1991), Godchild et al. (1993), Flowerdew and Green (1992) for the simplest ones. The third type comprises smoothing techniques, which are described for example in Tobler (1979), Martin (1989), Bracken and Martin (1991), Rase (2001) and Kyriakidis (2004). The set of methods can be classified by the type of variable they apply to (continuous or discrete, extensive or intensive), the volume preserving property satisfaction (pycnophylactic property), the presence of auxiliary information, and the use of simplifying assumptions. We could not provide in a single paper the same level of details for the simple methods and for more complex ones, therefore we decided to concentrate on the simple ones, which are the ones more likely to be adopted by practitioners and to give just some of the main references for the more complex methods. We use a simulated toy example to illustrate some of the methods. In order to ease the practitioner’s choice, we present a synoptic table (Table 1) to summarize this classification. We believe that presenting the methods in such a unified way can help the practitioners clarifying the relationships between the very diverse presentations found in the literature. Note that a more detailed and lengthy presentation for practitioners has been written for the Dreal (Vignes et al., 2013). This work of clarification also leads us to find extensions of some of these methods to new cases: for example in section 3.2.1, we extend the ordinary dasymetric weighting method to the case of an intensive target variable Y and an extensive auxiliary quantitative variable X and in section 3.2.2, we show that the assumption of intersection units nested within control zones is unnecessary. Finally, this approach helped us laying the groundwork for a future mathematical evaluation of the respective accuracy of the methods. A brief point on the current state of the literature about accuracy and software issues is done in the conclusion.
2. Data, definitions and notations
5 The variable of interest that needs to be interpolated is called the target variable and it needs to have a meaning on any subregion of the given space. will denote the value of the target variable on the subregion D of the region of interest Ω. We restrict attention to the case of quantitative target variables (see for example Chakir (2009) for the case of categorical target variables).
6 In the general area-to-area reallocation problem, the original data for the target variable is available for a set of source zones that will be denoted by ; s = 1,..., S and has to be transferred to an independent set of target zones that will be denoted by ; t = 1, ..., T. The variable will be denoted by for simplicity and similarly by . The source zones and target zones are not necessarily nested and their boundaries do not usually coincide. Figure 1 illustrates these two partitions of the region of interest.
7 With a set of source zones and target zones, one can create a set of doubly indexed intersection zones , s standing for the index of the source zone and t for that of the target zone. For simplicity, will be denoted by . Figure 2 illustrates the partition with intersection zones with a zoom on a particular target on the left. Many methods involve the areas of different subregions (sources, targets or other). We will denote by |A| the area of any subregion A.
8 Most methods will then first proceed to the interpolation of the data from the source to the intersection and in a second step combine the interpolated intersection values to get the target interpolated values. This combination step will require an aggregation rule: one needs to explain how the value of the target variable Y on a zone Ω, , relates to the value of Y on a set of subzones , k = 1,..., p forming a partition of Ω. The literature distinguishes between two types of aggregation rules. Let us start with two examples: population and population density. The overall population of a region Ω is obtained by simple summation of the population of each subregion . Same is true for any counting variable and such variables are named extensive. Otherwise stated, an extensive variable is a variable, which is expected to take half the region’s value in each half of the region. Now the population density of the region Ω can be obtained from the densities of the subregions by a weighted average with weights given by , since
9 This type of variable is called intensive with weights . More generally linear aggregation takes the general form
10 for a set of weights . If all weights are equal to 1, the variable is called extensive and it is called intensive otherwise. For variables such as population density, we will make use of the following areal weights matrix: the (s, t) element of the areal weights matrix W is given by the ratio which is the share of the area of source zone s that lies in target zone t. Another example of intensive variable is given by the average price of housing units in a given subregion for a data set of house prices. In this case, the weighting scheme is different and is given by , where is the number of housing units in and n is the total number of housing units . More generally, proportions and rates are intensive variables. Although never really stated, the weights are not allowed to depend upon Y but they may be related to another extensive variable Z by
11 In that case, note that and that . These notions of extensive/intensive variables are also found in physics. Some variables are neither extensive nor intensive: the target variable defined by the maximum price on the subregion A is neither extensive nor intensive.
12 Let us show that it is always possible to associate an intensive variable to a given extensive variable by the following scheme. If Y is extensive, and if is a weighting scheme of the form Equation 1, the variable
13 is intensive since
14 Reversely, if one starts from an intensive variable Y with weighting scheme of the form Equation 1, it can be transformed into an extensive variable by
15 Indeed we have
16 Depending on the relative sizes of sources and targets, the areal interpolation problem can be rather of aggregation or disaggregation type. If sources are much smaller in size than targets, one will recover a target value by aggregating sources that will fall inside this target and possibly a few border intersections: this is an aggregation type. In the reverse situation a given target will contain intersections of itself with possible several sources. An intermediate case is when the sizes of sources are comparable to that of targets. Figures 3 and 4 illustrate these cases. We will concentrate here on the disaggregation type.
17 One property, which is often quoted, is the so called pycnophylactic property. According to Rase (2001), this name comes from the Greek words “pyknos” for mass and “phylax” for guard. This property requires the preservation of the initial data in the following sense: the predicted value on source obtained by aggregating the predicted values on intersections with should coincide with the observed value on . In the case of an extensive variable, this is equivalent to
18 In the case of an intensive variable with weighting scheme given by , this is equivalent to
19 In the literature, one usually encounters this property for the extensive case.
20 One assumption, which is often used to compensate for the absence of information is that of homogeneity. For an extensive target variable, we will say that it is homogeneous in a given zone A if it is evenly distributed within A, meaning that its value on a sub-zone of A is equal to the share of the area of the sub-zone times its value on A. For an intensive variable, we will use the same vocabulary when the variable is constant in each sub-zone of A. The two notions indeed correspond to each other by the relationships (2) and (3).
21 Let us introduce the toy example that will be used to demonstrate some properties. On Figure 5, we can see a square divided into 25 equal cells and three source regions made of unions of cells. Figure 5 presents the values of an auxiliary variable X in the center panel and the values of two target variables on the left and on the right. We can see that there is inhomogeneity within sources. The target zones are visible on Figures 6 through 9, which compare some methods through the targets prediction errors.
3. Methods
22 One early method cannot easily be classified as the others. It is called “point in polygon” and we will describe it first. The others fall into three main classes: the class of dasymetric methods, the class of regression methods and the class of smoothing methods.
23 Some methods use auxiliary information contained in the observation of an additional related variable X to improve the reallocation. When this information is categorical, the level sets of this variable define the so-called control zones. The spatial support of this auxiliary information can be at the source, target, intersection level or control levels. To expect that the use of X improves the reallocation of Y, we need to believe that Y and X are correlated enough. This raises some questions since Y as well as X are spatial variables hence they can be spatially autocorrelated and it is unclear how to take this into account to correct the classical correlation measures.
24 Some methods require additional assumptions on the target variable, like for example: Y is homogeneous on the sources, or on targets, or the distribution of Y is known to be Poisson or gaussian. We start with the most elementary methods requiring no additional information and complexify progressively.
3.1. Elementary methods
3.1.1. Point in polygon
25 The centroid assignment method also called “point in polygon” allocates the source data to a target if and only if the source polygon centroid is located within the target polygon. The areal data is thus collapsed to a point datum via a representative point such as the centroid. Voss et al. (1979) report that it is the least accurate method. Moreover, it does not satisfy the pycnophylactic property.
3.1.2. Areal weighting interpolation
26 It can be applied to an extensive or intensive variable and does not require auxiliary information. For an extensive variable, it is based on the homogeneity assumption that is proportional to the area |A|. It thus consists in allocating to each subregion a value proportional to the fraction of the area of the source that lies within that subregion. For s such that ,
27 After the combination step, this results in the following formula
28 For an intensive variable with areal weights, the areal weighting interpolation is based on the assumption that Y is uniform on the sources. It thus consists in allocating to the intersection the value of leading to
29 After the combination step, this results in the following formula
30 It is easy to see that this method does satisfy the pycnophylactic property.
31 From now on, all subsequent methods require additional auxiliary information except in section 3.3.1.
3.2. Dasymetric weighting
32 Bajatet al. (2011) trace this method back to the 19th century with George Julius Poulett Scrope in 1833 mapping the classes of global population density. The word dasymetry was introduced in the English language by Wright (1936). The class of dasymetric weighting methods comprises generalizations of areal weighting methods. In order to improve upon areal weighting, the idea is to get rid of the assumption of the count density being uniform throughout the source zones because this assumption is almost never accurate. For reflecting density variation within source zone, they use other relevant and available information X to distribute Y accordingly. This approach should help allocating to the small intersection zones within the sources provided the relationship between X and Y be of a proportionality type with a strong enough correlation. Of course it replaces the previous assumption by the assumption that the data is proportional to the auxiliary information on any subregion. This raises the question of how to check the validity of this assumption.
33 These methods are described in the literature for an extensive variable Y and an extensive auxiliary information X. However it can be adapted to the case of intensive Y as we will see below.
34 There are some classical examples of auxiliary information for socio-demographic count data or other socio-economic trends coming from road structure or remotely sensed urban land cover data. Yuan et al. (1997) observe a high correlation between population counts and land cover types.
35 These methods satisfy the pycnophylactic property.
3.2.1. Ordinary dasymetric weighting
36 It is assumed here that the auxiliary information is known at the intersection level and that it is of a quantitative nature. It might seem difficult to find auxiliary information at intersection level but the following example should convince the user that it is possible. Voss et al. (1999) and Reibel and Bufalino (2005) propose to use the network of road segments with auxiliary variables like length of roads or number of road nodes to allocate demographic characteristics such as population or number of housing units. The weight of a given subzone is then proportional to the aggregated length of streets and roads in that subzone.
37 For the case of an extensive target variable with an extensive auxiliary quantitative variable X, the following formulae extend Equations 4 and 5 by substituting X for the area:
38 yielding after the combination step:
39 We propose to extend this method to the case of an intensive target variable with weights given by for a given variable Z and an extensive auxiliary quantitative variable X. We define the corresponding extensive variables and intensive variable by introducing the transformations from intensive to extensive and from extensive to intensive . The following formula is obtained using the correspondence intensive-to-extensive given by Equation 2 (see the annex for a proof).
40 Similar formulae can be obtained easily in the case Y extensive with X intensive and Y intensive with X intensive.
41 Let us illustrate this method with the toy example introduced at the end of section 2. Figure 6 presents a comparison between the results of the areal weighting method and the dasymetric method for target variable . Figure 7 does the same for target variable . In each target we can see the true value of (left) and the value of the prediction (right) and the relative prediction error below . We can see that the dasymetric method yields better results than areal weighting for variable because of the inhomogeneity within sources (indeed the sum of squared errors is 10 percent smaller for dasymetric). However for variable , for which the level of inhomogeneity within sources is not as high, this is not the case and areal weighting is doing better than dasymetric with a ratio of sum of squared errors of 48 percent.
3.2.2. Dasymetric weighting with control zones
42 This is the case when the auxiliary information is categorical, its level sets thus defining the so called control zones. The most classical case, called binary dasymetric mapping, is the case of population estimation when there are two control zones: one which is known to be populated and the other one unpopulated. It is assumed that the count density is uniform throughout control zones. A first step estimates these densities for control zone c by
43 where may have several meanings (containment, centroid, percent cover). For this method, it is often assumed in the literature that intersection units are nested within control zones in which case the intersection zone prediction is given by
44 where c(s, t) denotes the control zone which contains the intersection zone . One can see through this formula that this is the same as using ordinary dasymetric with the auxiliary information being a first step crude estimate of variable Y based on the assumption that its corresponding intensive variable Equation 3 is constant throughout control zones. The assumption that intersection units are nested within control zones is not so restrictive since it can be restated as “the control zones are unions of intersections units”: control zone information being rather coarse, they can be designed to respect this constraint. However let us prove that this assumption is unnecessary. Indeed if one denotes by the intersection between source zone s, target zone t and control zone c, the following gives a prediction for the target values
45 Mennis and Hultgren (2006) illustrate this approach with American census data using land cover auxiliary information coming from manual interpretation of aerial photographs.
3.2.3. Two steps dasymetric weighting
46 This method aims at relieving the constraint of the ordinary dasymetric weighting that the auxiliary information should be known at the intersection level, thus allowing a larger choice of such information. It is assumed here that the information is known at the level of some control zones, which means that the auxiliary information has two components: a quantitative one and a qualitative one. There is a constraint though on the control zones: they should be nested within source zones. The first step is just an ordinary dasymetric step using control zones as targets and the auxiliary information on control zones. In this case, the intersection level is the source-control intersection, which is the same as the control level since controls are nested within sources. The second step consists in areal weighting with the controls as sources (using the controls estimates of the first step) and the original targets as final targets. The homogeneity assumption used in the second step concerns the control level but since control zones are usually smaller than source zones, the assumption is less constraining. Gregory (2002) presents the implementation of this approach with historical British census data.
47 If controls are not nested within sources, the method can be adapted by adding an additional step of areal weighting to distribute the control information on the control-source intersections.
3.3. Regression techniques
48 The dasymetric weighting schemes have several restrictions: the assumption of proportionality of Y and X, the fact that the auxiliary information should be known at intersection level and the limitation to a unique auxiliary variable (exceptionally two in the case of two steps dasymetric). The regression techniques will overcome these three constraints. Another characteristic of dasymetric method is that when predicting at the level of the intersection only the areal data within which the intersection is nested is used for prediction and this will not be the case for regression. In general the regression techniques involve a regression of the source level data of Y on the target or control values of X. The regression without auxiliary information of section 3.3.1 can be regarded as an extension of the areal weighting method since it relies on the “proportionality to area” principle. The regression with control zones of section 3.3.2 is a regression version of the dasymetric weighting with control zones of section 3.2.2. The regression with auxiliary information at target level of section 3.3.3 can be compared to ordinary dasymetric weighting of section 3.2.1.
49 These regression methods raise some estimation issues in the sense that very often the target variable is non negative and therefore one would like the corresponding predictions to satisfy this constraint. In order to solve this issue, people resort sometimes to Poisson regression (as in Flowerdew et al., 1991), or ordinary least squares with constraints on the coefficients (see Goodchild et al., 1993), or lognormal regression (see Goodchild et al., 1993).
3.3.1. Regression without auxiliary information
50 A first idea discussed in Goodchild et al. (1993) consists in deriving a system of equations linking the known source values to the unknown target values using an aggregation formula and an additional assumption of homogeneity of the target variable on the target zones.
51 In the case of an extensive variable, the homogeneity assumption allows to allocate Y to intersection units proportionally to their area yielding the following system
52 For the case of an intensive variable, the homogeneity assumption is that Y is uniform on targets and that its weighting system is given by areal weights. This yields the following relationship between source and target values
53 These systems are then solved using an ordinary least squares procedure forced through the origin provided the number of source units is larger than the number of target units. This last condition is not satisfied for disaggregation problems. In that case, one can adapt the technique by combining it with the use of control zones as in section 3.3.2.
3.3.2. Regression with control zones
54 Using control zones as in section 3.3.2, Goodchild et al. (1993) propose a two steps procedure where the first step is the technique of section 3.3.1 with controls playing the role of targets. The number of such control zones is handled by the user and hence can be forced to be smaller than the number of sources thus relieving the constraint on the number of targets of section 3.3.1. The assumption of homogeneity on targets becomes homogeneity on controls hence it is not restrictive because the controls are usually built to reflect homogeneity zones for the target variable. At the end of the first step, one can recover estimates of the target variable at the control level. Using the uniformity on control assumption, one gets from the control level to the control-target level. The second step in Goodchild et al. (1993) involves a simple aggregation from the control-target intersections level to the target level with homogeneity weights. Yuan et al. (1998) apply rather a dasymetric second step, which they call “scaling” using the first step target variable prediction as an auxiliary variable, thus enforcing the pycnophylactic property. Reibel and Agrawal (2006) superimpose a fine grid on the set of source and target zones. They first compute the proportion of each source zone corresponding to each land cover type and then regress the target variable (population) at source level on theses proportions. With the estimated coefficients, they can derive a coarse grid cell based map of the population surface. They rescale these estimates to impose the pycnophylatic property. Then with an aggregation formula they get population estimates for any combination of grid cells, namely for target regions.
3.3.3. Regression with auxiliary information at target level
55 This family of methods allows to use more than one auxiliary variable and of different natures (quantitative or categorical, or a mixture of both). In Flowerdew et al. (1991), the emphasis is on extensive target variables with a Poisson or binomial distribution (case 1 hereafter) and in Flowerdew and Green (1992), it is on intensive target variables with a gaussian distribution (case 2 hereafter). In the gaussian case, it is assumed that the target variable on A is a sample mean of some underlying gaussian variable measured on a number of individuals. Therefore the intensive weights are given by Equation 1 with and are approximated by areal weights when the counts are not known. In case 1, we have and similarly in case 2 we have where the means are in both cases functions of some parameters and the auxiliary information at target level . In case 2, moreover, it makes sense to assume that .
56 With the EM algorithm.
57 Except for a variant in Flowerdew and Green (1992) (see paragraph 2), the interpolation problem is cast as a missing data problem considering the intersection values of the target variable as unknown and the source values as known therefore allowing to use the EM algorithm to overcome the difficulty.
58 The algorithm is initialized with areal weighting estimates for . The E-step consists in calculating the conditional expectation of given the known values . In case 1, this yields the following formula
59 which yields the following predictor and it is clear that the pycnophylactic property is satisfied.
60 In case 2, the corresponding formula is
61 where is obtained from the by applying the aggregation formula to the sources subdivided into the intersections and by taking expectation on both sides yielding
62 Therefore the E-step yields the following predictor , where the come from the previous step and the from the estimation version of (11).
63 One can then check that this step enforces the pycnophylactic property since
64 In the M-step, the intersection values obtained at the previous E-step are considered as i.i.d. observations from the Poisson in case 1 and from the gaussian in case 2. Recall that in both cases, the intersection means are functions of some parameters and the auxiliary information at target level plus possibly some information at intersection level such as the area of the intersections. For example in case 1, Flowerdew et al. (1991) consider population as target variable and geology as auxiliary information assuming that the population density will be different in clay areas () and in limestone areas () so that , where λt is either or depending on whether target zone t is in the clay or the limestone area. One then performs maximum likelihood with a Poisson regression in case 1 and a weighted least squares in case 2.
65 Without the EM algorithm.
66 In case 2, Flowerdew and Green (1992) describe a simplified alternative version in the case when one is ready to make the uniform target zone assumption. Namely, since the auxiliary information X is available at target zone level, it does not hurt to assume . Let denotes the T × p design matrix where p is the number of explanatory factors in X and T the number of targets, denote the S × 1 vector of source values, denote the T × 1 vector of target values, W denote the weights matrix whose elements are given by . If we combine the following information:
67 - the relation between Y and X at target level:
68 - the aggregation equation ,
69 - the uniformity at target level assumption ,
70 we get the following regression equation
71 between target means at source level and auxiliary information at target level. Using the data at the source level and Equation 12, we can estimate the parameters β by weighted least squares with weights ns. Then is a prediction for .
72 Let us consider again the toy example defined earlier to illustrate this technique adapted to the case of Poisson regression. Figure 8 (respectively 9) compares the results of this regression technique with the dasymetric method based on the same auxiliary information for ( respectively ). For , the regression method is better than the dasymetric with a ratio of sum of squared errors of 12 percent. For however, the dasymetric is better than the regression with a ratio of sum of squared errors of 82 percent. The reason is that indeed the variable has been constructed to be almost proportional to X (which is in line with the spirit of dasymetric) whereas is not. Note that the dasymetric method uses more information than the regression method because it uses the auxiliary value at intersection level whereas the regression method uses it at target level.
73 Alternative with control zone.
74 In case 2, Flowerdew and Green (1992) consider another alternative with a set of control zones assuming that auxiliary information is at control zone level and that it is reasonable to believe that means are uniform on controls . The same arguments as above then yield the equations
75 where denotes the C × p design matrix with C being the number of control zones, denotes the C × 1 vector of control values, and W being the weight matrix at the source-intersection-control levels. Using the data at the source level and Equation 13, we can estimate the parameters β by weighted least squares with weights ns. Then and using the aggregation equation for target and control, one gets that is a prediction for . Note that one needs two sets of weights and .
3.4. A short overview of more elaborate methods
3.4.1. Other regression methods
76 In this section, we briefly describe alternative regression methods. A detailed development of these more sophisticated techniques would require much more tools and notations. Because one of our objectives is to give priority to the practitioner point of view, we do not develop them in this presentation but just give some of the main references. Murakami and Tsutsumi (2011) combine Flowerdew and Green EM algorithm approach with a spatial econometrics regression model to take into account spatial autocorrelation at the intersection unit level. Mugglin and Carlin (1998) propose a hierarchical bayesian version of the Poisson regression method of Flowerdew et al. (1991) with a Markov chain Monte Carlo estimation step and illustrate it on disease counts. The advantage of the hierarchical bayesian approaches is that they provide full posterior distribution estimates enabling accuracy evaluation but their approach requires that the spatial support of the auxiliary information be nested within both targets and source units. Mugglin et al. (2000) extend this approach introducing Markov random field priors on the source and target mean parameters: this allows them to introduce some spatial autocorrelation in the model. They illustrate their approach with population counts reallocation with 39 sources and 160 targets. Huang et al. (2002) introduce multiresolution tree structured autoregressive models.
3.4.2. Smoothing techniques
77 Initially meant for visual display and exploratory analysis, smoothing techniques can solve the point-to-point or the areal-to-point interpolation problems. By laying a fine lattice over the study area and predicting the target variable at each lattice node, they enable mapping the target variable. However they can be used as an intermediate step towards the areal-to-areal interpolation in the sense that once a point prediction is obtained, it is enough to use aggregation rules (integrate the point prediction) to obtain target zones predictions.
78 In this sense, choropleth mapping is a coarse interpolation technique, which amounts, for the intensive variable case, to allocate the areal data value to any point within the support of the corresponding source unit.
79 Martin (1989) and Bracken and Martin (1991) propose an adaptive kernel density estimation from the target variable values collapsed at the centroids of the source zones. This method is not pycnophylactic. A similar kernel based method is described in Grasland et al. (2000) with a discussion of the relationship between the choice of the bandwidth parameter and the level of aggregation of the initial information.
80 Tobler (1979) introduces a spline based approach for areal-to-point interpolation. His predictor is a discrete approximation (finite difference algorithm) of the solution to an optimization problem defining a type of interpolating spline with a smoothness criterion based on second partial derivatives. He includes additional constraints such as non-negative point predictions and mass-preservation. His choice of smoothness criterion has been criticized by Dyn et al. (1979). In contrast with Tobler's method which requires a regular grid of prediction points, Rase (2001) adapts Tobler's procedure replacing the regular grid by a triangulation of the space based on the observed centroids locations, and using some kernel smoothing with inverse distance weighting instead of splines.
81 Kyriakidis (2004) casts the problem into a geostatistical framework. Indeed the reverse problem of point-to-area interpolation is solved by the block Kriging in geostatistics which is classical due to mining practices: it is of interest for example to predict the total ore content of an area knowing the point data values. Kyriakidis (2004) shows that the area-to-point problem can be solved with similar methods but requires the modeling of all area-to-area and area-to-point covariances. The resulting prediction satisfies the pycnophylactic property. Moreover he proves that choropleth mapping, kernel smoothing and Tobler's pycnophylactic method can be viewed as particular cases of his framework, corresponding to various ways of specifying the covariance model (choropleth mapping corresponding to the absence of correlation at the point support level). A very interesting aspect of the method is that it offers a measure of reliability (standard error of each point prediction). The method can accomodate constraints such as maximum-minimum allowable value or prescribed value of the target variable: for example, zero population value over water bodies or high altitude regions. The method can handle large problems, possibly using moving local neighborhoods. Yoo et al. (2010) adapt it to accomodate more general constraints such as non-negativity. However estimating point covariance from areal data is difficult: it is possible for example with a maximum likelihood procedure based on multivariate gaussian assumption. Liu et al. (2008) propose to combine this approach with regression in an area-to-point residual kriging approach which can be used to disaggregate the regression residuals. Other generalizations can be found in Kelsall and Wakefield (2002) with log-normal kriging.
Methods | Target variableY | Auxiliary variable X | Control zones | Pycnophylactic property | |||
Nature | Additional assumption | Dimention | Nature | Support | |||
Areal weighting |
Extensive Intensive | Homogeneous on sources | none | none | none | none | yes |
Dasymetric |
Extensive Intensive | None | 1 | Extensive or Intensive | Intersections | none | yes |
Dasymetric with control zones |
Extensive Intensive | Homogeneous on controls | 1 | Categorical | Controls | yes | no |
Regression without auxiliary infor |
Extensive Intensive | Homogeneous on targets | none | none | none | none | no |
Regression with auxiliary info | Extensive | None | >=1 | Extensive or Intensive | Targets | none | no |
Intensive | Weight area | ||||||
Point in polygon | Extensive | None | none | none | none | none | no |
4. Conclusion
82 We have described the main classes of methods for the area-to-area spatial interpolation problem including proportional weighting schemes also called dasymetric methods, smoothing techniques and regression based interpolation. As we pointed out in the introduction, we have focused on the basic methods, which are more likely to be adopted by practitioners, and a summary of the main characteristics of these methods can be found in Table 1.
83 We have not addressed in this review the case of categorical target variable. Chakir (2009) proposes a technique for reallocating multinomial type data (namely land use shares) given sampled information at a disaggregated level and observation of aggregated land use shares with a generalized cross-entropy approach.
84 In terms of implementation of these methods in usual softwares, there is not much available. Bloom et al. (1996) describe their implementation of areal weighting from Flowerdew et al. (1991) with Mapinfo. With R, it is possible to use the “pycno” package by C. Brundson. From our experience with some real data cases, we believe that in large size real applications, the more sophisticated methods are not yet manageable because of size problems and are far too complicated to communicate to the public offices typical users. Simplicity and convenience considerations are certainly the prime arguments for the best choice.
85 As mentioned in the introduction, one motivation of this paper was to be a first step for a further study of the comparative precision of these prediction methods. Let us briefly summarize what can be found in the literature so far. Overall one finds two types of point of views: a methodological or an empirical one. Unfortunately, there is not much from the methodological point of view since we only found the work of Sadahiro (2000) who considers the point-in-polygon approach and compares it to the areal weighting scheme. He uses a counting process with a fixed number of i.i.d. points with a given density to model the target variable distribution. The target zone is modeled with a fixed shape but a random position. The sources realize a tiling partition of the space with geometric shapes (considered as unbounded to avoid boundary problems). The last step of the evaluation is of an empirical nature. He finds that the accuracy of point-in-polygon depends upon the target zone size (the bigger, the better) and the concentration of the underlying distribution of points. One needs a concentration of points around the representative point in an area of at most 12-15 percent of the total for the point-in-polygon to compare favorably with the areal weighting method, which is quite unrealistic in applications. He also studies the optimal location of representative points which is found to be at the spatial median of the source zone.
86 The rest of this literature contains many papers of an empirical nature. The comparison of areal weighting with the alternative dasymetric methods is found in Reibel and Bufalino (2005), Voss (1992), Mennis (2006), Fisher and Langford (1995), Gregory (2002). The dasymetric methods are always found to have better performance than the simple areal weighting with reported improvements up to 71 per cent in relative mean square error (Reibel and Bufalino, 2005).
87 The comparison of regression methods with several alternatives is found in Flowerdew and Green (1992), Flowerdew et al. (1991), Reibel and Agrawal (2007), Gregory (2002). Flowerdew et al. (1991) find that the EM algorithm regression for the Poisson or binomial models performs better than areal weighting by factors of 50 − 60 per cent (Poisson case) and 25 − 55 per cent (Binomial case) in target deviance. Murakami and Tsutsumi (2011) compare their spatial regression method to more classical regression approaches and find that their spatial lag model performs better. Overall regression methods are found to perform better than dasymetric methods.
88 For the smoothing methods, Goodchild and Lam (1980) compare areal weighting and Tobler's pycnophylactic interpolation and they do not report any significant advantage for the smoothing method. This may be due to the fact that “count density gradients are not in fact typically smooth up to and beyond tract boundaries” (from Reibel and Agrawal, 2007).
89 Finally, it is important to point out that the only methods that come along with an accuracy measure are area-to-point kriging and the hierarchical bayesian methods. We think that more attention should be paid to systematic comparisons of the relative accuracies of all these methods in the future.
Acknowledgments
We would like to thank A. Ruiz-Gazen and C. Vignes for fruitful discussions. This work was supported by the French Agence Nationale de la Recherche through the ModULand project (ANR-11-BSH1-005).Ordinary dasymetric weighting for intensive target variable and extensive auxiliary variable
90 For an intensive target variable with weights given by and an extensive auxiliary quantitative variable X, we define the corresponding extensive variable using (3) by and intensive variable . Using Equation 8 we have
91 and therefore
92 which is similar to Equation 8. After the combination step, we get
References
- Bajatb B, Krunic N,Kilibarda M (2011) Dasymetric mapping of spatial distribution of population in Timok Region. Proceedings of International conference Professional practice and education in geodesy and related fields, Klavodo-Djerdap, Serbia.
- Bloom LM, Pedler PJ, Wragg GE (1996) Implementation of enhanced areal interpolation using Mapinfo. Computers and Geosciences 22(5): 459-466.
- Chakir R (2009) Spatial Downscaling of agricultural land-use data: an econometric approach using cross-entropy. Land Economics 85(2): 238-251.
- Dynn, Wahba G,Wong W (1979) Comment on “Smooth pycnophylactic interpolation for geographical regions” by Tobler. Journal of the American Statistical Association 74(367): 530-535.
- Fisher P, Langford M (1996) Modelling sensitivity to accuracy in classified imagery: a study of areal interpolation. The Professional Geographer 48: 299-309.
- Flowerdew R, Green M,Kehris E (1991) Using areal interpolation methods in geographical information systems. Papers in Regional Science 70(3): 303-315.
- Flowerdew R,Green M (1992) Developments in areal interpolation methods and GIS. The Annals of Regional Science 26(1): 67-78.
- Goodchild MF, Anselin L, Deichman U (1993) A framework for the areal interpolation of socio-economic data. Environment and Planning A 25: 383-397.
- Goodchild MF,Lam N (1980) Areal interpolation: a variant of the traditional spatial problem. Geo-Processing 1: 297-312.
- Gotway CA, Young LJ (2002) Combining incompatible spatial data. Journal of the American Statistical Association 97(458): 632-648.
- Grasland C, Mathian H, Vincent J-M (2000) Multiscalar analysis and map generalisation of discrete social phenomena: statistical problems and political consequences. Statistical Journal of the United Nations ECE 17 IOS Press: 157-188.
- Gregory IN (2002) The accuracy of areal interpolation techniques: standardizing 19th and 20th century census data to allow long-term comparisons. Computers, environments and urban systems 26(4): 293-314.
- Kelsal J, Wakefield J (2002) Modeling spatial variation in disease risk: a geostatistical approach. Journal of the American Statistical Association 97: 692-701.
- Kyriakidis PC (2004) A geostatistical framework for areal-to-point spatial interpolation. Geographical Analysis 36(3): 259-289.
- Lam NS (1982) An evaluation of areal interpolation methods. Proceedings, Fifth International Symposium on Computer-Assisted Cartography (AutoCarto 5) 2: 471-479.
- Langford M (2007) Rapid facilitation of dasymetric-based population interpolation by means of raster pixel maps. Computers, Environment and Urban Systems 31: 19-32.
- Liu XH, Kyriakidis PC, Goodchild MF (2008) Population-density estimation using regression and area-to-point residual kriging. International Journal of geographical information science 22(4): 199-213.
- Martin D (1989) Mapping population data from zone centroid locations. Transactions of the Institute of British Geographers, New Series 14(1): 90-97.
- Martin D, Bracken I (1991) Techniques for modeling population-related raster databases. Environment and Planning A 23(7): 1069-1075.
- Mugglin AS, Carlin BP (1998) Hierarchical modeling in geographic information systems: population interpolation over incompatible zones. Journal of Agricultural, Biological and Environmental Statistics 3(2): 111-130.
- Mugglin AS, Carlin BP,Gelfand A E (2000) Fully model-based approaches for spatially misaligned data. Journal of the American Statistical Association 95(451): 877-887.
- Murakami D, Tsutsumi M (2011) A new areal interpolation technique based on spatial econometrics. Procedia-Social and Behavioral Sciences 21: 230-239.
- Rase W (2001) Volume-preserving interpolation of a smooth surface from polygon-related data. Journal of Geographycal Systems 3: 199-213.
- Reibel M, Bufalino ME (2005) Street-weighted interpolation techniques for demographic count estimation in incompatible zone systems. Environment and Planning A 37(1): 127-139.
- Reibel M, Agrawal A (2007) Areal interpolation of population counts using pre-classified land cover data. Population Research and Policy Review 26(5-6): 619-633.
- Sadahiro Y (2000) Accuracy of count data estimated by the point-in-polygon method. Geographical Analysis 32(1): 64-89.
- Tobler WR (1979) Smooth pycnophylactic interpolation for geographical regions. Journal of the American Statistical Association 74(367): 519-530.
- Vignes C, Rimbourg S, Ruiz-Gazen A,Thomas-Agnan C (2013) Fiches méthodologiques, Méthodes statistiques d’allocation spatiale : interpolation de données surfaciques. Technical report.
- Voss PR, Long DD, Hammer RB (1999) When census geography does not work: using ancillary information to improve the spatial interpolation of demographic data. CDE working paper, 99-26, Wisconsin, Madison.
- Wright JK (1936) A method of mapping densities of population with Cape Cod as an example. Geographical Review 26(1): 103-110.
- Yoo E, Kyriakidis PC,Tobler W (2010) Reconstructing population density surfaces from areal data: a comparison of Tobler’s pycnophylactic interpolation method and area-o-point kriging. Geographical Analysis 42(1): 78-98.
- Yoo E, Kyriakidis PC (2006) Area-to-point kriging with inequality-type data. Journal of geographical systems 8(4): 357-390.
- Yuan Y, Smth R M,Limp W F (1997) Remodeling census population with spatial information from Landsat TM imagery. Computers, Environment and Urban Systems 21(3): 245-258.
Mots-clés éditeurs : propriété pycnophylactique, problème de superposition des polygones, désagrégation spatiale, changement de support, interpolation spatiale surfacique
Date de mise en ligne : 26/06/2015
https://doi.org/10.3917/reru.151.0027