Article de revue

Spatial reallocation of areal data – another look at basic methods

Pages 27 à 58

Citer cet article


  • Huyen Do, V.,
  • Thomas-Agnan, C.
  • et Vanhems, A.
(2015). Spatial reallocation of areal data – another look at basic methods. Revue d’Économie Régionale & Urbaine, mai(1), 27-58. https://doi.org/10.3917/reru.151.0027.

  • Huyen Do, Van.,
  • et al.
« Spatial reallocation of areal data – another look at basic methods ». Revue d’Économie Régionale & Urbaine, 2015/1 mai, 2015. p.27-58. CAIRN.INFO, shs.cairn.info/revue-d-economie-regionale-et-urbaine-2015-1-page-27?lang=en.

  • HUYEN DO, Van,
  • THOMAS-AGNAN, Christine
  • et VANHEMS, Anne,
2015. Spatial reallocation of areal data – another look at basic methods. Revue d’Économie Régionale & Urbaine, 2015/1 mai, p.27-58. DOI : 10.3917/reru.151.0027. URL : https://shs.cairn.info/revue-d-economie-regionale-et-urbaine-2015-1-page-27?lang=en.

https://doi.org/10.3917/reru.151.0027


1.  Introduction

1 The origin of this work is in a collaboration with a French administration, the Midi-Pyrenées Dreal (Direction Régionale Environnement Aménagement Logement) about the merge of several administrative data bases with different spatial support. It was necessary for example to disaggregate the number of housing units, originally available at the commune level, on a fine regular square grid. Similarly, many administrative agencies nowadays are facing the problem of merging information from different administrative origins collected on several incompatible partitions of the zone of interest into spatial units. An easy way to combine data on incompatible supports is to align them on a common grid. For this reason, the Eu directive Inspire (2007), INfrastructure for SPatial InfoRmation, states principles to “give infrastructure for spatial information to support community environment policies”. One of its objectives is to ensure that “it is possible to combine spatial data and services from different sources across the community in a consistent way and share them between several users and applications” and one requirement is that reference data should “enable merging of data from various sources”.

2 The reasons for the existence of incompatible partitions is a historical lack of coordination between collecting agencies, each using its favorite spatial division. Another origin can be the changes of administrative boundaries through time so that the combination of data from different historical periods results in incompatible spatial supports. The support of spatial data refers to the spatial domain informed by each characteristic. It is often that one needs to combine national census statistics with other sources of data, for example in geomarketing or natural sciences. Other examples of such situations arise when some planification task is undertaken such as where a new school or store should be located and the planners need to transfer census data to their particular catchment areas. Even when it is possible to reaggregate the data from the individual level, this solution is time consuming and expensive and may raise confidentiality problems. An easy way to combine data on several different supports is to align them on a common grid and to reallocate all sources to this single target partition. This option (called carroyage in French) is currently being exploited in France at Insee.

3 This problem is also referred to as the areal interpolation problem. More generally, the change of support problem may involve point-to-point, area-to-point or point-to-area interpolation. For example, the point interpolation problem is the case of a target variable available for a set of point locations and needed at another location where the data is not available. Gotway and Young (2002) describe these different types and give an overview of the methods. We will focus here on the area-to-area case with a particular emphasis on disaggregation. A discussion of some methods relative to this framework can also be found in Goodchild et al. (1993) but we go one step further in the degree of formalization and unification.

4 After introducing the vocabulary and definitions in section 2, we will see that there are three main types of such techniques in section 3. The first type is the family of proportional weighting schemes, also called dasymetric methods, which are illustrated in Yuan et al. (1997), Voss et al. (1979), Reibel and Bufalino (2005), Mennis and Hultgren (2006) and Gregory (2002). The second type is made of regression based interpolation and can be found in Flowerdew et al. (1991), Godchild et al. (1993), Flowerdew and Green (1992) for the simplest ones. The third type comprises smoothing techniques, which are described for example in Tobler (1979), Martin (1989), Bracken and Martin (1991), Rase (2001) and Kyriakidis (2004). The set of methods can be classified by the type of variable they apply to (continuous or discrete, extensive or intensive), the volume preserving property satisfaction (pycnophylactic property), the presence of auxiliary information, and the use of simplifying assumptions. We could not provide in a single paper the same level of details for the simple methods and for more complex ones, therefore we decided to concentrate on the simple ones, which are the ones more likely to be adopted by practitioners and to give just some of the main references for the more complex methods. We use a simulated toy example to illustrate some of the methods. In order to ease the practitioner’s choice, we present a synoptic table (Table 1) to summarize this classification. We believe that presenting the methods in such a unified way can help the practitioners clarifying the relationships between the very diverse presentations found in the literature. Note that a more detailed and lengthy presentation for practitioners has been written for the Dreal (Vignes et al., 2013). This work of clarification also leads us to find extensions of some of these methods to new cases: for example in section 3.2.1, we extend the ordinary dasymetric weighting method to the case of an intensive target variable Y and an extensive auxiliary quantitative variable X and in section 3.2.2, we show that the assumption of intersection units nested within control zones is unnecessary. Finally, this approach helped us laying the groundwork for a future mathematical evaluation of the respective accuracy of the methods. A brief point on the current state of the literature about accuracy and software issues is done in the conclusion.

2.  Data, definitions and notations

5 The variable of interest that needs to be interpolated is called the target variable and it needs to have a meaning on any subregion of the given space. Description de l'image par IA : Y majuscule indice D majuscule

will denote the value of the target variable on the subregion D of the region of interest Ω. We restrict attention to the case of quantitative target variables (see for example Chakir (2009) for the case of categorical target variables).

6 In the general area-to-area reallocation problem, the original data for the target variable is available for a set of source zones that will be denoted by Description de l'image par IA : S majuscule indice s

; s = 1,..., S and has to be transferred to an independent set of target zones that will be denoted by Description de l'image par IA : T majuscule indice t ; t = 1, ..., T. The variable Description de l'image par IA : Y majuscule indice S majuscule sub-indice s will be denoted by Description de l'image par IA : Y majuscule indice s for simplicity and similarly Description de l'image par IA : Y majuscule indice T majuscule sub-indice i by Description de l'image par IA : Y majuscule indice t. The source zones and target zones are not necessarily nested and their boundaries do not usually coincide. Figure 1 illustrates these two partitions of the region of interest.

Figure 1

Source zones, target zones and grid target zones

Description de l'image par IA : Three irregularly shaped zones labeled S1, S2, and S3, with S1 and S3 larger than S2.
Description de l'image par IA : A map divided into nine labeled zones, each marked with a distinct identifier (T1 to T9).
Description de l'image par IA : A map with a grid overlay, showing various zones within an irregular boundary.

Source zones, target zones and grid target zones

7 With a set of source zones and target zones, one can create a set of doubly indexed intersection zones Description de l'image par IA : A majuscule indice s virgule t position de base égale S majuscule indice s position de base intersection de la famille T majuscule indice t

, s standing for the index of the source zone and t for that of the target zone. For simplicity, Description de l'image par IA : Y majuscule indice A majuscule sub-indice s virgule t will be denoted by Description de l'image par IA : Y majuscule indice s virgule t. Figure 2 illustrates the partition with intersection zones with a zoom on a particular target on the left. Many methods involve the areas of different subregions (sources, targets or other). We will denote by |A| the area of any subregion A.

Figure 2

Intersection zones

Description de l'image par IA : Two shapes with labeled sections, one transforming into another with an arrow.

Intersection zones

8 Most methods will then first proceed to the interpolation of the data from the source to the intersection and in a second step combine the interpolated intersection values to get the target interpolated values. This combination step will require an aggregation rule: one needs to explain how the value of the target variable Y on a zone Ω,  Description de l'image par IA : Y majuscule indice Oméga majuscule en normal

,  relates to the value of Y on a set of subzones Description de l'image par IA : Oméga majuscule en normal indice kk = 1,..., p forming a partition of Ω. The literature distinguishes between two types of aggregation rules. Let us start with two examples: population and population density. The overall population Description de l'image par IA : P majuscule indice Oméga majuscule en normal of a region Ω is obtained by simple summation of the population of each subregion Description de l'image par IA : P majuscule indice Oméga majuscule en normal sub-indice k. Same is true for any counting variable and such variables are named extensive. Otherwise stated, an extensive variable is a variable, which is expected to take half the region’s value in each half of the region. Now the population density Description de l'image par IA : Y majuscule indice Oméga majuscule en normal of the region Ω can be obtained from the densities of the subregions Description de l'image par IA : Y majuscule indice Oméga majuscule en normal sub-indice k by a weighted average with weights given by Description de l'image par IA : Equation with fraction involving Ω and k.,  since

Description de l'image par IA : Y majuscule indice Oméga majuscule en normal position de base égale début fraction sommation début souscript k fin scripts P majuscule indice Oméga majuscule en normal sub-indice k sub position de base sur début valeur absolue Oméga majuscule en normal fin valeur absolue fin fraction égale sommation début souscript k fin scripts début fraction début valeur absolue Oméga majuscule en normal indice k position de base fin valeur absolue sur début valeur absolue Oméga majuscule en normal fin valeur absolue fin fraction début fraction P majuscule indice Oméga majuscule en normal sub-indice k sub position de base sur début valeur absolue Oméga majuscule en normal indice k position de base fin valeur absolue fin fraction égale sommation début souscript k fin scripts w indice Oméga majuscule en normal sub-indice k position de base Y majuscule indice Oméga majuscule en normal sub-indice k

9 This type of variable is called intensive with weights Description de l'image par IA : w indice Oméga majuscule en normal sub-indice k

. More generally linear aggregation takes the general form

Description de l'image par IA : Y majuscule indice Oméga majuscule en normal position de base égale sommation début souscript k fin scripts w indice Oméga majuscule en normal sub-indice k position de base Y majuscule indice Oméga majuscule en normal sub-indice k

10 for a set of weights Description de l'image par IA : w indice Oméga majuscule en normal sub-indice k

. If all weights are equal to 1, the variable is called extensive and it is called intensive otherwise. For variables such as population density, we will make use of the following areal weights matrix: the (st) element of the areal weights matrix W is given by the ratio Description de l'image par IA : Equation with fraction: W sub s,f equals A sub s,f divided by S sub s. which is the share of the area of source zone s that lies in target zone t. Another example of intensive variable is given by the average price of housing units in a given subregion for a data set of house prices. In this case, the weighting scheme is different and is given by Description de l'image par IA : w indice Oméga majuscule en normal sub-indice k position de base égale début fraction n indice k position de base sur n fin fraction,  where Description de l'image par IA : n indice k is the number of housing units in Description de l'image par IA : Oméga majuscule en normal indice k and n is the total number of housing units Description de l'image par IA : Mathematical equation showing the summation of nk from k=1 to n.. More generally, proportions and rates are intensive variables. Although never really stated, the weights are not allowed to depend upon Y but they may be related to another extensive variable Z by

Description de l'image par IA : w indice Oméga majuscule en normal sub-indice k sub position de base égale début fraction Z majuscule indice Q majuscule sub-indice k sub position de base sur Z majuscule indice Q majuscule position de base fin fraction barre oblique inversée parenthèse droite barre oblique inversée parenthèse droite barre oblique inversée parenthèse droite barre oblique inversée parenthèse droite barre oblique inversée parenthèse droite 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 2 2 7 7

11 In that case, note that Description de l'image par IA : w indice Oméga majuscule en normal position de base égale 1

  and that Description de l'image par IA : sommation début souscript k fin scripts w indice Oméga majuscule en normal sub-indice k position de base égale 1. These notions of extensive/intensive variables are also found in physics. Some variables are neither extensive nor intensive: the target variable Description de l'image par IA : Y majuscule indice A majuscule defined by the maximum price on the subregion A is neither extensive nor intensive.

12 Let us show that it is always possible to associate an intensive variable to a given extensive variable by the following scheme. If Y is extensive, and if Description de l'image par IA : w indice A majuscule

is a weighting scheme of the form Equation 1, the variable

Description de l'image par IA : suscrire Y majuscule avec tilde indice A majuscule position de base égale début fraction Y majuscule indice A majuscule position de base sur Z majuscule indice A majuscule position de base fin fraction parenthèse gauche 2 parenthèse droite

13 is intensive since

Description de l'image par IA : suscrire Y majuscule avec tilde indice Oméga majuscule en normal position de base égale début fraction sommation début souscript k fin scripts Y majuscule indice Oméga majuscule en normal sub-indice k sub position de base sur Z majuscule indice Oméga majuscule en normal position de base fin fraction égale sommation début souscript k fin scripts début fraction Z majuscule indice Oméga majuscule en normal sub-indice k sub position de base sur Z majuscule indice Oméga majuscule en normal position de base fin fraction début fraction Y majuscule indice Oméga majuscule en normal sub-indice k sub position de base sur Z majuscule indice Oméga majuscule en normal sub-indice k sub position de base fin fraction égale sommation début souscript k fin scripts w indice Oméga majuscule en normal sub-indice k sub position de base suscrire Y majuscule avec tilde indice Oméga majuscule en normal sub-indice k sub position de base point

14 Reversely, if one starts from an intensive variable Y with weighting scheme Description de l'image par IA : w indice A majuscule

of the form Equation 1, it can be transformed into an extensive variable by

Description de l'image par IA : suscrire Y majuscule en gras avec tilde indice A majuscule position de base égale Z majuscule indice A majuscule position de base Y majuscule indice A majuscule position de base suscrire Y majuscule en gras avec tilde en gras indice A majuscule position de base égale Z majuscule indice A majuscule position de base Y majuscule indice A majuscule position de base suscrire Y majuscule en gras avec tilde en gras indice A majuscule position de base égale Z majuscule indice A majuscule position de base Y majuscule indice A majuscule position de base suscrire Y majuscule en gras avec tilde en gras indice A majuscule position de base égale suscrire Y majuscule en gras avec tilde en gras indice A majuscule position de base

15 Indeed we have

Description de l'image par IA : suscrire Y majuscule avec tilde indice Oméga majuscule en normal position de base égale Z majuscule indice Oméga majuscule en normal position de base Y majuscule indice Oméga majuscule en normal position de base égale Z majuscule indice Oméga majuscule en normal position de base sommation début souscript k fin scripts w indice Oméga majuscule en normal sub-indice k sub position de base Y majuscule indice Oméga majuscule en normal sub-indice k sub position de base égale sommation début souscript k fin scripts Z majuscule indice Oméga majuscule en normal sub-indice k sub position de base Y majuscule indice Oméga majuscule en normal sub-indice k sub position de base égale sommation début souscript k fin scripts suscrire Y majuscule avec tilde indice Oméga majuscule en normal sub-indice k sub position de base point

16 Depending on the relative sizes of sources and targets, the areal interpolation problem can be rather of aggregation or disaggregation type. If sources are much smaller in size than targets, one will recover a target value by aggregating sources that will fall inside this target and possibly a few border intersections: this is an aggregation type. In the reverse situation a given target will contain intersections of itself with possible several sources. An intermediate case is when the sizes of sources are comparable to that of targets. Figures 3 and 4 illustrate these cases. We will concentrate here on the disaggregation type.

Figure 3

Aggregation case

Description de l'image par IA : Three diagrams showing different stages of partitioning a shape into labeled sections T1, T2, and T3.

Aggregation case

Figure 4

Disaggregation case

Description de l'image par IA : Three maps showing different levels of disaggregation of a region into smaller sections labeled S1, S2, and S3.

Disaggregation case

17 One property, which is often quoted, is the so called pycnophylactic property. According to Rase (2001), this name comes from the Greek words “pyknos” for mass and “phylax” for guard. This property requires the preservation of the initial data in the following sense: the predicted value on source Description de l'image par IA : S majuscule indice s

obtained by aggregating the predicted values on intersections with Description de l'image par IA : S majuscule indice s should coincide with the observed value on Description de l'image par IA : S majuscule indice s. In the case of an extensive variable, this is equivalent to

Description de l'image par IA : Y majuscule en gras indice s position de base égale sommation début souscript t en gras t en gras intersection s pas égal à ensemble vide en normal fin scripts suscrire Y majuscule en gras avec circonflexe indice s virgule t position de base point

18 In the case of an intensive variable with weighting scheme given by Description de l'image par IA : w indice A majuscule

, this is equivalent to

Description de l'image par IA : Y majuscule indice s position de base égale sommation début souscript t deux points t intersection s pas égal à ensemble vide en normal fin scripts w indice s virgule t position de base suscrire Y majuscule avec circonflexe indice s virgule t position de base point

19 In the literature, one usually encounters this property for the extensive case.

20 One assumption, which is often used to compensate for the absence of information is that of homogeneity. For an extensive target variable, we will say that it is homogeneous in a given zone A if it is evenly distributed within A, meaning that its value on a sub-zone of A is equal to the share of the area of the sub-zone times its value on A. For an intensive variable, we will use the same vocabulary when the variable is constant in each sub-zone of A. The two notions indeed correspond to each other by the relationships (2) and (3).

Figure 5

Toy example. Data on cells. Description de l'image par IA : Y majuscule indice 1 (left), X (central), Description de l'image par IA : Y majuscule indice 2 (right)

Description de l'image par IA : A grid of numbers in cells, some grouped in boxes.
Description de l'image par IA : A grid with numbers in cells, some cells are highlighted with a border.
Description de l'image par IA : A grid with numbers in cells, divided into two sections. Top section has larger numbers, bottom section has smaller numbers.

Toy example. Data on cells. Description de l'image par IA : Y majuscule indice 1 (left), X (central), Description de l'image par IA : Y majuscule indice 2 (right)

21 Let us introduce the toy example that will be used to demonstrate some properties. On Figure 5, we can see a square divided into 25 equal cells and three source regions made of unions of cells. Figure 5 presents the values of an auxiliary variable X in the center panel and the values of two target variables Description de l'image par IA : Y majuscule indice 1

 on the left and Description de l'image par IA : Y majuscule indice 2 on the right. We can see that there is inhomogeneity within sources. The target zones are visible on Figures 6 through 9, which compare some methods through the targets prediction errors.

Figure 6

Toy example. Target variable Description de l'image par IA : Y majuscule indice 1: Areal weighting (left) and dasymetric with X (right)

Description de l'image par IA : A heatmap comparing absolute relative errors in two different methods, with percentages and error ranges indicated in each cell.
Description de l'image par IA : Matrix chart comparing errors in two methods.

Toy example. Target variable Description de l'image par IA : Y majuscule indice 1: Areal weighting (left) and dasymetric with X (right)

Figure 7

Toy example. Target variable Description de l'image par IA : Y majuscule indice 2: Areal weighting (left) and dasymetric with X (right)

Description de l'image par IA : A heatmap displaying absolute relative error percentages in a grid format, with varying shades indicating different error ranges.
Description de l'image par IA : Grid chart showing absolute relative error values with percentages.

Toy example. Target variable Description de l'image par IA : Y majuscule indice 2: Areal weighting (left) and dasymetric with X (right)

Figure 8

Toy example. Target variable Description de l'image par IA : Y majuscule indice 1: Dasymetric (left) and Regression (right)

Description de l'image par IA : A chart comparing absolute relative errors between Dasymetric (left) and Regression (right) methods, categorized by error ranges.
Description de l'image par IA : A chart comparing absolute relative errors between Dasymetric and Regression methods, categorized by error ranges.

Toy example. Target variable Description de l'image par IA : Y majuscule indice 1: Dasymetric (left) and Regression (right)

Figure 9

Toy example. Target variable Description de l'image par IA : Y majuscule indice 2: Dasymetric (left) and Regression (right)

Description de l'image par IA : Grid chart showing absolute relative error ranges with percentages.
Description de l'image par IA : A chart displaying absolute and relative errors for two variables, Dasymetric and Regression, with color-coded error ranges.

Toy example. Target variable Description de l'image par IA : Y majuscule indice 2: Dasymetric (left) and Regression (right)

3.  Methods

22 One early method cannot easily be classified as the others. It is called “point in polygon” and we will describe it first. The others fall into three main classes: the class of dasymetric methods, the class of regression methods and the class of smoothing methods.

23 Some methods use auxiliary information contained in the observation of an additional related variable X to improve the reallocation. When this information is categorical, the level sets of this variable define the so-called control zones. The spatial support of this auxiliary information can be at the source, target, intersection level or control levels. To expect that the use of X improves the reallocation of Y, we need to believe that Y and X are correlated enough. This raises some questions since Y as well as X are spatial variables hence they can be spatially autocorrelated and it is unclear how to take this into account to correct the classical correlation measures.

24 Some methods require additional assumptions on the target variable, like for example: Y is homogeneous on the sources, or on targets, or the distribution of Y is known to be Poisson or gaussian. We start with the most elementary methods requiring no additional information and complexify progressively.

3.1.  Elementary methods

3.1.1.  Point in polygon

25 The centroid assignment method also called “point in polygon” allocates the source data Description de l'image par IA : Y majuscule indice s

to a target Description de l'image par IA : T majuscule indice t if and only if the source polygon centroid is located within the target polygon. The areal data is thus collapsed to a point datum via a representative point such as the centroid. Voss et al. (1979) report that it is the least accurate method. Moreover, it does not satisfy the pycnophylactic property.

3.1.2.  Areal weighting interpolation

26 It can be applied to an extensive or intensive variable and does not require auxiliary information. For an extensive variable, it is based on the homogeneity assumption that Description de l'image par IA : Y majuscule indice A majuscule

is proportional to the area |A|. It thus consists in allocating to each subregion a value proportional to the fraction of the area of the source that lies within that subregion. For s such that Description de l'image par IA : s intersection de la famille t pas égal à ensemble vide en normal

Description de l'image par IA : suscrire Y majuscule avec circonflexe indice s virgule t position de base égale début fraction début valeur absolue A majuscule indice s virgule t position de base fin valeur absolue sur début valeur absolue S majuscule indice s position de base fin valeur absolue fin fraction Y majuscule indice s position de base virgule parenthèse gauche 4 parenthèse droite

27 After the combination step, this results in the following formula

Description de l'image par IA : suscrire Y majuscule avec circonflexe indice t position de base égale sommation début souscript s s intersection k pas égal à 0 fin scripts début fraction début valeur absolue A majuscule indice s s position de base fin valeur absolue sur début valeur absolue S majuscule indice s position de base fin valeur absolue fin fraction Y majuscule indice s position de base point

28 For an intensive variable with areal weights, the areal weighting interpolation is based on the assumption that Y is uniform on the sources. It thus consists in allocating to the intersection Description de l'image par IA : A majuscule indice s virgule t

the value of Description de l'image par IA : Y majuscule indice s leading to

Description de l'image par IA : suscrire Y majuscule en gras avec circonflexe indice s virgule t position de base égale Y majuscule en gras indice t position de base virgule début ensemble 6 fin ensemble

29 After the combination step, this results in the following formula

Description de l'image par IA : suscrire Y majuscule avec circonflexe indice t position de base égale sommation début souscript s s intersection t pas égal à 0 fin scripts début fraction début valeur absolue A majuscule indice s s position de base fin valeur absolue sur début valeur absolue T majuscule indice t position de base fin valeur absolue fin fraction Y majuscule indice s position de base point

30 It is easy to see that this method does satisfy the pycnophylactic property.

31 From now on, all subsequent methods require additional auxiliary information except in section 3.3.1.

3.2.  Dasymetric weighting

32 Bajatet al. (2011) trace this method back to the 19th century with George Julius Poulett Scrope in 1833 mapping the classes of global population density. The word dasymetry was introduced in the English language by Wright (1936). The class of dasymetric weighting methods comprises generalizations of areal weighting methods. In order to improve upon areal weighting, the idea is to get rid of the assumption of the count density being uniform throughout the source zones because this assumption is almost never accurate. For reflecting density variation within source zone, they use other relevant and available information X to distribute Y accordingly. This approach should help allocating  Description de l'image par IA : Y majuscule indice s

to the small intersection zones within the sources provided the relationship between X and Y be of a proportionality type with a strong enough correlation. Of course it replaces the previous assumption by the assumption that the data is proportional to the auxiliary information on any subregion. This raises the question of how to check the validity of this assumption.

33 These methods are described in the literature for an extensive variable Y and an extensive auxiliary information X. However it can be adapted to the case of intensive Y as we will see below.

34 There are some classical examples of auxiliary information for socio-demographic count data or other socio-economic trends coming from road structure or remotely sensed urban land cover data. Yuan et al. (1997) observe a high correlation between population counts and land cover types.

35 These methods satisfy the pycnophylactic property.

3.2.1.  Ordinary dasymetric weighting

36 It is assumed here that the auxiliary information is known at the intersection level and that it is of a quantitative nature. It might seem difficult to find auxiliary information at intersection level but the following example should convince the user that it is possible. Voss et al. (1999) and Reibel and Bufalino (2005) propose to use the network of road segments with auxiliary variables like length of roads or number of road nodes to allocate demographic characteristics such as population or number of housing units. The weight of a given subzone is then proportional to the aggregated length of streets and roads in that subzone.

37 For the case of an extensive target variable with an extensive auxiliary quantitative variable X, the following formulae extend Equations 4 and 5 by substituting X for the area:

Description de l'image par IA : suscrire Y majuscule avec circonflexe indice s virgule i position de base égale début fraction X majuscule indice s virgule i position de base sur X majuscule indice i position de base fin fraction Y majuscule indice i position de base virgule parenthèse gauche 8 parenthèse droite

38 yielding after the combination step:

Description de l'image par IA : suscrire Y majuscule avec circonflexe indice i position de base égale sommation début souscript s x intersection k pas égal à 0 fin scripts début fraction X majuscule indice s virgule t position de base sur X majuscule indice i position de base fin fraction Y majuscule indice r

39 We propose to extend this method to the case of an intensive target variable with weights given by Description de l'image par IA : Equation with variables w, Z_A, Z_Q, and w_A. Fraction w_A = Z_A / Z_Q.

for a given variable Z and an extensive auxiliary quantitative variable X. We define the corresponding extensive variables Description de l'image par IA : suscrire Y majuscule avec tilde and intensive variable Description de l'image par IA : sus-suscrire suscrire X majuscule avec tilde avec tilde by introducing the transformations from intensive to extensive Description de l'image par IA : suscrire Y majuscule avec tilde indice A majuscule position de base égale Z majuscule indice A majuscule position de base Y majuscule indice A majuscule and from extensive to intensive Description de l'image par IA : Equation with variables X_A and Z_A, and a fraction where X_A is the numerator and Z_A is the denominator.. The following formula is obtained using the correspondence intensive-to-extensive given by Equation 2 (see the annex for a proof).

Description de l'image par IA : suscrire Y majuscule avec circonflexe indice t position de base égale sommation début souscript s x intersection t pas égal à 0 fin scripts début fraction X majuscule indice s virgule t position de base sur X majuscule indice t position de base fin fraction début fraction Z majuscule indice t position de base sur Z majuscule indice t position de base fin fraction Y majuscule indice s position de base point

40 Similar formulae can be obtained easily in the case Y extensive with X intensive and Y intensive with X intensive.

41 Let us illustrate this method with the toy example introduced at the end of section 2. Figure 6 presents a comparison between the results of the areal weighting method and the dasymetric method for target variable Description de l'image par IA : Y majuscule indice 1

. Figure 7 does the same for target variable Description de l'image par IA : Y majuscule indice 2. In each target we can see the true value of Description de l'image par IA : Y majuscule indice 1 (left) and the value of the prediction (right) and the relative prediction error below Description de l'image par IA : début fraction suscrire Y majuscule avec circonflexe indice 1 position de base moins Y majuscule indice 1 position de base sur Y majuscule indice 1 position de base fin fraction. We can see that the dasymetric method yields better results than areal weighting for variable Description de l'image par IA : Y majuscule indice 2 because of the inhomogeneity within sources (indeed the sum of squared errors is 10 percent smaller for dasymetric). However for variable Description de l'image par IA : Y majuscule indice 1,  for which the level of inhomogeneity within sources is not as high, this is not the case and areal weighting is doing better than dasymetric with a ratio of sum of squared errors of 48 percent.

3.2.2.  Dasymetric weighting with control zones

42 This is the case when the auxiliary information is categorical, its level sets thus defining the so called control zones. The most classical case, called binary dasymetric mapping, is the case of population estimation when there are two control zones: one which is known to be populated and the other one unpopulated. It is assumed that the count density is uniform throughout control zones. A first step estimates these densities Description de l'image par IA : D majuscule indice c

for control zone c by

Description de l'image par IA : Equation with summation symbols, mathematical variables, and Greek letters.

43 where Description de l'image par IA : s appartient à c

may have several meanings (containment, centroid, percent cover). For this method, it is often assumed in the literature that intersection units are nested within control zones in which case the intersection zone prediction is given by

Description de l'image par IA : suscrire Y majuscule avec circonflexe indice s virgule t position de base égale début fraction début valeur absolue A majuscule indice s virgule t position de base fin valeur absolue suscrire D majuscule avec circonflexe indice c parenthèse gauche s virgule t parenthèse droite position de base sur sommation début souscript t prime deux points t prime intersection s pas égal à ensemble vide en normal fin scripts début valeur absolue A majuscule indice s virgule t sub-exposant prime sub position de base fin valeur absolue suscrire D majuscule avec circonflexe indice c parenthèse gauche s virgule t sub-exposant prime sub parenthèse droite position de base fin fraction Y majuscule indice s position de base virgule

44 where c(st) denotes the control zone which contains the intersection zone Description de l'image par IA : A majuscule indice s virgule t

. One can see through this formula that this is the same as using ordinary dasymetric with the auxiliary information being a first step crude estimate of variable Y based on the assumption that its corresponding intensive variable Equation 3 is constant throughout control zones. The assumption that intersection units are nested within control zones is not so restrictive since it can be restated as “the control zones are unions of intersections units”: control zone information being rather coarse, they can be designed to respect this constraint. However let us prove that this assumption is unnecessary. Indeed if one denotes by Description de l'image par IA : A majuscule indice s virgule t virgule c the intersection between source zone s, target zone t and control zone c, the following gives a prediction for the target values

Description de l'image par IA : suscrire Y majuscule de ronde avec circonflexe indice t position de base égale sommation début souscript s u intersection t pas égal à 0 fin scripts début fraction sommation début souscript c fin scripts début valeur absolue A majuscule en gras indice s virgule t virgule c position de base fin valeur absolue suscrire D majuscule de ronde avec circonflexe indice c position de base sur sommation début souscript t exposant prime position de base fin scripts sommation début souscript c fin scripts début valeur absolue A majuscule en gras indice s virgule t prime virgule c position de base fin valeur absolue suscrire D majuscule de ronde avec circonflexe indice c position de base fin fraction Y majuscule en gras indice s position de base point

45 Mennis and Hultgren (2006) illustrate this approach with American census data using land cover auxiliary information coming from manual interpretation of aerial photographs.

3.2.3.  Two steps dasymetric weighting

46 This method aims at relieving the constraint of the ordinary dasymetric weighting that the auxiliary information should be known at the intersection level, thus allowing a larger choice of such information. It is assumed here that the information is known at the level of some control zones, which means that the auxiliary information has two components: a quantitative one and a qualitative one. There is a constraint though on the control zones: they should be nested within source zones. The first step is just an ordinary dasymetric step using control zones as targets and the auxiliary information on control zones. In this case, the intersection level is the source-control intersection, which is the same as the control level since controls are nested within sources. The second step consists in areal weighting with the controls as sources (using the controls estimates of the first step) and the original targets as final targets. The homogeneity assumption used in the second step concerns the control level but since control zones are usually smaller than source zones, the assumption is less constraining. Gregory (2002) presents the implementation of this approach with historical British census data.

47 If controls are not nested within sources, the method can be adapted by adding an additional step of areal weighting to distribute the control information on the control-source intersections.

3.3.  Regression techniques

48 The dasymetric weighting schemes have several restrictions: the assumption of proportionality of Y and X, the fact that the auxiliary information should be known at intersection level and the limitation to a unique auxiliary variable (exceptionally two in the case of two steps dasymetric). The regression techniques will overcome these three constraints. Another characteristic of dasymetric method is that when predicting at the level of the Description de l'image par IA : A majuscule indice s virgule t

intersection only the areal data Description de l'image par IA : Y majuscule indice s within which the intersection is nested is used for prediction and this will not be the case for regression. In general the regression techniques involve a regression of the source level data of Y on the target or control values of X. The regression without auxiliary information of section 3.3.1 can be regarded as an extension of the areal weighting method since it relies on the “proportionality to area” principle. The regression with control zones of section 3.3.2 is a regression version of the dasymetric weighting with control zones of section 3.2.2. The regression with auxiliary information at target level of section 3.3.3 can be compared to ordinary dasymetric weighting of section 3.2.1.

49 These regression methods raise some estimation issues in the sense that very often the target variable is non negative and therefore one would like the corresponding predictions to satisfy this constraint. In order to solve this issue, people resort sometimes to Poisson regression (as in Flowerdew et al., 1991), or ordinary least squares with constraints on the coefficients (see Goodchild et al., 1993), or lognormal regression (see Goodchild et al., 1993).

3.3.1.  Regression without auxiliary information

50 A first idea discussed in Goodchild et al. (1993) consists in deriving a system of equations linking the known source values Description de l'image par IA : Y majuscule indice s

to the unknown target values Description de l'image par IA : Y majuscule indice t using an aggregation formula and an additional assumption of homogeneity of the target variable on the target zones.

51 In the case of an extensive variable, the homogeneity assumption allows to allocate Y to intersection units proportionally to their area yielding the following system

Description de l'image par IA : Y majuscule indice s position de base égale sommation début souscript t fin scripts suscrire Y majuscule avec circonflexe indice s virgule t position de base égale sommation début souscript t fin scripts début fraction début valeur absolue A majuscule indice s virgule t position de base fin valeur absolue sur début valeur absolue T majuscule indice t position de base fin valeur absolue fin fraction suscrire Y majuscule avec circonflexe indice t position de base point

52 For the case of an intensive variable, the homogeneity assumption is that Y is uniform on targets and that its weighting system is given by areal weights. This yields the following relationship between source and target values

Description de l'image par IA : Y majuscule indice s position de base égale sommation début souscript t fin scripts début fraction début valeur absolue A majuscule indice s virgule t position de base fin valeur absolue sur début valeur absolue S majuscule indice s position de base fin valeur absolue fin fraction suscrire Y majuscule avec circonflexe indice s virgule t position de base égale sommation début souscript t fin scripts début fraction début valeur absolue A majuscule indice s virgule t position de base fin valeur absolue sur début valeur absolue S majuscule indice s position de base fin valeur absolue fin fraction suscrire Y majuscule avec circonflexe indice t position de base point

53 These systems are then solved using an ordinary least squares procedure forced through the origin provided the number of source units is larger than the number of target units. This last condition is not satisfied for disaggregation problems. In that case, one can adapt the technique by combining it with the use of control zones as in section 3.3.2.

3.3.2.  Regression with control zones

54 Using control zones as in section 3.3.2, Goodchild et al. (1993) propose a two steps procedure where the first step is the technique of section 3.3.1 with controls playing the role of targets. The number of such control zones is handled by the user and hence can be forced to be smaller than the number of sources thus relieving the constraint on the number of targets of section 3.3.1. The assumption of homogeneity on targets becomes homogeneity on controls hence it is not restrictive because the controls are usually built to reflect homogeneity zones for the target variable. At the end of the first step, one can recover estimates of the target variable at the control level. Using the uniformity on control assumption, one gets from the control level to the control-target level. The second step in Goodchild et al. (1993) involves a simple aggregation from the control-target intersections level to the target level with homogeneity weights. Yuan et al. (1998) apply rather a dasymetric second step, which they call “scaling” using the first step target variable prediction as an auxiliary variable, thus enforcing the pycnophylactic property. Reibel and Agrawal (2006) superimpose a fine grid on the set of source and target zones. They first compute the proportion of each source zone corresponding to each land cover type and then regress the target variable (population) at source level on theses proportions. With the estimated coefficients, they can derive a coarse grid cell based map of the population surface. They rescale these estimates to impose the pycnophylatic property. Then with an aggregation formula they get population estimates for any combination of grid cells, namely for target regions.

3.3.3.  Regression with auxiliary information at target level

55 This family of methods allows to use more than one auxiliary variable and of different natures (quantitative or categorical, or a mixture of both). In Flowerdew et al. (1991), the emphasis is on extensive target variables with a Poisson or binomial distribution (case 1 hereafter) and in Flowerdew and Green (1992), it is on intensive target variables with a gaussian distribution (case 2 hereafter). In the gaussian case, it is assumed that the target variable Description de l'image par IA : Y majuscule indice A majuscule

on A is a sample mean of some underlying gaussian variable measured on a number Description de l'image par IA : n indice A majuscule of individuals. Therefore the intensive weights are given by Equation 1 with Description de l'image par IA : Z majuscule indice A majuscule position de base égale n indice A majuscule  and are approximated by areal weights when the counts Description de l'image par IA : n indice A majuscule are not known. In case 1, we have Description de l'image par IA : Y majuscule indice s virgule t position de base opérateur tilde P majuscule de ronde parenthèse gauche mû indice s virgule t position de base parenthèse droite  and similarly in case 2 we have Description de l'image par IA : Y majuscule indice s virgule t position de base opérateur tilde N majuscule en normal parenthèse gauche mû indice s virgule t position de base virgule début fraction sigma au carré sur n indice s virgule t position de base fin fraction parenthèse droite  where the means Description de l'image par IA : mû indice s virgule t are in both cases functions of some parameters Description de l'image par IA : bêta and the auxiliary information at target level Description de l'image par IA : X majuscule indice t. In case 2, moreover, it makes sense to assume that Description de l'image par IA : Equation showing covariance formula with variables and sigma squared..

56 With the EM algorithm.

57 Except for a variant in Flowerdew and Green (1992) (see paragraph 2), the interpolation problem is cast as a missing data problem considering the intersection values of the target variable as unknown and the source values as known therefore allowing to use the EM algorithm to overcome the difficulty.

58 The algorithm is initialized with areal weighting estimates for Description de l'image par IA : mû indice s virgule t

. The E-step consists in calculating the conditional expectation of Description de l'image par IA : Y majuscule indice s virgule t given the known values Description de l'image par IA : Y majuscule indice s. In case 1, this yields the following formula

Description de l'image par IA : E majuscule ajouré Y majuscule indice s virgule t position de base est un diviseur de Y majuscule indice s position de base égale début fraction mû indice s virgule t position de base sur sommation début souscript t exposant prime position de base fin scripts mû indice s virgule t sub-exposant prime sub position de base fin fraction Y majuscule indice s position de base

59 which yields the following predictor Description de l'image par IA : suscrire Y majuscule avec circonflexe indice s virgule t position de base égale début fraction suscrire mû avec circonflexe indice s virgule t position de base sur sommation début souscript t exposant prime position de base fin scripts suscrire mû avec circonflexe indice s virgule t sub-exposant prime sub position de base fin fraction Y majuscule indice s

and it is clear that the pycnophylactic property is satisfied.

60 In case 2, the corresponding formula is

Description de l'image par IA : E majuscule en gras Y majuscule en gras indice s virgule t position de base est un diviseur de Y majuscule indice s position de base égale mû indice s virgule t position de base début fraction c w en italique gras parenthèse gauche Y majuscule indice s virgule t position de base virgule Y majuscule indice s position de base parenthèse droite sur v a r parenthèse gauche Y majuscule indice s position de base parenthèse droite fin fraction parenthèse gauche Y majuscule indice s position de base moins mû indice s position de base parenthèse droite égale mû indice s virgule t position de base parenthèse gauche Y majuscule indice s position de base moins mû indice s position de base parenthèse droite

61 where Description de l'image par IA : mû indice s

  is obtained from the Description de l'image par IA : mû indice s virgule t  by applying the aggregation formula to the sources subdivided into the intersections and by taking expectation on both sides yielding

Description de l'image par IA : mû indice s en normal position de base égale E majuscule ajouré parenthèse gauche Y majuscule indice s en normal position de base parenthèse droite égale E majuscule ajouré parenthèse gauche sommation début souscript t fin scripts début fraction n indice s en normal virgule t position de base sur n indice s en normal position de base fin fraction Y majuscule indice s en normal virgule t position de base parenthèse droite égale sommation début souscript t fin scripts début fraction n indice s en normal virgule t position de base sur n indice s en normal position de base fin fraction mû indice s en normal virgule t position de base point

62 Therefore the E-step yields the following predictor Description de l'image par IA : suscrire Y majuscule avec circonflexe indice s virgule t position de base égale suscrire mû en italique gras avec circonflexe indice s virgule t position de base parenthèse gauche Y majuscule indice s position de base moins suscrire mû en italique gras avec circonflexe indice s position de base parenthèse droite

,  where the Description de l'image par IA : suscrire mû avec circonflexe indice s virgule t come from the previous step and the Description de l'image par IA : suscrire mû avec circonflexe indice s from the estimation version of (11).

63 One can then check that this step enforces the pycnophylactic property since

Description de l'image par IA : sommation début souscript t deux points r intersection s pas égal à ensemble vide en normal fin scripts début fraction n indice s virgule t position de base sur n indice s position de base fin fraction suscrire Y majuscule avec circonflexe indice s virgule t position de base égale sommation début souscript s deux points intersection s pas égal à ensemble vide en normal fin scripts début fraction n indice s virgule t position de base sur n indice s position de base fin fraction suscrire mû avec circonflexe indice s virgule t position de base sommation début souscript s deux points intersection s pas égal à ensemble vide en normal fin scripts début fraction n indice s virgule t position de base sur n indice s position de base fin fraction parenthèse gauche Y majuscule indice s position de base moins suscrire mû avec circonflexe indice s position de base parenthèse droite égale suscrire mû avec circonflexe indice s position de base parenthèse gauche Y majuscule indice s position de base moins suscrire mû avec circonflexe indice s position de base parenthèse droite égale Y majuscule indice s

64 In the M-step, the intersection values obtained at the previous E-step are considered as i.i.d. observations from the Poisson Description de l'image par IA : P majuscule de ronde parenthèse gauche mû indice s virgule t position de base parenthèse droite

in case 1 and from the gaussian Description de l'image par IA : N majuscule en normal parenthèse gauche mû indice s virgule t position de base virgule début fraction sigma au carré sur n indice s virgule t position de base fin fraction parenthèse droite in case 2. Recall that in both cases, the intersection means are functions of some parameters Description de l'image par IA : bêta and the auxiliary information at target level Description de l'image par IA : X majuscule indice t plus possibly some information at intersection level such as the area of the intersections. For example in case 1, Flowerdew et al. (1991) consider population as target variable and geology as auxiliary information assuming that the population density will be different in clay areas (Description de l'image par IA : lambda indice 1) and in limestone areas (Description de l'image par IA : lambda indice 2) so that Description de l'image par IA : mû indice s virgule t position de base égale lambda indice t position de base début valeur absolue A majuscule indice s virgule t position de base fin valeur absolue,  where λt is either Description de l'image par IA : lambda indice 1 or Description de l'image par IA : lambda indice 2 depending on whether target zone t is in the clay or the limestone area. One then performs maximum likelihood with a Poisson regression in case 1 and a weighted least squares in case 2.

65 Without the EM algorithm.

66 In case 2, Flowerdew and Green (1992) describe a simplified alternative version in the case when one is ready to make the uniform target zone assumption. Namely, since the auxiliary information X is available at target zone level, it does not hurt to assume Description de l'image par IA : mû en italique gras indice s virgule t position de base égale mû en italique gras indice t

. Let Description de l'image par IA : X majuscule indice T majuscule denotes the T × p design matrix where p is the number of explanatory factors in X and T the number of targets, Description de l'image par IA : mû indice S majuscule denote the S × 1 vector of source values, Description de l'image par IA : mû indice T majuscule denote the T × 1 vector of target values, W denote the weights matrix whose elements are given by Description de l'image par IA : Equation with variables n_s,t and n_s, divided by n_s.. If we combine the following information:

67 - the relation between Y and X at target level: Description de l'image par IA : mû indice T majuscule position de base égale X majuscule indice T majuscule position de base bêta

68 - the aggregation equation Description de l'image par IA : mû indice s position de base égale sommation début souscript t fin scripts début fraction n indice s virgule t position de base sur n indice s position de base fin fraction mû indice s virgule t

,

69 - the uniformity at target level assumption Description de l'image par IA : mû en italique gras indice s virgule t position de base égale mû en italique gras indice t

,

70 we get the following regression equation

Description de l'image par IA : mû indice S majuscule position de base égale W majuscule X majuscule indice T majuscule position de base bêta parenthèse gauche 1 2 parenthèse droite

71 between target means at source level and auxiliary information at target level. Using the data at the source level Description de l'image par IA : Y majuscule indice s

and Equation 12, we can estimate the parameters β by weighted least squares with weights ns. Then Description de l'image par IA : suscrire mû avec circonflexe indice t position de base égale X majuscule indice t position de base suscrire bêta avec circonflexe is a prediction for Description de l'image par IA : Y majuscule indice t.

72 Let us consider again the toy example defined earlier to illustrate this technique adapted to the case of Poisson regression. Figure 8 (respectively 9) compares the results of this regression technique with the dasymetric method based on the same auxiliary information for Description de l'image par IA : Y majuscule indice 1

( respectively Description de l'image par IA : Y majuscule indice 2). For Description de l'image par IA : Y majuscule indice 1, the regression method is better than the dasymetric with a ratio of sum of squared errors of 12 percent. For Description de l'image par IA : Y majuscule indice 2 however, the dasymetric is better than the regression with a ratio of sum of squared errors of 82 percent. The reason is that indeed the variable Description de l'image par IA : Y majuscule indice 2 has been constructed to be almost proportional to X (which is in line with the spirit of dasymetric) whereas Description de l'image par IA : Y majuscule indice 1 is not. Note that the dasymetric method uses more information than the regression method because it uses the auxiliary value at intersection level whereas the regression method uses it at target level.

73 Alternative with control zone.

74 In case 2, Flowerdew and Green (1992) consider another alternative with a set of control zones assuming that auxiliary information is at control zone level and that it is reasonable to believe that means are uniform on controls Description de l'image par IA : mû indice s virgule c position de base égale mû indice c

. The same arguments as above then yield the equations

Description de l'image par IA : mû indice G majuscule position de base égale X majuscule indice C majuscule ajouré position de base bêta parenthèse gauche 1 3 parenthèse droite
Description de l'image par IA : mû indice S majuscule position de base égale W majuscule X majuscule indice C majuscule position de base bêta parenthèse gauche 1 4 parenthèse droite

75 where Description de l'image par IA : X majuscule indice C majuscule

denotes the C × p design matrix with C being the number of control zones, Description de l'image par IA : mû indice C majuscule  denotes the C × 1 vector of control values, and W being the weight matrix at the source-intersection-control levels. Using the data at the source level and Equation 13, we can estimate the parameters β by weighted least squares with weights ns. Then Description de l'image par IA : suscrire mû avec circonflexe indice t position de base égale X majuscule indice c position de base suscrire bêta avec circonflexe and using the aggregation equation for target and control, one gets that Description de l'image par IA : Equation with summation and subscripted variables, involving n_c, J, and μ_c. is a prediction for Description de l'image par IA : Y majuscule indice t. Note that one needs two sets of weights Description de l'image par IA : début fraction n indice s virgule c position de base sur n indice s position de base fin fraction and Description de l'image par IA : début fraction n indice c virgule t position de base sur n indice c position de base fin fraction.

3.4.  A short overview of more elaborate methods

3.4.1.  Other regression methods

76 In this section, we briefly describe alternative regression methods. A detailed development of these more sophisticated techniques would require much more tools and notations. Because one of our objectives is to give priority to the practitioner point of view, we do not develop them in this presentation but just give some of the main references. Murakami and Tsutsumi (2011) combine Flowerdew and Green EM algorithm approach with a spatial econometrics regression model to take into account spatial autocorrelation at the intersection unit level. Mugglin and Carlin (1998) propose a hierarchical bayesian version of the Poisson regression method of Flowerdew et al. (1991) with a Markov chain Monte Carlo estimation step and illustrate it on disease counts. The advantage of the hierarchical bayesian approaches is that they provide full posterior distribution estimates enabling accuracy evaluation but their approach requires that the spatial support of the auxiliary information be nested within both targets and source units. Mugglin et al. (2000) extend this approach introducing Markov random field priors on the source and target mean parameters: this allows them to introduce some spatial autocorrelation in the model. They illustrate their approach with population counts reallocation with 39 sources and 160 targets. Huang et al. (2002) introduce multiresolution tree structured autoregressive models.

3.4.2.  Smoothing techniques

77 Initially meant for visual display and exploratory analysis, smoothing techniques can solve the point-to-point or the areal-to-point interpolation problems. By laying a fine lattice over the study area and predicting the target variable at each lattice node, they enable mapping the target variable. However they can be used as an intermediate step towards the areal-to-areal interpolation in the sense that once a point prediction is obtained, it is enough to use aggregation rules (integrate the point prediction) to obtain target zones predictions.

78 In this sense, choropleth mapping is a coarse interpolation technique, which amounts, for the intensive variable case, to allocate the areal data value to any point within the support of the corresponding source unit.

79 Martin (1989) and Bracken and Martin (1991) propose an adaptive kernel density estimation from the target variable values collapsed at the centroids of the source zones. This method is not pycnophylactic. A similar kernel based method is described in Grasland et al. (2000) with a discussion of the relationship between the choice of the bandwidth parameter and the level of aggregation of the initial information.

80 Tobler (1979) introduces a spline based approach for areal-to-point interpolation. His predictor is a discrete approximation (finite difference algorithm) of the solution to an optimization problem defining a type of interpolating spline with a smoothness criterion based on second partial derivatives. He includes additional constraints such as non-negative point predictions and mass-preservation. His choice of smoothness criterion has been criticized by Dyn et al. (1979). In contrast with Tobler's method which requires a regular grid of prediction points, Rase (2001) adapts Tobler's procedure replacing the regular grid by a triangulation of the space based on the observed centroids locations, and using some kernel smoothing with inverse distance weighting instead of splines.

81 Kyriakidis (2004) casts the problem into a geostatistical framework. Indeed the reverse problem of point-to-area interpolation is solved by the block Kriging in geostatistics which is classical due to mining practices: it is of interest for example to predict the total ore content of an area knowing the point data values. Kyriakidis (2004) shows that the area-to-point problem can be solved with similar methods but requires the modeling of all area-to-area and area-to-point covariances. The resulting prediction satisfies the pycnophylactic property. Moreover he proves that choropleth mapping, kernel smoothing and Tobler's pycnophylactic method can be viewed as particular cases of his framework, corresponding to various ways of specifying the covariance model (choropleth mapping corresponding to the absence of correlation at the point support level). A very interesting aspect of the method is that it offers a measure of reliability (standard error of each point prediction). The method can accomodate constraints such as maximum-minimum allowable value or prescribed value of the target variable: for example, zero population value over water bodies or high altitude regions. The method can handle large problems, possibly using moving local neighborhoods. Yoo et al. (2010) adapt it to accomodate more general constraints such as non-negativity. However estimating point covariance from areal data is difficult: it is possible for example with a maximum likelihood procedure based on multivariate gaussian assumption. Liu et al. (2008) propose to combine this approach with regression in an area-to-point residual kriging approach which can be used to disaggregate the regression residuals. Other generalizations can be found in Kelsall and Wakefield (2002) with log-normal kriging.

Tableau 1

Summary of methods

Methods Target variableY Auxiliary variable X Control zones Pycnophylactic property
Nature Additional assumption Dimention Nature Support
Areal weighting Extensive
Intensive
Homogeneous on sources none none none none yes
Dasymetric Extensive
Intensive
None 1 Extensive or Intensive Intersections none yes
Dasymetric with control zones Extensive
Intensive
Homogeneous on controls 1 Categorical Controls yes no
Regression without auxiliary infor Extensive
Intensive
Homogeneous on targets none none none none no
Regression with auxiliary info Extensive None >=1 Extensive or Intensive Targets none no
Intensive Weight area
Point in polygon Extensive None none none none none no
Table comparing statistical methods and properties.

Summary of methods

4.  Conclusion

82 We have described the main classes of methods for the area-to-area spatial interpolation problem including proportional weighting schemes also called dasymetric methods, smoothing techniques and regression based interpolation. As we pointed out in the introduction, we have focused on the basic methods, which are more likely to be adopted by practitioners, and a summary of the main characteristics of these methods can be found in Table 1.

83 We have not addressed in this review the case of categorical target variable. Chakir (2009) proposes a technique for reallocating multinomial type data (namely land use shares) given sampled information at a disaggregated level and observation of aggregated land use shares with a generalized cross-entropy approach.

84 In terms of implementation of these methods in usual softwares, there is not much available. Bloom et al. (1996) describe their implementation of areal weighting from Flowerdew et al. (1991) with Mapinfo. With R, it is possible to use the “pycno” package by C. Brundson. From our experience with some real data cases, we believe that in large size real applications, the more sophisticated methods are not yet manageable because of size problems and are far too complicated to communicate to the public offices typical users. Simplicity and convenience considerations are certainly the prime arguments for the best choice.

85 As mentioned in the introduction, one motivation of this paper was to be a first step for a further study of the comparative precision of these prediction methods. Let us briefly summarize what can be found in the literature so far. Overall one finds two types of point of views: a methodological or an empirical one. Unfortunately, there is not much from the methodological point of view since we only found the work of Sadahiro (2000) who considers the point-in-polygon approach and compares it to the areal weighting scheme. He uses a counting process with a fixed number of i.i.d. points with a given density to model the target variable distribution. The target zone is modeled with a fixed shape but a random position. The sources realize a tiling partition of the space with geometric shapes (considered as unbounded to avoid boundary problems). The last step of the evaluation is of an empirical nature. He finds that the accuracy of point-in-polygon depends upon the target zone size (the bigger, the better) and the concentration of the underlying distribution of points. One needs a concentration of points around the representative point in an area of at most 12-15 percent of the total for the point-in-polygon to compare favorably with the areal weighting method, which is quite unrealistic in applications. He also studies the optimal location of representative points which is found to be at the spatial median of the source zone.

86 The rest of this literature contains many papers of an empirical nature. The comparison of areal weighting with the alternative dasymetric methods is found in Reibel and Bufalino (2005), Voss (1992), Mennis (2006), Fisher and Langford (1995), Gregory (2002). The dasymetric methods are always found to have better performance than the simple areal weighting with reported improvements up to 71 per cent in relative mean square error (Reibel and Bufalino, 2005).

87 The comparison of regression methods with several alternatives is found in Flowerdew and Green (1992), Flowerdew et al. (1991), Reibel and Agrawal (2007), Gregory (2002). Flowerdew et al. (1991) find that the EM algorithm regression for the Poisson or binomial models performs better than areal weighting by factors of 50 − 60 per cent (Poisson case) and 25 − 55 per cent (Binomial case) in target deviance. Murakami and Tsutsumi (2011) compare their spatial regression method to more classical regression approaches and find that their spatial lag model performs better. Overall regression methods are found to perform better than dasymetric methods.

88 For the smoothing methods, Goodchild and Lam (1980) compare areal weighting and Tobler's pycnophylactic interpolation and they do not report any significant advantage for the smoothing method. This may be due to the fact that “count density gradients are not in fact typically smooth up to and beyond tract boundaries” (from Reibel and Agrawal, 2007).

89 Finally, it is important to point out that the only methods that come along with an accuracy measure are area-to-point kriging and the hierarchical bayesian methods. We think that more attention should be paid to systematic comparisons of the relative accuracies of all these methods in the future.

Acknowledgments

We would like to thank A. Ruiz-Gazen and C. Vignes for fruitful discussions. This work was supported by the French Agence Nationale de la Recherche through the ModULand project (ANR-11-BSH1-005).

Ordinary dasymetric weighting for intensive target variable and extensive auxiliary variable

90 For an intensive target variable with weights given by Description de l'image par IA : Equation with variables w, Z_A, Z_Q, and w_A. Fraction w_A = Z_A / Z_Q.

and an extensive auxiliary quantitative variable X,  we define the corresponding extensive variable Description de l'image par IA : suscrire Y majuscule avec tilde using (3) by Description de l'image par IA : suscrire Y majuscule avec tilde indice A majuscule position de base égale Z majuscule indice A majuscule position de base Y majuscule indice A majuscule and intensive variable Description de l'image par IA : Equation with variables X_A and Z_A, and a fraction where X_A is the numerator and Z_A is the denominator.. Using Equation 8 we have

Description de l'image par IA : suscrire Y majuscule avec tilde indice s virgule t position de base égale début fraction X majuscule indice s virgule t position de base sur X majuscule indice s position de base fin fraction suscrire Y majuscule avec tilde indice s position de base égale début fraction X majuscule indice s virgule t position de base sur X majuscule indice s position de base fin fraction Z majuscule indice s position de base Y majuscule indice s position de base virgule

91 and therefore

Description de l'image par IA : suscrire Y majuscule en gras avec circonflexe indice s virgule t position de base égale début fraction sus-suscrire suscrire Y majuscule en gras avec tilde avec circonflexe sur Z majuscule indice s virgule t position de base fin fraction égale début fraction X majuscule indice s virgule t position de base sur X majuscule indice s position de base fin fraction début fraction Z majuscule indice s position de base sur Z majuscule indice s virgule t position de base fin fraction Y majuscule en gras indice s position de base égale début fraction X majuscule indice s virgule t position de base divisé par Z majuscule indice s virgule t position de base sur X majuscule indice s position de base divisé par Z majuscule indice s position de base fin fraction Y majuscule en gras indice s position de base égale début fraction suscrire X majuscule en gras avec tilde indice s virgule t position de base sur sus-suscrire suscrire X majuscule en gras avec tilde avec tilde indice s position de base fin fraction Y majuscule en gras indice s

92 which is similar to Equation 8. After the combination step, we get

Description de l'image par IA : suscrire Y majuscule avec circonflexe indice t position de base égale sommation indice s s intersection t pas égal à 0 position de base début fraction Z majuscule indice s virgule t position de base sur Z majuscule indice t position de base fin fraction suscrire Y majuscule avec circonflexe indice s virgule t position de base égale sommation indice s s intersection t pas égal à 0 position de base début fraction Z majuscule indice s virgule t position de base sur Z majuscule indice t position de base fin fraction début fraction X majuscule indice s virgule t position de base sur X majuscule indice s position de base fin fraction début fraction Z majuscule indice s position de base sur Z majuscule indice s virgule t position de base fin fraction Y majuscule indice s position de base égale sommation indice s s intersection t pas égal à 0 position de base début fraction X majuscule indice s virgule t position de base sur X majuscule indice s position de base fin fraction début fraction Z majuscule indice s position de base sur Z majuscule indice t position de base fin fraction Y majuscule indice s position de base point

References

  • Bajatb B, Krunic N,Kilibarda M (2011) Dasymetric mapping of spatial distribution of population in Timok Region. Proceedings of International conference Professional practice and education in geodesy and related fields, Klavodo-Djerdap, Serbia.
  • Bloom LM, Pedler PJ, Wragg GE (1996) Implementation of enhanced areal interpolation using Mapinfo. Computers and Geosciences 22(5): 459-466.
  • Chakir R (2009) Spatial Downscaling of agricultural land-use data: an econometric approach using cross-entropy. Land Economics 85(2): 238-251.
  • Dynn, Wahba G,Wong W (1979) Comment on “Smooth pycnophylactic interpolation for geographical regions” by Tobler. Journal of the American Statistical Association 74(367): 530-535.
  • Fisher P, Langford M (1996) Modelling sensitivity to accuracy in classified imagery: a study of areal interpolation. The Professional Geographer 48: 299-309.
  • Flowerdew R, Green M,Kehris E (1991) Using areal interpolation methods in geographical information systems. Papers in Regional Science 70(3): 303-315.
  • Flowerdew R,Green M (1992) Developments in areal interpolation methods and GIS. The Annals of Regional Science 26(1): 67-78.
  • Goodchild MF, Anselin L, Deichman U (1993) A framework for the areal interpolation of socio-economic data. Environment and Planning A 25: 383-397.
  • Goodchild MF,Lam N (1980) Areal interpolation: a variant of the traditional spatial problem. Geo-Processing 1: 297-312.
  • Gotway CA, Young LJ (2002) Combining incompatible spatial data. Journal of the American Statistical Association 97(458): 632-648.
  • Grasland C, Mathian H, Vincent J-M (2000) Multiscalar analysis and map generalisation of discrete social phenomena: statistical problems and political consequences. Statistical Journal of the United Nations ECE 17 IOS Press: 157-188.
  • Gregory IN (2002) The accuracy of areal interpolation techniques: standardizing 19th and 20th century census data to allow long-term comparisons. Computers, environments and urban systems 26(4): 293-314.
  • Kelsal J, Wakefield J (2002) Modeling spatial variation in disease risk: a geostatistical approach. Journal of the American Statistical Association 97: 692-701.
  • Kyriakidis PC (2004) A geostatistical framework for areal-to-point spatial interpolation. Geographical Analysis 36(3): 259-289.
  • Lam NS (1982) An evaluation of areal interpolation methods. Proceedings, Fifth International Symposium on Computer-Assisted Cartography (AutoCarto 5) 2: 471-479.
  • Langford M (2007) Rapid facilitation of dasymetric-based population interpolation by means of raster pixel maps. Computers, Environment and Urban Systems 31: 19-32.
  • Liu XH, Kyriakidis PC, Goodchild MF (2008) Population-density estimation using regression and area-to-point residual kriging. International Journal of geographical information science 22(4): 199-213.
  • Martin D (1989) Mapping population data from zone centroid locations. Transactions of the Institute of British Geographers, New Series 14(1): 90-97.
  • Martin D, Bracken I (1991) Techniques for modeling population-related raster databases. Environment and Planning A 23(7): 1069-1075.
  • Mugglin AS, Carlin BP (1998) Hierarchical modeling in geographic information systems: population interpolation over incompatible zones. Journal of Agricultural, Biological and Environmental Statistics 3(2): 111-130.
  • Mugglin AS, Carlin BP,Gelfand A E (2000) Fully model-based approaches for spatially misaligned data. Journal of the American Statistical Association 95(451): 877-887.
  • Murakami D, Tsutsumi M (2011) A new areal interpolation technique based on spatial econometrics. Procedia-Social and Behavioral Sciences 21: 230-239.
  • Rase W (2001) Volume-preserving interpolation of a smooth surface from polygon-related data. Journal of Geographycal Systems 3: 199-213.
  • Reibel M, Bufalino ME (2005) Street-weighted interpolation techniques for demographic count estimation in incompatible zone systems. Environment and Planning A 37(1): 127-139.
  • Reibel M, Agrawal A (2007) Areal interpolation of population counts using pre-classified land cover data. Population Research and Policy Review 26(5-6): 619-633.
  • Sadahiro Y (2000) Accuracy of count data estimated by the point-in-polygon method. Geographical Analysis 32(1): 64-89.
  • Tobler WR (1979) Smooth pycnophylactic interpolation for geographical regions. Journal of the American Statistical Association 74(367): 519-530.
  • Vignes C, Rimbourg S, Ruiz-Gazen A,Thomas-Agnan C (2013) Fiches méthodologiques, Méthodes statistiques d’allocation spatiale : interpolation de données surfaciques. Technical report.
  • Voss PR, Long DD, Hammer RB (1999) When census geography does not work: using ancillary information to improve the spatial interpolation of demographic data. CDE working paper, 99-26, Wisconsin, Madison.
  • Wright JK (1936) A method of mapping densities of population with Cape Cod as an example. Geographical Review 26(1): 103-110.
  • Yoo E, Kyriakidis PC,Tobler W (2010) Reconstructing population density surfaces from areal data: a comparison of Tobler’s pycnophylactic interpolation method and area-o-point kriging. Geographical Analysis 42(1): 78-98.
  • Yoo E, Kyriakidis PC (2006) Area-to-point kriging with inequality-type data. Journal of geographical systems 8(4): 357-390.
  • Yuan Y, Smth R M,Limp W F (1997) Remodeling census population with spatial information from Landsat TM imagery. Computers, Environment and Urban Systems 21(3): 245-258.

Mots-clés éditeurs : changement de support, désagrégation spatiale, interpolation spatiale surfacique, problème de superposition des polygones, propriété pycnophylactique

Date de mise en ligne : 26/06/2015

https://doi.org/10.3917/reru.151.0027