CONTROLE DE QUALIDADE DE BASES DE DADOS ESPACIAIS ATRAVÉS DE UMA AMOSTRAGEM DE ZERO-DEFEITOS COM RETIFICAÇÃO Quality control of a spatial database by a zero-defect sampling with rectification procedure

Quality is commonly used to indicate the superiority of a manufactured good or as the degree of excellence of a product, service or performance. Since a database can be viewed as a result of a production process, and the reliability of the process imparts value and utility to the database, so sampling procedures can be applied to evaluate if the database met the specifications made by the user. In this paper, we present the optimum sample size to be extracted in a digital file generated from a conversion process. A zero-defect acceptance sampling with rectification was considered and quadrats as area sampling frames. The procedures are implemented in a program using the software Matlab and illustrated by an application to a digital data related to the blocks of a region of São Paulo downtown.


INTRODUCTION
Quality is commonly used to indicate the superiority of a manufactured good or as the degree of excellence of a product, service or performance.In manufacturing processes, quality may be stated as a desirable goal to be achieved by managements and by the control of the production process (usually employing tools as control charts, for example).These same issues may be easily extended or adapted to evaluate the quality of databases, since a database can be viewed as a result of a production process, and the reliability of the process imparts value and utility to the database.
In manufacturing, the characteristics to be evaluated are easily identified and usually classified in two main groups: attributes (conforming or nonconforming) or variables (some measurement of interest).In data quality, users are faced by some problems: what are the dimensions of geographical data quality since features of the real world are represented in the database by objects, points, lines, polygon or areas (for example, rivers or roads are represented by lines).According to VEREGIN (1999), the conventional view is that geographical data is "spatial".The terms "geographical data" and "spatial data" have been used interchangeably.However, this approach is not adequate since it ignores the inherent coupling of space and time (geographical entities are actually events unfolding over space and time) and geography is connected by themes (not space).Space (or space-time) is just the framework inside which theme is measured.In the absence of theme, only geometry is present.So a better definition of geographical data may include the three dimensions: space, time and theme (where-when-what).These three dimensions are the basis for all geographical observation and data quality must concern on them by components as accuracy; precision; consistency; completeness.
To evaluate the quality of digital products is not an easy task and different aspects of the quality of a spatial (sometimes cartographic) database have been discussed in the literature.Some contributions may be listed.For example, REINGRUBER and GREGORY (1994); CHENGALUR-SMITH; BALLOU and PAZER (1999) have pointed out the influence of the spatial database quality on the decision process.Control cartographic objects in a quality evaluation of spatial database process were subjects of interest.See for example: LEUNG and YANG (1998); SHI and LIU (2000) and VEREGIN (1999 and2000).Related to spatial database building process, the next contributions may be listed: COUCLELIS (1992); NUGENT (1995);LIU, SHI and TONG (1999), QUINTANILHA (2002), QUINTANILHA and HO (2002).
Consider a situation that a digital file design to a spatial database is generated from a conversion process (for example documents or maps or some others cartographical products in paper format and converted to a digital file).This file will be used in a geographical information application and it is necessary to evaluate it if the specifications (for example specification limits and restrictions for spatial features, attribute values considerations and other relevant aspects) settled by the users are met.
Similar to the evaluation of manufacturer process, a sample of database is randomly selected using some area sampling frame (as we are dealing with spatial data, quadrats are the most common frame).Each sampling unit is evaluated to verify if it satisfies criteria previously fixed.A rule is chosen to decide if the database meets the specification or not.In this paper we will consider the following acceptance sampling scheme: 1 -Consider an area covered by T sheets in a fixed scale.Each sheet can be divided in n independent quadrats [see: KISH, (1965);SHAW and WHEELER, (1985)] of a fixed format (in our case, a square) and size.
2 -A random sample of m < n quadrats is extracted from each sheet.
3 -The subset of files corresponding to the m quadrats are examined and if all information in each file are conforming, then the examined sheet is accepted; otherwise all n quadrats of the sheet are inspected, corrected and then the file of the examined sheet is accepted.
Figure 1 illustrates the described sampling procedure.Such sampling scheme is known as zerodefect with rectification and it is usually used to evaluate high quality manufactured processes by attributes.In those processes, we have batches instead of sheets or cartographical products and items or products are examined in place of files related to the quadrats.
In technical literature some papers about zerodefect with rectification can be found.We may mention the contributions from HAHN (1986), BRUSH;HOADLEY and SAPERSTEIN (1990), GREENBERG and STOKES (1992, 1995) and ANDERSON;GREENBERG and STOKES (2001).In those papers, the main objective is to present estimator for the number of non-conforming items in an accepted batch (here we have non-conforming features in an accepted sheet).In ANDERSON; GREENBERG and STOKES ( 2001), they introduced the possibility of the classification criteria presents diagnosis errors in zerodefect with rectification procedure.That is, an examined item/product is classified as non-conforming but in reality it is conforming or an item/product can be classified as conforming but it is non-conforming.
Similarly, when we are evaluating a spatial database, a subset of file related to a quadrat is examined and classified as non-conforming but in reality it is conforming or that subset can be classified as conforming but it is non-conforming.[More details about diagnosis errors, see, JOHNSON; KOTZ and WU, (1991)].Such diagnosis errors can occur either in the inspection or in the rectification stages.MARKOWSKI and MARKOWSKI (2002) presented a methodology to minimize the impact of such diagnosis errors in the acceptance sampling.
However before extracting the sample of files, it is important to design how large must be the sample in order to meet some criteria (statistical and/or economical ones).In this paper, we will consider the determination of an optimum sample size m such that minimizes a cost function.The components of such function include the inspection cost, the costs due to the presence of non-conforming quadrats subset of files in accepted sheets and the costs due to diagnosis errors.Economical models found in the literature do not include the possibility of the diagnosis errors in the inspection stage.The determination of the sample size including such errors in the inspection stage is the focus of this paper.
In Section 2, we introduce the notation and hypothesis considered in this paper.The expected cost function is developed in Section 3 and such procedure is illustrated by a numerical sample in Section 4. We finish this paper with discussions and extensions in future works.

NOTATION AND HYPOTHESIS
Consider an area covered by a sheet.This sheet can be divided in n independent quadrats of a fixed format and size.A random sample of m quadrats is selected and is the probability of a quadrat be non-conforming.The value of is equal to zero with probability (1 2 Y i → the number of non-conforming quadrats from subsets of files observed in (n-m) nonsampled quadrats in the sheet i; → the number of nonconforming quadrats from subsets of files observed in the sheet i.

COST FUNCTION AND DETERMINATION OF THE OPTIMUM SAMPLE SIZE m
In this Section, an expected cost function per sheet ( ) is developed employing the earlier notations and hypothesis.The total medium cost to evaluate T sheets is T .And m such that minimizes will also minimize the total medium cost.So hereafter, the index i will be is suppressed in the expression of the expected cost function per sheet.The .
(1) The first one ( ) is related to inspection cost.It is compounded by costs to inspect m quadrats and the possibility to inspect the (n-m) non-sampled quadrats.Such factor is conditioned to the presence of at least one non-conforming quadrat in the initial inspection of m quadrats.So is given by To obtain the value of 1-U, we have to consider two scenarios: 1 -In the random sample of m quadrats, all are conforming and they must be correctly classified as conforming and the probability of this event is given by: m ) e )( ( 2 -In m examined quadrats, D 1 are nonconforming quadrats, but all of m quadrats must be classified as conforming.The probability of this event conditioned on fixed values of p and D 1 is given by Beta a b π (3) ations ( 2) and (3), we have the proba reality it is a non-conforming one and it is given by. 1) is due to the possibility of a quadrat be classified as conforming but in

[ ]
[ ] where  , , , ) (1 Beta a b π n the components of such cost function per sheet are: The last part ( 3 m E ) in ( 1) is due to the consequence in classifying a quadrat as nonconforming when it is a conforming one.In this case, the sheet is rejected and consequently all quadrats are classified as non-conforming but there is a possibility to be rectified them unnecessarily.After this introductio

NUM ICAL EXAMPLE
absence of block drafts were or not correctly located.
it is reasonable the occurren Let us consi e foll

ER
The example described in this section is based on an application to a digital data related to the blocks of a small region of São Paulo downtown, Brazil.The attribute of the interest was to verify if the presence/ It is known that the area recovered by sheets and each one is made up by n=5000 quadrats.They will be inspected by a zero-defect with rectification procedure and the inspection consists of checking visually the presence or absence of block drafts on the screen or by plot.In this context ce of misclassifications.der th owing costs: $1.00   Wiley and Sons, New York, 1985. p=p+sum(temp2.*((1-p1l).^(vm-vdi).*(1-p1).^vdi));s1=s1+sum (exp(vcomb+vtemp1+vtemp+log(vdi s2=s2+sum(exp(
it can vary from one sheet to another sheet according to a Beta distribution (a, b) with probabilityπ (the probability of p > 0) .Let: 1 e → the probability of a quadrat from a subset of file be wrongly classified as non-conforming when it is conforming; 2 e → the probability of a quadrat from a subset of file be wrongly classified as conforming when it is non-conforming; 0 c → the cost to inspect a quadrat from a subset of file ; 1 c → the cost of a non-conforming and nonrectified quadrat subset of file in an accepted sheet; 2 c → the cost to judge erroneously a quadrat from a subset of file as conforming when it is nonconforming ; 1 D i → the number of non-conforming quadrats from subsets of files in a sample of size m in the sheet i; 2 D i → the number of non-conforming quadrats from subsets of files in (n-m) non-sampled quadrats in sheet i; subsets of files in sheet i; 1 Y i → the number of non-conforming quadrats from subsets of files observed in a sample of size m in the sheet i;

[ ] 2 I
• denotes an indicator function and ( ) E • → the expected value of a random variable.Such result can produce alteration in the expenses when the sheet is accepted or rejected in the inspection stage.As D=D +D , the above expression can be written as Fig. 2: Values of m versus expected cost (c 0 =1.0; =100.00 and c 2 =500.00).