An Efficient Variant of Ranked Set Sampling, Probability Proportional to Size with Application to Economic Data

In this paper, we apply the Ranked Set Sampling (RSS) technique to economic data in the form of homescan market research data set for the meat food group. The RSS method is then extended to select sampling units based on the Probability-Proportional-to-Size (PPS) approach. The new proposed ranked set sampling, using the PPS-derived method, RPPS, is assessed via Monte Carlo investigations and an extensive homescan data set to evaluate its performances. The results are promising and in line with theoretical and simulation studies, showing that the RPPS technique is more reliable and has a smaller variance than the PPS route.


Introduction
Ranked Set Sampling, hereafter referred to as RSS, is a sampling approach whose basic structure could lead to improved statistical inference in a range of situations where the actual measurement of the variable of interest is difficult or expensive to obtain, while sampling units can be easily and cheaply ordered by certain means, including visual inspection, without actual quantification.In fact, it is an intriguing development in data collection techniques that enables one to gather a more informative sample than can be garnered through simple random sampling.RSS is a two-stage sampling technique where a number of sampling units are first ranked with respect to the variable of interest, and second, measurements are taken from a fraction of these ranked units.Rank-based sampling designs are powerful alternatives to simple random sampling (SRS), often offer notable improvements in precision, and have been used in diverse applications, including the applications for RSS designs in ecological and environmental studies (e.g., Al-Saleh and Zheng (2002) and Kvam (2003)), forestry (Halls and Dell (1966)) medical studies (Samawi and Al-Sagheer (2001) and Chen, Stasny, and Wolfe (2005)), and reliability (Mahdizadeh and Zamanzade (2018)), among others.Such applications have attracted widespread attention; in this paper, we consider a possible application of RSS using prices data.
Probability Proportional to Size Heravi and Morgan (2014) evaluated various sampling methods for meat prices, stratifying by kind of meat and other attributes, such as brand, method of storage/preservation, and region, to estimate the Consumer Prices Index (CPI).This index is an important macroeconomic indicator that attempts to summarize the changes in the price of a 'typical' basket of goods and is widely used for formulating economic policy and indexing pensions and welfare benefits.Therefore, its accurate measurement is critical, and it is clearly of interest to know how various sampling schemes perform in the context of such a price index construction.The authors mentioned above suggested that Probability Proportional to Size sampling (PPS) is an accurate method to estimate CPI.
It is well-known that both RSS is superior to the SRS, and Heravi and Morgan (2014) show using PPS can get an accurate estimate of CPI, so it is of interest to study a hybrid of RSS and PPS.In this paper, we extend the RSS method, using PPS (denoted as RPPS) and evaluate this new proposed sampling technique's performance.In fact, RSS is not so much a sampling technique as a data management method; accordingly, the combination of RSS and sampling methods like PPS would be of great interest.In this study, the performance measures considered are the bias and standard deviation of the mean estimate.We considered PPS with replacement to keep the probability constant.PPS deals with finite populations.The inference under RSS with a finite population is considered for different designs, see Deshpande, Frey, and Ozturk (2006), Ozturk (2016), and more recently PPS is considered in Ozturk (2019) and Ozturk (2020) that considered the stratified populations, here we focus deeply on RPPS and show its applications in macroeconomic data.. Next, we will provide an overview of the data structure of an RSS and present a summary of this RSS method.Section 3 discusses PPS, Section 4 explores the performance of RPPS relative to PPS, and Section 5 describes two numerical studies to explore the finite sampling properties of the proposed method.We also present an application of the proposed technique to the homescan data set.Section 6 provides some concluding remarks.

Foundation of ranked set sampling
Perfect RSS has two stages.In the first stage, units are identified and ranked perfectly.In the second stage, measurements are taken from a fraction of the ranked elements.To obtain an RSS of size k, one should choose an SRS of k units, {Y 11 , . . ., Y 1k }, rank them without measurement on the variable of interest Y (11) ≤ . . .≤ Y (1k) , and select the smallest one, i.e., Y (11) .Next we select the second smallest on the second SRS sample of k units, Y (22) .This procedure is then repeated until k observations have been collected.Let us consider a cycle of RSS sample and denote as Y = {Y (1) , . . ., Y (k) }.Y ∼ F (.) and σ 2 < ∞, the estimate of the population mean µ and its variance are where µ is the mean of population and µ (i) denotes the mean of ith order statistic in an SRS of size k.Takahasi and Wakimoto (1968) consider the relative precision comparing RSS estimation of population mean to SRS and showed that relative precision is bounded by 1 and by (k + 1)/2 for any distribution with finite variance, the upper bound is achieved when sampling from the uniform distribution.
Dealing with finite population and following Arnold, Balakrishnan, and Nagaraja (2008), the probability mass function can be expressed.
where P (r) and F (r) are the probability mass function and the cumulative distribution function of a ranked statistic with rank r.This presentation helps us to work with the discrete distribution.Using this expression It can be shown that An easier way to show statement (2) holds is to use the following density function The use of this formula here is straightforward, for details, see Arnold et al. (2008).We have (3) The process of ranking may not be error-free.Under such a scenario, the probability mass function of a ranked statistic with rank r is no longer P (r) (y) and hence is denoted as P [r] (y), see Chen, Bai, and Sinha (2004).Let p sr be the probability that the sth order statistics is judged to have rank r.For the same probability of judging, we have where k r=1 p sr = 1, and obviously p sr )P (s) (y) = P (y).
Note that (4) holds under the assumption that the value of the sth order statistic and the event that it receives judgment rank r are independent, see Presnell and Bohn (1999).The review of complete discussion of imperfect ranking and the tests of imperfect ranking can be found in Amiri, Modarres, and Zwanzig (2017), and references therein.
To obtain a total number of n = km units, the whole procedure should be repeated m times.
Let Y (r)j denote the measurement on the jth measured unit with rank r.This results in a RSS of size n from the underlying population written as It is worth mentioning that, in RSS designs, {Y (1)j , . . ., Y (k)j } are independent order statistics (as they are obtained from independent sets) and each Y (r)j provides information about a different stratum of the population.

Probability proportional to size sampling
Probability proportional to size sampling is a method of sample selection in which the units are selected with probability appropriate to a given measure related to the characteristics under study.It is also known as unequal probability sampling.Here sampling with replacement is considered as explained in Cochran (1977), the PPS with replacement is proposed in Hansen and Hurwitz (1943) to estimate the population total as same as Cochran (1977).
This assumption helps developing theory, and it does not violate the Consumer Prices Index (CPI) because the data is generated continuously and the probabilities stay constant.Here the population mean is of interest.
To draw inference, let us consider a finite population {y 1 , . . ., y N } where the probability corresponding to selecting unit j from this population is π j = P (Y = y j ), i.e., the probability of unit j being sampled is π j .Let us consider the following probability mass function for PPS: where N is the size of population.Consider a sample, {Y 1 , . . ., Y n }, of size n with replacement from a finite population {y 1 , . . ., y N }.Then the estimate of mean is where π * i denotes the selection probability of the ith sampled unit, which this estimate is an unbiased estimate with variance where η = 1 N N i=1 y i .The equations ( 5) and ( 6) can easily be obtained using the technique given in Cochran (1977), pp.253.Here a direct approach is used The variance is The estimate of the variance from a sample of size n is

Using RSS to achieve PPS
To collect n observations by the combination of RSS and PPS, denoted as RPPS hereafter, let us first obtain k sampling units selected with PPS the unit with rank 1 is identified and taken for the measurement, Y (1)1 , and the remaining are disregarded.The procedure can be repeated for m times to have m iid units with rank 1. Next, another k units are drawn with PPS and the unit with rank 2 is measured, Y (2)1 .The procedure is continued until m units with rank k are collected.Using this procedure, n = km observations are collected.Then, the sample is The sample mean is estimated by where , it is the probability corresponding to selecting unit j from this population that is appeared in cycle r, π rj = P (Y (r)j = y j ).Using the definition, the expected value of the rth order statistic in a finite population is Using the following proposition, we prove that the RPPS for m = 1 provides an unbiased estimate with a lower variance than that for standard balanced PPS.
Proposition 1. Suppose the samples are selected with unequal probability where N i=1 y 2 i π i < ∞ from a finite population, and {Y (1) , . . ., Y (k) } are collected according to the proposed algorithm under a perfect ranking from this population, then Proof.Recall that the mean of these observations is Its expected value is then obtained as follows: This statement shows that RPPS provides an unbiased estimate of the population mean.The variance can be obtained using Proposition 1 can be used to generalize the result for RSS with n = mk.According to Propositions 1 The estimate of the first part is given in ( 7), but the unbiased estimate of the second part for m = 1 is not trivial, see Zamanzade and Vock (2015) for the discussion of variance.A practical way to accomplish this is using the bootstrap method, the bootstrap of RSS is discussed in Amiri, Jafari Jozani, and Modarres (2014).The bootstrap method is a standard tool in statistical analysis that can be used to perform the statistical inference.In this study, the non-parametric bootstrap is used to approximate the population distribution function.
The bootstrap can be used to obtain the sampling distribution of a statistic of interest.The bootstrap allows for estimation of the standard error of any well-defined statistic and enables us to draw inferences when the exact or the asymptotic distribution of the statistic of interest is unavailable.To estimate the variance of ȳrpps , the bootstrap method can be used, see Algorithm 1.

Numerical evaluation
In this section, we first study the performance of the proposed method for estimating the population mean of proposed designs.We then apply our method to a real data set where we also study the performance of our proposed ranked-based technique.

Simulation
Monte Carlo simulations were used in order to investigate the finite sample properties of the proposed RSS algorithm.We examine certain desirable features such as unbiasedness and smaller variance.Here different balanced RSS with k = 5 and different sizes are used to study the performance of discussed methods 2. Combine all the observations to form Z ⋄ = {Z 1 , . . ., Z k } and assign the probability of 1/km to each element of Z ⋄ .
where Z is the average of Z * b .
The design D i = (i, i, i, i, i) shows RSS data where each order statistic is gathered i times.
Let us consider an artificial finite population, To study the estimation of mean using the proposed methods, a PPS sample with the size of n i i = 1, . . ., 5 is collected from P and the sample mean corresponding to ( 5) is calculated.
To study its competitor, RPPS, a sample via the discussed procedure with size n i and the ith design is collected from P and the sample mean is calculated via (9), the whole procedure is repeated 10,000 times and the mean and variance (number given in the parentheses) are given in Table (2).It shows the estimate of mean using the RPPS has lower variance for different designs and probabilities, which is expected from the theory provided in Section 4.
Studying the behavior of the proposed methods under imperfect ranking is important because when the ranking process is not perfect, there is often a loss of efficiency.The statistical tests of perfectness of rankings have received attention in RSS literature, see Frey, Ozturk, and Deshpande (2007), Vock andBalakrishnan (2011), andAmiri et al. (2017) among others.Several mechanisms are presented to produce imperfect RSS samples, see Amiri et al. (2017) and references therein.We use the Fraction of Neighbors technique to generate the imperfect ranking in data; let us denote the ranks using imperfect ranking by [.], we assume where λ is the fraction of incorrectly chosen statistics.Here, λ = 1 3 is used and for the extreme judgment order statistics F (0) := F (1) and F (k+1) := F (k) .Perfect rankings are obtained by setting λ = 0. Table 3 includes the estimate of the mean and variance under RSS with imperfect ranking, comparing Table 2 and 3 show that RSS procedure when applied to samples collected using PPS give rise to improved precision than using just PPS.Note that RPPS may not always result in improvement over SRS, as PPS does not always give rise to an improvement in precision over SRS.50.273(110.395) 50.323(72.200) 50.230(54.003) 50.197(44.303)

Experience with real data
In this section, we conduct a comparison of the methods in terms of their applications to real data.To this end, we consider the data set supplied by Taylor Nelson Softres (TNS, now part of the Kantar World Panel) 1 , which contains 60 million transactions, from a sample panel of 35,000 households, for about 400,000 products.The households were chosen so that they would cover all ages, genders, and social classes and represent every region of the UK.Householders were required to scan their shopping purchases within their own homes.The main data set contains the details of the transactions, including the bar codes, household numbers, product codes, shop codes, product descriptions, market categories, year/month/week/day of transaction, expenditure and the number of packs bought.For example, for the meat data, there are around eighteen product attributes.To explore the theoretical part using the real data, the subset of these data which lists meat sold in London in December of 2005 is considered that includes 5,553 observations.For the purposes of study, the meat's attributes are categorized to the Frozen meat, Cooked Ham, Total Fresh Foods, prepackaged Fresh (meat,veg,pastry), non organic, pork and so on.The summary statistics of price (per pack) are given in Table 4. Table 4 and the histogram in Figure 1 shows the data is positively skewed.
We used this data as a population, where 1 N N i=1 y i = 2.3909.To achieve PPS sampling, we first considered equal probability, i.e., π V,i = 1 N where f k , k = 1, . . ., 7 is the frequency for each category and the summation of all π V,i equals one.To explore the proposed methods under unequal probabilities, we consider: where the item i belongs to category j.The motivation behind such choices is logical, because the probability of elements is in terms of frequency.Table 5 shows the frequencies and probabilities assigned to observations.The elements and the proposed probabilities π V , π V I , π V II to generate the observations are denoted as V, V I, V II.We also consider situation when the inclusion probabilities are inversely proportional to the the group frequencies (suggested by a referee); π V III,i ∝ 1 f j when unit i belongs to category j.The estimate of mean (variance) under the perfect ranking is given in Table 6.Obviously, for the given probabilities, the RPPS leads to an unbiased estimate of mean and a smaller variance.In addition, to attain a better sample with lower variance, RSS also has the advantage of reducing the cost of data collection when, for example, sample collection is time-consuming and expensive, while the ranking variable is cheap.To sample prices for the actual CPI, the price collectors physically call into the shops, which is inherently expensive.However, if we were to use the last period prices, then we would obtain a variable that is highly correlated with the current prices and can be used as a useful ranking variable.Therefore, in this example, we used the total shopping expenditure as such variable.The estimate of mean (variance) under the imperfect ranking is given in Table 6.Here to generate the imperfect ranking, we consider a concomitant variable, the total shopping expenditure.The result shows that RPPS leads to a better estimate than PPS.However, as expected, comparing the variance of RPPS under imperfect and perfect ranking reveals the variance increases under imperfect ranking.

Conclusions
A considerable amount of research has been done to elaborate RSS.Ranked-based sampling techniques are designed to use additional information from inexpensive and easily obtained sources to collect a more representative sample than can be gained from simple random sampling.Due to the unique structure of RSS, researchers are able to have an estimate with lower variabilities, which helps us to draw better inference.
This paper defined a ranked sampling procedure for PPS sampling; we explored the RSS and PPS approaches and considered the possibility of achieving the latter using the former (denoted by RPPS).The properties of these sampling methods were studied theoretically and proved that RPPS outperformed PPS, giving an unbiased estimate with lower variance.The Monte Carlo simulations under perfect/imperfect ranking designs also confirmed the theoretical results obtained.Our findings showed that RPPS is always superior to PPS, with significantly lower variance.In fact, it shows a reduction of up to 50% in the variance for some cases.Taking the TNS database as the population of interest, we also examined the two sampling methods with real data, which were fairly skewed.The results indicated that RPPS provides an unbiased estimate with lower variance and can be considered an efficient sampling technique.
There are many other methods that could be used; for instance, we considered a finite population with sampling with replacement.However, this method can be extended to sample without replacement and consider unbalanced RSS with missing data, but we leave them for future research.

Figure 1 :
Figure 1: The histogram of the price of meat (per pack)

Table 2 :
Simulation of mean and variance for the proposed approach under perfect ranking

Table 3 :
Simulation of mean and variance for the proposed approach under imperfect ranking

Table 4 :
Summary statistics for the price of meat (per pack) 1 Office for National Statistics (ONS) was provided the real data

Table 5 :
The frequency of meat's categories and the assigned probabilities * Meat,Veg, and Pastry

Table 6 :
Study of mean and variance for the proposed approach on the real data.