What is the general population. General and sample population. Two-sample t-test for independent samples

The set of social objects, phenomena, processes that are the subject of study of sociological research, form general population... Any general population is characterized by some explicitly specified attribute (or a set of attributes), by the value of which it is always possible to unambiguously determine whether a given object belongs to the general population or not.

Some of the objects of the general population acting as objects of observation are called sample population.

In other words, if the general population includes all, without exception, the units that make up the object of research, then the sample population is a specially selected part of the general population. The sample population is designed in such a way that, with a minimum of objects under study, it would be possible to represent the entire general population with the necessary degree of guarantee.

The sampling unit is called the elements of the general population, which act as the units of account in various selection procedures that form the sample.

Observation units are the elements of the formed sample, which are directly exposed to the study.

The unit of selection and the unit of observation are social objects with characteristics that are essential for the subject of a particular sociological study. They can be the same (in simple selection schemes) and different (in complex combined selection schemes). Selection units can be both individual individuals and entire collectives or entire groups (for example, when conducting a continuous survey).

If the observation unit coincides with the sampling unit, a single-stage (simple) sampling is used, if there is a mismatch, a multi-stage (complex) sampling is used.

The sample size depends on a number of factors:

From the purpose and objectives of the study,

On the degree of homogeneity of the general population,

From the value of the confidence probability,

· On the accuracy of the results (the value of the acceptable error of representativeness).

Table 4 shows the ratio of the general population and the sample size.

Table 4. The ratio of the volumes of the general and sample populations.

The presented table reflects the many years of experience of sociologists, it is often used in the absence of data on the general population, which makes it impossible to apply the formula.

Determining the size of the sample population is not enough to study it. It is necessary to decide on the type of sample.

Samples vary probabilistic and targeted.

Model probabilistic (random) sampling is related to the concept of probability, which is widely used in many social sciences... In the most general case, the probability of some expected event is the ratio of the number of all possible events to the number of expected ones. Moreover, the total number of events should be large enough (statistically significant). In addition, it is necessary to create conditions equiprobability selection of units. The condition of equiprobability must guarantee for each element of the general population to be included in the sample. This situation is possible with a uniform distribution of the elements of the general population.

There are various methods of probabilistic (random) sampling:

Method of proper random selection,

· Random-nonrepeatable method,

Randomly repeated,

· Method of mechanical sampling (for example, every tenth element of the general population is included in the sample).

Often used quite exact method selection of the sample - serial sampling method. The essence of this method is to divide the general population into homogeneous parts (series) according to a given attribute... After that, the selection of respondents is carried out in each series according to a given criterion.

In addition, there is nested sampling method... A "nest" is a group of objects made up of a number of elements. As units of research, not individual respondents are used, but groups, collectives.

Along with probabilistic sampling, sociological research also uses targeted sampling. Targeted sampling is not carried out using the theory of probability, but using a number of methods:

Spontaneous sampling,

Main array,

· Quota sampling.

Spontaneous sampling most often used in journalism. An example of spontaneous sampling is a postal survey. The reliability and quality of the information obtained in this case is very low and applies only to the surveyed population.

Main array method it is used as a "sounding" during a pilot study, with 60-70% of the general population being studied.

The most accurate of the targeted sampling methods can be considered quota sampling method... However, the application of this method is possible if there are statistical data on the general population. All data on the characteristics of the general population act as quotas, and individual numerical values ​​as parameters for quotas. In the case of a quota sample, respondents are selected purposefully in accordance with the parameters of quotas. No more than four signs can act as a quota. For example, gender, age, work experience, educational level, etc.

Determining the size and type of the sample is an insufficient condition for the validity of the dissemination of the research findings to the entire general population. From the whole variety of possible sample sets, it is necessary to select one, the most accurate. The ability of the sample to reflect, simulate significant properties of the general population - yes representativeness sampling.

The deviation of the results of a sample study from the essential characteristics of the general population is called error of representativeness.

Representative errors can be random and systematic. Random errors of representativeness are probabilistic in nature and, when measured again, change according to probabilistic laws. Systematic Representativeness errors are bias errors that violate the accuracy of the sample. Systematic errors occur when miscalculations at the design stage of the sample, in the absence of information about social facility, with incorrect sampling. Representative bias can also be unintentional(for example, miscalculation at the design stage of the sample) and deliberate(due to ideological, economic, etc. factors).

When studying the general population, the sampling method greatly facilitates the task of the researcher, however, it is necessary to remember about the possible difficulties associated with the sampling method.

As a result of studying the material in Chapter 2, the student should:

know

  • basic concepts of the general and sample populations;
  • estimation methods, types and properties of estimates of parameters of the general population;
  • basic methods of statistical testing of hypotheses regarding the parameters of one-dimensional and multidimensional general populations;

be able to

  • find, based on sample data, estimates of the parameters of one-dimensional and multidimensional general populations;
  • analyze the properties of parameters;
  • test hypotheses regarding the parameters and type of distribution of the general population;
  • compare the parameters of several general populations;

own

  • skills of statistical estimation of parameters of one-dimensional and multidimensional general populations;
  • the skills of testing hypotheses regarding the parameters and type of distribution of the general population when conducting socio-economic research using analytical software.

Population distribution

Probabilistic-statistical methods of data analysis assume that the laws governing the investigated variable (random variable) are completely determined by the complex of conditions for its observation. Mathematically, these patterns are set by the corresponding probability distribution law. However, when conducting statistical research, the concept of the general population is more convenient.

Thus, the mathematical concepts "general population", "random variable" and "probability distribution law" corresponding to a given set of conditions can be considered synonyms in a certain sense.

The general population name the set of all conceivable observations that could be made under a given set of conditions.

Since in the definition it comes about mentally possible observations (or objects), then the general population is an abstract concept, and it should not be confused with real populations subject to statistical research. So, having examined even all enterprises of a sub-sector, we can consider them as representatives of a hypothetically possible broader set of enterprises that could function within a set of conditions.

The general population can be either finite or infinite. The ultimate population takes place, for example, in a survey of family budgets, when a sample is taken from the population of families actually present in the country. The income and expenditure of the selected families is then monitored. Endless the general population is observed, for example, in scientific research when we are interested in the average result of a large number of experiments.

In the simplest case, the general population is a one-dimensional random variable NS with a distribution function that determines the probability that NS will take a value less than a fixed real number.

In the general case, general populations are studied that include several features (usually more than two). The considered set of features is denoted by a vector having k component, each of which characterizes the corresponding feature. To analyze a vector X multidimensional statistical methods are used.

Thus, the object of research in multivariate analysis is a random vector X, or a random point in ft-dimensional Euclidean space, the system To random (one-dimensional) variables, ft-dimensional random variable

The distribution function of a random vector is called a deterministic non-negative quantity determined by the formula

where is a dimensional vector of fixed real numbers.

Deterministic non-negative quantity F (X)

Distinguish:

  • continuous k-dimensional random variables, all components of which are continuous (one-dimensional) random variables;
  • discrete k-dimensional random variables, all components of which are discrete random variables;
  • mixed k-dimensional random variables, among the components of which there are both discrete and continuous random variables.

Distribution function F (X) for continuous k-dimensional random variable is continuous by definition.

The density of the probability distribution of the continuous k-dimensional random variable satisfies the condition

Density f (X) has the following properties:

The area bounded at the top by the density plot is always equal to one:

where through k the total number (multiplicity) of integrals is indicated;

The probability of hitting a point () in some area G is equal to

It follows from the definition of the density that if we integrate the joint distribution density of the two quantities NS 1, NS 2 one by one, for example, within infinite limits, then we get the probability density of another quantity:

Similarly, we have

Probability densities, distribution functions of subsystems, random variables systems To random variables are called private or marginal distributions .

Conditional distributions random vector X called the distribution of the subsystem, its components, provided that the rest of the components are fixed. These components will be separated from non-fixed components with a slash.

For a continuous random variable, for example, the formulas that determine the density of the conditional distribution of a two-dimensional random variable () are valid, which is a subsystem of the system (), provided that the last three components are fixed in it:

Subsystem, component and additional subsystem of vector components X are called independent(stochastically, probabilistically) if the equality

In particular, the components of the vector X are called independent, if

In the case of independence, similar formulas are valid for the products of densities or probabilities of marginal distributions and the coincidence of conditional distributions with the corresponding marginal distributions (23].

General population - a lot of those people about whom the sociologist seeks to obtain in his research. Depending on how broad the research topic is, the general population will be just as wide.

Sample population - reduced model of the general population; those to whom the sociologist distributes questionnaires, who are called respondents, who, finally, are the object of sociological research.

Who exactly belongs to the general population is determined by the objectives of the study, and who to include in the sample is decided by mathematical methods. If a sociologist intends to look at the Afghan war through the eyes of its participants, the general population will include all Afghan warriors, but he will have to interview a small part - the sample population. In order for the sample to accurately reflect the general population, the sociologist adheres to the rule: any Afghan warrior, regardless of place of residence, place of work, health status and other circumstances, should have the same probability of being included in the sample.

As soon as the sociologist decided on whom he wanted to interview, he determined sampling frame... Then the question of the type of sample is decided.

The samples are divided into three large classes:

but) solid(censuses, referendums). All units from the general population are surveyed;

b) random;

in) not random.

Random and non-random sampling types, in turn, are subdivided into several types.

The random ones include:

1) probabilistic;

2) systematic;

3) zoned (stratified);

4) nesting.

Non-random ones include:

1) "Spontaneous";

2) quota;

3) method "main array".

A complete and accurate list of units of the sample population forms sampling frame . The items to be selected are called sampling units ... The sampling units can be the same as the observation units, since unit of observation an element of the general population is considered, from which information is directly collected. Usually the unit of observation is an individual person. Selection from a list is best done by numbering units and using a table of random numbers, although a quasi-random method is often used, where every nth element is taken from a list of prime numbers.

If the sampling frame includes a list of sampling units, then the sampling structure implies their grouping according to some important characteristics, for example, the distribution of individuals by profession, qualifications, sex or age. If in the general population, for example, 30% of young people, 50% of middle-aged people and 20% of the elderly, then the same percentage of the three ages should be observed in the sample. Grades, gender, nationality, etc. can be added to ages. For each, the percentage proportions in the general and sample population are established. Thus, sample structure - percentage proportions of the characteristics of the object, on the basis of which the sample is compiled.

If the type of sample tells how people get into the sample population, then the sample size tells how many of them got here.

Sample size - the number of units in the sample. Since the sample population is part of the general population, selected using special methods, its volume is always less than the general population. Therefore, it is so important that the part does not distort the idea of ​​the whole, that is, it represents it.

The reliability of the data is influenced not by the quantitative characteristics of the sample (its volume), but by the qualitative characteristics of the general population - by the degree of its homogeneity. The discrepancy between the general and sample population is called error of representativeness , the permissible deviation is 5%.

Here are some ways to avoid the error:

    each unit of the population should have equal probability get into the sample;

    it is desirable to make selection from homogeneous populations;

    you need to know the characteristics of the general population;

    when compiling a sample, it is necessary to take into account random and systematic errors.

If the sample (sample) is compiled correctly, then the sociologist receives reliable results that characterize the entire general population.

What are the main sampling methods?

Mechanical sampling method when the required number of respondents (for example, every 10th) is selected from the general list of the general population at regular intervals.

Serial sampling method... In this case, the general population is divided into homogeneous parts and units of analysis are proportionally selected from each (for example, 20% of men and women at the enterprise).

Nested sampling method... The selection units are not individual respondents, but groups with subsequent continuous research in them. This sample will be representative if the composition of the groups is similar (for example, one group of students from each stream of a university department).

Main array method- a survey of 60–70% of the general population.

Quota sampling method... The most complex method, which requires the determination of at least four criteria by which the selection of respondents is carried out. It is usually used with a large general population.

Research usually begins with some kind of assumption that needs to be verified using facts. This assumption - a hypothesis - is formulated in relation to the connection of phenomena or properties in a certain set of objects. To test such assumptions on facts, it is necessary to measure the corresponding properties of their carriers. But it is impossible to measure, for example, anxiety in all adolescents. Therefore, when conducting research, they are limited to only a relatively small group of representatives of the corresponding populations of people.

General population- this is the whole set of objects in relation to which the research hypothesis is formulated. Theoretically, it is believed that the size of the general population is not limited. In practice, the volume of the general population is always limited and can be different depending on the subject of observation and the task that the psychologist has to solve. Typically, the general population includes a very large number of objects - university students, schoolchildren, enterprise employees, retirees, etc. A complete survey of general populations is extremely difficult, therefore, as a rule, a small part of the general population is studied, called the sample population, or sample.

Sample - this is a group of objects limited in number (in psychology - subjects, respondents), specially selected from the general population to study its properties. Accordingly, the study on a sample of the properties of the general population is called a sample study. Almost all psychological studies are selective, and their findings apply to general populations.

A number of mandatory requirements are applied to the sample, determined, first of all, by the goals and objectives of the study. It should be such that the generalization of the conclusions of the sample study is justified - generalization, their dissemination to the general population.

The sample must meet the following conditions:



1. This is a group of objects available for study. The sample size is determined by the objectives and possibilities of observation and experiment.

2. It is part of a predetermined population.

3. This is a group selected at random so that any object in the general population has the same probability of being included in the sample.

The main criteria for the validity of the research conclusions are the representativeness of the sample and the statistical reliability of the (empirical) results.

Representativeness - in other words, its representativeness is the ability to characterize the corresponding general population with a certain accuracy and sufficient reliability. If the sample of subjects is representative of the general population in terms of its characteristics, then there are grounds for the results obtained during its study to be extended to the entire general population.

Ideally, a representative sample should be such that each of the main characteristics, traits, personality traits studied by a psychologist, etc., is represented in it in proportion to the same characteristics in the general population.

Representative errors occur in two cases:

1. Small sample characterizing the general population.

2. The discrepancy between the properties (parameters) of the sample and the parameters of the general population.

Statistical reliability, or statistical significance, of research results is determined using statistical inference methods. These methods will be discussed in more detail in the "Hypothesis Testing" topic. Note that they have certain requirements for the size, or sample size.

The largest sample size is required when developing a diagnostic technique - from 200 to 1000-2500 people.

If it is necessary to compare 2 samples, their total number should be at least 50 people; the number of compared samples should be approximately the same.

If the relationship between any properties is being studied, then the sample size should be at least 30-35 people.

The greater the variability of the studied property, the larger the sample size should be. Therefore, the variability can be reduced by increasing the homogeneity of the sample, for example, by sex, age, etc. At the same time, naturally, the possibilities of generalization of conclusions are reduced.

Dependent and independent samples. A typical research situation is when a property of interest to a researcher is studied on two or more samples for the purpose of their further comparison. These samples can be in different proportions - depending on the procedure for their organization. Independent samples are characterized by the fact that the probability of selecting any subject from one sample does not depend on the selection of any of the subjects from another sample. On the contrary, dependent samples are characterized by the fact that each subject of one sample is assigned a subject from another sample according to a certain criterion.

The most typical example of an independent sample is, for example, a comparison of men and women in terms of intelligence.

This is a science that, based on the methods of the theory of probability, is engaged in the systematization and processing of statistical data to obtain scientific and practical conclusions.

Statistics information about the number of objects with certain characteristics is called .

A group of objects, united by some qualitative or quantitative criterion, is called statistical population ... The objects that make up a set are called its elements, and their total number is called its volume.

The general population is called the set of all conceivably possible observations that could be made under a given real set of conditions or more strictly: the general population is a random variable x and the associated probability space (W, Á, P).

The distribution of a random variable x is called population distribution(say, for example, about a normally distributed or just a normal general population).

For example, if a number of independent measurements of a random variable are made x, then the general population is theoretically infinite (that is, the general population is an abstract, conventionally mathematical concept); if the number of defective items in a batch of N items is checked, then this batch is considered as the final general population of volume N.

In the case of socio-economic research, the general population of volume N can be the population of a city, region or country, and the measured characteristics are income, expenses, or the volume of savings of an individual. If some feature is of a qualitative nature (for example, gender, nationality, social status, occupation, etc.), but belongs to a finite set of options, then it can also be encoded with a number (as is often done in questionnaires).

If the number of objects N is large enough, then it is difficult to conduct a complete survey, and sometimes it is physically impossible (for example, to check the quality of all cartridges). Then a limited number of objects are randomly selected from the entire general population and subjected to study.

Sample population or just sampling volume n is called a sequence x 1, x 2, ..., x n independent identically distributed random variables, the distribution of each of which coincides with the distribution of the random variable x.

For example, the results of the first n measurements of a random variable x it is customary to consider it as a sample of size n from an infinite general population. The data obtained is called observations of a random variable x, and they also say that the random variable x "takes values" x 1, x 2, ..., x n.


The main task of mathematical statistics is to draw scientifically substantiated conclusions about the distribution of one or more unknown random variables or their relationship with each other. The method consisting in the fact that, based on the properties and characteristics of the sample, conclusions are made about the numerical characteristics and the distribution law of a random variable (general population) is called selective method.

In order for the characteristics of a random variable obtained by the sampling method to be objective, it is necessary that the sample be representative, those. fairly well represented the investigated quantity. By virtue of the law of large numbers, it can be argued that the sample will be representative if it is carried out randomly, i.e. all objects in the general population have the same probability of being included in the sample. For this there are different kinds sampling.

1. Simple random selection is called selection, in which objects are retrieved one at a time from the entire population.

2. Stratified (stratified) selection consists in the fact that the initial general population of volume N is subdivided into subsets (strata) N 1, N 2, ..., N k, so that N 1 + N 2 +… + N k = N. When the strata are determined, from each a simple random sample of size n 1, n 2,…, nk is extracted from them. A special case of stratified selection is typical selection, in which objects are selected not from the entire general population, but from each typical part of it.

Combined selection combines several types of selection at once, forming different phases of a sample survey. There are other sampling methods.

The sample is called repeated , if the selected object is returned to the general population before choosing the next one. The sample is called unrepeatable , if the selected object is not returned to the general population. For a finite general population, random selection without return leads at each step to the dependence of individual observations, and a random, equally possible selection with return leads to the independence of observations. In practice, one usually deals with non-replicate samples. However, when the population size N is many times larger than the sample size n (for example, hundreds or thousands of times), the dependence of the observations can be neglected.

Thus, a random sample x 1, x 2, ..., x n is the result of sequential and independent observations of a random variable ξ representing the general population, and all sample elements have the same distribution as the original random variable x.

The distribution function F x (x) and other numerical characteristics of a random variable x will be called theoretical, Unlike sample characteristics , which are determined by the results of observations.

Let the sample x 1, x 2, ..., x k be the result of independent observations of a random variable x, and x 1 was observed n 1 times, x 2 - n 2 times, ..., x k - n k times, so that ni = n - sample size. The number n i, showing how many times the value x i appeared in n observations, is called frequency given value, and the ratio n i / n = w i - relative frequency. Obviously, the numbers w i are rational and.

The statistical population, arranged in ascending order of the characteristic, is called variation series ... Its members denote x (1), x (2), ... x (n) and are called options . The variation series is called discrete if its members take on specific isolated values. Statistical distribution sampling a discrete random variable x called a list of options and their corresponding relative frequencies w i. The resulting table is called statistical series.

X (1) x (2) ... x k (k)
ω 1 ω 2 ... ω k

Greatest and smallest value variation series are denoted by x min and x max and called extreme members of the variation series.

If a continuous random variable is studied, then the grouping consists in dividing the interval of observed values ​​into k partial intervals of equal length h, and counting the number of hits of observations in these intervals. The resulting numbers are taken as frequencies n i (for some new, already discrete random variable). The midpoints of the intervals are usually taken as new values ​​for the variant x i (or the intervals themselves are indicated in the table). According to the Stezhdes formula, the recommended number of partition intervals is k »1 + log 2 n, and the lengths of the partial intervals are h = (x max - x min) / k. It is assumed that the entire interval has the form.

Statistical series can be graphically presented as a polygon, a histogram, or a graph of accumulated frequencies.

Polygon of frequencies is called a broken line, the segments of which connect the points (x 1, n 1), (x 2, n 2),…, (x k, n k). Polygon relative frequencies is called a broken line, the segments of which connect the points (x 1, w 1), (x 2, w 2),…, (x k, w k). Polygons are usually used to represent a sample in the case of discrete random variables (Fig. 7.1.1).

Rice. 7.1
.1.

Histogram of relative frequencies is called a stepped figure consisting of rectangles, the base of which are partial intervals of length h, and the heights

are equal w i / h.

A histogram is usually used to display a sample in the case of continuous random variables. The area of ​​the histogram is equal to one (Fig. 7.1.2). If we connect the midpoints of the upper sides of the rectangles on the histogram of relative frequencies, then the resulting broken line forms a polygon of relative frequencies. Therefore, the histogram can be viewed as a graph empirical (sample) distribution density f n (x). If the theoretical distribution has a finite density, then the empirical density is some approximation of the theoretical one.

Accumulated frequencies graph a figure is called, which is built similarly to a histogram with the difference that to calculate the heights of rectangles, not simple ones are taken, but accumulated relative frequencies, those. magnitudes. These values ​​do not decrease, and the graph of accumulated frequencies looks like a stepped "ladder" (from 0 to 1).

The cumulative frequency plot is used in practice to approximate the theoretical distribution function.

A task. A sample of 100 small businesses in the region is analyzed. The purpose of the survey is to measure the ratio of borrowed and own funds (x i) at each i-th enterprise. The results are presented in table 7.1.1.

table Coefficients of the ratio of borrowed and own funds of enterprises.

5,56 5,45 5,48 5,45 5,39 5,37 5,46 5,59 5,61 5,31
5,46 5,61 5,11 5,41 5.31 5,57 5,33 5,11 5,54 5,43
5,34 5,53 5,46 5,41 5,48 5,39 5,11 5,42 5,48 5,49
5,36 5,40 5,45 5,49 5,68 5,51 5,50 5,68 5,21 5,38
5,58 5,47 5,46 5,19 5,60 5,63 5,48 5,27 5,22 5,37
5,33 5,49 5,50 5,54 5,40 5.58 5,42 5,29 5,05 5,79
5,79 5,65 5,70 5,71 5,85 5,44 5,47 5,48 5,47 5,55
5,67 5,71 5,73 5,05 5,35 5,72 5,49 5,61 5,57 5,69
5,54 5,39 5,32 5,21 5,73 5,59 5,38 5,25 5,26 5,81
5,27 5,64 5,20 5,23 5,33 5,37 5,24 5,55 5,60 5,51

Build a histogram and a graph of the accumulated frequencies.

Solution. Let's construct a grouped series of observations:

1. Let's define in the sample x min = 5.05 and x max = 5.85;

2. Divide the entire range into k equal intervals: k »1 + log 2 100 = 7.62; k = 8, hence the length of the interval

Table 7.1.2. Grouped series of observations

Interval Number Intervals Midpoints of intervals x i w i f n (x)
5,05-5,15 5,1 0,05 0,05 0,5
5,15-5,25 5,2 0,08 0,13 0,8
5,25-5,35 5,3 0,12 0,25 1,2
5,35-5,45 5,4 0,20 0,45 2,0
5,45-5,55 5,5 0,26 0,71 2,6
5,55-5,65 5,6 0,15 0,86 1,5
5,65-5,75 5,7 0,10 0,96 1,0
5,75-5,85 5,8 0,04 1,00 0,4

In fig. 7.1.3 and 7.1.4, plotted according to table 7.1.2, a histogram and a graph of accumulated frequencies are presented. Curves correspond to density and function normal distribution matched to the data.

Thus, the distribution of the sample is some approximation of the distribution of the general population.



 
Articles on topic:
How to eat right to lose weight?
Good physical shape is when a lot of muscles, a fast metabolism, nothing hangs anywhere and a person feels great. For some, this is a natural state, but most people make a heroic effort to look
Benefit or harm: what medicinal properties does prunes have and under what contraindications can its consumption be dangerous for your body?
Nutrients Amount (mg / 100 g of product) Calcium 43.0 Iron Phosphorus 69.0 Potassium Zinc Copper Manganese Vitamins: Vitamin A 781 IU Vitamin C 0.6 mg Vitamin E 0.4 mg Vitamin K 59.5 μg
Introductory words in English, examples of usage, words and phrases
How beautiful is it to start your thought? Unfortunately, many people overuse the phrase I think, turning speech into an endless stream of "reflections". What should be done to avoid this catastrophe? Yes, just learn common introductory words in English, choose
Where to watch the coming solar eclipses How many years is a solar eclipse
On Friday, July 27, a unique event will take place - the longest lunar eclipse of the century, which can be observed in almost all corners of the globe. The Earth will completely eclipse the Moon by one hour and 43 minutes, Day.Az reports with reference to Sego