# How do you define statistical data

## Statistical data

Statistical data is the basis of any quantitative analysis. It does not matter whether you want to describe your sample, make statements about its population, or discover new connections. It is therefore important to be particularly careful when collecting and processing them.

### Statistical data - what is it?

Statistical data take many different forms. The answer options to a questionnaire can be small or large (“Do you have a university degree?” Vs. “What is your highest level of education?”). For other questions, a specific number or range of values ​​must be selected. Subjective classifications using the so-called Likert scale are also often used. With this, the test person expresses his degree of agreement with numbers (for example from 1 to 5) or words (“strongly agree”, “partly / partly”, “strongly disagree”) (cf. Kuckarzt et al. 2013 : 244). You can find an example of this in Figure 1.

Figure 1: Likert scale, source: https://www.ethodsberatung.uzh.ch/de/skalenniveau.html

In addition to such restrictive requirements, there are also so-called "open questions". Any characters, i.e. letters, numbers or their combination, can be entered here. Because this increases the risk of errors, it makes sense to check the plausibility of the answers during the input, but at the latest during the analysis.

### What other statistical data is there?

Questionnaires are a common, but by no means the only means of generating statistical data. For example, you can analyze running texts from academic papers, newspapers or books. Does Stephen King let his protagonists curse more often than Sebastian Fitzek? Which word was used particularly frequently in the headlines of German daily newspapers in 2018? And there is more. Would you have thought that images are statistical data? For example, using fMRI, the blood flow in different areas of the brain can be displayed and checked using statistical methods. Movement patterns are also statistical data. Which route do customers choose through a supermarket? How does my stylus move across the tablet when I write my name? You can collect and analyze all of this with the right methods. A statistical advice can help you with these questions by the way.

### Nominal scale

Levels of scale are a way to further describe the nature of your data. It also depends on them which methods you can use for your analysis. So nominally scaled data are only suitable for comparisons. They group your data in categories (e.g. hair colors), but do not make any statements about the extent or ranking. Therefore you can do the modebut do not calculate the median or mean. Many other methods are also not applicable to nominally scaled data. This means that the information content of your data increases with the scale level (cf. Mayer 2013: 71).

### Ordinal scale

It looks different with ordinal scaled data. There is a clear ranking here. For example, the Abitur is a higher educational qualification than the Mittlerer Reife, but you cannot make an exact statement about the distances between the categories. In other words, you don't know whether the secondary school leaving certificate and the secondary school leaving certificate are as far apart as the secondary school leaving certificate and high school diploma. Therefore you are allowed to do this here Median determine, but not an average.

### Interval scale

This is where interval-scaled data comes into play. These also have a clear ranking, and you can now make statements about their distances because they are 'equidistant'. This means that the successive values ​​are exactly the same distance apart. Calendar years are a good example of interval-scaled dates. The year 2020 is just as far away from 2018 as the year 2016. Now you can do that too Average calculate, as well as all measures of the lower scale levels. For the next higher scale level, however, we are still missing a decisive feature: the natural zero point. A Data analysis service can help you if there are problems here.

### Ratio scale

Calendar years are not proportionally scaled because they do not have a natural zero point. After all, our year 0 does not describe the actual beginning of time, but is in a certain way chosen arbitrarily. It looks different, for example, in terms of weight or length. A stone cannot weigh 4 kilograms and a ruler cannot weigh 0.3 centimeters. Also no -5 people can be in a room. This means that the zero point cannot simply be selected here, but is given by external factors - of course. Therefore, from now on you can not only multiply and divide your data (it makes no difference whether you calculate with 10 meters or 1,000 centimeters), but also specify ratios. So if there are initially 10 people in a room and half an hour later there are 20, the number of people has doubled. As simple as this statement appears at first, you must not make it for all other scales.

The assignment of scale levels is often unfamiliar at first, but with a little practice you will quickly get the hang of it. About theService center for teaching at the University of Kassel can you try it. If you have any difficulties, a Statistics service help.

### How do you prepare statistical data?

In order to be able to evaluate your data in a meaningful way, it must be in the correct form without errors. That doesn't sound complicated at first, but it often requires a lot of preparatory work. First of all, of course, you need data. Here you should ask yourself the following: Can you access existing data? Or do you have to raise new ones? If so, what should this survey look like? You can then either pull your data from existing databases or enter them from scratch. Sometimes you may need to access different data sources. In this case, you have to think about how you can connect them together in a meaningful way. The data format can also cause problems. A Word file overwhelms most statistical programs and if your table looks like Figure 2, you will complain too.

Figure 2: Example table, source: own illustration

### Incorrect statistical data

Figure 2 gives a good idea of ​​what can go wrong with statistical data. Dots instead of commas can be cleaned up quickly as soon as you have discovered the error, it becomes more difficult with the specification "-3". Should this value be entered as missing? If so, how do you deal with missing values ​​in the further analysis? It becomes even more problematic when values ​​are improbable but not completely implausible. For example, a monthly salary of € 20,000 is unusual, but it can happen. It could also be a simple typo. Leaving the value in the data set can lead to bias, removing it from the data too. Therefore, when cleaning up your data, you often have to make and justify individual decisions. The Data cleansing in SPSS is a convenient way to adjust data.

Statistical data is versatile. It can be text, numbers, intervals or images. No matter how exactly they look: Before you analyze them, you not only have to clean them up and bring them into a processable form, but also consider what statements you can make with them. Even if you are ultimately concerned with the actual analysis, a clean preparation of your data is essential. Therefore, you should always have enough time for this part of your statistical work.

### literature

Mayer, Horst (2013): Interview and written survey: Basics and methods of empirical social research, 6th edition Berlin.

Kuckarzt, Udo / Rädiker, Stefan / Ebert, Thomas / Schehl, Julia (2013): Statistics: An understandable introduction, 2nd edition Wiesbaden.