« The processing of data » : différence entre les versions
| Ligne 46 : | Ligne 46 : | ||
[[Fichier:Types d’analyses bivariées.png|500px|vignette|centré]] | [[Fichier:Types d’analyses bivariées.png|500px|vignette|centré]] | ||
In this context we wish to cross characteristics either in a descriptive or an explanatory perspective. Depending on the type of variable, different techniques are used to analyze and process the data. | |||
Both dependent and independent variables need to be addressed. When cross-checking, we must look at the dependent and independent variable if we are dealing with categorical or ordinal variables that make it possible to distinguish three main families of types of analysis: | |||
*'''categorical / nominal - nominal variables''': contingency tables are made, other techniques cannot be used. Most of the time in political science we are dealing with this type of variables, because the answers give rise to ordinal variables. There are coefficients that make it possible to give a single measure of the relationship between these two variables, such as Cramer's V, which makes it possible to see the association between categorical variables. To interpret, it is important that the total percentage should always refer to the categories of the independent variable; we want to see how the distribution on the dependent variable depends on the functions in the independent variable. The number of cases indicates whether the process is statistically representative, as the sample size affects the measurement. | |||
*'''cardinal-cardinal variables''': we no longer cross tabulate, we have other tools and in particular the regression and correlation tool: | |||
**'''covariation''': when we have two continuous variables, when one increases the other increases proportionally or inversely proportional, the two variables are linked in this direction. | |||
**'''correlation''': it is simply a standardised covariation that is between -1 and +1. Standardisation is used to compare variables that are measured in a different way; if, for example, we have scales from 0 to 10 and scales from 0 to 5, we cannot compare these variables, so we must make sure that this information is standardised. Variables can be repeated on the same scale or software that calculates a standardized correlation. | |||
**'''regression''': in a correlation one is in a descriptive perspective, one does not seek to see a direction of causality in a regression one wants to see if two variables are associated, related, correlated. | |||
*'''Nominal independent variables - cardinal dependent variables''': cross tabulations, correlations and regressions cannot be applied; an analysis of variance or covariance is carried out, the simplest case of which is a comparison of averages, which could for example be the number of times that individuals participate in an election according to the canton. | |||
*'''variables | |||
= | = Linear regression = | ||
[[Fichier:Régression linéaire.png|500px|vignette|centré]] | [[Fichier:Régression linéaire.png|500px|vignette|centré]] | ||
It is a very varied and sometimes complex set of tools, but it is the main tool. Linear regression is the main element; much of the quantitative analysis done in social science relies on linear regression. | |||
We talk about linearity, because we postulate that there is a linear relationship between the variables we are studying, in other words there is a linear function behind this relationship; however, we can also envisage regressions that are not only linear. | |||
It is assumed that what we want to explain is a linear function of one or more independent variables. This is crucial, as linear regression is only a subset of a larger family of regression analyses that is not based on a linearity idea between the two variables. | |||
The simplest model is with an explanatory variable such as, for example, political participation based on political interest. | |||
In descriptive terms, there is a strong correlation between these two variables; if a hypothesis says that it is the interest in politics that influences participation, then a regression analysis is done. | |||
There is always the problem of endogeneity in this type of analysis, we postulate that the interest in politics determines participation; we could also postulate that the more we participate, the more we develop an interest in politics. | |||
Political participation is a linear function of the interest in politics "plus" a constant factor: the value of Y when X is equal to 0, i. e. what is my level of participation when the interest in politics is nil. At the bottom it is where the regression line crosses the ordinate axis. | |||
In multivariate analysis, there is always a margin of error; one thing is related if you have survey data related to the margin of error between population and sample, but regardless of whether you are working on samples or an overall population; There is a term of error that comes into play, because there is always something that influences what we want to explain and which is not included in the regression model, such as education, age, social, institutional and other factors. | |||
In fact, the E groups together the unexplained variance, i. e. everything that could explain Y, but is not introduced into the model, it is the problem of the model's under-specification, i. e. the model specification issue; the more variables a model has more variables, the more likely it is to be underspecified and the less variation in the Y and the higher the E in terms of error, the less likely it is to be underspecified. | |||
This suggests that not including some variables in an explanatory model with two major consequences: | |||
*the model is under-specified, there is little explanation of the variability of Y with this model, i. e. the factors strongly correlated with what we want to study. | |||
*the second reason is related to the control of variables, because if one introduces interest in politics, a third variable can influence interest in politics and participation in politics; the association is misleading. | |||
We want to include the maximum number of variables that we think can directly influence Y or indirectly making the relationship between X and Y false or only apparent. | |||
The B is the regression coefficient, i. e. the slope of the regression line giving the strength of the effect of the X because it is multiplied by X, i. e. the higher the effect of X, the higher the B is. | |||
B can be non-standardized or standardized." Standardization "means standardizing and the goal is to be able to compare different coefficients. | |||
One is in an additive logic, there are "+"; one assumes that the variation of Y is a linear function is additive or cumulative of the effect of all the other variables introduced in the model. | |||
= | = Regression line = | ||
[[Fichier:Droite de régression.png|500px|vignette|centré]] | [[Fichier:Droite de régression.png|500px|vignette|centré]] | ||
The regression line represents the linear regression function. We want to see how much more Y increases when we increase X. Let us admit that (0;12) is the interest in politics and the other one political participation; we can see that there is a fairly strong correlation between the two; when we have an increase in political interest, we increase political participation. | |||
The blue dots represent the cases, the regression line is the estimation of the values and so we will look at the extent to which and how this line encounters a cloud of dots. | |||
The quality of the model has to do with the quality of the estimate, which depends very much on how the points are distributed. It is possible that the point cloud may be estimated for a straight line that has the same slope, however the quality of this effect is the same while it is different because the straight line makes only a much more accurate approximation of the point cloud where the points are close to the right. | |||
It should be kept in mind that one of the main instruments used for quantitative analysis when using interval or cardinal variables and correlation or regression analysis. | |||
The idea of linear regression which is a subset of a larger set is based on the idea of a linear function between X and Y; we try to estimate a cloud of points which represents the crossing between the two variables in the sample, so we will analyze the regression line and its slope. If the slope is 0 then Y doesn't change when you change X, you may be very interested in politics, but you always participate at the same level. | |||
= Analyses multivariées = | = Analyses multivariées = | ||
Version du 9 mars 2018 à 14:12
The analysis of quantitative data is very different from the analysis of qualitative data; they are two very different or even opposed research practices.
We will focus on quantitative analysis, which is in fact easier than the analysis of qualitative data, if only because there are institutionalized routines.
Data matrix
It is a matrix that cross-references the cases studied with a number of variables, namely column variables and online cases.
A code must be assigned to exclude from the analysis those who did not respond and differentiate them from those who did.
There are three analyses that correspond to three different objectives:
- Univariate analyses: analyses that are performed on a single variable or characteristic.
- bivariate analyses: linking two variables, we want to cross data to analyze variations more subtly such as the interest of politics according to city or age.
- multivariate analyses: we think that an explained phenomenon is never explained by a single independent variable; on the other hand, we want to introduce controls in order to control relations through the purification technique.
A distinction must be made between a descriptive analysis that attempts to describe a "factual state" that is made univariate or bivariate.
Types of univariate analyses
Types of variables and operations between methods
There are different types of univariate analyses, these techniques depend on the type of variable:
- nominal variables: only equivalence or difference operations can be performed.
- ordinal variables: allows to order i. e. to categorize according to an order from the smallest to the largest.
Note: Ordinal and cardinal variables are categorical, discrete data, distances cannot be seen.
- cardinal variables: allow the four basic arithmetic operations to be performed in addition to the previous operations.
Central trend measurement
When carrying out a quantitative analysis, it is necessary to consider the type of variables and then choose the tool to use. We can distinguish between two main types of measures, i.e. between two types of information that we want to have single variables:
- central trend measures
- dispersion measurements.
Note: Depending on the variable, the measurements are different.
The mean is a measure of a central trend value that can be applied to the cardinal variables, but it cannot be applied to the categorical variables. The median is the category that separates the statistical series in two with the same number of cases on either side.
This is important information that forms the starting point for this type of data description, allowing us to know what to do next in the case of more sophisticated analyses.
Measurement of dispersions
Dispersion measurements are also distinguished: the basic measure is the standard deviation, which is a standardized measure that varies from -1 to +1 of the variance, which is the measure that indicates how individuals are distributed.
Variance is very important for calculating the probability of error. Different measures are required depending on the unit of measurement of the variable, and account must be taken of the central trend and dispersion measure as the standard deviation that is the key coefficient throughout the quantitative analysis.
Types of bivariate analyses
In this context we wish to cross characteristics either in a descriptive or an explanatory perspective. Depending on the type of variable, different techniques are used to analyze and process the data.
Both dependent and independent variables need to be addressed. When cross-checking, we must look at the dependent and independent variable if we are dealing with categorical or ordinal variables that make it possible to distinguish three main families of types of analysis:
- categorical / nominal - nominal variables: contingency tables are made, other techniques cannot be used. Most of the time in political science we are dealing with this type of variables, because the answers give rise to ordinal variables. There are coefficients that make it possible to give a single measure of the relationship between these two variables, such as Cramer's V, which makes it possible to see the association between categorical variables. To interpret, it is important that the total percentage should always refer to the categories of the independent variable; we want to see how the distribution on the dependent variable depends on the functions in the independent variable. The number of cases indicates whether the process is statistically representative, as the sample size affects the measurement.
- cardinal-cardinal variables: we no longer cross tabulate, we have other tools and in particular the regression and correlation tool:
- covariation: when we have two continuous variables, when one increases the other increases proportionally or inversely proportional, the two variables are linked in this direction.
- correlation: it is simply a standardised covariation that is between -1 and +1. Standardisation is used to compare variables that are measured in a different way; if, for example, we have scales from 0 to 10 and scales from 0 to 5, we cannot compare these variables, so we must make sure that this information is standardised. Variables can be repeated on the same scale or software that calculates a standardized correlation.
- regression: in a correlation one is in a descriptive perspective, one does not seek to see a direction of causality in a regression one wants to see if two variables are associated, related, correlated.
- Nominal independent variables - cardinal dependent variables: cross tabulations, correlations and regressions cannot be applied; an analysis of variance or covariance is carried out, the simplest case of which is a comparison of averages, which could for example be the number of times that individuals participate in an election according to the canton.
Linear regression
It is a very varied and sometimes complex set of tools, but it is the main tool. Linear regression is the main element; much of the quantitative analysis done in social science relies on linear regression.
We talk about linearity, because we postulate that there is a linear relationship between the variables we are studying, in other words there is a linear function behind this relationship; however, we can also envisage regressions that are not only linear.
It is assumed that what we want to explain is a linear function of one or more independent variables. This is crucial, as linear regression is only a subset of a larger family of regression analyses that is not based on a linearity idea between the two variables.
The simplest model is with an explanatory variable such as, for example, political participation based on political interest.
In descriptive terms, there is a strong correlation between these two variables; if a hypothesis says that it is the interest in politics that influences participation, then a regression analysis is done.
There is always the problem of endogeneity in this type of analysis, we postulate that the interest in politics determines participation; we could also postulate that the more we participate, the more we develop an interest in politics.
Political participation is a linear function of the interest in politics "plus" a constant factor: the value of Y when X is equal to 0, i. e. what is my level of participation when the interest in politics is nil. At the bottom it is where the regression line crosses the ordinate axis.
In multivariate analysis, there is always a margin of error; one thing is related if you have survey data related to the margin of error between population and sample, but regardless of whether you are working on samples or an overall population; There is a term of error that comes into play, because there is always something that influences what we want to explain and which is not included in the regression model, such as education, age, social, institutional and other factors.
In fact, the E groups together the unexplained variance, i. e. everything that could explain Y, but is not introduced into the model, it is the problem of the model's under-specification, i. e. the model specification issue; the more variables a model has more variables, the more likely it is to be underspecified and the less variation in the Y and the higher the E in terms of error, the less likely it is to be underspecified.
This suggests that not including some variables in an explanatory model with two major consequences:
- the model is under-specified, there is little explanation of the variability of Y with this model, i. e. the factors strongly correlated with what we want to study.
- the second reason is related to the control of variables, because if one introduces interest in politics, a third variable can influence interest in politics and participation in politics; the association is misleading.
We want to include the maximum number of variables that we think can directly influence Y or indirectly making the relationship between X and Y false or only apparent.
The B is the regression coefficient, i. e. the slope of the regression line giving the strength of the effect of the X because it is multiplied by X, i. e. the higher the effect of X, the higher the B is.
B can be non-standardized or standardized." Standardization "means standardizing and the goal is to be able to compare different coefficients.
One is in an additive logic, there are "+"; one assumes that the variation of Y is a linear function is additive or cumulative of the effect of all the other variables introduced in the model.
Regression line
The regression line represents the linear regression function. We want to see how much more Y increases when we increase X. Let us admit that (0;12) is the interest in politics and the other one political participation; we can see that there is a fairly strong correlation between the two; when we have an increase in political interest, we increase political participation.
The blue dots represent the cases, the regression line is the estimation of the values and so we will look at the extent to which and how this line encounters a cloud of dots.
The quality of the model has to do with the quality of the estimate, which depends very much on how the points are distributed. It is possible that the point cloud may be estimated for a straight line that has the same slope, however the quality of this effect is the same while it is different because the straight line makes only a much more accurate approximation of the point cloud where the points are close to the right.
It should be kept in mind that one of the main instruments used for quantitative analysis when using interval or cardinal variables and correlation or regression analysis.
The idea of linear regression which is a subset of a larger set is based on the idea of a linear function between X and Y; we try to estimate a cloud of points which represents the crossing between the two variables in the sample, so we will analyze the regression line and its slope. If the slope is 0 then Y doesn't change when you change X, you may be very interested in politics, but you always participate at the same level.
Analyses multivariées
Analyse de régression
Selon le type de variables que l’on souhaite expliquer, on peut ou on ne peut pas appliquer l’outil de régression linéaire, il y a par exemple la régression logistique dans le cas de variables dummies soit absence ou présence, on ne peut appliquer la régression linéaire, car les présupposés de base ne sont pas garantis.
Analyse des chemins causaux (path analysis)
Un des problèmes de l’analyse de la régression est qu’on suppose que Y est une fonction linéaire de la somme de toutes les variables indépendantes or se faisant on regarde que les effets directs des variables d’un modèle ; toutefois que se passe-t-il lorsqu’on veut regarder des effets indirects ?
On fait une analyse des « chemins causaux » ; il y a des coefficients de régression qui peuvent être significatif ou pas, mais on peut voir des chemins causaux c’est-à-dire qu’on peut voir de quelle manière les valeurs de gauche influence la participation non pas directement, mais indirectement à savoir que le fait d’être de gauche fait qu’on a plus de probabilités d’être intégré dans certains types de réseau développent un internet pour un certain enjeu qui permet de développer un sentiment d’efficacité individuel faisant qu’on a une intensité plus forte de participation. On introduit des variables intermédiaires.
Au lieu d’avoir une indication, on en a plusieurs parce que chaque variable peut ou est une variable dépendante, on fait une somme d’équations.
Analyse factorielle
C’est une analyse qui a pour objectif de réduire la complexité qu’on peut avoir lorsqu’on a une matrice de données avec beaucoup de variables et de cas et que l’on veut avoir un index plus succinct.
Lorsqu’on a parlé d’opérationnalisation des concepts complexes, on est arrivé à une dernière étape de construction ; l’analyse factorielle permet de construire des index par l’analyse des liens sous-jacents qui expliquent la variation sur un ensemble multiple d’indicateurs.
C’est un outil fréquemment utilisé en science politique et notamment lorsqu’on étudie les changements de valeurs.
Analyse multiniveaux
Auparavant toutes les mesures concernaient les variables individuelles, maintenant il y a des propriétés du contexte qui ne sont pas de l’individu qui peuvent influencer la participation politique comme le système électoral ou le type de système politique.
Dans une optique de régression normale il y a des manières de court-circuiter le problème, on ne peut intégrer les facteurs contextuels dans l’analyse on peut simplement comparer.
L’analyse multiniveau permet de faire une analyse de régression multiniveau, on ajoute des propriétés du contexte et pas seulement des propriétés individuelles ; on intègre des propriétés individuelles et contextuelles. Il y a ce développement important en sciences politiques.
Type de méthodes qualitatives
On peut faire une distinction entre l’analyse de contenu et l’analyse de discours. Ces termes ne font pas le consensus dans la littérature, certains estiment que l’analyse de discours est un type d’analyse de contenu et pour d’autres ce n’est pas le cas.
Analyse de contenu
L’analyse de contenu s’intéresse au poids, elle est plus descriptive, elle s’intéresse aux différents enjeux soulevés par des personnes. On peut faire une distinction ultérieure :
- thématique : on compte le nombre de fois qu’un tel thème apparaît dans un discours.
- lexicale : analyse basée sur l’analyse des occurrences ou cooccurrences à savoir une analyse qualitative qui a des éléments d’analyse quantitative.
Analyse de discours
C’est une analyse interprétative, on parle d’une famille de techniques, on peut dire qu’on s’intéresse à comment et aux effets d’un discours donné.
Pour simplifier, l‘analyse de contenu est plutôt descriptive et l’analyse de discours explicatif.
Étapes de l’analyse thématique
Il y a cinq grandes étapes :
- familiarisation (préanalyse) : il faut d’abord se familiariser avec le matériel à disposition.
- identification d’un cadre thématique (schéma de codage, index) : manière de coder l’information soit d’identifier le cadre thématique.
- indexation (codage) : réduire l’information.
- cartographie (catégorisation et réduction des données) : création de typologies, de classifications, réduction des données afin de pouvoir les interpréter.
- mapping et interprétation (analyse et interprétation)
Stages of discourse analysis
- Pre-analysis
- Identification of relevant elements
- Systematic analysis based on identified elements

