« The processing of data » : différence entre les versions

De Baripedia
Aucun résumé des modifications
 
(8 versions intermédiaires par le même utilisateur non affichées)
Ligne 1 : Ligne 1 :
L’analyse des données quantitatives est très différente de l’analyse des données qualitatives ; ce sont deux pratiques de la recherche très différentes, voire même opposées.
{{Infobox Lecture
   
| image =
Nous allons nous focaliser sur l’analyse quantitative qui est en fait plus facile que l’analyse des données qualitative ne serait-ce parce qu’il y a des routines institutionnalisées.
| image_caption =
| faculté = [[Faculté des sciences de la société]]
| département = [[Département de science politique et relations internationales]]
| professeurs = [[Marco Giugni]]<ref>[https://unige.ch/sciences-societe/speri/membres/marco-giugni/ Page personnelle de Marco Giugni sur le site de l'Université de Genève]</ref>
| assistants = 
| enregistrement =
| cours = [[Introduction to the methods of political science]]
  | lectures =
* [[Introductory course on the methods of political science]]
* [[The positivist paradigm and the interpretative paradigm]]
* [[Fundamental scientific methods]]
* [[From theory to data]]
* [[Data collection]]
* [[The processing of data]]
}}
 
The analysis of quantitative data is very different from the analysis of qualitative data; they are two very different or even opposed research practices.
 
We will focus on quantitative analysis, which is in fact easier than the analysis of qualitative data, if only because there are institutionalized routines.


{{Translations
{{Translations
| en = Le traitement des données
| fr = Le traitement des données
| es = Procesamiento de datos
| it = Trattamento dei dati
}}
}}


= Matrice des données =
= Data matrix =
[[Fichier:Matrice des données.png|500px|vignette|centré]]
[[Fichier:Matrice des données.png|500px|vignette|centré]]
C’est une matrice qui croise les cas étudiés avec un certain nombre de variables à savoir les variables en colonne et les cas en ligne.
It is a matrix that cross-references the cases studied with a number of variables, namely column variables and online cases.
 
Il faut attribuer un code qui permette d’exclure de l’analyse ceux qui n’ont pas répondu et les différencier de ceux qui ont répondu.
A code must be assigned to exclude from the analysis those who did not respond and differentiate them from those who did.
 
Il y a trois analyses qui correspondent à trois objectifs différents :
There are three analyses that correspond to three different objectives:
*'''analyses univariées''' : analyses qui se font sur une seule variable ou une seule caractéristique.
*'''Univariate analyses''': analyses that are performed on a single variable or characteristic.
*'''analyses bivariées''' : mise en relation de deux variables, on veut croiser des données pour analyser les variations plus subtilement comme l’intérêt de la politique selon la ville ou l’âge.
*'''bivariate analyses''': linking two variables, we want to cross data to analyze variations more subtly such as the interest of politics according to city or age.
*'''analyses multivariées''' : on pense qu’un phénomène expliqué n’est jamais expliqué par une seule variable indépendante ; d’autre part on veut introduire des contrôles afin de contrôler les relations à travers la technique de la dépuration.
*'''multivariate analyses''': we think that an explained phenomenon is never explained by a single independent variable; on the other hand, we want to introduce controls in order to control relations through the purification technique.
   
   
Il faut faire une distinction entre une analyse descriptive qui veut décrire un « état de fait » que l’on fait univariée ou bivariée.
A distinction must be made between a descriptive analysis that attempts to describe a "factual state" that is made univariate or bivariate.


= Types d’analyses univariées =
= Types of univariate analyses =
[[Fichier:Types d’analyses univariées.png|500px|vignette|centré]]
[[Fichier:Types d’analyses univariées.png|500px|vignette|centré]]


== Types de variables et opérations entre modalités ==
== Types of variables and operations between methods ==
Il y a différents types d’analyses univariées, ces techniques dépendent du type de variable :
There are different types of univariate analyses, these techniques depend on the type of variable:
*'''variables nominales''' : on peut faire seulement des opérations d’équivalence ou de différence.
* '''nominal variables''': only equivalence or difference operations can be performed.
*'''variables ordinales''' : permet d’ordonner c’est-à-dire catégoriser selon un ordre du plus petit au plus grand.
* '''ordinal variables''': allows to order i. e. to categorize according to an order from the smallest to the largest.
Nota bene : les variables ordinales et cardinales sont catégorielles, relevant de données discrètes, on ne peut voir les distances.
Note: Ordinal and cardinal variables are categorical, discrete data, distances cannot be seen.
*'''variables cardinales''' : permettent en plus des opérations précédentes de faire les quatre opérations arithmétiques de base.
* '''cardinal variables''': allow the four basic arithmetic operations to be performed in addition to the previous operations.


== Mesure de tendance centrale ==
== Central trend measurement ==
Lorsqu’on fait une analyse quantitative, il faut s’interroger sur le type de variables et ensuite on choisit l’outil à utiliser. On peut distinguer entre deux grands types de mesures c’est-à-dire entre deux types d’informations qu’on veut avoir des variables uniques :
When carrying out a quantitative analysis, it is necessary to consider the type of variables and then choose the tool to use. We can distinguish between two main types of measures, i.e. between two types of information that we want to have single variables:
*mesures de tendance centrale
* central trend measures ;
*mesures de dispersion.  
* dispersion measurements.
Nota bene : selon la variable les mesures sont différentes.
Note: Depending on the variable, the measurements are different.
 
La moyenne est une mesure de valeur de tendance centrale que l’on peut appliquer aux variables cardinales, mais on ne peut l’appliquer aux variables catégorielles. La médiane est la catégorie qui sépare la série statistique en deux avec un même nombre de cas d’un côté et de l’autre.
The mean is a measure of a central trend value that can be applied to the cardinal variables, but it cannot be applied to the categorical variables. The median is the category that separates the statistical series in two with the same number of cases on either side.
 
Ce sont des informations importantes qui forment le point de départ de ce type de description des données permettant de savoir quoi faire ensuite dans le cas d’analyses plus sophistiquées
This is important information that forms the starting point for this type of data description, allowing us to know what to do next in the case of more sophisticated analyses.
 
== Measurement of dispersions ==
Dispersion measurements are also distinguished: the basic measure is the standard deviation, which is a standardized measure that varies from -1 to +1 of the variance, which is the measure that indicates how individuals are distributed.


== Mesure de dispersions ==
Variance is very important for calculating the probability of error. Different measures are required depending on the unit of measurement of the variable, and account must be taken of the central trend and dispersion measure as the standard deviation that is the key coefficient throughout the quantitative analysis.
On distingue aussi des mesures de dispersions : la mesure de base est l’écart-type qui est une mesure standardisée qui varie de -1 à +1 de la variance qui est la mesure qui indique de quelle manière sont distribués les individus.
La variance est très importante pour calculer la probabilité d’erreur. Il faut différentes mesures selon l’unité de mesure de la variable et il faut tenir compte de la mesure de tendance centrale et de dispersion comme l’écart-type qui est le coefficient clef dans toute l’analyse quantitative.


= Types d’analyses bivariées =
= Types of bivariate analyses =
[[Fichier:Types d’analyses bivariées.png|500px|vignette|centré]]
[[Fichier:Types d’analyses bivariées.png|500px|vignette|centré]]


Dans ce cadre on souhaite croiser des caractéristiques soit dans une optique descriptive soit dans une optique explicative. En fonction du type de variable, on a des techniques différentes pour analyser et traiter les données.
In this context we wish to cross characteristics either in a descriptive or an explanatory perspective. Depending on the type of variable, different techniques are used to analyze and process the data.
 
Il faut s’intéresser à la fois aux variables dépendantes et aux variables indépendantes. En croisant, il faut regarder du côté de la variable dépendante et indépendante si on a affaire à des variables catégorielles ou ordinales permettant de distinguer trois grandes familles de types d’analyses :
Both dependent and independent variables need to be addressed. When cross-checking, we must look at the dependent and independent variable if we are dealing with categorical or ordinal variables that make it possible to distinguish three main families of types of analysis:
*'''variables catégorielles / nominale - nominale''' : on fait des tableaux de contingence, on ne peut utiliser les autres techniques. La plupart du temps en science politique on a affaire à ce type de variables, car les réponses donnent lieu à des variables ordinales. Il y a des coefficients qui permettent de donner une mesure unique de la relation entre ces deux variables comme le V de Cramer qui permet de voir l’association entre variables catégorielles. Pour interpréter, il est important que le pourcentage total doit toujours se référer aux catégories de la variable indépendante ; on veut voir comment la distribution sur la variable dépendante dépend des fonctions dans la variable indépendante. L’indication du nombre de cas permet de voir si le processus est statistiquement représentatif, car la taille de l’échantillon affecte la mesure.
*'''categorical / nominal - nominal variables''': contingency tables are made, other techniques cannot be used. Most of the time in political science we are dealing with this type of variables, because the answers give rise to ordinal variables. There are coefficients that make it possible to give a single measure of the relationship between these two variables, such as Cramer's V, which makes it possible to see the association between categorical variables. To interpret, it is important that the total percentage should always refer to the categories of the independent variable; we want to see how the distribution on the dependent variable depends on the functions in the independent variable. The number of cases indicates whether the process is statistically representative, as the sample size affects the measurement.
*'''variables cardinales - cardinales''': on ne fait plus un tableau croisé, on à d’autres outils et en particulier l’outil de la régression et de la corrélation :
*'''cardinal-cardinal variables''': we no longer cross tabulate, we have other tools and in particular the regression and correlation tool:
**'''covariation''' : lorsqu’on a deux variables continues, lorsqu’une augmente l’autre augmente de manière proportionnelle ou inversement proportionnelle, les deux variables sont liées dans ce sens.
**'''covariation''': when we have two continuous variables, when one increases the other increases proportionally or inversely proportional, the two variables are linked in this direction.
**'''corrélation''' : c’est simplement une covariation standardisée c’est-à-dire qui se situe entre -1 et +1. On standardise pour faire en sorte de comparer des variables qui à la base sont mesurées de manière différente ; si on a par exemple des échelles de 0 à 10 et des échelles allant de 0 à 5 on ne peut comparer ces variables alors il faut faire en sorte de standardiser ces informations. On peut reconduire les variables sur la même échelle ou un logiciel qui calcule une corrélation standardisée.
**'''correlation''': it is simply a standardised covariation that is between -1 and +1. Standardisation is used to compare variables that are measured in a different way; if, for example, we have scales from 0 to 10 and scales from 0 to 5, we cannot compare these variables, so we must make sure that this information is standardised. Variables can be repeated on the same scale or software that calculates a standardized correlation.
**'''régression''' : dans une corrélation on est dans une optique descriptive, on ne cherche pas à voir une direction de la causalité dans une régression on veut voir si deux variables sont associées, liées, corrélées.
**'''regression''': in a correlation one is in a descriptive perspective, one does not seek to see a direction of causality in a regression one wants to see if two variables are associated, related, correlated.
   
   
*'''variables indépendantes nominales – variables dépendantes cardinales''' : on ne peut appliquer les tableaux croisés ni les corrélations et régressions ; on fait une analyse de la variance ou de la covariance dont le cas le plus simple est une comparaison de moyennes qui pourrait par exemple être le nombre de fois que les individus participent à une élection en fonction du canton.
*'''Nominal independent variables - cardinal dependent variables''': cross tabulations, correlations and regressions cannot be applied; an analysis of variance or covariance is carried out, the simplest case of which is a comparison of averages, which could for example be the number of times that individuals participate in an election according to the canton.


= Régression linéaire =
= Linear regression =
[[Fichier:Régression linéaire.png|500px|vignette|centré]]
[[Fichier:Régression linéaire.png|500px|vignette|centré]]


C’est un ensemble très varié et parfois complexe d’outils, mais c’est l’outil principal. La régression linéaire est l’élément principal ; une bonne partie des analyses quantitatives faites en science sociale s’appuient sur la régression linéaire.
It is a very varied and sometimes complex set of tools, but it is the main tool. Linear regression is the main element; much of the quantitative analysis done in social science relies on linear regression.
 
We talk about linearity, because we postulate that there is a linear relationship between the variables we are studying, in other words there is a linear function behind this relationship; however, we can also envisage regressions that are not only linear.
 
It is assumed that what we want to explain is a linear function of one or more independent variables. This is crucial, as linear regression is only a subset of a larger family of regression analyses that is not based on a linearity idea between the two variables.
 
The simplest model is with an explanatory variable such as, for example, political participation based on political interest.
 
In descriptive terms, there is a strong correlation between these two variables; if a hypothesis says that it is the interest in politics that influences participation, then a regression analysis is done.
 
There is always the problem of endogeneity in this type of analysis, we postulate that the interest in politics determines participation; we could also postulate that the more we participate, the more we develop an interest in politics.
 
Political participation is a linear function of the interest in politics "plus" a constant factor: the value of Y when X is equal to 0, i. e. what is my level of participation when the interest in politics is nil. At the bottom it is where the regression line crosses the ordinate axis.
 
In multivariate analysis, there is always a margin of error; one thing is related if you have survey data related to the margin of error between population and sample, but regardless of whether you are working on samples or an overall population; There is a term of error that comes into play, because there is always something that influences what we want to explain and which is not included in the regression model, such as education, age, social, institutional and other factors.
 
In fact, the E groups together the unexplained variance, i. e. everything that could explain Y, but is not introduced into the model, it is the problem of the model's under-specification, i. e. the model specification issue; the more variables a model has more variables, the more likely it is to be underspecified and the less variation in the Y and the higher the E in terms of error, the less likely it is to be underspecified.
 
This suggests that not including some variables in an explanatory model with two major consequences:
*the model is under-specified, there is little explanation of the variability of Y with this model, i. e. the factors strongly correlated with what we want to study.
*the second reason is related to the control of variables, because if one introduces interest in politics, a third variable can influence interest in politics and participation in politics; the association is misleading.
   
   
On parle de linéarité, car on postule qu’il y a une relation linéaire entre les variables qu’on étudie, en d’autres termes il y a une fonction linéaire derrière cette relation ; cependant, on peut aussi envisager des régressions qui ne sont pas que linéaires.
We want to include the maximum number of variables that we think can directly influence Y or indirectly making the relationship between X and Y false or only apparent.
 
On présuppose que ce qu’on veut expliquer est une fonction linéaire d’une ou plusieurs variables indépendantes. Cela est crucial, car la régression linéaire n’est qu’un sous-ensemble d’une famille plus large d’analyses de régressions qui ne se base pas sur une idée de linéarité entre les deux variables.
The B is the regression coefficient, i. e. the slope of the regression line giving the strength of the effect of the X because it is multiplied by X, i. e. the higher the effect of X, the higher the B is.
 
Le modèle le plus simple est avec une variable explicative comme, par exemple, la participation politique en fonction de l’intérêt pour la politique.
B can be non-standardized or standardized." Standardization "means standardizing and the goal is to be able to compare different coefficients.
 
En terme descriptif il y a une forte corrélation entre ces deux variables ; si une hypothèse dit que c’est l’intérêt pour la politique qui influence la participation alors on fait une analyse de régression.
One is in an additive logic, there are "+"; one assumes that the variation of Y is a linear function is additive or cumulative of the effect of all the other variables introduced in the model.
On a toujours le problème de l’endogénéité dans ce type d’analyse, on postule que l’intérêt pour la politique détermine une participation ; on pourrait aussi postuler que plus on participe plus on développe un intérêt pour la politique.
La participation politique est une fonction linéaire de l’intérêt pour la politique « plus » un facteur constant à savoir la valeur de Y lorsque X est égal à 0 c’est-à-dire quel est mon niveau de participation lorsque l’intérêt pour la politique est nul. Au fond c’est où la droite de régression croise l’axe des ordonnées.
Dans l’analyse multivariée, il y a toujours une marge d’erreur ; une chose est liée si on des données du sondage liées à la marge d’erreur entre population et échantillon, mais indépendamment qu’on soit en train de travailler sur des échantillons ou une population globale ; il y a un terme d’erreur qui intervient, car il y a toujours quelque chose qui influence ce qu’on veut expliquer et qui n’est pas inclus dans le modèle de régression comme, par exemple, l’éducation, l’âge, le contexte social, institutionnel, etc.
En fait, le E regroupe la variance non expliquée à savoir tout ce qui pourrait expliquer Y, mais n’est pas introduit dans le modèle, c’est le problème de la sous-spécification du modèle à savoir l’enjeu concernant la spécification du modèle ; plus un modèle a plus de variables plus il risque d’être sous-spécifié et moins on explique de variations dans le Y et plus le E en termes d’erreur est élevé, on veut réduire au maximum le E.
Ceci permet de dire que le fait de ne pas inclure certaines variables dans un modèle explicatif à deux conséquences majeures :
*le modèle est sous-spécifié, on explique peu la variabilité de Y avec ce modèle c’est-à-dire les facteurs fortement corrélés avec ce que l’on souhaite étudier.
*la deuxième raison est liée au contrôle des variables, car si on introduit l’intérêt pour la politique, une troisième variable peut influencer l’intérêt pour la politique et la participation pour la politique ; l’association est fallacieuse.
On veut inclure le maximum de variables dont on pense qu’elle peut influencer directement Y ou indirectement faisant que la relation entre X et Y soit fausse ou seulement apparente.
Le B est le coefficient de régression à savoir la pente de la droite de régression donnant la force de l’effet du X car il est multiplicatif par X c’est-à-dire que plus l’effet de X est fort plus B est élevé.
Le B peut être non standardisé ou standardisé. « Standardisation » signifie normaliser et le but est de pouvoir comparer des coefficients différents.
On est dans une logique additive, il y a des « + » ; on suppose que la variation de Y est une fonction linéaire est additive ou cumulée de l’effet de toutes les autres variables introduites dans le modèle.


= Droite de régression =
= Regression line =
[[Fichier:Droite de régression.png|500px|vignette|centré]]
[[Fichier:Droite de régression.png|500px|vignette|centré]]


La droite de régression représente la fonction de régression linéaire. On veut regarder de combien augmente Y lorsqu’on augmente X. admettons que les (0 ; 12) sont l’intérêt pour la politique et l’autre la participation politique; on peut voir qu’il y a une corrélation assez forte entre les deux, lorsqu’on a une augmentation de l’intérêt pour la politique on augmente la participation politique.
The regression line represents the linear regression function. We want to see how much more Y increases when we increase X. Let us admit that (0;12) is the interest in politics and the other one political participation; we can see that there is a fairly strong correlation between the two; when we have an increase in political interest, we increase political participation.
 
Les points bleus représentent les cas, la droite de régression est l’estimation des valeurs et donc on va regarder dans quelle mesure et comment cette droite rencontre un nuage de points.
The blue dots represent the cases, the regression line is the estimation of the values and so we will look at the extent to which and how this line encounters a cloud of dots.
 
La qualité du modèle a à voir à la qualité de l’estimation qui dépend beaucoup de la manière dont sont distribués les points. Il est possible que le nuage de point soit estimé pour une droite qui a la même pente, toutefois la qualité de cet effet est le même alors qu’il est diffèrent parce que la droite ne fait qu’une approximation beaucoup plus précise du nuage de points ou les points sont proches de la droite.
The quality of the model has to do with the quality of the estimate, which depends very much on how the points are distributed. It is possible that the point cloud may be estimated for a straight line that has the same slope, however the quality of this effect is the same while it is different because the straight line makes only a much more accurate approximation of the point cloud where the points are close to the right.
 
Il faut retenir que l’un des principaux instruments privilégiés pour l’analyse quantitative lorsqu’on a à faire à des variables intervalles ou cardinal et l’analyse de la corrélation ou de la régression.
It should be kept in mind that one of the main instruments used for quantitative analysis when using interval or cardinal variables and correlation or regression analysis.
 
L’idée de la régression linéaire qui est un sous-ensemble d’un ensemble plus vaste se base sur l’idée d’une fonction linéaire entre X et Y ; on essaie d’estimer un nuage de points qui représente le croisement entre les deux variables dans l’échantillon donc on va analyser la droite de régression et sa pente. Si la pente est de 0 alors Y ne change pas quand on change X, on peut être très intéressé à la politique, mais on y participe toujours au même niveau.
The idea of linear regression which is a subset of a larger set is based on the idea of a linear function between X and Y; we try to estimate a cloud of points which represents the crossing between the two variables in the sample, so we will analyze the regression line and its slope. If the slope is 0 then Y doesn't change when you change X, you may be very interested in politics, but you always participate at the same level.
 
= Multivariate analyses =
 
== Regression analysis ==
Depending on the type of variables that you want to explain, you can or cannot apply the linear regression tool, for example, there is logistic regression in the case of dummies, either absence or presence, you cannot apply linear regression, because the basic assumptions are not guaranteed.
 
== Ppath analysis ==
One of the problems with regression analysis is that Y is assumed to be a linear function of the sum of all independent variables, and as a result, we only look at the direct effects of model variables; however, what happens when we want to look at indirect effects?
 
We're doing a causal pathway analysis; There are regression coefficients that may or may not be significant, but we can see causal pathways, i. e. we can see how the left values influence participation not directly, but indirectly to know that being left-wing makes it more likely to be integrated in certain types of network develop an internet for a certain issue that allows to develop a feeling of individual efficiency and a higher intensity of participation. Intermediate variables are introduced.
 
Instead of having an indication, we have several because each variable can or is a dependent variable, we make a sum of equations.
 
== Factor analysis ==
The objective of this analysis is to reduce the complexity that one can have when one has a data matrix with many variables and cases and wants to have a more succinct index.
 
When we talked about the operationalization of complex concepts, we arrived at a final stage of construction; factor analysis allows us to construct indexes by analyzing the underlying links that explain the variation on a multiple set of indicators.
 
It is a tool frequently used in political science, especially when studying changes in values.
 
== Multi-level analysis ==
Previously all measures were concerned with individual variables, now there are context properties that are not of the individual that can influence political participation such as the electoral system or the type of political system.


= Analyses multivariées =
From a normal regression point of view there are ways to short-circuit the problem, one cannot integrate contextual factors in the analysis one can simply compare.


== Analyse de régression ==
Multi-level analysis makes it possible to perform a multi-level regression analysis, adding context properties and not just individual properties; integrating individual and contextual properties. There is this important development in political science.
Selon le type de variables que l’on souhaite expliquer, on peut ou on ne peut pas appliquer l’outil de régression linéaire, il y a par exemple la régression logistique dans le cas de variables dummies soit absence ou présence, on ne peut appliquer la régression linéaire, car les présupposés de base ne sont pas garantis.


== Analyse des chemins causaux (path analysis) ==
= Type of qualitative methods =
Un des problèmes de l’analyse de la régression est qu’on suppose que Y est une fonction linéaire de la somme de toutes les variables indépendantes or se faisant on regarde que les effets directs des variables d’un modèle ; toutefois que se passe-t-il lorsqu’on veut regarder des effets indirects ?
A distinction can be made between content analysis and discourse analysis. There is no consensus in the literature on these terms, some believe that discourse analysis is a type of content analysis and others do not.
On fait une analyse des « chemins causaux » ; il y a des coefficients de régression qui peuvent être significatif ou pas, mais on peut voir des chemins causaux c’est-à-dire qu’on peut voir de quelle manière les valeurs de gauche influence la participation non pas directement, mais indirectement à savoir que le fait d’être de gauche fait qu’on a plus de probabilités d’être intégré dans certains types de réseau développent un internet pour un certain enjeu qui permet de développer un sentiment d’efficacité individuel faisant qu’on a une intensité plus forte de participation. On introduit des variables intermédiaires.
Au lieu d’avoir une indication, on en a plusieurs parce que chaque variable peut ou est une variable dépendante, on fait une somme d’équations.


== Analyse factorielle ==
== Content analysis ==
C’est une analyse qui a pour objectif de réduire la complexité qu’on peut avoir lorsqu’on a une matrice de données avec beaucoup de variables et de cas et que l’on veut avoir un index plus succinct.
Content analysis is about weight, it's more descriptive, it's about the different issues raised by people. A later distinction can be made:
* '''thematic''': we count the number of times such a theme appears in a speech.
Lorsqu’on a parlé d’opérationnalisation des concepts complexes, on est arrivé à une dernière étape de construction ; l’analyse factorielle permet de construire des index par l’analyse des liens sous-jacents qui expliquent la variation sur un ensemble multiple d’indicateurs.
* '''lexical''': analysis based on the analysis of occurrences or co-occurrences, i. e. a qualitative analysis that has elements of quantitative analysis.
C’est un outil fréquemment utilisé en science politique et notamment lorsqu’on étudie les changements de valeurs.


== Analyse multiniveaux ==
== Discourse Analysis ==
Auparavant toutes les mesures concernaient les variables individuelles, maintenant il y a des propriétés du contexte qui ne sont pas de l’individu qui peuvent influencer la participation politique comme le système électoral ou le type de système politique.
It is an interpretative analysis, we are talking about a family of techniques, we can say that we are interested in how and the effects of a given discourse.
Dans une optique de régression normale il y a des manières de court-circuiter le problème, on ne peut intégrer les facteurs contextuels dans l’analyse on peut simplement comparer.
L’analyse multiniveau permet de faire une analyse de régression multiniveau, on ajoute des propriétés du contexte et pas seulement des propriétés individuelles ; on intègre des propriétés individuelles et contextuelles. Il y a ce développement important en sciences politiques.
= Type de méthodes qualitatives =
On peut faire une distinction entre l’analyse de contenu et l’analyse de discours. Ces termes ne font pas le consensus dans la littérature, certains estiment que l’analyse de discours est un type d’analyse de contenu et pour d’autres ce n’est pas le cas.


== Analyse de contenu ==
To simplify, content analysis is rather descriptive and explanatory discourse analysis.
L’analyse de contenu s’intéresse au poids, elle est plus descriptive, elle s’intéresse aux différents enjeux soulevés par des personnes. On peut faire une distinction ultérieure :
*'''thématique''' : on compte le nombre de fois qu’un tel thème apparaît dans un discours.
*'''lexicale''' : analyse basée sur l’analyse des occurrences ou cooccurrences à savoir une analyse qualitative qui a des éléments d’analyse quantitative.


== Analyse de discours ==
= Steps of the thematic analysis =
C’est une analyse interprétative, on parle d’une famille de techniques, on peut dire qu’on s’intéresse à comment et aux effets d’un discours donné.
There are five main stages:
# '''Familiarization''' (pre-analysis): Familiarity with the available equipment is essential.
Pour simplifier, l‘analyse de contenu est plutôt descriptive et l’analyse de discours explicatif.
# '''identification of a thematic framework''' (coding scheme, index): way of coding information or identifying the thematic framework.
# '''indexing''' (coding): reducing information.
# '''cartography''' (categorization and reduction of data): creation of typologies, classifications, reduction of data in order to be able to interpret them.
# '''mapping and interpretation''' (analysis and interpretation)


= Étapes de l’analyse thématique =
=Stages of discourse analysis=
Il y a cinq grandes étapes :
*Pre-analysis
#'''familiarisation''' (préanalyse) : il faut d’abord se familiariser avec le matériel à disposition.
*Identification of relevant elements
#'''identification d’un cadre thématique''' (schéma de codage, index) : manière de coder l’information soit d’identifier le cadre thématique.
*Systematic analysis based on identified elements
#'''indexation''' (codage) : réduire l’information.
#'''cartographie''' (catégorisation et réduction des données) : création de typologies, de classifications, réduction des données afin de pouvoir les interpréter.
#'''mapping et interprétation''' (analyse et interprétation)
=Étapes de l’analyse de discours=
*Préanalyse
*Identification d'éléments pertinents
*Analyse systématique à partir des éléments identifiés


= References =
= References =
<references/>
<references />


[[Category:science politique]]
[[Category:science politique]]

Version actuelle datée du 17 août 2022 à 17:00


The analysis of quantitative data is very different from the analysis of qualitative data; they are two very different or even opposed research practices.

We will focus on quantitative analysis, which is in fact easier than the analysis of qualitative data, if only because there are institutionalized routines.

Data matrix[modifier | modifier le wikicode]

Matrice des données.png

It is a matrix that cross-references the cases studied with a number of variables, namely column variables and online cases.

A code must be assigned to exclude from the analysis those who did not respond and differentiate them from those who did.

There are three analyses that correspond to three different objectives:

  • Univariate analyses: analyses that are performed on a single variable or characteristic.
  • bivariate analyses: linking two variables, we want to cross data to analyze variations more subtly such as the interest of politics according to city or age.
  • multivariate analyses: we think that an explained phenomenon is never explained by a single independent variable; on the other hand, we want to introduce controls in order to control relations through the purification technique.

A distinction must be made between a descriptive analysis that attempts to describe a "factual state" that is made univariate or bivariate.

Types of univariate analyses[modifier | modifier le wikicode]

Types d’analyses univariées.png

Types of variables and operations between methods[modifier | modifier le wikicode]

There are different types of univariate analyses, these techniques depend on the type of variable:

  • nominal variables: only equivalence or difference operations can be performed.
  • ordinal variables: allows to order i. e. to categorize according to an order from the smallest to the largest.

Note: Ordinal and cardinal variables are categorical, discrete data, distances cannot be seen.

  • cardinal variables: allow the four basic arithmetic operations to be performed in addition to the previous operations.

Central trend measurement[modifier | modifier le wikicode]

When carrying out a quantitative analysis, it is necessary to consider the type of variables and then choose the tool to use. We can distinguish between two main types of measures, i.e. between two types of information that we want to have single variables:

  • central trend measures ;
  • dispersion measurements.

Note: Depending on the variable, the measurements are different.

The mean is a measure of a central trend value that can be applied to the cardinal variables, but it cannot be applied to the categorical variables. The median is the category that separates the statistical series in two with the same number of cases on either side.

This is important information that forms the starting point for this type of data description, allowing us to know what to do next in the case of more sophisticated analyses.

Measurement of dispersions[modifier | modifier le wikicode]

Dispersion measurements are also distinguished: the basic measure is the standard deviation, which is a standardized measure that varies from -1 to +1 of the variance, which is the measure that indicates how individuals are distributed.

Variance is very important for calculating the probability of error. Different measures are required depending on the unit of measurement of the variable, and account must be taken of the central trend and dispersion measure as the standard deviation that is the key coefficient throughout the quantitative analysis.

Types of bivariate analyses[modifier | modifier le wikicode]

Types d’analyses bivariées.png

In this context we wish to cross characteristics either in a descriptive or an explanatory perspective. Depending on the type of variable, different techniques are used to analyze and process the data.

Both dependent and independent variables need to be addressed. When cross-checking, we must look at the dependent and independent variable if we are dealing with categorical or ordinal variables that make it possible to distinguish three main families of types of analysis:

  • categorical / nominal - nominal variables: contingency tables are made, other techniques cannot be used. Most of the time in political science we are dealing with this type of variables, because the answers give rise to ordinal variables. There are coefficients that make it possible to give a single measure of the relationship between these two variables, such as Cramer's V, which makes it possible to see the association between categorical variables. To interpret, it is important that the total percentage should always refer to the categories of the independent variable; we want to see how the distribution on the dependent variable depends on the functions in the independent variable. The number of cases indicates whether the process is statistically representative, as the sample size affects the measurement.
  • cardinal-cardinal variables: we no longer cross tabulate, we have other tools and in particular the regression and correlation tool:
    • covariation: when we have two continuous variables, when one increases the other increases proportionally or inversely proportional, the two variables are linked in this direction.
    • correlation: it is simply a standardised covariation that is between -1 and +1. Standardisation is used to compare variables that are measured in a different way; if, for example, we have scales from 0 to 10 and scales from 0 to 5, we cannot compare these variables, so we must make sure that this information is standardised. Variables can be repeated on the same scale or software that calculates a standardized correlation.
    • regression: in a correlation one is in a descriptive perspective, one does not seek to see a direction of causality in a regression one wants to see if two variables are associated, related, correlated.
  • Nominal independent variables - cardinal dependent variables: cross tabulations, correlations and regressions cannot be applied; an analysis of variance or covariance is carried out, the simplest case of which is a comparison of averages, which could for example be the number of times that individuals participate in an election according to the canton.

Linear regression[modifier | modifier le wikicode]

Régression linéaire.png

It is a very varied and sometimes complex set of tools, but it is the main tool. Linear regression is the main element; much of the quantitative analysis done in social science relies on linear regression.

We talk about linearity, because we postulate that there is a linear relationship between the variables we are studying, in other words there is a linear function behind this relationship; however, we can also envisage regressions that are not only linear.

It is assumed that what we want to explain is a linear function of one or more independent variables. This is crucial, as linear regression is only a subset of a larger family of regression analyses that is not based on a linearity idea between the two variables.

The simplest model is with an explanatory variable such as, for example, political participation based on political interest.

In descriptive terms, there is a strong correlation between these two variables; if a hypothesis says that it is the interest in politics that influences participation, then a regression analysis is done.

There is always the problem of endogeneity in this type of analysis, we postulate that the interest in politics determines participation; we could also postulate that the more we participate, the more we develop an interest in politics.

Political participation is a linear function of the interest in politics "plus" a constant factor: the value of Y when X is equal to 0, i. e. what is my level of participation when the interest in politics is nil. At the bottom it is where the regression line crosses the ordinate axis.

In multivariate analysis, there is always a margin of error; one thing is related if you have survey data related to the margin of error between population and sample, but regardless of whether you are working on samples or an overall population; There is a term of error that comes into play, because there is always something that influences what we want to explain and which is not included in the regression model, such as education, age, social, institutional and other factors.

In fact, the E groups together the unexplained variance, i. e. everything that could explain Y, but is not introduced into the model, it is the problem of the model's under-specification, i. e. the model specification issue; the more variables a model has more variables, the more likely it is to be underspecified and the less variation in the Y and the higher the E in terms of error, the less likely it is to be underspecified.

This suggests that not including some variables in an explanatory model with two major consequences:

  • the model is under-specified, there is little explanation of the variability of Y with this model, i. e. the factors strongly correlated with what we want to study.
  • the second reason is related to the control of variables, because if one introduces interest in politics, a third variable can influence interest in politics and participation in politics; the association is misleading.

We want to include the maximum number of variables that we think can directly influence Y or indirectly making the relationship between X and Y false or only apparent.

The B is the regression coefficient, i. e. the slope of the regression line giving the strength of the effect of the X because it is multiplied by X, i. e. the higher the effect of X, the higher the B is.

B can be non-standardized or standardized." Standardization "means standardizing and the goal is to be able to compare different coefficients.

One is in an additive logic, there are "+"; one assumes that the variation of Y is a linear function is additive or cumulative of the effect of all the other variables introduced in the model.

Regression line[modifier | modifier le wikicode]

Droite de régression.png

The regression line represents the linear regression function. We want to see how much more Y increases when we increase X. Let us admit that (0;12) is the interest in politics and the other one political participation; we can see that there is a fairly strong correlation between the two; when we have an increase in political interest, we increase political participation.

The blue dots represent the cases, the regression line is the estimation of the values and so we will look at the extent to which and how this line encounters a cloud of dots.

The quality of the model has to do with the quality of the estimate, which depends very much on how the points are distributed. It is possible that the point cloud may be estimated for a straight line that has the same slope, however the quality of this effect is the same while it is different because the straight line makes only a much more accurate approximation of the point cloud where the points are close to the right.

It should be kept in mind that one of the main instruments used for quantitative analysis when using interval or cardinal variables and correlation or regression analysis.

The idea of linear regression which is a subset of a larger set is based on the idea of a linear function between X and Y; we try to estimate a cloud of points which represents the crossing between the two variables in the sample, so we will analyze the regression line and its slope. If the slope is 0 then Y doesn't change when you change X, you may be very interested in politics, but you always participate at the same level.

Multivariate analyses[modifier | modifier le wikicode]

Regression analysis[modifier | modifier le wikicode]

Depending on the type of variables that you want to explain, you can or cannot apply the linear regression tool, for example, there is logistic regression in the case of dummies, either absence or presence, you cannot apply linear regression, because the basic assumptions are not guaranteed.

Ppath analysis[modifier | modifier le wikicode]

One of the problems with regression analysis is that Y is assumed to be a linear function of the sum of all independent variables, and as a result, we only look at the direct effects of model variables; however, what happens when we want to look at indirect effects?

We're doing a causal pathway analysis; There are regression coefficients that may or may not be significant, but we can see causal pathways, i. e. we can see how the left values influence participation not directly, but indirectly to know that being left-wing makes it more likely to be integrated in certain types of network develop an internet for a certain issue that allows to develop a feeling of individual efficiency and a higher intensity of participation. Intermediate variables are introduced.

Instead of having an indication, we have several because each variable can or is a dependent variable, we make a sum of equations.

Factor analysis[modifier | modifier le wikicode]

The objective of this analysis is to reduce the complexity that one can have when one has a data matrix with many variables and cases and wants to have a more succinct index.

When we talked about the operationalization of complex concepts, we arrived at a final stage of construction; factor analysis allows us to construct indexes by analyzing the underlying links that explain the variation on a multiple set of indicators.

It is a tool frequently used in political science, especially when studying changes in values.

Multi-level analysis[modifier | modifier le wikicode]

Previously all measures were concerned with individual variables, now there are context properties that are not of the individual that can influence political participation such as the electoral system or the type of political system.

From a normal regression point of view there are ways to short-circuit the problem, one cannot integrate contextual factors in the analysis one can simply compare.

Multi-level analysis makes it possible to perform a multi-level regression analysis, adding context properties and not just individual properties; integrating individual and contextual properties. There is this important development in political science.

Type of qualitative methods[modifier | modifier le wikicode]

A distinction can be made between content analysis and discourse analysis. There is no consensus in the literature on these terms, some believe that discourse analysis is a type of content analysis and others do not.

Content analysis[modifier | modifier le wikicode]

Content analysis is about weight, it's more descriptive, it's about the different issues raised by people. A later distinction can be made:

  • thematic: we count the number of times such a theme appears in a speech.
  • lexical: analysis based on the analysis of occurrences or co-occurrences, i. e. a qualitative analysis that has elements of quantitative analysis.

Discourse Analysis[modifier | modifier le wikicode]

It is an interpretative analysis, we are talking about a family of techniques, we can say that we are interested in how and the effects of a given discourse.

To simplify, content analysis is rather descriptive and explanatory discourse analysis.

Steps of the thematic analysis[modifier | modifier le wikicode]

There are five main stages:

  1. Familiarization (pre-analysis): Familiarity with the available equipment is essential.
  2. identification of a thematic framework (coding scheme, index): way of coding information or identifying the thematic framework.
  3. indexing (coding): reducing information.
  4. cartography (categorization and reduction of data): creation of typologies, classifications, reduction of data in order to be able to interpret them.
  5. mapping and interpretation (analysis and interpretation)

Stages of discourse analysis[modifier | modifier le wikicode]

  • Pre-analysis
  • Identification of relevant elements
  • Systematic analysis based on identified elements

References[modifier | modifier le wikicode]