Factor Analysis

Factor Analysis

Factor Analysis is located under Model Studio (  ) in Data Preparation, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis.

Refer to Properties of Factor Analysis.


Steps in Factor Analysis:
  1. Define the problem statement as to why you want to perform Factor Analysis
  2. Construct the Correlation Matrix
  3. Determine the Pearson Correlation between variables and identify which variables are correlated.
  4. Decide the method to be taken up for Factor Analysis
    1. Rotation for Varimax
    2. Maximum Likelihood
  5. Determine the number of relevant factors for the study. For example, you have a set of seven variables, and you want to reduce them to three. It is an individual decision based on the dataset and analytical requirements. (This is mostly determined using the trial and error methodology)
  6. Rotate the factors and interpret the results.

Properties of Factor Analysis

The available properties of the Factor Analysis are shown in the figure below.


The table below describes the different fields present on the Properties pane of the Chi-Square Goodness of Fit Test.

Field

Description

Remark

RunIt allows you to run the node.-
ExploreIt allows you to explore the successfully executed node.-
Vertical Ellipses

The available options are

  • Run till node
  • Run from node
  • Publish as a model
  • Publish code
-

Task Name

It is the name of the task selected on the workbook canvas.

  • You can click the text field to edit or modify the task's name as required.
  • To read more about the algorithm/functionality, hover over the Help Icon (  ) next to the Task Name heading.

Independent Variable


It allows you to select the unknown variables to determine the factors.

  • You can select any numerical data type of variable as an independent variable.
  • These variables are reduced to common attributes called factors used for regression or modeling purposes.
  • In Factor Analysis, all the variables are considered independent.

Number of Factors


It allows you to select the number of factors you want to reduce the number of variables.

  • By default, the number of factors is one (1).
  • The number of factors is always an integer value.
  • Normally, you should select a number less than the number of independent variables in the dataset. It is because the aim of Factor Analysis is data reduction.
  • The number of factors is one less than the number of variables.

Advanced



Scores

It allows you to select the factor score for Factor Analysis.

  • It is also known as the component score.
  • The following methods are available for calculating the factor score.
    • None
    • Regression (default)
    • Bartlett
  • Selection of None value means that no score is selected
  • The Bartlett score is a test statistic (sigma value) used to explain the null hypothesis; that is, the variables are not correlated in the dataset.
  • Generally, if the Bartlett score is less than 0.05, Factor Analysis is recommended.

Rotations

It allows you to select the factor rotation method.

  • The factor rotation method decides how the axes are rotated so that factors undergo multiple rotations, and we obtain the best possible combination of variables for a given factor.
  • By default, the method selected is Varimax.
  • You can select from the following methods.
    • None
    • Varimax
    • Promax
  • Selection of None value means that the factor rotation method is not selected.
  • Varimax is an orthogonal rotation method. It constrains the factors to be non-correlated. It minimizes the number of variables with high factor loading on each factor. It simplifies the factor interpretation.
  • Promax is an oblique rotation method. It allows the correlation of factors. Promax is useful for large datasets since it can be calculated quickly.
  • Varimax is a more common and recommended method of factor rotation.

Node Configuration

It allows you to select the instance of the AWS server to provide control over the execution of a task in a workbook or workflow.

For more details, refer to Worker Node Configuration.

Results of Factor Analysis

The following table elaborates on the findings obtained from Factor Analysis.

Result

Significance

Loadings

  • Factor Loadings show how much a factor can explain a variable.
  • The range of factor loadings is from -1 to 1.
  • The values of factor loadings close to -1 or 1 show that the factor has a huge influence on the variable.
  • If the factor loading is close to zero, this influence is weak.
  • Some factors may have a simultaneous high influence on multiple variables.

KMO Test (Kaiser-Meyer-Olkin Test)

  • The KMO test measures the suitability of data for Factor Analysis.
  • It measures whether each variable has an adequate sample in the model and the complete model.
  • The test yields a statistic called the KMO statistic.
  • It measures the proportion of variance that the variables have with each other, which might have a common variance.
  • The value of the KMO statistic lies between 0 and 1.
  • The following table gives the interpretation of various KMO values.

KMO Value

Interpretation

Close to zero

Widespread correlation among the variables

Less than 0.6

  • Inadequate Sampling
  • Action is needed to resolve the sampling issue.
  • Any KMO statistic below 0.5 is unacceptable.

Between 0.6 & 1

  • Adequate Sampling
  • Any KMO statistic above 0.9 is considered excellent

Note: Practically, a KMO accuracy value above 0.5 is also considered acceptable for the validity of the test. Below 0.5, the value indicates that more data collection is essential since the sample is inadequate.|

Uniqueness

  • Uniqueness gives the value of variance unique to a variable.
  • It is not shared with any other variable in the dataset.
  • Uniqueness = 1 – Communality,
    • Where communality is that variance that it shares with the other variables.
    • For example, a communality of 0.75 means a uniqueness of 0.25 in the variable. Thus, the variable has 75% variance from other variables and only 25% uniqueness.
  • A greater uniqueness value indicates that the variable is less relevant for factor analysis.

Communalities

  • Communality is the common variance found in any variable.
  • It is an important measure to determine the value of a variable in Factor Analysis (or Principal Component Analysis).
  • Communality tells us what proportion of the variable's variance results from the correlation between the variable and the individual factors.
  • In Factor Analysis, communality is denoted by h2.
  • The communality of a variable lies between 0 and 1.
  • If the communality of variance is 0, the variable is unique. It cannot be explained by any other variable at all.
  • If the communality is 1, the variable does not have any unique variance and can be completely explained using other variables.

Practically, the values of communality extraction for a variable should be greater than 0.5. You can remove the variable and re-run the Factor Analysis in such a case.

Note: However, values of 0.3 and above are also considered depending on the dataset.

Scree Plot

  • A scree plot is a line graph that essentially plots and displays the eigenvalues in a multivariate analysis.
  • The eigenvalues are
    • factors that need to be retained in the exploratory factor analysis, or
    • principal components that need to be retained in the principal component analysis.
  • The eigenvalues are plotted on the Y-axis on a scree plot, while the number of factors is plotted on the X-axis.

  • A scree plot displays these eigenvalues in a downward curve from largest to the smallest.
    • The first factor or component usually explains a major part of the variability.
    • The subsequent few components explain the variability moderately, while the components after that explain only a minuscule fraction of the variability.
  • The scree plot results from the scree test, a method to determine the most significant factors or components.
    • The 'elbow point' (where the eigenvalues seem to be leveling off) is determined from this test.
    • All eigenvalues to the left of this elbow are considered significant and are retained, while the values to the right are non-significant and are discarded.

Bartlett Test of Sphericity

  • Bartlett's test of sphericity is a test of comparison between the observed correlation matrix and the identity matrix.
  • Bartlett's test checks
  • the redundancy between variables
  • and whether we can reduce the redundant variables so that the data can be summarized with only a few factors.
  • For Bartlett's test,
  • Null Hypothesis: The variables are orthogonal. That is, they are not correlated.
  • Alternate Hypothesis: The variables are not orthogonal. That is, they are correlated.
  • Thus, Bartlett's test ensures that the correlation matrix diverges prominently from the identity matrix. It helps us in selecting the data reduction technique.
  • This test yields a p-value which is then compared with the chosen significant level represented by a significance value (usually, 0.01. 0.05, or 0.1 are chosen as significance values.)
  • If the p-value is less than the significant level, the data set contains redundant variables and is suitable for the data reduction technique.

Example of Factor Analysis

Consider a Hyundai Stock price dataset.

A snippet of the input data is shown in the figure given below.



For applying Factor Analysis, the following properties are selected.

Independent Variables

Open, High, Low, Close and so on

Number of Factors

4

Scores

None

Rotation

Varimax

The image below shows the Results page of the Factor Analysis.


We first look at the Bartlett Test of Sphericity results on the Results page, followed by the Scree Plot. The statistics and results obtained in this test is crucial to decide whether the Factor Analysis is required.

Bartlett's Test of Sphericity tells you whether you can go ahead with the Factor Analysis for data reduction. KMO test tells you whether the Factor Analysis is appropriate and accurate.

Bartlett Test of Sphericity:

  • The Approximate value of Chi-Square = 48478.6146
  • Degrees of Freedom (df) = 6
  • Sigma (sig) = 0

Inference:

  • Since the sigma value (0) is less than the p-Value (assumed as 0.05), the dataset is suitable for data reduction, in our case,Factor Analysis.
  • It also indicates a substantial amount of correlation within the data.

KMO Test:

  • You can see the individual KMO values for each variable.
  • It is 7749for Open, 0.7745 for High, and so on.
  • The overall KMO accuracy is 0.7798 (greater than 0.6).

Inference:

  • Since the individual KMO scores for each variable are greater than 0.6, the individual data points are adequate for Sampling.
  • Since the overall KMO score is greater than 0.6, the overall Sampling is adequate.

After the results from these two tests are analyzed, you can study other results. Among the remaining, you first analyze the communality extraction values for various variables.

Communalities:

  • You can see the communality extraction score for each variable.
    • It is 0.9974 for Open, 0.9982 for High, and so on.
    • It is maximumfor High (0.9982) and minimum for  Close (0.995).

Inference:

  • The closer the communality is to one (1), the better is the variable explained by the factors.
  • The closer the communality extraction values for variables, the better the chances of the variables belonging to a group (or community) and having a communal variance.
  • For example, High, Low, and Close have high chances of belonging to a group.

Loadings:

  • You can see the variance values of a variable as determined by each factor.
    • For example, the variance of variable Open is better explained by factor F0 (0.998) compared to F2 (-0.0322).
  • The factor loading scores indicate which variables fall into which factor category to be combined.

 Notes:

  • In unsupervised machine learning, the variables are unknown.
  • After Factor Analysis, the variables are grouped into factors, and then the factors are suitably named.
  • For example, you want to stitch shirts for students in a class. You take each student's Height and Shoulder Width measurements (variables) and then classify them into sizes Small, Medium, and Large (factors) as required.

Uniqueness:

  • You can see the uniqueness extraction values for each variable.
  • It is 0.0026 for Open, 0.0018 for High, and so on.
  • It is maximum for Open (0.0026) and minimum for Adj. Close (0.005).
  • Uniqueness is calculated as '1- Communality'. It means that more communal is a variable, less is its uniqueness.
  • Thus, Open is the most unique, while Close is the least unique value.

Correlation Plot:

  • It gives the values of Pearson correlation for all the variables.
  • A value of Pearson correlation closer to one (1) indicates a strong correlation between two variables, while a value close to zero (0) indicates a weak correlation.

Scree Plot:

  • It plots the eigenvalues of factors against the factor number.
  • It displays PC (Principal Component) value for components.

    • Related Articles

    • Factor Analysis

      Factor Analysis is located under Model Studio ( ) in Data Preparation, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Refer to ...
    • Local Outlier Factor

      Local Outlier Factor is located under Machine Learning ( ) in Anomaly Detection, in the task pane on the left. Use the drag-and-drop method (or double-click on the node) to use the algorithm in the canvas. Click the algorithm to view and select ...
    • Basic Sentiment Analysis

      Basic Sentiment Analysis is located under Textual Analysis ( ) in Sentiment, in the task pane on the left. Use drag-and-drop method to use algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Refer to ...
    • Basic Sentiment Analysis

      Basic Sentiment Analysis is located under Textual Analysis ( ) in Sentiment, in the task pane on the left. Use drag-and-drop method to use algorithm in the canvas. Click the algorithm to view and select different properties for analysis. Refer to ...
    • Process Capability Analysis

      Process Capability Analysis is located under Model Studio ( ) under Statistical Analysis, in the left task pane. Use the drag-and-drop method to use the algorithm in the canvas. Click the algorithm to view and select different properties for ...