Locally Linear Embedding (LLE) is another very powerful nonlinear dimensionality reduction (NLDR) technique. It is a Manifold Learning technique that does not rely on projections like the previous algorithms. In a nutshell, LLE works by first measuring how each training instance linearly relates to its closest neighbors (c.n.), and then looking for a low-dimensional representation of the training set where these local relationships are best preserved (more details shortly). This makes it particularly good at unrolling twisted manifolds, especially when there is not too much noise.
For example, the following code uses Scikit-Learn’s LocallyLinearEmbedding class to unroll the Swiss roll. The resulting 2D dataset is shown in Figure below As you can see, the Swiss roll is completely unrolled and the distances between instances are locally well preserved. However, distances are not preserved on a larger scale: the left part of the unrolled Swiss roll is stretched, while the right part is squeezed. Nevertheless, LLE did a pretty good job at modeling the manifold
from sklearn.manifold import LocallyLinearEmbedding lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10) X_reduced = lle.fit_transform(X)

Here’s how LLE works: first, for each training instance x(i) , the algorithm identifies its k closest neighbors (in the preceding code k = 10), then tries to reconstruct x(i) as a linear function of these neighbors. More specifically, it finds the weights wi,j such that the squared distance between x(i) and ∑j = 1,m wi, jx(j) is as small as possible, assuming wi,j= 0 if x(j)is not one of the k closest neighbors of x(i) . Thus the first step of LLE is the constrained optimization problem described in Equation 8-4, where W is the weight matrix containing all the weights wi,j. The second constraint simply normalizes the weights for each training instance x(i).
LLE step 1: linearly modeling local relationships

After this step, the weight matrix W (containing the weights wi, j) encodes the local linear relationships between the training instances. Now the second step is to map the training instances into a d-dimensional space (where d < n) while preserving these local relationships as much as possible. If z(i) is the image of x(i) in this d-dimensional space, then we want the squared distance between z(i) and ∑j = 1 m wi, j z(j) to be as small as possible. This idea leads to the unconstrained optimization problem described in Equation 8-5. It looks very similar to the first step, but instead of keeping the instances fixed and finding the optimal weights, we are doing the reverse: keeping the weights fixed and finding the optimal position of the instances’ images in the lowdimensional space. Note that Z is the matrix containing all z(i).
LLE step 2: reducing dimensionality while preserving relationships

Scikit-Learn’s LLE implementation has the following computational complexity: O(m log(m)n log(k)) for finding the k nearest neighbors, O(mnk^3 ) for optimizing the weights, and O(dm^2 ) for constructing the low-dimensional representations. Unfortunately, the m^2 in the last term makes this algorithm scale poorly to very large datasets.
Other Dimensionality Reduction Techniques
There are many other dimensionality reduction techniques, several of which are
available in Scikit-Learn. Here are some of the most popular:
- Multidimensional Scaling (MDS) reduces dimensionality while trying to preserve
the distances between the instances - Isomap creates a graph by connecting each instance to its nearest neighbors, then reduces dimensionality while trying to preserve the geodesic distances9 between the instances.
- t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces dimensionality
while trying to keep similar instances close and dissimilar instances apart. It is
mostly used for visualization, in particular to visualize clusters of instances in
high-dimensional space (e.g., to visualize the MNIST images in 2D). Linear Discriminant Analysis (LDA) is actually a classification algorithm, but during training it learns the most discriminative axes between the classes, and these axes can then be used to define a hyperplane onto which to project the data. The benefit is that the projection will keep classes as far apart as possible, so LDA is a good technique to reduce dimensionality before running another classification algorithm such as an SVM classifier.
