In [1]:

```
import numpy
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
def generate_data(curve_type, num_points=1000):
if curve_type=='swissroll':
tt = numpy.array((3*numpy.pi/2)*(1+2*numpy.random.rand(num_points)))
height = numpy.array((numpy.random.rand(num_points)-0.5))
X = numpy.array([tt*numpy.cos(tt), 10*height, tt*numpy.sin(tt)])
return X,tt
if curve_type=='scurve':
tt = numpy.array((3*numpy.pi*(numpy.random.rand(num_points)-0.5)))
height = numpy.array((numpy.random.rand(num_points)-0.5))
X = numpy.array([numpy.sin(tt), 10*height, numpy.sign(tt)*(numpy.cos(tt)-1)])
return X,tt
if curve_type=='helix':
tt = numpy.linspace(1, num_points, num_points).T / num_points
tt = tt*2*numpy.pi
X = numpy.r_[[(2+numpy.cos(8*tt))*numpy.cos(tt)],
[(2+numpy.cos(8*tt))*numpy.sin(tt)],
[numpy.sin(8*tt)]]
return X,tt
```

Just to start, lets pick some algorithm and one of the data sets, for example lets see what embedding of the Swissroll is produced by the Isomap algorithm. The Isomap algorithm is basically a slightly modified Multidimensional Scaling (MDS) algorithm which finds embedding as a solution of the following optimization problem:

$$ \min_{x'_1, x'_2, \dots} \sum_i \sum_j \| d'(x'_i, x'_j) - d(x_i, x_j)\|^2, $$with defined $x_1, x_2, \dots \in X~~$ and unknown variables $x_1, x_2, \dots \in X'~~$ while $\text{dim}(X') < \text{dim}(X)~~~$, $d: X \times X \to \mathbb{R}~~$ and $d': X' \times X' \to \mathbb{R}~~$ are defined as arbitrary distance functions (for example Euclidean).

Speaking less math, the MDS algorithm finds an embedding that preserves pairwise distances between points as much as it is possible. The Isomap algorithm changes quite small detail: the distance - instead of using local pairwise relationships it takes global factor into the account with shortest path on the neighborhood graph (so-called geodesic distance). The neighborhood graph is defined as graph with datapoints as nodes and weighted edges (with weight equal to the distance between points). The edge between point $x_i~$ and $x_j~$ exists if and only if $x_j~$ is in $k~$ nearest neighbors of $x_i$. Later we will see that that 'global factor' changes the game for the swissroll dataset.

However, first we prepare a small function to plot any of the original data sets together with its embedding.

In [2]:

```
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
def plot(data, embedded_data, colors='m'):
fig = plt.figure()
fig.set_facecolor('white')
ax = fig.add_subplot(121,projection='3d')
ax.scatter(data[0],data[1],data[2],c=colors,cmap=plt.cm.Spectral)
plt.axis('tight'); plt.axis('off')
ax = fig.add_subplot(122)
ax.scatter(embedded_data[0],embedded_data[1],c=colors,cmap=plt.cm.Spectral)
plt.axis('tight'); plt.axis('off')
plt.show()
```

In [3]:

```
from shogun import Isomap, RealFeatures, MultidimensionalScaling
# wrap data into Shogun features
data, colors = generate_data('swissroll')
features = RealFeatures(data)
# create instance of Isomap converter and configure it
isomap = Isomap()
isomap.set_target_dim(2)
# set the number of neighbours used in kNN search
isomap.set_k(20)
# create instance of Multidimensional Scaling converter and configure it
mds = MultidimensionalScaling()
mds.set_target_dim(2)
# embed Swiss roll data
embedded_data_mds = mds.embed(features).get_feature_matrix()
embedded_data_isomap = isomap.embed(features).get_feature_matrix()
plot(data, embedded_data_mds, colors)
plot(data, embedded_data_isomap, colors)
```

*intrinsic* dimension of the input data. Although the original data is in a three-dimensional space, its intrinsic dimension is lower, since the only degree of freedom are the polar angle and distance from the centre, or height.

Finally, we use yet another method, Stochastic Proximity Embedding (SPE) to embed the helix:

In [4]:

```
from shogun import StochasticProximityEmbedding
# wrap data into Shogun features
data, colors = generate_data('helix')
features = RealFeatures(data)
# create MDS instance
converter = StochasticProximityEmbedding()
converter.set_target_dim(2)
# embed helix data
embedded_features = converter.embed(features)
embedded_data = embedded_features.get_feature_matrix()
plot(data, embedded_data, colors)
```

- Lisitsyn, S., Widmer, C., Iglesias Garcia, F. J. Tapkee: An Efficient Dimension Reduction Library. (Link to paper in JMLR.)
- Tenenbaum, J. B., de Silva, V. and Langford, J. B. A Global Geometric Framework for Nonlinear Dimensionality Reduction. (Link to Isomap's website.)