clarkson_woodruff_transform(input_matrix, sketch_size, seed=None)
Given an input_matrix A of size (n, d), compute a matrix A' of size (sketch_size, d) so that
with high probability via the Clarkson-Woodruff Transform, otherwise known as the CountSketch matrix.
To make the statement
precise, observe the following result which is adapted from the proof of Theorem 14 of via Markov's Inequality. If we have a sketch size sketch_size=k which is at least
Then for any fixed vector x,
with probability at least one minus delta.
This implementation takes advantage of sparsity: computing a sketch takes time proportional to A.nnz. Data A which is in scipy.sparse.csc_matrix format gives the quickest computation time for sparse input.
>>> import numpy as np
>>> from scipy import linalg
>>> from scipy import sparse
>>> rng = np.random.default_rng()
>>> n_rows, n_columns, density, sketch_n_rows = 15000, 100, 0.01, 200
>>> A = sparse.rand(n_rows, n_columns, density=density, format='csc')
>>> B = sparse.rand(n_rows, n_columns, density=density, format='csr')
>>> C = sparse.rand(n_rows, n_columns, density=density, format='coo')
>>> D = rng.standard_normal((n_rows, n_columns))
>>> SA = linalg.clarkson_woodruff_transform(A, sketch_n_rows) # fastest
>>> SB = linalg.clarkson_woodruff_transform(B, sketch_n_rows) # fast
>>> SC = linalg.clarkson_woodruff_transform(C, sketch_n_rows) # slower
>>> SD = linalg.clarkson_woodruff_transform(D, sketch_n_rows) # slowest
That said, this method does perform well on dense inputs, just slower on a relative scale.
Input matrix, of shape (n, d).
Number of rows for the sketch.
If seed is None (or np.random), the numpy.random.RandomState singleton is used. If seed is an int, a new RandomState instance is used, seeded with seed. If seed is already a Generator or RandomState instance then that instance is used.
Sketch of the input matrix A, of size (sketch_size, d).
Applies a Clarkson-Woodruff Transform/sketch to the input matrix.
import numpy as np
from scipy import linalg
n_rows, n_columns = 15000, 100
rng = np.random.default_rng()
A = rng.standard_normal((n_rows, n_columns))
sketch_n_rows = 200
sketch = linalg.clarkson_woodruff_transform(A, sketch_n_rows, seed=rng)
sketch.shape
linalg.norm(A)
linalg.norm(sketch)
b = rng.standard_normal(n_rows)
x = linalg.lstsq(A, b)[0]
Ab = np.hstack((A, b.reshape(-1, 1)))
SAb = linalg.clarkson_woodruff_transform(Ab, sketch_n_rows, seed=rng)
SA, Sb = SAb[:, :-1], SAb[:, -1]
x_sketched = linalg.lstsq(SA, Sb)[0]
linalg.norm(A @ x - b)
linalg.norm(A @ x_sketched - b)
Hover to see nodes names; edges to Self not shown, Caped at 50 nodes.
Using a canvas is more power efficient and can get hundred of nodes ; but does not allow hyperlinks; , arrows or text (beyond on hover)
SVG is more flexible but power hungry; and does not scale well to 50 + nodes.
All aboves nodes referred to, (or are referred from) current nodes; Edges from Self to other have been omitted (or all nodes would be connected to the central node "self" which is not useful). Nodes are colored by the library they belong to, and scaled with the number of references pointing them