This section will give you a brief understanding of multidimensional arrays by going through different matrix operations.
In order to do matrix multiplication in NumPy, you have to use dot() instead of *. Let's see some examples:
In [66]: c = np.ones((4, 4))
c*c
Out[66]: array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]])
In [67]: c.dot(c)
Out[67]: array([[ 4., 4., 4., 4.],
[ 4., 4., 4., 4.],
[ 4., 4., 4., 4.],
[ 4., 4., 4., 4.]])
The most important topic in working with multidimensional arrays is stacking, in other words how to merge two arrays. hstack is used for stacking arrays horizontally (column-wise) and vstack is used for stacking arrays vertically (row-wise). You can also split the columns with the hsplit and vsplit methods in the same way that you stacked them:
In [68]: y = np.arange(15).reshape(3,5)
x = np.arange(10).reshape(2,5)
new_array = np.vstack((y,x))
new_array
Out[68]: array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9]])
In [69]: y = np.arange(15).reshape(5,3)
x = np.arange(10).reshape(5,2)
new_array = np.hstack((y,x))
new_array
Out[69]: array([[ 0, 1, 2, 0, 1],
[ 3, 4, 5, 2, 3],
[ 6, 7, 8, 4, 5],
[ 9, 10, 11, 6, 7],
[12, 13, 14, 8, 9]])
These methods are very useful in machine learning applications, especially when creating datasets. After you stack your arrays, you can check their descriptive statistics by using Scipy.stats. Imagine a case where you have 100 records, and each record has 10 features, which means you have a 2D matrix which has 100 rows and 10 columns. The following example shows how you can easily get some descriptive statistics for each feature:
In [70]: from scipy import stats
x= np.random.rand(100,10)
n, min_max, mean, var, skew, kurt = stats.describe(x)
new_array = np.vstack((mean,var,skew,kurt,min_max[0],min_max[1]))
new_array.T
Out[70]: array([[ 5.46011575e-01, 8.30007104e-02, -9.72899085e-02,
-1.17492785e+00, 4.07031246e-04, 9.85652100e-01],
[ 4.79292653e-01, 8.13883169e-02, 1.00411352e-01,
-1.15988275e+00, 1.27241020e-02, 9.85985488e-01],
[ 4.81319367e-01, 8.34107619e-02, 5.55926602e-02,
-1.20006450e+00, 7.49534810e-03, 9.86671083e-01],
[ 5.26977277e-01, 9.33829059e-02, -1.12640661e-01,
-1.19955646e+00, 5.74237697e-03, 9.94980830e-01],
[ 5.42622228e-01, 8.92615897e-02, -1.79102183e-01,
-1.13744108e+00, 2.27821933e-03, 9.93861532e-01],
[ 4.84397369e-01, 9.18274523e-02, 2.33663872e-01,
-1.36827574e+00, 1.18986562e-02, 9.96563489e-01],
[ 4.41436165e-01, 9.54357485e-02, 3.48194314e-01,
-1.15588500e+00, 1.77608372e-03, 9.93865324e-01],
[ 5.34834409e-01, 7.61735119e-02, -2.10467450e-01,
-1.01442389e+00, 2.44706226e-02, 9.97784091e-01],
[ 4.90262346e-01, 9.28757119e-02, 1.02682367e-01,
-1.28987137e+00, 2.97705706e-03, 9.98205307e-01],
[ 4.42767478e-01, 7.32159267e-02, 1.74375646e-01,
-9.58660574e-01, 5.52410464e-04, 9.95383732e-01]])
NumPy has a great module named numpy.ma, which is used for masking array elements. It's very useful when you want to mask (ignore) some elements while doing your calculations. When NumPy masks, it will be treated as an invalid and does not take into account computation:
In [71]: import numpy.ma as ma
x = np.arange(6)
print(x.mean())
masked_array = ma.masked_array(x, mask=[1,0,0,0,0,0])
masked_array.mean()
2.5
Out[71]: 3.0
In the preceding code, you have an array x = [0,1,2,3,4,5]. What you do is mask the first element of the array and then calculate the mean. When an element is masked as 1(True), the associated index value in the array will be masked. This method is also very useful while replacing the NAN values:
In [72]: x = np.arange(25, dtype = float).reshape(5,5)
x[x<5] = np.nan
x
Out[72]: array([[ nan, nan, nan, nan, nan],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
In [73]: np.where(np.isnan(x), ma.array(x, mask=np.isnan(x)).mean(axis=0), x)
Out[73]: array([[ 12.5, 13.5, 14.5, 15.5, 16.5],
[ 5. , 6. , 7. , 8. , 9. ],
[ 10. , 11. , 12. , 13. , 14. ],
[ 15. , 16. , 17. , 18. , 19. ],
[ 20. , 21. , 22. , 23. , 24. ]])
In preceding code, we changed the value of the first five elements to nan by putting a condition with index. x[x<5] refers to the elements which indexed for 0, 1, 2, 3, and 4. Then we overwrite these values with the mean of each column(excluding nan values). There are many other useful methods in array operations in order help your code be more concise:
Method
|
Description
|
np.concatenate
|
Join to the matrix in a sequence with a given matrix
|
np.repeat
|
Repeat the element of an array along a specific axis
|
np.delete
|
Return a new array with the deleted subarrays
|
np.insert
|
Insert values before the specified axis
|
np.unique
|
Find unique values in an array
|
np.tile
|
Create an array by repeating a given input for a given number of repetitions
|