Table indexing¶
Once a Table
has been created, it is possible to create indexes on one or
more columns of the table. An index internally sorts the rows of a table based
on the index column(s), allowing for element retrieval by column value and
improved performance for certain table operations.
Warning
The table indexing engine is new and is not yet considered stable. It is recommended to avoid using this engine in production code for now.
Creating an index¶
To create an index on a table, use the add_index()
method:
>>> from astropy.table import Table
>>> t = Table([(2, 3, 2, 1), (8, 7, 6, 5)], names=('a', 'b'))
>>> t.add_index('a')
The optional argument “unique” may be specified to create an index with uniquely valued elements.
To create a composite index on multiple columns, pass a list of columns instead:
>>> t.add_index(['a', 'b'])
In particular, the first index created using the
add_index()
method is considered the default index or the “primary key”. To
retrieve an index from a table, use the indices
property:
>>> t.indices['a']
a rows
--- ----
1 3
2 0
2 2
3 1
>>> t.indices['a', 'b']
a b rows
--- --- ----
1 5 3
2 6 2
2 8 0
3 7 1
Row retrieval using indices¶
Row retrieval can be accomplished using two table properties: loc
and
iloc
. The loc
property can be indexed either by column value, range of
column values (including the bounds), or a list or ndarray of column values:
>>> t = Table([(1, 2, 3, 4), (10, 1, 9, 9)], names=('a', 'b'), dtype=['i8', 'i8'])
>>> t.add_index('a')
>>> t.loc[2]
<Row index=1>
a b
int64 int64
----- -----
2 1
>>> t.loc[[1, 4]]
<Table length=2>
a b
int64 int64
----- -----
1 10
4 9
>>> t.loc[1:3]
<Table length=3>
a b
int64 int64
----- -----
1 10
2 1
3 9
>>> t.loc[:]
<Table length=4>
a b
int64 int64
----- -----
1 10
2 1
3 9
4 9
Note that by default, loc
uses the primary index, which here is column
‘a’. To use a different index, pass the indexed column name before the
retrieval data:
>>> t.add_index('b')
>>> t.loc['b', 8:10]
<Table length=3>
a b
int64 int64
----- -----
3 9
4 9
1 10
The property iloc
works similarly, except that the retrieval information must
be either an int or a slice, and relates to the sorted order of the index
rather than column values. For example:
>>> t.iloc[0] # smallest row by value 'a'
<Row index=0>
a b
int64 int64
----- -----
1 10
>>> t.iloc['b', 1:] # all but smallest value of 'b'
<Table length=3>
a b
int64 int64
----- -----
3 9
4 9
1 10
Effects on performance¶
Table operations change somewhat when indices are present, and there are a number of factors to consider when deciding whether the use of indices will improve performance. In general, indexing offers the following advantages:
- Table grouping and sorting based on indexed column(s) become faster
- Retrieving values by index is faster than custom searching
There are certain caveats, however:
- Creating an index requires time and memory
- Table modifications become slower due to automatic index updates
- Slicing a table becomes slower due to index relabeling
See here for an IPython notebook profiling various aspects of table indexing.
Index modes¶
The index_mode()
method allows for some flexibility in the behavior of table
indexing by allowing the user to enter a specific indexing mode via a context manager. There are
currently three indexing modes: freeze, copy_on_getitem, and
discard_on_copy. The freeze mode prevents automatic index updates whenever
a column of the index is modified, and all indices refresh themselves after the
context ends:
>>> with t.index_mode('freeze'):
... t['a'][0] = 0
... print(t.indices['a']) # unmodified
a rows
--- ----
1 0
2 1
3 2
4 3
>>> print(t.indices['a']) # modified
a rows
--- ----
0 0
2 1
3 2
4 3
The copy_on_getitem mode forces columns to copy and relabel their indices upon slicing. In the absence of this mode, table slices will preserve indices while column slices will not:
>>> ca = t['a'][[1, 3]]
>>> ca.info.indices
[]
>>> with t.index_mode('copy_on_getitem'):
... ca = t['a'][[1, 3]]
... print(ca.info.indices)
[ a rows
--- ----
2 0
4 1]
The discard_on_copy mode prevents indices from being copied whenever a column or table is copied:
>>> t2 = Table(t)
>>> t2.indices['a']
a rows
--- ----
0 0
2 1
3 2
4 3
>>> t2.indices['b']
b rows
--- ----
1 1
9 2
9 3
10 0
>>> with t.index_mode('discard_on_copy'):
... t2 = Table(t)
... print(t2.indices)
[]
Updating row using indices¶
Row updates can be accomplished by assigning the table property: loc
a complete row or a list of rows:
>>> t = Table([('w', 'x', 'y', 'z'), (10, 1, 9, 9)], names=('a', 'b'), dtype=['str', 'i8'])
>>> t.add_index('a')
>>> t.loc['x']
<Row index=1>
a b
str1 int64
---- -----
x 1
>>> t.loc['x'] = ['a', 12]
>>> t
<Table length=4>
a b
str1 int64
---- -----
w 10
a 12
y 9
z 9
>>> t.loc[['w', 'y']]
<Table length=2>
a b
str1 int64
---- -----
w 10
y 9
>>> t.loc[['w', 'z']] = [['b',23], ['c',56]]
>>> t
<Table length=4>
a b
str1 int64
---- -----
b 23
a 12
y 9
c 56
Retrieving the location of rows using indices¶
Retrieval of the location of rows can be accomplished using a table property: loc_indices
.
The loc_indices
property can be indexed either by column value, range of
column values (including the bounds), or a list or ndarray of column values:
>>> t = Table([('w', 'x', 'y', 'z'), (10, 1, 9, 9)], names=('a', 'b'), dtype=['str', 'i8'])
>>> t.add_index('a')
>>> t.loc_indices['x']
1
Engines¶
When creating an index via add_index()
, the keyword argument “engine” may be
specified to use a particular indexing engine. The available engines are
SortedArray
, a sorted array engine using an underlying sorted TableSCEngine
, a sorted list engine using the Sorted Containers packageFastRBT
, a C-based red-black tree engineFastBST
, a C-based binary search tree engineBST
, a Python-based binary search tree engine
Note that FastRBT and FastBST depend on the bintrees dependency; without this
dependency, both classes default to BST
. The SCEngine depends
on the sortedcontainers dependency. For a comparison of
engine performance, see this IPython notebook. Probably
the most important takeaway is that SortedArray
(the default
engine) is usually best, although SCEngine
may be more
appropriate for an index created on an empty column since adding new values is quicker.