Coding categorical dataΒΆ
Patsy allows great flexibility in how categorical data is coded,
via the function C(). C() marks some data as being
categorical (including data which would not automatically be treated
as categorical, such as a column of integers), while also optionally
setting the preferred coding scheme and level ordering.
Let’s get some categorical data to work with:
In [1]: from patsy import dmatrix, demo_data, ContrastMatrix, Poly
In [2]: data = demo_data("a", nlevels=3)
In [3]: data
Out[3]: {'a': ['a1', 'a2', 'a3', 'a1', 'a2', 'a3']}
As you know, simply giving Patsy a categorical variable causes it
to be coded using the default Treatment coding
scheme. (Strings and booleans are treated as categorical by default.)
In [4]: dmatrix("a", data)
Out[4]:
DesignMatrix with shape (6, 3)
Intercept a[T.a2] a[T.a3]
1 0 0
1 1 0
1 0 1
1 0 0
1 1 0
1 0 1
Terms:
'Intercept' (column 0)
'a' (columns 1:3)
We can also alter the level ordering, which is useful for, e.g.,
Diff coding:
In [5]: l = ["a3", "a2", "a1"]
In [6]: dmatrix("C(a, levels=l)", data)
Out[6]:
DesignMatrix with shape (6, 3)
Intercept C(a, levels=l)[T.a2] C(a, levels=l)[T.a1]
1 0 1
1 1 0
1 0 0
1 0 1
1 1 0
1 0 0
Terms:
'Intercept' (column 0)
'C(a, levels=l)' (columns 1:3)
But the default coding is just that – a default. The easiest alternative is to use one of the other built-in coding schemes, like orthogonal polynomial coding:
In [7]: dmatrix("C(a, Poly)", data)
Out[7]:
DesignMatrix with shape (6, 3)
Intercept C(a, Poly).Linear C(a, Poly).Quadratic
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 0.70711 0.40825
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 0.70711 0.40825
Terms:
'Intercept' (column 0)
'C(a, Poly)' (columns 1:3)
There are a number of built-in coding schemes; for details you can check the API reference. But we aren’t restricted to those. We can also provide a custom contrast matrix, which allows us to produce all kinds of strange designs:
In [8]: contrast = [[1, 2], [3, 4], [5, 6]]
In [9]: dmatrix("C(a, contrast)", data)
Out[9]:
DesignMatrix with shape (6, 3)
Intercept C(a, contrast)[custom0] C(a, contrast)[custom1]
1 1 2
1 3 4
1 5 6
1 1 2
1 3 4
1 5 6
Terms:
'Intercept' (column 0)
'C(a, contrast)' (columns 1:3)
In [10]: dmatrix("C(a, [[1], [2], [-4]])", data)
Out[10]:
DesignMatrix with shape (6, 2)
Intercept C(a, [[1], [2], [-4]])[custom0]
1 1
1 2
1 -4
1 1
1 2
1 -4
Terms:
'Intercept' (column 0)
'C(a, [[1], [2], [-4]])' (column 1)
Hmm, those [custom0], [custom1] names that Patsy
auto-generated for us are a bit ugly looking. We can attach names to
our contrast matrix by creating a ContrastMatrix object, and
make things prettier:
In [11]: contrast_mat = ContrastMatrix(contrast, ["[pretty0]", "[pretty1]"])
In [12]: dmatrix("C(a, contrast_mat)", data)
Out[12]:
DesignMatrix with shape (6, 3)
Intercept C(a, contrast_mat)[pretty0] C(a, contrast_mat)[pretty1]
1 1 2
1 3 4
1 5 6
1 1 2
1 3 4
1 5 6
Terms:
'Intercept' (column 0)
'C(a, contrast_mat)' (columns 1:3)
And, finally, if we want to get really fancy, we can also define our
own “smart” coding schemes like Poly. Just define a class
that has two methods, code_with_intercept() and
code_without_intercept(). They have identical signatures, taking
a list of levels as their argument and returning a
ContrastMatrix. Patsy will automatically choose the
appropriate method to call to produce a full-rank design matrix
without redundancy; see Redundancy and categorical factors for the full details on how
Patsy makes this decision.
As an example, here’s a simplified version of the built-in
Treatment coding object:
import numpy as np
class MyTreat(object):
def __init__(self, reference=0):
self.reference = reference
def code_with_intercept(self, levels):
return ContrastMatrix(np.eye(len(levels)),
["[My.%s]" % (level,) for level in levels])
def code_without_intercept(self, levels):
eye = np.eye(len(levels) - 1)
contrasts = np.vstack((eye[:self.reference, :],
np.zeros((1, len(levels) - 1)),
eye[self.reference:, :]))
suffixes = ["[MyT.%s]" % (level,) for level in
levels[:self.reference] + levels[self.reference + 1:]]
return ContrastMatrix(contrasts, suffixes)
And it can now be used just like the built-in methods:
# Full rank:
In [13]: dmatrix("0 + C(a, MyTreat)", data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-fc2731b99fa5> in <module>()
----> 1 dmatrix("0 + C(a, MyTreat)", data)
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, memorize_state, data)
564 return self._eval(memorize_state["eval_code"],
565 memorize_state,
--> 566 data)
567
568 __getstate__ = no_pickling
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in _eval(self, code, memorize_state, data)
549 memorize_state["eval_env"].eval,
550 code,
--> 551 inner_namespace=inner_namespace)
552
553 def memorize_chunk(self, state, which_pass, data):
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
116 try:
--> 117 return f(*args, **kwargs)
118 except Exception as e:
119 if sys.version_info[0] >= 3:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, expr, source_name, inner_namespace)
164 code = compile(expr, source_name, "eval", self.flags, False)
165 return eval(code, {}, VarLookupDict([inner_namespace]
--> 166 + self._namespaces))
167
168 @classmethod
<string> in <module>()
NameError: name 'MyTreat' is not defined
# Reduced rank:
In [14]: dmatrix("C(a, MyTreat)", data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-09011f1be5a2> in <module>()
----> 1 dmatrix("C(a, MyTreat)", data)
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, memorize_state, data)
564 return self._eval(memorize_state["eval_code"],
565 memorize_state,
--> 566 data)
567
568 __getstate__ = no_pickling
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in _eval(self, code, memorize_state, data)
549 memorize_state["eval_env"].eval,
550 code,
--> 551 inner_namespace=inner_namespace)
552
553 def memorize_chunk(self, state, which_pass, data):
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
116 try:
--> 117 return f(*args, **kwargs)
118 except Exception as e:
119 if sys.version_info[0] >= 3:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, expr, source_name, inner_namespace)
164 code = compile(expr, source_name, "eval", self.flags, False)
165 return eval(code, {}, VarLookupDict([inner_namespace]
--> 166 + self._namespaces))
167
168 @classmethod
<string> in <module>()
NameError: name 'MyTreat' is not defined
# With argument:
In [15]: dmatrix("C(a, MyTreat(2))", data)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-324e4d268f2e> in <module>()
----> 1 dmatrix("C(a, MyTreat(2))", data)
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in dmatrix(formula_like, data, eval_env, NA_action, return_type)
289 eval_env = EvalEnvironment.capture(eval_env, reference=1)
290 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
--> 291 NA_action, return_type)
292 if lhs.shape[1] != 0:
293 raise PatsyError("encountered outcome variables for a model "
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
163 return iter([data])
164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
--> 165 NA_action)
166 if design_infos is not None:
167 return build_design_matrices(design_infos, data,
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/highlevel.py in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
68 data_iter_maker,
69 eval_env,
---> 70 NA_action)
71 else:
72 return None
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
694 factor_states,
695 data_iter_maker,
--> 696 NA_action)
697 # Now we need the factor infos, which encapsulate the knowledge of
698 # how to turn any given factor into a chunk of data:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/build.py in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
441 for data in data_iter_maker():
442 for factor in list(examine_needed):
--> 443 value = factor.eval(factor_states[factor], data)
444 if factor in cat_sniffers or guess_categorical(value):
445 if factor not in cat_sniffers:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, memorize_state, data)
564 return self._eval(memorize_state["eval_code"],
565 memorize_state,
--> 566 data)
567
568 __getstate__ = no_pickling
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in _eval(self, code, memorize_state, data)
549 memorize_state["eval_env"].eval,
550 code,
--> 551 inner_namespace=inner_namespace)
552
553 def memorize_chunk(self, state, which_pass, data):
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/compat.py in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
115 def call_and_wrap_exc(msg, origin, f, *args, **kwargs):
116 try:
--> 117 return f(*args, **kwargs)
118 except Exception as e:
119 if sys.version_info[0] >= 3:
/builddir/build/BUILD/patsy-0.4.1/doc/../patsy/eval.py in eval(self, expr, source_name, inner_namespace)
164 code = compile(expr, source_name, "eval", self.flags, False)
165 return eval(code, {}, VarLookupDict([inner_namespace]
--> 166 + self._namespaces))
167
168 @classmethod
<string> in <module>()
NameError: name 'MyTreat' is not defined