instance: The instance to run on
feature_group: The optional feature group to run the extractor for.
output_file: Optional file to write the output to.
runsolver_args: The arguments for runsolver. If not present,
will run the extractor without runsolver.
cutoff_time: The maximum runtime.
log_dir: Directory path for logs.
extractor_path: Path to the executable
instance: Path to the instance to run on
feature_group: The feature group to compute. Must be supported by the
extractor to use.
output_file: Target output. If None, piped to the RunRunner job.
cutoff_time: CPU cutoff time in seconds
log_dir: Directory to write logs. Defaults to CWD.
Returns:
The features or None if an output file is used, or features can not be found.
Run the Extractor CLI and write result to the FeatureDataFrame.
Args:
instance_set: The instance set to run the Extractor on.
feature_dataframe: The feature dataframe to write to.
cutoff_time: CPU cutoff time in seconds
feature_group: The feature group to compute. If left empty,
will run on all feature groups.
run_on: The runner to use.
sbatch_options: Additional options to pass to sbatch.
srun_options: Additional options to pass to srun.
parallel_jobs: Number of parallel jobs to run.
slurm_prepend: Slurm script to prepend to the sbatch
dependencies: List of dependencies to add to the job.
log_dir: The directory to write logs to.
Even when the index of other is the same as the index of the DataFrame,
the Series will not be reoriented. If index-wise alignment is desired,
DataFrame.add() should be used with axis=’index’.
>>> s2=pd.Series([0.5,1.5],index=['elk','moose'])>>> df[['height','weight']]+s2 elk height moose weightelk NaN NaN NaN NaNmoose NaN NaN NaN NaN
Export the pandas DataFrame as an Arrow C stream PyCapsule.
This relies on pyarrow to convert the pandas DataFrame to the Arrow
format (and follows the default behaviour of pyarrow.Table.from_pandas
in its handling of the index, i.e. store the index as a column except
for RangeIndex).
This conversion is not necessarily zero-copy.
t : str, the type of setting error
force : bool, default False
If True, then force showing an error.
validate if we are doing a setitem on a chained copy.
It is technically possible to figure out that we are setting on
a copy even WITH a multi-dtyped pandas object. In other words, some
blocks may be views while other are not. Currently _is_view will ALWAYS
return False for multi-blocks to avoid having to handle this case.
# This technically need not raise SettingWithCopy if both are view
# (which is not generally guaranteed but is usually True. However,
# this is in general not a good practice and we recommend using .loc.
df.iloc[0:5][‘group’] = ‘a’
Ensures new columns (which go into the BlockManager as new blocks) are
always copied (or a reference is being tracked to them under CoW)
and converted into an array.
Internal version of the take method that sets the _is_copy
attribute to keep track of the parent dataframe (using in indexing
for the SettingWithCopyWarning).
For Series this does the same as the public take (it never sets _is_copy).
See the docstring of take for full explanation of the parameters.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Add an extractor and its feature names to the dataframe.
Arguments:
extractor: Name of the extractor
extractor_features: Tuples of [FeatureGroup, FeatureName]
values: Initial values of the Extractor per instance in the dataframe.
DataFrame.apply : Perform any type of operations.
DataFrame.transform : Perform transformation type operations.
pandas.DataFrame.groupby : Perform operations over groups.
pandas.DataFrame.resample : Perform operations over resampled bins.
pandas.DataFrame.rolling : Perform operations over rolling window.
pandas.DataFrame.expanding : Perform operations over expanding window.
pandas.core.window.ewm.ExponentialMovingWindow : Perform operation over exponential
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to
numpy.mean(arr_2d,axis=0).
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected
behavior or errors and are not supported. See gotchas.udf-mutation
for more details.
A passed user-defined-function will be passed a Series for evaluation.
DataFrame.apply : Perform any type of operations.
DataFrame.transform : Perform transformation type operations.
pandas.DataFrame.groupby : Perform operations over groups.
pandas.DataFrame.resample : Perform operations over resampled bins.
pandas.DataFrame.rolling : Perform operations over rolling window.
pandas.DataFrame.expanding : Perform operations over expanding window.
pandas.core.window.ewm.ExponentialMovingWindow : Perform operation over exponential
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to
numpy.mean(arr_2d,axis=0).
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected
behavior or errors and are not supported. See gotchas.udf-mutation
for more details.
A passed user-defined-function will be passed a Series for evaluation.
other : DataFrame or Series
join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
Type of alignment to be performed.
left: use only keys from left frame, preserve key order.
right: use only keys from right frame, preserve key order.
outer: use union of keys from both frames, sort keys lexicographically.
inner: use intersection of keys from both frames,
preserve the order of the left keys.
axisallowed axis of the other object, default None
Align on index (0), columns (1), or both (None).
levelint or level name, default None
Broadcast across a level, matching Index values on the
passed MultiIndex level.
copybool, default True
Always returns new objects. If copy=False and no reindexing is
required then original objects are returned.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
fill_valuescalar, default np.nan
Value to use for missing values. Defaults to NaN, but can be any
“compatible” value.
Method to use for filling holes in reindexed Series:
pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.
Deprecated since version 2.1.
limitint, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
Deprecated since version 2.1.
fill_axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame, default 0
Filling axis, method and limit.
Deprecated since version 2.1.
broadcast_axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame, default None
Broadcast values along this axis, if aligning two objects of
different dimensions.
>>> df=pd.DataFrame(... [[1,2,3,4],[6,7,8,9]],columns=["D","B","E","A"],index=[1,2]... )>>> other=pd.DataFrame(... [[10,20,30,40],[60,70,80,90],[600,700,800,900]],... columns=["A","B","C","D"],... index=[2,3,4],... )>>> df D B E A1 1 2 3 42 6 7 8 9>>> other A B C D2 10 20 30 403 60 70 80 904 600 700 800 900
Align on columns:
>>> left,right=df.align(other,join="outer",axis=1)>>> left A B C D E1 4 2 NaN 1 32 9 7 NaN 6 8>>> right A B C D E2 10 20 30 40 NaN3 60 70 80 90 NaN4 600 700 800 900 NaN
We can also align on the index:
>>> left,right=df.align(other,join="outer",axis=0)>>> left D B E A1 1.0 2.0 3.0 4.02 6.0 7.0 8.0 9.03 NaN NaN NaN NaN4 NaN NaN NaN NaN>>> right A B C D1 NaN NaN NaN NaN2 10.0 20.0 30.0 40.03 60.0 70.0 80.0 90.04 600.0 700.0 800.0 900.0
Finally, the default axis=None will align on both index and columns:
>>> left,right=df.align(other,join="outer",axis=None)>>> left A B C D E1 4.0 2.0 NaN 1.0 3.02 9.0 7.0 NaN 6.0 8.03 NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN>>> right A B C D E1 NaN NaN NaN NaN NaN2 10.0 20.0 30.0 40.0 NaN3 60.0 70.0 80.0 90.0 NaN4 600.0 700.0 800.0 900.0 NaN
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced. For Series this parameter
is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the
original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the
original index.
None : reduce all axes, return a scalar.
bool_onlybool, default False
Include only boolean columns. Not implemented for Series.
skipnabool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is
True, then the result will be True, as for an empty row/column.
If skipna is False, then NA are treated as True, because these are not
equal to zero.
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced. For Series this parameter
is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the
original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the
original index.
None : reduce all axes, return a scalar.
bool_onlybool, default False
Include only boolean columns. Not implemented for Series.
skipnabool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is
True, then the result will be False, as for an empty row/column.
If skipna is False, then NA are treated as True, because these are not
equal to zero.
numpy.any : Numpy version of this method.
Series.any : Return whether any element is True.
Series.all : Return whether all elements are True.
DataFrame.any : Return whether any element is True over requested axis.
DataFrame.all : Return whether all elements are True over requested axis.
Objects passed to the function are Series objects whose index is
either the DataFrame’s index (axis=0) or the DataFrame’s columns
(axis=1). By default (result_type=None), the final return type
is inferred from the return type of the applied function. Otherwise,
it depends on the result_type argument.
Determines if row or column is passed as a Series or ndarray object:
False : passes each row or column as a Series to the
function.
True : the passed function will receive ndarray objects
instead.
If you are just applying a NumPy reduction function this will
achieve much better performance.
‘expand’ : list-like results will be turned into columns.
‘reduce’ : returns a Series if possible rather than expanding
list-like results. This is the opposite of ‘expand’.
‘broadcast’ : results will be broadcast to the original shape
of the DataFrame, the original index and columns will be
retained.
The default behaviour (None) depends on the return value of the
applied function: list-like results will be returned as a Series
of those. However if the apply function returns a Series these
are expanded to columns.
argstuple
Positional arguments to pass to func in addition to the
array/series.
by_rowFalse or “compat”, default “compat”
Only has an effect when func is a listlike or dictlike of funcs
and the func isn’t a string.
If “compat”, will if possible first translate the func into pandas
methods (e.g. Series().apply(np.sum) will be translated to
Series().sum()). If that doesn’t work, will try call to apply again with
by_row=True and if that fails, will call apply again with
by_row=False (backward compatible).
If False, the funcs will be passed the whole Series at once.
Added in version 2.1.0.
engine{‘python’, ‘numba’}, default ‘python’
Choose between the python (default) engine or the numba engine in apply.
The numba engine will attempt to JIT compile the passed function,
which may result in speedups for large DataFrames.
It also supports the following engine_kwargs :
nopython (compile the function in nopython mode)
nogil (release the GIL inside the JIT compiled function)
parallel (try to apply the function in parallel over the DataFrame)
Note: Due to limitations within numba/how pandas interfaces with numba,
you should only use this if raw=True
Note: The numba compiler only supports a subset of
valid Python/numpy operations.
Pass keyword arguments to the engine.
This is currently only used by the numba engine,
see the documentation for the engine argument for more information.
DataFrame.map: For elementwise operations.
DataFrame.aggregate: Only perform aggregating type operations.
DataFrame.transform: Only perform transforming type operations.
Passing result_type='broadcast' will ensure the same shape
result, whether list-like or scalar is returned by the function,
and broadcast it along the axis. The resulting column names will
be the originals.
>>> df.apply(lambdax:[1,2],axis=1,result_type='broadcast') A B0 1 21 1 22 1 2
DataFrame.apply : Apply a function along input axis of DataFrame.
DataFrame.map : Apply a function along input axis of DataFrame.
DataFrame.replace: Replace values given in to_replace with value.
Returns the original data conformed to a new index with the specified
frequency.
If the index of this Series/DataFrame is a PeriodIndex, the new index
is the result of transforming the original index with
PeriodIndex.asfreq (so the original index
will map one-to-one to the new index).
Otherwise, the new index will be equivalent to pd.date_range(start,end,freq=freq) where start and end are, respectively, the first and
last entries in the original index (see pandas.date_range()). The
values corresponding to any timesteps in the new index which were not present
in the original index will be null (NaN), unless a method for filling
such unknowns is provided (see the method parameter below).
The resample() method is more appropriate if an operation on each group of
timesteps (such as an aggregate) is necessary to represent the data at the new
frequency.
Return the last row(s) without any NaNs before where.
The last row (for each element in where, if list) without any
NaN is taken.
In case of a DataFrame, the last row without NaN
considering only the subset of columns (if not None)
If there is no good value, NaN is returned for a Series or
a Series of NaN values for a DataFrame
The column names are keywords. If the values are
callable, they are computed on the DataFrame and
assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it).
If the values are not callable, (e.g. a Series, scalar, or array),
they are simply assigned.
Assigning multiple columns within the same assign is possible.
Later items in ‘**kwargs’ may refer to newly created or modified
columns in ‘df’; items are computed and assigned into ‘df’ in order.
dtypestr, data type, Series or Mapping of column name -> data type
Use a str, numpy.dtype, pandas.ExtensionDtype or Python type to
cast entire pandas object to the same type. Alternatively, use a
mapping, e.g. {col: dtype, …}, where col is a column label and dtype is
a numpy.dtype or Python type to cast one or more of the DataFrame’s
columns to column-specific types.
copybool, default True
Return a copy when copy=True (be very careful setting
copy=False as changes to values then may propagate to other
pandas objects).
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
errors{‘raise’, ‘ignore’}, default ‘raise’
Control raising of exceptions on invalid data for provided dtype.
raise : allow exceptions to be raised
ignore : suppress exceptions. On error return original object.
to_datetime : Convert argument to datetime.
to_timedelta : Convert argument to timedelta.
to_numeric : Convert argument to a numeric type.
numpy.ndarray.astype : Cast a numpy array to a specified type.
Changed in version 2.0.0: Using astype to convert from timezone-naive dtype to
timezone-aware dtype will raise an exception.
Use Series.dt.tz_localize() instead.
between_time : Select values between particular times of the day.
first : Select initial periods of time series based on a date offset.
last : Select final periods of time series based on a date offset.
DatetimeIndex.indexer_at_time : Get just the index locations for
at_time : Select values at a particular time of the day.
first : Select initial periods of time series based on a date offset.
last : Select final periods of time series based on a date offset.
DatetimeIndex.indexer_between_time : Get just the index locations for
axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame
Axis along which to fill missing values. For Series
this parameter is unused and defaults to 0.
inplacebool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limitint, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
If limit is specified, consecutive NaNs will be filled with this
restriction.
None: No fill restriction.
‘inside’: Only fill NaNs surrounded by valid values
(interpolate).
‘outside’: Only fill NaNs outside valid values (extrapolate).
Added in version 2.2.0.
downcastdict, default is None
A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
Return the bool of a single element Series or DataFrame.
Deprecated since version 2.1.0: bool is deprecated and will be removed in future version of pandas.
For Series use pandas.Series.item.
This must be a boolean scalar value, either True or False. It will raise a
ValueError if the Series or DataFrame does not have exactly 1 element, or that
element is not boolean (integer values 0 and 1 will also raise an exception).
Series.astype : Change the data type of a Series, including to boolean.
DataFrame.astype : Change the data type of a DataFrame, including to boolean.
numpy.bool_ : NumPy boolean data type, used by pandas for boolean values.
Make a box-and-whisker plot from DataFrame columns, optionally grouped
by some other columns. A box plot is a method for graphically depicting
groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data,
with a line at the median (Q2). The whiskers extend from the edges
of box to show the range of the data. By default, they extend no more than
1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest
data point within that interval. Outliers are plotted as separate dots.
For further details see
Wikipedia’s entry for boxplot.
Column name or list of names, or vector.
Can be any valid input to pandas.DataFrame.groupby().
bystr or array-like, optional
Column in the DataFrame to pandas.DataFrame.groupby().
One box-plot will be done per value of columns in by.
axobject of class matplotlib.axes.Axes, optional
The matplotlib axes to be used by boxplot.
fontsizefloat or str
Tick label font size in points or as a string (e.g., large).
rotfloat, default 0
The rotation angle of labels (in degrees)
with respect to the screen coordinate system.
gridbool, default True
Setting this to True will show the grid.
figsizeA tuple (width, height) in inches
The size of the figure to create in matplotlib.
layouttuple (rows, columns), optional
For example, (3, 5) will display the subplots
using 3 rows and 5 columns, starting from the top-left.
return_type{‘axes’, ‘dict’, ‘both’} or None, default ‘axes’
The kind of object to return. The default is axes.
‘axes’ returns the matplotlib axes the boxplot is drawn on.
‘dict’ returns a dictionary whose values are the matplotlib
Lines of the boxplot.
‘both’ returns a namedtuple with the axes and dict.
when grouping with by, a Series mapping columns to
return_type is returned.
If return_type is None, a NumPy array
of axes with the same shape as layout is returned.
backendstr, default None
Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to
specify the plotting.backend for the whole session, set
pd.options.plotting.backend.
The return type depends on the return_type parameter:
‘axes’ : object of class matplotlib.axes.Axes
‘dict’ : dict of matplotlib.lines.Line2D objects
‘both’ : a namedtuple with structure (ax, lines)
For data grouped with by, return a Series of the above or a numpy
array:
Series
array (for return_type=None)
Use return_type='dict' when you want to tweak the appearance
of the lines after plotting. In this case a dict containing the Lines
making up the boxes, caps, fliers, medians, and whiskers is returned.
Boxplots can be created for every column in the dataframe
by df.boxplot() or indicating the columns to be used:
Boxplots of variables distributions grouped by the values of a third
variable can be created using the option by. For instance:
A list of strings (i.e. ['X','Y']) can be passed to boxplot
in order to group the data by combination of the variables in the x-axis:
The layout of boxplot can be adjusted giving a tuple to layout:
Additional formatting can be done to the boxplot, like suppressing the grid
(grid=False), rotating the labels in the x-axis (i.e. rot=45)
or changing the fontsize (i.e. fontsize=15):
The parameter return_type can be used to select the type of element
returned by boxplot. When return_type='axes' is selected,
the matplotlib axes on which the boxplot is drawn are returned:
Assigns values outside boundary to boundary values. Thresholds
can be singular values or array like, and in the latter case
the clipping is performed element-wise in the specified axis.
Series.clip : Trim values at input threshold in series.
DataFrame.clip : Trim values at input threshold in dataframe.
numpy.clip : Clip (limit) the values in an array.
Perform column-wise combine with another DataFrame.
Combines a DataFrame with other DataFrame using func
to element-wise combine columns. The row and column indexes of the
resulting DataFrame will be the union of the two.
Example using a true element-wise combine function.
>>> df1=pd.DataFrame({'A':[5,0],'B':[2,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine(df2,np.minimum) A B0 1 21 0 3
Using fill_value fills Nones prior to passing the column to the
merge function.
>>> df1=pd.DataFrame({'A':[0,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine(df2,take_smaller,fill_value=-5) A B0 0 -5.01 0 4.0
However, if the same element in both dataframes is None, that None
is preserved
>>> df1=pd.DataFrame({'A':[0,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[None,3]})>>> df1.combine(df2,take_smaller,fill_value=-5) A B0 0 -5.01 0 3.0
Example that demonstrates the use of overwrite and behavior when
the axis differ between the dataframes.
>>> df1=pd.DataFrame({'A':[0,0],'B':[4,4]})>>> df2=pd.DataFrame({'B':[3,3],'C':[-10,1],},index=[1,2])>>> df1.combine(df2,take_smaller) A B C0 NaN NaN NaN1 NaN 3.0 -10.02 NaN 3.0 1.0
>>> df1.combine(df2,take_smaller,overwrite=False) A B C0 0.0 NaN NaN1 0.0 3.0 -10.02 NaN 3.0 1.0
Demonstrating the preference of the passed in dataframe.
>>> df2=pd.DataFrame({'B':[3,3],'C':[1,1],},index=[1,2])>>> df2.combine(df1,take_smaller) A B C0 0.0 NaN NaN1 0.0 3.0 NaN2 NaN 3.0 NaN
>>> df2.combine(df1,take_smaller,overwrite=False) A B C0 0.0 NaN NaN1 0.0 3.0 1.02 NaN 3.0 1.0
Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame
with non-null values from other DataFrame. The row and column indexes
of the resulting DataFrame will be the union of the two. The resulting
dataframe contains the ‘first’ dataframe values and overrides the
second one values where both first.loc[index, col] and
second.loc[index, col] are not missing values, upon calling
first.combine_first(second).
>>> df1=pd.DataFrame({'A':[None,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine_first(df2) A B0 1.0 3.01 0.0 4.0
Null values still persist if the location of that null value
does not exist in other
>>> df1=pd.DataFrame({'A':[None,0],'B':[4,None]})>>> df2=pd.DataFrame({'B':[3,3],'C':[1,1]},index=[1,2])>>> df1.combine_first(df2) A B C0 NaN 4.0 NaN1 0.0 3.0 1.02 NaN 3.0 1.0
>>> df=pd.DataFrame(... {... "col1":["a","a","b","b","a"],... "col2":[1.0,2.0,3.0,np.nan,5.0],... "col3":[1.0,2.0,3.0,4.0,5.0]... },... columns=["col1","col2","col3"],... )>>> df col1 col2 col30 a 1.0 1.01 a 2.0 2.02 b 3.0 3.03 b NaN 4.04 a 5.0 5.0
>>> df2=df.copy()>>> df2.loc[0,'col1']='c'>>> df2.loc[2,'col3']=4.0>>> df2 col1 col2 col30 c 1.0 1.01 a 2.0 2.02 b 3.0 4.03 b NaN 4.04 a 5.0 5.0
Align the differences on columns
>>> df.compare(df2) col1 col3 self other self other0 a c NaN NaN2 NaN NaN 3.0 4.0
Assign result_names
>>> df.compare(df2,result_names=("left","right")) col1 col3 left right left right0 a c NaN NaN2 NaN NaN 3.0 4.0
Stack the differences on rows
>>> df.compare(df2,align_axis=0) col1 col30 self a NaN other c NaN2 self NaN 3.0 other NaN 4.0
Keep the equal values
>>> df.compare(df2,keep_equal=True) col1 col3 self other self other0 a c 1.0 1.02 b b 3.0 4.0
Keep all original rows and columns
>>> df.compare(df2,keep_shape=True) col1 col2 col3 self other self other self other0 a c NaN NaN NaN NaN1 NaN NaN NaN NaN NaN NaN2 NaN NaN NaN NaN 3.0 4.03 NaN NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN NaN
Keep all original rows and columns and also all original values
>>> df.compare(df2,keep_shape=True,keep_equal=True) col1 col2 col3 self other self other self other0 a c 1.0 1.0 1.0 1.01 a a 2.0 2.0 2.0 2.02 b b 3.0 3.0 3.0 4.03 b b NaN NaN 4.0 4.04 a a 5.0 5.0 5.0 5.0
Whether object dtypes should be converted to the best possible types.
convert_stringbool, default True
Whether object dtypes should be converted to StringDtype().
convert_integerbool, default True
Whether, if possible, conversion can be done to integer extension types.
convert_booleanbool, defaults True
Whether object dtypes should be converted to BooleanDtypes().
convert_floatingbool, defaults True
Whether, if possible, conversion can be done to floating extension types.
If convert_integer is also True, preference will be give to integer
dtypes if the floats can be faithfully casted to integers.
By default, convert_dtypes will attempt to convert a Series (or each
Series in a DataFrame) to dtypes that support pd.NA. By using the options
convert_string, convert_integer, convert_boolean and
convert_floating, it is possible to turn off individual conversions
to StringDtype, the integer extension types, BooleanDtype
or floating extension types, respectively.
For object-dtyped columns, if infer_objects is True, use the inference
rules as during normal Series/DataFrame construction. Then, if possible,
convert to StringDtype, BooleanDtype or an appropriate integer
or floating extension type, otherwise leave as object.
If the dtype is integer, convert to an appropriate integer extension type.
If the dtype is numeric, and consists of all integers, convert to an
appropriate integer extension type. Otherwise, convert to an
appropriate floating extension type.
In the future, as new dtypes are added that support pd.NA, the results
of this method will change to support those new dtypes.
When deep=True (default), a new object will be created with a
copy of the calling object’s data and indices. Modifications to
the data or indices of the copy will not be reflected in the
original object (see notes below).
When deep=False, a new object will be created without copying
the calling object’s data or index (only references to the data
and index are copied). Any changes to the data of the original
will be reflected in the shallow copy (and vice versa).
Note
The deep=False behaviour as described above will change
in pandas 3.0. Copy-on-Write
will be enabled by default, which means that the “shallow” copy
is that is returned with deep=False will still avoid making
an eager copy, but changes to the data of the original will no
longer be reflected in the shallow copy (or vice versa). Instead,
it makes use of a lazy (deferred) copy mechanism that will copy
the data only when any changes to the original or shallow copy is
made.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
When deep=True, data is copied but actual Python objects
will not be copied recursively, only the reference to the object.
This is in contrast to copy.deepcopy in the Standard Library,
which recursively copies object data (see examples below).
While Index objects are copied when deep=True, the underlying
numpy array is not copied for performance reasons. Since Index is
immutable, the underlying data can be safely shared and a copy
is not needed.
Since pandas is not thread safe, see the
gotchas when copying in a threading
environment.
When copy_on_write in pandas config is set to True, the
copy_on_write config takes effect even when deep=False.
This means that any changes to the copied data would make a new copy
of the data upon write (and vice versa). Changes made to either the
original or copied variable would not be reflected in the counterpart.
See Copy_on_Write for more information.
Updates to the data shared by shallow copy and original is reflected
in both (NOTE: this will no longer be true for pandas >= 3.0);
deep copy remains unchanged.
Note that when copying an object containing Python objects, a deep copy
will copy the data, but will not do so recursively. Updating a nested
data object will be reflected in the deep copy.
method{‘pearson’, ‘kendall’, ‘spearman’} or callable
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr
will have 1 along the diagonals and will be symmetric
regardless of the callable’s behavior.
min_periodsint, optional
Minimum number of observations required per pair of columns
to have a valid result. Currently only available for Pearson
and Spearman correlation.
numeric_onlybool, default False
Include only float, int or boolean data.
Added in version 1.5.0.
Changed in version 2.0.0: The default value of numeric_only is now False.
Pairwise correlation is computed between rows or columns of
DataFrame with rows or columns of Series or DataFrame. DataFrames
are first aligned along both axes before computing the
correlations.
Series.count: Number of non-NA elements in a Series.
DataFrame.value_counts: Count unique combinations of columns.
DataFrame.shape: Number of DataFrame rows and columns (including NA
elements).
DataFrame.isna: Boolean same-sized DataFrame showing places of NA
>>> df=pd.DataFrame({"Person":... ["John","Myla","Lewis","John","Myla"],... "Age":[24.,np.nan,21.,33,26],... "Single":[False,True,True,True,False]})>>> df Person Age Single0 John 24.0 False1 Myla NaN True2 Lewis 21.0 True3 John 33.0 True4 Myla 26.0 False
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame.
The returned data frame is the covariance matrix of the columns
of the DataFrame.
Both NA and null values are automatically excluded from the
calculation. (See the note below about bias from missing values.)
A threshold can be set for the minimum number of
observations for each value created. Comparisons with observations
below this threshold will be returned as NaN.
This method is generally used for the analysis of time series data to
understand the relationship between different measures
across time.
Minimum number of observations required per pair of columns
to have a valid result.
ddofint, default 1
Delta degrees of freedom. The divisor used in calculations
is N-ddof, where N represents the number of elements.
This argument is applicable only when no nan is in the dataframe.
numeric_onlybool, default False
Include only float, int or boolean data.
Added in version 1.5.0.
Changed in version 2.0.0: The default value of numeric_only is now False.
Returns the covariance matrix of the DataFrame’s time series.
The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that
data is missing at random)
the returned covariance matrix will be an unbiased estimate
of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable
because the estimate covariance matrix is not guaranteed to be positive
semi-definite. This could lead to estimate correlations having
absolute values which are greater than one, and/or a non-invertible
covariance matrix. See Estimation of covariance matrices for more details.
>>> np.random.seed(42)>>> df=pd.DataFrame(np.random.randn(1000,5),... columns=['a','b','c','d','e'])>>> df.cov() a b c d ea 0.998438 -0.020161 0.059277 -0.008943 0.014144b -0.020161 1.059352 -0.008543 -0.024738 0.009826c 0.059277 -0.008543 1.010670 -0.001486 -0.000271d -0.008943 -0.024738 -0.001486 0.921297 -0.013692e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional min_periods keyword
that specifies the required minimum number of non-NA observations for
each column pair in order to have a valid result:
>>> np.random.seed(42)>>> df=pd.DataFrame(np.random.randn(20,3),... columns=['a','b','c'])>>> df.loc[df.index[:5],'a']=np.nan>>> df.loc[df.index[5:10],'b']=np.nan>>> df.cov(min_periods=12) a b ca 0.316741 NaN -0.150812b NaN 1.248003 0.191417c -0.150812 0.191417 0.895202
Descriptive statistics include those that summarize the central
tendency, dispersion and shape of a
dataset’s distribution, excluding NaN values.
Analyzes both numeric and object series, as well
as DataFrame column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.
The percentiles to include in the output. All should
fall between 0 and 1. The default is
[.25,.5,.75], which returns the 25th, 50th, and
75th percentiles.
include‘all’, list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored
for Series. Here are the options:
‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number. To limit it instead to object columns submit
the numpy.object data type. Strings
can also be used in the style of
select_dtypes (e.g. df.describe(include=['O'])). To
select pandas categorical columns, use 'category'
None (default) : The result will include all numeric columns.
excludelist-like of dtypes or None (default), optional,
A black list of data types to omit from the result. Ignored
for Series. Here are the options:
A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number. To exclude object columns submit the data
type numpy.object. Strings can also be used in the style of
select_dtypes (e.g. df.describe(exclude=['O'])). To
exclude pandas categorical columns, use 'category'
DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the observations.
DataFrame.select_dtypes: Subset of a DataFrame including/excluding
For numeric data, the result’s index will include count,
mean, std, min, max as well as lower, 50 and
upper percentiles. By default the lower percentile is 25 and the
upper percentile is 75. The 50 percentile is the
same as the median.
For object data (e.g. strings or timestamps), the result’s index
will include count, unique, top, and freq. The top
is the most common value. The freq is the most common value’s
frequency. Timestamps also include the first and last items.
If multiple object values have the highest count, then the
count and top results will be arbitrarily chosen from
among those with the highest count.
For mixed data types provided via a DataFrame, the default is to
return only an analysis of numeric columns. If the dataframe consists
only of object and categorical data without any numeric columns, the
default is to return an analysis of both the object and categorical
columns. If include='all' is provided as an option, the result
will include a union of attributes of each type.
The include and exclude parameters can be used to limit
which columns in a DataFrame are analyzed for the output.
The parameters are ignored when analyzing a Series.
Describing all columns of a DataFrame regardless of data type.
>>> df.describe(include='all') categorical numeric objectcount 3 3.0 3unique 3 NaN 3top f NaN afreq 1 NaN 1mean NaN 2.0 NaNstd NaN 1.0 NaNmin NaN 1.0 NaN25% NaN 1.5 NaN50% NaN 2.0 NaN75% NaN 2.5 NaNmax NaN 3.0 NaN
Describing a column from a DataFrame by accessing it as
an attribute.
Excluding object columns from a DataFrame description.
>>> df.describe(exclude=[object]) categorical numericcount 3 3.0unique 3 NaNtop f NaNfreq 1 NaNmean NaN 2.0std NaN 1.0min NaN 1.025% NaN 1.550% NaN 2.075% NaN 2.5max NaN 3.0
For boolean dtypes, this uses operator.xor() rather than
operator.sub().
The result is calculated according to current dtype in DataFrame,
however dtype of the result is always float64.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
If other is a Series, return the matrix product between self and
other as a Series. If other is a DataFrame or a numpy.array, return
the matrix product of self and other in a DataFrame of a np.array.
The dimensions of DataFrame and other must be compatible in order to
compute the matrix multiplication. In addition, the column names of
DataFrame and the index of other must contain the same values, as they
will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the
matrix product here.
Remove rows or columns by specifying label names and corresponding
axis, or by directly specifying index or column names. When using a
multi-index, labels on different levels can be removed by specifying
the level. See the user guide
for more information about the now unused levels.
Drop a specific index combination from the MultiIndex
DataFrame, i.e., drop the combination 'falcon' and
'weight', which deletes only the corresponding row
>>> df=pd.DataFrame({"name":['Alfred','Batman','Catwoman'],... "toy":[np.nan,'Batmobile','Bullwhip'],... "born":[pd.NaT,pd.Timestamp("1940-04-25"),... pd.NaT]})>>> df name toy born0 Alfred NaN NaT1 Batman Batmobile 1940-04-252 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() name toy born1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against
each other to see if they have the same shape and elements. NaNs in
the same location are considered equal.
The row/column index do not need to have the same type, as long
as the values are considered equal. Corresponding columns and
index must be of the same dtype.
DataFrames df and different_column_type have the same element
types and values, but have different types for the column labels,
which will still return True.
DataFrames df and different_data_type have different types for the
same values for their elements, and will return False even though
their column labels are the same values and types.
Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows
eval to run arbitrary code, which can make you vulnerable to code
injection if you pass user input to this function.
If the expression contains an assignment, whether to perform the
operation inplace and mutate the existing DataFrame. Otherwise,
a new DataFrame is returned.
Exactly one of com, span, halflife, or alpha must be
provided if times is not provided. If times is provided,
halflife and one of com, span or alpha may be provided.
If times is specified, a timedelta convertible unit over which an
observation decays to half its value. Only applicable to mean(),
and halflife value will not apply to the other functions.
alphafloat, optional
Specify smoothing factor \(\alpha\) directly
\(0 < \alpha \leq 1\).
min_periodsint, default 0
Minimum number of observations in window required to have a value;
otherwise, result is np.nan.
adjustbool, default True
Divide by decaying adjustment factor in beginning periods to account
for imbalance in relative weightings (viewing EWMA as a moving average).
When adjust=True (default), the EW function is calculated using weights
\(w_i = (1 - \alpha)^i\). For example, the EW moving average of the series
[\(x_0, x_1, ..., x_t\)] would be:
When ignore_na=False (default), weights are based on absolute positions.
For example, the weights of \(x_0\) and \(x_2\) used in calculating
the final weighted average of [\(x_0\), None, \(x_2\)] are
\((1-\alpha)^2\) and \(1\) if adjust=True, and
\((1-\alpha)^2\) and \(\alpha\) if adjust=False.
When ignore_na=True, weights are based
on relative positions. For example, the weights of \(x_0\) and \(x_2\)
used in calculating the final weighted average of
[\(x_0\), None, \(x_2\)] are \(1-\alpha\) and \(1\) if
adjust=True, and \(1-\alpha\) and \(\alpha\) if adjust=False.
axis{0, 1}, default 0
If 0 or 'index', calculate across the rows.
If 1 or 'columns', calculate across the columns.
For Series this parameter is unused and defaults to 0.
times : np.ndarray, Series, default None
Only applicable to mean().
Times corresponding to the observations. Must be monotonically increasing and
datetime64[ns] dtype.
If 1-D array like, a sequence with the same shape as the observations.
methodstr {‘single’, ‘table’}, default ‘single’
Added in version 1.4.0.
Execute the rolling operation per single column or row ('single')
or over the entire object ('table').
This argument is only implemented when specifying engine='numba'
in the method call.
Column(s) to explode.
For multiple columns, specify a non-empty list with each element
be str or tuple, and all specified columns their list-like data
on same row of the frame must have matching length.
Added in version 1.3.0: Multi-column explode
ignore_indexbool, default False
If True, the resulting index will be labeled 0, 1, …, n - 1.
This routine will explode list-likes including lists, tuples, sets,
Series, and np.ndarray. The result dtype of the subset rows will
be object. Scalars will be returned unchanged, and empty list-likes will
result in a np.nan for that row. In addition, the ordering of rows in the
output will be non-deterministic when exploding sets.
axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame
Axis along which to fill missing values. For Series
this parameter is unused and defaults to 0.
inplacebool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limitint, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
If limit is specified, consecutive NaNs will be filled with this
restriction.
None: No fill restriction.
‘inside’: Only fill NaNs surrounded by valid values
(interpolate).
‘outside’: Only fill NaNs outside valid values (extrapolate).
Added in version 2.2.0.
downcastdict, default is None
A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
>>> df=pd.DataFrame([[np.nan,2,np.nan,0],... [3,4,np.nan,1],... [np.nan,np.nan,np.nan,np.nan],... [np.nan,3,np.nan,4]],... columns=list("ABCD"))>>> df A B C D0 NaN 2.0 NaN 0.01 3.0 4.0 NaN 1.02 NaN NaN NaN NaN3 NaN 3.0 NaN 4.0
>>> df.ffill() A B C D0 NaN 2.0 NaN 0.01 3.0 4.0 NaN 1.02 3.0 4.0 NaN 1.03 3.0 3.0 NaN 4.0
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). Values not
in the dict/Series/DataFrame will not be filled. This value cannot
be a list.
Method to use for filling holes in reindexed Series:
ffill: propagate last valid observation forward to next valid.
backfill / bfill: use next valid observation to fill gap.
Deprecated since version 2.1.0: Use ffill or bfill instead.
axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame
Axis along which to fill missing values. For Series
this parameter is unused and defaults to 0.
inplacebool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limitint, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
downcastdict, default is None
A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
ffill : Fill values by propagating the last valid observation to next valid.
bfill : Fill values by using the next valid observation to fill the gap.
interpolate : Fill NaN values using interpolation.
reindex : Conform object to new index.
asfreq : Convert TimeSeries to specified frequency.
>>> df=pd.DataFrame([[np.nan,2,np.nan,0],... [3,4,np.nan,1],... [np.nan,np.nan,np.nan,np.nan],... [np.nan,3,np.nan,4]],... columns=list("ABCD"))>>> df A B C D0 NaN 2.0 NaN 0.01 3.0 4.0 NaN 1.02 NaN NaN NaN NaN3 NaN 3.0 NaN 4.0
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D0 0.0 2.0 0.0 0.01 3.0 4.0 0.0 1.02 0.0 0.0 0.0 0.03 0.0 3.0 0.0 4.0
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1,
2, and 3 respectively.
>>> values={"A":0,"B":1,"C":2,"D":3}>>> df.fillna(value=values) A B C D0 0.0 2.0 2.0 0.01 3.0 4.0 2.0 1.02 0.0 1.0 2.0 3.03 0.0 3.0 2.0 4.0
Only replace the first NaN element.
>>> df.fillna(value=values,limit=1) A B C D0 0.0 2.0 2.0 0.01 3.0 4.0 NaN 1.02 NaN 1.0 NaN 3.03 NaN 3.0 NaN 4.0
When filling using a DataFrame, replacement happens along
the same column names and same indices
>>> df2=pd.DataFrame(np.zeros((4,4)),columns=list("ABCE"))>>> df.fillna(df2) A B C D0 0.0 2.0 0.0 0.01 3.0 4.0 0.0 1.02 0.0 0.0 0.0 NaN3 0.0 3.0 0.0 4.0
Note that column D is not affected since it is not present in df2.
Keep labels from axis for which “like in label == True”.
regexstr (regular expression)
Keep labels from axis for which re.search(regex, label) == True.
axis{0 or ‘index’, 1 or ‘columns’, None}, default None
The axis to filter on, expressed either as an index (int)
or axis name (str). By default this is the info axis, ‘columns’ for
DataFrame. For Series this parameter is unused and defaults to None.
last : Select final periods of time series based on a date offset.
at_time : Select values at a particular time of the day.
between_time : Select values between particular times of the day.
Notice the data for 3 first calendar days were returned, not the first
3 days observed in the dataset, and therefore data for 2018-04-13 was
not returned.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the
object, applying a function, and combining the results. This can be
used to group large amounts of data and compute operations on these
groups.
bymapping, function, label, pd.Grouper or list of such
Used to determine the groups for the groupby.
If by is a function, it’s called on each value of the object’s
index. If a dict or Series is passed, the Series or dict VALUES
will be used to determine the groups (the Series’ values are first
aligned; see .align() method). If a list or ndarray of length
equal to the selected axis is passed (see the groupby user guide),
the values are used as-is to determine the groups. A label or list
of labels may be passed to group by the columns in self.
Notice that a tuple is interpreted as a (single) key.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1). For Series this parameter
is unused and defaults to 0.
Deprecated since version 2.1.0: Will be removed and behave like axis=0 in a future version.
For axis=1, do frame.T.groupby(...) instead.
levelint, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular
level or levels. Do not specify both by and level.
as_indexbool, default True
Return object with group labels as the
index. Only relevant for DataFrame input. as_index=False is
effectively “SQL-style” grouped output. This argument has no effect
on filtrations (see the filtrations in the user guide),
such as head(), tail(), nth() and in transformations
(see the transformations in the user guide).
sortbool, default True
Sort group keys. Get better performance by turning this off.
Note this does not influence the order of observations within each
group. Groupby preserves the order of rows within each group. If False,
the groups will appear in the same order as they did in the original DataFrame.
This argument has no effect on filtrations (see the filtrations in the user guide),
such as head(), tail(), nth() and in transformations
(see the transformations in the user guide).
Changed in version 2.0.0: Specifying sort=False with an ordered categorical grouper will no
longer sort the values.
group_keysbool, default True
When calling apply and the by argument produces a like-indexed
(i.e. a transform) result, add group keys to
index to identify pieces. By default group keys are not included
when the result’s index (and column) labels match the inputs, and
are included otherwise.
Changed in version 1.5.0: Warns that group_keys will no longer be ignored when the
result from apply is a like-indexed Series or DataFrame.
Specify group_keys explicitly to include the group keys or
not.
Changed in version 2.0.0: group_keys now defaults to True.
observedbool, default False
This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
Deprecated since version 2.1.0: The default value will change to True in a future version of pandas.
dropnabool, default True
If True, and if group keys contain NA values, NA values together
with row/column will be dropped.
If False, NA values will also be treated as the key in groups.
See the user guide for more
detailed usage and examples, including splitting an object into groups,
iterating through groups, selecting a group, aggregation, and more.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
This function returns the first n rows for the object based
on position. It is useful for quickly testing if your object
has the right type of data in it.
For negative values of n, this function returns all rows except
the last |n| rows, equivalent to df[:n].
If n is larger than the number of rows, this function returns all rows.
A `histogram`_ is a representation of the distribution of data.
This function calls matplotlib.pyplot.hist(), on each series in
the DataFrame, resulting in one histogram per column.
If passed, will be used to limit data to a subset of columns.
byobject, optional
If passed, then used to form histograms for separate groups.
gridbool, default True
Whether to show axis grid lines.
xlabelsizeint, default None
If specified changes the x-axis label size.
xrotfloat, default None
Rotation of x axis labels. For example, a value of 90 displays the
x labels rotated 90 degrees clockwise.
ylabelsizeint, default None
If specified changes the y-axis label size.
yrotfloat, default None
Rotation of y axis labels. For example, a value of 90 displays the
y labels rotated 90 degrees clockwise.
axMatplotlib axes object, default None
The axes to plot the histogram on.
sharexbool, default True if ax is None else False
In case subplots=True, share x axis and set some x axis labels to
invisible; defaults to True if ax is None otherwise False if an ax
is passed in.
Note that passing in both an ax and sharex=True will alter all x axis
labels for all subplots in a figure.
shareybool, default False
In case subplots=True, share y axis and set some y axis labels to
invisible.
figsizetuple, optional
The size in inches of the figure to create. Uses the value in
matplotlib.rcParams by default.
layouttuple, optional
Tuple of (rows, columns) for the layout of the histograms.
binsint or sequence, default 10
Number of histogram bins to be used. If an integer is given, bins + 1
bin edges are calculated and returned. If bins is a sequence, gives
bin edges, including left edge of first bin and right edge of last
bin. In this case, bins is returned unmodified.
backendstr, default None
Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to
specify the plotting.backend for the whole session, set
pd.options.plotting.backend.
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped
columns, leaving non-object and unconvertible
columns unchanged. The inference rules are the
same as during normal Series/DataFrame construction.
Whether to make a copy for non-object or non-inferable columns
or Series.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
to_datetime : Convert argument to datetime.
to_timedelta : Convert argument to timedelta.
to_numeric : Convert argument to numeric type.
convert_dtypes : Convert argument to best possible dtype.
Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columns is followed.
bufwritable buffer, defaults to sys.stdout
Where to send the output. By default, the output is printed to
sys.stdout. Pass a writable buffer if you need to further process
the output.
max_colsint, optional
When to switch from the verbose to the truncated output. If the
DataFrame has more than max_cols columns, the truncated output
is used. By default, the setting in
pandas.options.display.max_info_columns is used.
memory_usagebool, str, optional
Specifies whether total memory usage of the DataFrame
elements (including the index) should be displayed. By default,
this follows the pandas.options.display.memory_usage setting.
True always show memory usage. False never shows memory usage.
A value of ‘deep’ is equivalent to “True with deep introspection”.
Memory usage is shown in human-readable units (base-2
representation). Without deep introspection a memory estimation is
made based in column dtype and number of rows assuming values
consume the same memory amount for corresponding dtypes. With deep
memory introspection, a real memory usage calculation is performed
at the cost of computational resources. See the
Frequently Asked Questions for more
details.
show_countsbool, optional
Whether to show the non-null counts. By default, this is shown
only if the DataFrame is smaller than
pandas.options.display.max_info_rows and
pandas.options.display.max_info_columns. A value of True always
shows the counts, and False never shows the counts.
‘linear’: Ignore the index and treat the values as equally
spaced. This is the only method supported on MultiIndexes.
‘time’: Works on daily and higher resolution data to interpolate
given length of interval.
‘index’, ‘values’: use the actual numerical values of the index.
‘pad’: Fill in NaNs using existing values.
‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’,
‘barycentric’, ‘polynomial’: Passed to
scipy.interpolate.interp1d, whereas ‘spline’ is passed to
scipy.interpolate.UnivariateSpline. These methods use the numerical
values of the index. Both ‘polynomial’ and ‘spline’ require that
you also specify an order (int), e.g.
df.interpolate(method='polynomial',order=5). Note that,
slinear method in Pandas refers to the Scipy first order spline
instead of Pandas first order spline.
‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’,
‘cubicspline’: Wrappers around the SciPy interpolation methods of
similar names. See Notes.
‘from_derivatives’: Refers to
scipy.interpolate.BPoly.from_derivatives.
axis{{0 or ‘index’, 1 or ‘columns’, None}}, default None
Axis to interpolate along. For Series this parameter is unused
and defaults to 0.
limitint, optional
Maximum number of consecutive NaNs to fill. Must be greater than
0.
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’
methods are wrappers around the respective SciPy implementations of
similar names. These use the actual numerical values of the index.
For more information on their behavior, see the
SciPy documentation.
Filling in NaN in a Series via polynomial interpolation or splines:
Both ‘polynomial’ and ‘spline’ methods require that you also specify
an order (int).
Fill the DataFrame forward (that is, going down) along each column
using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently,
because there is no entry after it to use for interpolation.
Note how the first entry in column ‘b’ remains NaN, because there
is no entry before it to use for interpolation.
>>> df=pd.DataFrame([(0.0,np.nan,-1.0,1.0),... (np.nan,2.0,np.nan,np.nan),... (2.0,3.0,np.nan,9.0),... (np.nan,4.0,-4.0,16.0)],... columns=list('abcd'))>>> df a b c d0 0.0 NaN -1.0 1.01 NaN 2.0 NaN NaN2 2.0 3.0 NaN 9.03 NaN 4.0 -4.0 16.0>>> df.interpolate(method='linear',limit_direction='forward',axis=0) a b c d0 0.0 NaN -1.0 1.01 1.0 2.0 -2.0 5.02 2.0 3.0 -3.0 9.03 2.0 4.0 -4.0 16.0
frame.isetitem(loc,value) is an in-place method as it will
modify the DataFrame in place (not returning a new object). In contrast to
frame.iloc[:,i]=value which will try to update the existing values in
place, frame.isetitem(loc,value) will not update the values of the column
itself in place, it will instead insert a new array.
In cases where frame.columns is unique, this is equivalent to
frame[frame.columns[i]]=value.
The result will only be true at a location if all the
labels match. If values is a Series, that’s the index. If
values is a dict, the keys must be the column names,
which must match. If values is a DataFrame,
then both the index and column labels must match.
DataFrame.eq: Equality test for DataFrame.
Series.isin: Equivalent method on Series.
Series.str.contains: Test if pattern or regex is contained within a
Return a boolean same-sized object indicating if the values are NA.
NA values, such as None or numpy.NaN, gets mapped to True
values.
Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
>>> df=pd.DataFrame(dict(age=[5,6,np.nan],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy0 False True False True1 False False False False2 True False False False
Return a boolean same-sized object indicating if the values are NA.
NA values, such as None or numpy.NaN, gets mapped to True
values.
Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
>>> df=pd.DataFrame(dict(age=[5,6,np.nan],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy0 False True False True1 False False False False2 True False False False
Because iterrows returns a Series for each row,
it does not preserve dtypes across the rows (dtypes are
preserved across columns for DataFrames).
To preserve dtypes while iterating over the rows, it is better
to use itertuples() which returns namedtuples of the values
and which is generally faster than iterrows.
You should never modify something you are iterating over.
This is not guaranteed to work in all cases. Depending on the
data types, the iterator returns a copy and not a view, and writing
to it will have no effect.
An object to iterate over namedtuples for each row in the
DataFrame with the first field possibly being the index and
following fields being the column values.
otherDataFrame, Series, or a list containing any combination of them
Index should be similar to one of the columns in this one. If a
Series is passed, its name attribute must be set, and that will be
used as the column name in the resulting joined DataFrame.
onstr, list of str, or array-like, optional
Column or index level name(s) in the caller to join on the index
in other, otherwise joins index-on-index. If multiple
values given, the other DataFrame must have a MultiIndex. Can
pass an array as the join key if it is not already contained in
the calling DataFrame. Like an Excel VLOOKUP operation.
>>> df.join(other,lsuffix='_caller',rsuffix='_other') key_caller A key_other B0 K0 A0 K0 B01 K1 A1 K1 B12 K2 A2 K2 B23 K3 A3 NaN NaN4 K4 A4 NaN NaN5 K5 A5 NaN NaN
If we want to join using the key columns, we need to set key to be
the index in both df and other. The joined DataFrame will have
key as its index.
>>> df.set_index('key').join(other.set_index('key')) A BkeyK0 A0 B0K1 A1 B1K2 A2 B2K3 A3 NaNK4 A4 NaNK5 A5 NaN
Another option to join using the key columns is to use the on
parameter. DataFrame.join always uses other’s index but we can use
any column in df. This method preserves the original DataFrame’s
index in the result.
first : Select initial periods of time series based on a date offset.
at_time : Select values at a particular time of the day.
between_time : Select values between particular times of the day.
Notice the data for 3 last calendar days were returned, not the last
3 observed days in the dataset, and therefore data for 2018-04-11 was
not returned.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
DataFrame.apply : Apply a function along input axis of DataFrame.
DataFrame.replace: Replace values given in to_replace with value.
Series.map : Apply a function elementwise on a Series.
condbool Series/DataFrame, array-like, or callable
Where cond is False, keep the original value. Where
True, replace with corresponding value from other.
If cond is callable, it is computed on the Series/DataFrame and
should return boolean Series/DataFrame or array. The callable must
not change input Series/DataFrame (though pandas doesn’t check it).
otherscalar, Series/DataFrame, or callable
Entries where cond is True are replaced with
corresponding value from other.
If other is callable, it is computed on the Series/DataFrame and
should return scalar or Series/DataFrame. The callable must not
change input Series/DataFrame (though pandas doesn’t check it).
If not specified, entries will be filled with the corresponding
NULL value (np.nan for numpy dtypes, pd.NA for extension
dtypes).
inplacebool, default False
Whether to perform the operation in place on the data.
axisint, default None
Alignment axis if needed. For Series this parameter is
unused and defaults to 0.
The mask method is an application of the if-then idiom. For each
element in the calling DataFrame, if cond is False the
element is used; otherwise the corresponding element from the DataFrame
other is used. If the axis of other does not align with axis of
cond Series/DataFrame, the misaligned index positions will be filled with
True.
The signature for DataFrame.where() differs from
numpy.where(). Roughly df1.where(m,df2) is equivalent to
np.where(m,df1,df2).
For further details and examples see the mask documentation in
indexing.
The dtype of the object takes precedence. The fill value is casted to
the object’s dtype, if this can be done losslessly.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (id_vars), while all other
columns, considered measured variables (value_vars), are “unpivoted” to
the row axis, leaving just two non-identifier columns, ‘variable’ and
‘value’.
Specifies whether to include the memory usage of the DataFrame’s
index in returned Series. If index=True, the memory usage of
the index is the first item in the output.
deepbool, default False
If True, introspect the data deeply by interrogating
object dtypes for system-level memory consumption, and include
it in the returned values.
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes will be ignored. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on.
When performing a cross merge, no column specifications to merge on are
allowed.
Warning
If both key columns contain rows where the key is a null value, those
rows will be matched against each other. This is different from usual SQL
join behaviour and can lead to unexpected results.
left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order
of the left keys.
onlabel or list
Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_onlabel or list, or array-like
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_onlabel or list, or array-like
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_indexbool, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels.
right_indexbool, default False
Use the index from the right DataFrame as the join key. Same caveats as
left_index.
sortbool, default False
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixeslist-like, default is (“_x”, “_y”)
A length-2 sequence where each element is optionally a string
indicating the suffix to add to overlapping column names in
left and right respectively. Pass a value of None instead
of a string to indicate that the column name from left or
right should be left as-is, with no suffix. At least one of the
values must not be None.
copybool, default True
If False, avoid copy if possible.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
indicatorbool or str, default False
If True, adds a column to the output DataFrame called “_merge” with
information on the source of each row. The column can be given a different
name by providing a string argument. The column will have a Categorical
type with the value of “left_only” for observations whose merge key only
appears in the left DataFrame, “right_only” for observations
whose merge key only appears in the right DataFrame, and “both”
if the observation’s merge key is found in both DataFrames.
validatestr, optional
If specified, checks if merge is of specified type.
“one_to_one” or “1:1”: check if merge keys are unique in both
left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left
dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right
dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
>>> df=pd.DataFrame([('bird',2,2),... ('mammal',4,np.nan),... ('arthropod',8,0),... ('bird',2,np.nan)],... index=('falcon','horse','spider','ostrich'),... columns=('species','legs','wings'))>>> df species legs wingsfalcon bird 2 2.0horse mammal 4 NaNspider arthropod 8 0.0ostrich bird 2 NaN
By default, missing values are not considered, and the mode of wings
are both 0 and 2. Because the resulting DataFrame has two rows,
the second row of species and legs contains NaN.
>>> df.mode() species legs wings0 bird 2.0 0.01 NaN NaN 2.0
Setting dropna=FalseNaN values are considered and they can be
the mode (like for wings).
>>> df.mode(dropna=False) species legs wings0 bird 2 NaN
Setting numeric_only=True, only the mode of numeric columns is
computed, and columns of other types are ignored.
>>> df.mode(numeric_only=True) legs wings0 2.0 0.01 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in
descending order. The columns that are not specified are returned as
well, but not used for ordering.
This method is equivalent to
df.sort_values(columns,ascending=False).head(n), but more
performant.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
NA values, such as None or numpy.NaN, get mapped to False
values.
>>> df=pd.DataFrame(dict(age=[5,6,np.nan],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy0 True False True False1 True True True True2 False True True True
DataFrame.notnull is an alias for DataFrame.notna.
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
NA values, such as None or numpy.NaN, get mapped to False
values.
>>> df=pd.DataFrame(dict(age=[5,6,np.nan],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy0 True False True False1 True True True True2 False True True True
Return the first n rows ordered by columns in ascending order.
Return the first n rows with the smallest values in columns, in
ascending order. The columns that are not specified are returned as
well, but not used for ordering.
This method is equivalent to
df.sort_values(columns,ascending=True).head(n), but more
performant.
Fractional change between the current and a prior element.
Computes the fractional change from the immediately previous row by
default. This is useful in comparing the fraction of change in a time
series of elements.
Note
Despite the name of this method, it calculates fractional change
(also known as per unit change or relative change) and not
percentage change. If you need the percentage change, multiply
these values by 100.
Series.diff : Compute the difference of two elements in a Series.
DataFrame.diff : Compute the difference of two elements in a DataFrame.
Series.shift : Shift the index by some number of periods.
DataFrame.shift : Shift the index by some number of periods.
Function to apply to the Series/DataFrame.
args, and kwargs are passed into func.
Alternatively a (callable,data_keyword) tuple where
data_keyword is a string indicating the keyword of
callable that expects the Series/DataFrame.
DataFrame.apply : Apply a function along input axis of DataFrame.
DataFrame.map : Apply a function elementwise on a whole DataFrame.
Series.map : Apply a mapping correspondence on a
If you have a function that takes the data as (say) the second
argument, pass a tuple indicating which keyword expects the
data. For example, suppose national_insurance takes its data as df
in the second argument:
Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a “pivot” table) based on column values. Uses
unique values from specified index / columns to form axes of the
resulting DataFrame. This function does not support data
aggregation, multiple values will result in a MultiIndex in the
columns. See the User Guide for more on reshaping.
Column to use to make new frame’s index. If not given, uses existing index.
valuesstr, object or a list of the previous, optional
Column(s) to use for populating new frame’s values. If not
specified, all remaining columns will be used and the result will
have hierarchically indexed columns.
>>> df=pd.DataFrame({'foo':['one','one','one','two','two',... 'two'],... 'bar':['A','B','C','A','B','C'],... 'baz':[1,2,3,4,5,6],... 'zoo':['x','y','z','q','w','t']})>>> df foo bar baz zoo0 one A 1 x1 one B 2 y2 one C 3 z3 two A 4 q4 two B 5 w5 two C 6 t
>>> df.pivot(index='foo',columns='bar',values='baz')bar A B Cfooone 1 2 3two 4 5 6
>>> df.pivot(index='foo',columns='bar')['baz']bar A B Cfooone 1 2 3two 4 5 6
>>> df.pivot(index='foo',columns='bar',values=['baz','zoo']) baz zoobar A B C A B Cfooone 1 2 3 x y ztwo 4 5 6 q w t
You could also assign a list of column names or a list of index names.
>>> df.pivot(index=["lev1","lev2"],columns=["lev3"],values="values") lev3 1 2lev1 lev2 1 1 0.0 1.0 2 2.0 NaN 2 1 4.0 3.0 2 NaN 5.0
A ValueError is raised if there are any duplicates.
>>> df=pd.DataFrame({"foo":['one','one','two','two'],... "bar":['A','A','B','C'],... "baz":[1,2,3,4]})>>> df foo bar baz0 one A 11 one A 22 two B 33 two C 4
Notice that the first two rows are the same for our index
and columns arguments.
indexcolumn, Grouper, array, or list of the previous
Keys to group by on the pivot table index. If a list is passed,
it can contain any of the other types (except list). If an array is
passed, it must be the same length as the data and will be used in
the same manner as column values.
columnscolumn, Grouper, array, or list of the previous
Keys to group by on the pivot table column. If a list is passed,
it can contain any of the other types (except list). If an array is
passed, it must be the same length as the data and will be used in
the same manner as column values.
aggfuncfunction, list of functions, dict, default “mean”
If a list of functions is passed, the resulting pivot table will have
hierarchical columns whose top level are the function names
(inferred from the function objects themselves).
If a dict is passed, the key is column to aggregate and the value is
function or list of functions. If margin=True, aggfunc will be
used to calculate the partial aggregates.
fill_valuescalar, default None
Value to replace missing values with (in the resulting pivot table,
after aggregation).
marginsbool, default False
If margins=True, special All columns and rows
will be added with partial group aggregates across the categories
on the rows and columns.
dropnabool, default True
Do not include columns whose entries are all NaN. If True,
rows with a NaN value in any column will be omitted before
computing margins.
margins_namestr, default ‘All’
Name of the row / column that will contain the totals
when margins is True.
observedbool, default False
This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
Deprecated since version 2.2.0: The default value of False is deprecated and will change to
True in a future version of pandas.
>>> df=pd.DataFrame({"A":["foo","foo","foo","foo","foo",... "bar","bar","bar","bar"],... "B":["one","one","one","two","two",... "one","one","two","two"],... "C":["small","large","large","small",... "small","large","small","small",... "large"],... "D":[1,2,2,3,3,4,5,6,7],... "E":[2,4,5,5,6,6,8,9,9]})>>> df A B C D E0 foo one small 1 21 foo one large 2 42 foo one large 2 53 foo two small 3 54 foo two small 3 65 bar one large 4 66 bar one small 5 87 bar two small 6 98 bar two large 7 9
This first example aggregates values by taking the sum.
>>> table=pd.pivot_table(df,values='D',index=['A','B'],... columns=['C'],aggfunc="sum")>>> tableC large smallA Bbar one 4.0 5.0 two 7.0 6.0foo one 4.0 1.0 two NaN 6.0
We can also fill missing values using the fill_value parameter.
>>> table=pd.pivot_table(df,values='D',index=['A','B'],... columns=['C'],aggfunc="sum",fill_value=0)>>> tableC large smallA Bbar one 4 5 two 7 6foo one 4 1 two 0 6
The next example aggregates by taking the mean across multiple columns.
>>> table=pd.pivot_table(df,values=['D','E'],index=['A','C'],... aggfunc={'D':"mean",'E':"mean"})>>> table D EA Cbar large 5.500000 7.500000 small 5.500000 8.500000foo large 2.000000 4.500000 small 2.333333 4.333333
We can also calculate multiple types of aggregations for any given
value column.
>>> table=pd.pivot_table(df,values=['D','E'],index=['A','C'],... aggfunc={'D':"mean",... 'E':["min","max","mean"]})>>> table D E mean max mean minA Cbar large 5.500000 9 7.500000 6 small 5.500000 9 8.500000 8foo large 2.000000 5 4.500000 4 small 2.333333 6 4.333333 2
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.prod with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
Added in version 2.0.0.
skipnabool, default True
Exclude NA/null values when computing the result.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
min_countint, default 0
The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.prod with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
Added in version 2.0.0.
skipnabool, default True
Exclude NA/null values when computing the result.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
min_countint, default 0
The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
This optional parameter specifies the interpolation method to use,
when the desired quantile lies between two data points i and j:
linear: i + (j - i) * fraction, where fraction is the
fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
method{‘single’, ‘table’}, default ‘single’
Whether to compute quantiles per-column (‘single’) or over all columns
(‘table’). When ‘table’, the only allowed interpolation methods are
‘nearest’, ‘lower’, and ‘higher’.
You can refer to variables
in the environment by prefixing them with an ‘@’ character like
@a+b.
You can refer to column names that are not valid Python variable names
by surrounding them in backticks. Thus, column names containing spaces
or punctuations (besides underscores) or starting with digits must be
surrounded by backticks. (For example, a column named “Area (cm^2)” would
be referenced as `Area(cm^2)`). Column names which are Python keywords
(like “list”, “for”, “import”, etc) cannot be used.
For example, if one of your columns is called aa and you want
to sum it with b, your query should be `aa`+b.
inplacebool
Whether to modify the DataFrame rather than creating a new one.
The result of the evaluation of this expression is first passed to
DataFrame.loc and if that fails because of a
multidimensional key (e.g., a DataFrame) then the result will be passed
to DataFrame.__getitem__().
This method uses the top-level eval() function to
evaluate the passed query.
The query() method uses a slightly
modified Python syntax by default. For example, the & and |
(bitwise) operators have the precedence of their boolean cousins,
and and or. This is syntactically valid Python,
however the semantics are different.
You can change the semantics of the expression by passing the keyword
argument parser='python'. This enforces the same semantics as
evaluation in Python space. Likewise, you can pass engine='python'
to evaluate an expression using Python itself as a backend. This is not
recommended as it is inefficient compared to using numexpr as the
engine.
The DataFrame.index and
DataFrame.columns attributes of the
DataFrame instance are placed in the query namespace
by default, which allows you to treat both the index and columns of the
frame as a column in the frame.
The identifier index is used for the frame index; you can also
use the name of the index to identify it in a query. Please note that
Python keywords may not be used as identifiers.
For further details and examples see the query documentation in
indexing.
Backtick quoted variables
Backtick quoted variables are parsed as literal Python code and
are converted internally to a Python valid identifier.
This can lead to the following problems.
During parsing a number of disallowed characters inside the backtick
quoted string are replaced by strings that are allowed as a Python identifier.
These characters include all operators in Python, the space character, the
question mark, the exclamation mark, the dollar sign, and the euro sign.
For other characters that fall outside the ASCII range (U+0001..U+007F)
and those that are not further specified in PEP 3131,
the query parser will raise an error.
This excludes whitespace different than the space character,
but also the hashtag (as it is used for comments) and the backtick
itself (backtick can also not be escaped).
In a special case, quotes that make a pair around a backtick can
confuse the parser.
For example, `it's`>`that's` will raise an error,
as it forms a quoted string ('s>`that') with a backtick inside.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
The following example shows how the method behaves with the above
parameters:
default_rank: this is the default behaviour obtained without using
any parameter.
max_rank: setting method='max' the records that have the
same values are ranked using the highest rank (e.g.: since ‘cat’
and ‘dog’ are both in the 2nd and 3rd position, rank 3 is assigned.)
NA_bottom: choosing na_option='bottom', if there are records
with NaN values they are placed at the bottom of the ranking.
pct_rank: when setting pct=True, the ranking is expressed as
percentile rank.
>>> df['default_rank']=df['Number_legs'].rank()>>> df['max_rank']=df['Number_legs'].rank(method='max')>>> df['NA_bottom']=df['Number_legs'].rank(na_option='bottom')>>> df['pct_rank']=df['Number_legs'].rank(pct=True)>>> df Animal Number_legs default_rank max_rank NA_bottom pct_rank0 cat 4.0 2.5 3.0 2.5 0.6251 penguin 2.0 1.0 1.0 1.0 0.2502 dog 4.0 2.5 3.0 2.5 0.6253 spider 8.0 4.0 4.0 4.0 1.0004 snake NaN NaN NaN 5.0 NaN
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Conform DataFrame to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object
is produced unless the new index is equivalent to the current one and
copy=False.
Method to use for filling holes in reindexed DataFrame.
Please note: this is only applicable to DataFrames/Series with a
monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next
valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.
copybool, default True
Return a new object, even if the passed indexes are the same.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
levelint or name
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuescalar, default np.nan
Value to use for missing values. Defaults to NaN, but can be any
“compatible” value.
limitint, default None
Maximum number of consecutive elements to forward or backward fill.
toleranceoptional
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation abs(index[indexer]-target)<=tolerance.
Tolerance may be a scalar value, which applies the same tolerance
to all values, or list-like, which applies variable tolerance per
element. List-like includes list, tuple, array, Series, and must be
the same size as the index and its dtype must exactly match the
index’s type.
DataFrame.set_index : Set row labels.
DataFrame.reset_index : Remove row labels or move them to new columns.
DataFrame.reindex_like : Change to same indices as other DataFrame.
Create a new index and reindex the dataframe. By default
values in the new index that do not have corresponding
records in the dataframe are assigned NaN.
>>> new_index=['Safari','Iceweasel','Comodo Dragon','IE10',... 'Chrome']>>> df.reindex(new_index) http_status response_timeSafari 404.0 0.07Iceweasel NaN NaNComodo Dragon NaN NaNIE10 404.0 0.08Chrome 200.0 0.02
We can fill in the missing values by passing a value to
the keyword fill_value. Because the index is not monotonically
increasing or decreasing, we cannot use arguments to the keyword
method to fill the NaN values.
To further illustrate the filling functionality in
reindex, we will create a dataframe with a
monotonically increasing index (for example, a sequence
of dates).
The index entries that did not have a value in the original data frame
(for example, ‘2009-12-29’) are by default filled with NaN.
If desired, we can fill in the missing values using one of several
options.
For example, to back-propagate the last valid value to fill the NaN
values, pass bfill as an argument to the method keyword.
Please note that the NaN value present in the original dataframe
(at index value 2010-01-03) will not be filled by any of the
value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and
desired indexes. If you do want to fill in the NaN values present
in the original dataframe, use the fillna() method.
Return an object with matching indices as other object.
Conform the object to the same index on all axes. Optional
filling logic, placing NaN in locations having no value
in the previous index. A new object is produced unless the
new index is equivalent to the current one and copy=False.
Method to use for filling holes in reindexed DataFrame.
Please note: this is only applicable to DataFrames/Series with a
monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: propagate last valid observation forward to next
valid
backfill / bfill: use next valid observation to fill gap
nearest: use nearest valid observations to fill gap.
copybool, default True
Return a new object, even if the passed indexes are the same.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
limitint, default None
Maximum number of consecutive labels to fill for inexact matches.
toleranceoptional
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations must
satisfy the equation abs(index[indexer]-target)<=tolerance.
Tolerance may be a scalar value, which applies the same tolerance
to all values, or list-like, which applies variable tolerance per
element. List-like includes list, tuple, array, Series, and must be
the same size as the index and its dtype must exactly match the
index’s type.
DataFrame.set_index : Set row labels.
DataFrame.reset_index : Remove row labels or move them to new columns.
DataFrame.reindex : Change to new indices or expand indices.
>>> df2 temp_celsius windspeed2014-02-12 28.0 low2014-02-13 30.0 low2014-02-15 35.1 medium
>>> df2.reindex_like(df1) temp_celsius temp_fahrenheit windspeed2014-02-12 28.0 NaN low2014-02-13 30.0 NaN low2014-02-14 NaN NaN NaN2014-02-15 35.1 NaN medium
Dict-like or function transformations to apply to
that axis’ values. Use either mapper and axis to
specify the axis to target with mapper, or index and
columns.
indexdict-like or function
Alternative to specifying axis (mapper,axis=0
is equivalent to index=mapper).
columnsdict-like or function
Alternative to specifying axis (mapper,axis=1
is equivalent to columns=mapper).
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Axis to target with mapper. Can be either the axis name
(‘index’, ‘columns’) or number (0, 1). The default is ‘index’.
copybool, default True
Also copy underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
inplacebool, default False
Whether to modify the DataFrame rather than creating a new one.
If True then value of copy is ignored.
levelint or level name, default None
In case of a MultiIndex, only rename labels in the specified
level.
errors{‘ignore’, ‘raise’}, default ‘ignore’
If ‘raise’, raise a KeyError when a dict-like mapper, index,
or columns contains labels that are not present in the Index
being transformed.
If ‘ignore’, existing keys will be renamed and extra keys will be
ignored.
index, columnsscalar, list-like, dict-like or function, optional
A scalar, list-like, dict-like or functions transformations to
apply to that axis’ values.
Note that the columns parameter is not allowed if the
object is a Series. This parameter only apply for DataFrame
type objects.
Use either mapper and axis to
specify the axis to target with mapper, or index
and/or columns.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to rename. For Series this parameter is unused and defaults to 0.
copybool, default None
Also copy underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
inplacebool, default False
Modifies the object directly, instead of creating a new Series
or DataFrame.
DataFrame.rename_axis supports two calling conventions
(index=index_mapper,columns=columns_mapper,...)
(mapper,axis={'index','columns'},...)
The first calling convention will only modify the names of
the index and/or the names of the Index object that is the columns.
In this case, the parameter copy is ignored.
The second calling convention will modify the names of the
corresponding index if mapper is a list or a scalar.
However, if mapper is dict-like or a function, it will use the
deprecated behavior of modifying the axis labels.
We highly recommend using keyword arguments to clarify your
intent.
Values of the Series/DataFrame are replaced with other values dynamically.
This differs from updating with .loc or .iloc, which require
you to specify a location to update with some value.
to_replacestr, regex, list, dict, Series, int, float, or None
How to find the values that will be replaced.
numeric, str or regex:
numeric: numeric values equal to to_replace will be
replaced with value
str: string exactly matching to_replace will be replaced
with value
regex: regexs matching to_replace will be replaced with
value
list of str, regex, or numeric:
First, if to_replace and value are both lists, they
must be the same length.
Second, if regex=True then all of the strings in both
lists will be interpreted as regexs otherwise they will match
directly. This doesn’t matter much for value since there
are only a few possible substitution regexes you can use.
str, regex and numeric rules apply as above.
dict:
Dicts can be used to specify different replacement values
for different existing values. For example,
{'a':'b','y':'z'} replaces the value ‘a’ with ‘b’ and
‘y’ with ‘z’. To use a dict in this way, the optional value
parameter should not be given.
For a DataFrame a dict can specify that different values
should be replaced in different columns. For example,
{'a':1,'b':'z'} looks for the value 1 in column ‘a’
and the value ‘z’ in column ‘b’ and replaces these values
with whatever is specified in value. The value parameter
should not be None in this case. You can treat this as a
special case of passing two lists except that you are
specifying the column to search in.
For a DataFrame nested dictionaries, e.g.,
{'a':{'b':np.nan}}, are read as follows: look in column
‘a’ for the value ‘b’ and replace it with NaN. The optional value
parameter should not be specified to use a nested dict in this
way. You can nest regular expressions as well. Note that
column names (the top-level dictionary keys in a nested
dictionary) cannot be regular expressions.
None:
This means that the regex argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If value is also None then
this must be a nested dictionary or Series.
See the examples section for examples of each of these.
valuescalar, dict, list, str, regex, default None
Value to replace any values matching to_replace with.
For a DataFrame a dict of values can be used to specify which
value to use for each column (columns not in the dict will not be
filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
inplacebool, default False
If True, performs operation inplace and returns None.
limitint, default None
Maximum size gap to forward or backward fill.
Deprecated since version 2.1.0.
regexbool or same types as to_replace, default False
Whether to interpret to_replace and/or value as regular
expressions. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
to_replace must be None.
method{‘pad’, ‘ffill’, ‘bfill’}
The method to use when for replacement, when to_replace is a
scalar, list or tuple and value is None.
Series.fillna : Fill NA values.
DataFrame.fillna : Fill NA values.
Series.where : Replace values based on boolean condition.
DataFrame.where : Replace values based on boolean condition.
DataFrame.map: Apply a function to a Dataframe elementwise.
Series.map: Map values of Series according to an input mapping or function.
Series.str.replace : Simple string replacement.
Regex substitution is performed under the hood with re.sub. The
rules for substitution for re.sub are the same.
Regular expressions will only substitute on strings, meaning you
cannot provide, for example, a regular expression matching floating
point numbers and expect the columns in your frame that have a
numeric dtype to be matched. However, if those floating point
numbers are strings, then you can do this.
This method has a lot of options. You are encouraged to experiment
and play with this method to gain intuition about how it works.
When dict is used as the to_replace value, it is like
key(s) in the dict are the to_replace part and
value(s) in the dict are the value parameter.
>>> df.replace({0:10,1:100}) A B C0 10 5 a1 100 6 b2 2 7 c3 3 8 d4 4 9 e
>>> df.replace({'A':0,'B':5},100) A B C0 100 100 a1 1 6 b2 2 7 c3 3 8 d4 4 9 e
>>> df.replace({'A':{0:100,4:400}}) A B C0 100 5 a1 1 6 b2 2 7 c3 3 8 d4 400 9 e
Regular expression `to_replace`
>>> df=pd.DataFrame({'A':['bat','foo','bait'],... 'B':['abc','bar','xyz']})>>> df.replace(to_replace=r'^ba.$',value='new',regex=True) A B0 new abc1 foo new2 bait xyz
>>> df.replace({'A':r'^ba.$'},{'A':'new'},regex=True) A B0 new abc1 foo bar2 bait xyz
>>> df.replace(regex=r'^ba.$',value='new') A B0 new abc1 foo new2 bait xyz
>>> df.replace(regex={r'^ba.$':'new','foo':'xyz'}) A B0 new abc1 xyz new2 bait xyz
>>> df.replace(regex=[r'^ba.$','foo'],value='new') A B0 new abc1 new new2 bait xyz
Compare the behavior of s.replace({'a':None}) and
s.replace('a',None) to understand the peculiarities
of the to_replace parameter:
>>> s=pd.Series([10,'a','a','b','a'])
When one uses a dict as the to_replace value, it is like the
value(s) in the dict are equal to the value parameter.
s.replace({'a':None}) is equivalent to
s.replace(to_replace={'a':None},value=None,method=None):
When value is not explicitly passed and to_replace is a scalar, list
or tuple, replace uses the method parameter (default ‘pad’) to do the
replacement. So this is why the ‘a’ values are being replaced by 10
in rows 1 and 2 and ‘b’ in row 4 in this case.
>>> s.replace('a')0 101 102 103 b4 bdtype: object
Deprecated since version 2.1.0: The ‘method’ parameter and padding behavior are deprecated.
On the other hand, if None is explicitly passed for value, it will
be respected:
Convenience method for frequency conversion and resampling of time series.
The object must have a datetime-like index (DatetimeIndex, PeriodIndex,
or TimedeltaIndex), or the caller must pass the label of a datetime-like
series/index to the on/level keyword parameter.
The offset string or object representing target conversion.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Which axis to use for up- or down-sampling. For Series this parameter
is unused and defaults to 0. Must be
DatetimeIndex, TimedeltaIndex or PeriodIndex.
Deprecated since version 2.0.0: Use frame.T.resample(…) instead.
closed{‘right’, ‘left’}, default None
Which side of bin interval is closed. The default is ‘left’
for all frequency offsets except for ‘ME’, ‘YE’, ‘QE’, ‘BME’,
‘BA’, ‘BQE’, and ‘W’ which all have a default of ‘right’.
label{‘right’, ‘left’}, default None
Which bin edge label to label bucket with. The default is ‘left’
for all frequency offsets except for ‘ME’, ‘YE’, ‘QE’, ‘BME’,
‘BA’, ‘BQE’, and ‘W’ which all have a default of ‘right’.
Pass ‘timestamp’ to convert the resulting index to a
DateTimeIndex or ‘period’ to convert it to a PeriodIndex.
By default the input representation is retained.
Deprecated since version 2.2.0: Convert index to desired type explicitly instead.
onstr, optional
For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.
levelstr or int, optional
For a MultiIndex, level (name or number) to use for
resampling. level must be datetime-like.
originTimestamp or str, default ‘start_day’
The timestamp on which to adjust the grouping. The timezone of origin
must match the timezone of the index.
If string, must be one of the following:
‘epoch’: origin is 1970-01-01
‘start’: origin is the first value of the timeseries
‘start_day’: origin is the first day at midnight of the timeseries
‘end’: origin is the last value of the timeseries
‘end_day’: origin is the ceiling midnight of the last day
Added in version 1.3.0.
Note
Only takes effect for Tick-frequencies (i.e. fixed frequencies like
days, hours, and minutes, rather than months or quarters).
offsetTimedelta or str, default is None
An offset timedelta added to the origin.
group_keysbool, default False
Whether to include the group keys in the result index when using
.apply() on the resampled object.
Added in version 1.5.0: Not specifying group_keys will retain values-dependent behavior
from pandas 1.4 and earlier (see pandas 1.5.0 Release notes for examples).
Changed in version 2.0.0: group_keys now defaults to False.
Series.resample : Resample a Series.
DataFrame.resample : Resample a DataFrame.
groupby : Group Series/DataFrame by mapping, function, label, or list of labels.
asfreq : Reindex a Series/DataFrame with the given frequency without grouping.
Downsample the series into 3 minute bins as above, but label each
bin using the right edge instead of the left. Please note that the
value in the bucket used as the label is not included in the bucket,
which it labels. For example, in the original series the
bucket 2000-01-0100:03:00 contains the value 3, but the summed
value in the resampled bucket with the label 2000-01-0100:03:00
does not include 3 (if it did, the summed value would be 6, not 3).
In contrast with the start_day, you can use end_day to take the ceiling
midnight of the largest Timestamp as the end of the bins and drop the bins
not containing data:
Using the given string, rename the DataFrame column which contains the
index data. If the DataFrame has a MultiIndex, this has to be a list or
tuple with length equal to the number of levels.
DataFrame.set_index : Opposite of reset_index.
DataFrame.reindex : Change to new indices or expand indices.
DataFrame.reindex_like : Change to same indices as other DataFrame.
>>> df=pd.DataFrame([('bird',389.0),... ('bird',24.0),... ('mammal',80.5),... ('mammal',np.nan)],... index=['falcon','parrot','lion','monkey'],... columns=('class','max_speed'))>>> df class max_speedfalcon bird 389.0parrot bird 24.0lion mammal 80.5monkey mammal NaN
When we reset the index, the old index is added as a column, and a
new sequential index is used:
>>> df.reset_index() index class max_speed0 falcon bird 389.01 parrot bird 24.02 lion mammal 80.53 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as
a column:
>>> df.reset_index(drop=True) class max_speed0 bird 389.01 bird 24.02 mammal 80.53 mammal NaN
You can also use reset_index with MultiIndex.
>>> index=pd.MultiIndex.from_tuples([('bird','falcon'),... ('bird','parrot'),... ('mammal','lion'),... ('mammal','monkey')],... names=['class','name'])>>> columns=pd.MultiIndex.from_tuples([('speed','max'),... ('species','type')])>>> df=pd.DataFrame([(389.0,'fly'),... (24.0,'fly'),... (80.5,'run'),... (np.nan,'jump')],... index=index,... columns=columns)>>> df speed species max typeclass namebird falcon 389.0 fly parrot 24.0 flymammal lion 80.5 run monkey NaN jump
Using the names parameter, choose a name for the index column:
>>> df.reset_index(names=['classes','names']) classes names speed species max type0 bird falcon 389.0 fly1 bird parrot 24.0 fly2 mammal lion 80.5 run3 mammal monkey NaN jump
If the index has multiple levels, we can reset a subset of them:
>>> df.reset_index(level='class') class speed species max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top
level. We can place it in another level:
>>> df.reset_index(level='class',col_level=1) speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
When the index is inserted under another level, we can specify under
which one with the parameter col_fill:
>>> df.reset_index(level='class',col_level=1,col_fill='species') species speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
If we specify a nonexistent level for col_fill, it is created:
>>> df.reset_index(level='class',col_level=1,col_fill='genus') genus speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
windowint, timedelta, str, offset, or BaseIndexer subclass
Size of the moving window.
If an integer, the fixed number of observations used for
each window.
If a timedelta, str, or offset, the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period. This is only valid for datetimelike indexes.
To learn more about the offsets & frequency strings, please see this link.
If a BaseIndexer subclass, the window boundaries
based on the defined get_window_bounds method. Additional rolling
keyword arguments, namely min_periods, center, closed and
step will be passed to get_window_bounds.
min_periodsint, default None
Minimum number of observations in window required to have a value;
otherwise, result is np.nan.
For a window that is specified by an offset, min_periods will default to 1.
For a window that is specified by an integer, min_periods will default
to the size of the window.
centerbool, default False
If False, set the window labels as the right edge of the window index.
If True, set the window labels as the center of the window index.
Certain Scipy window types require additional parameters to be passed
in the aggregation function. The additional parameters must match
the keywords specified in the Scipy window type method signature.
onstr, optional
For a DataFrame, a column label or Index level on which
to calculate the rolling window, rather than the DataFrame’s index.
Provided integer column is ignored and excluded from result since
an integer index is not used to calculate the rolling window.
axisint or str, default 0
If 0 or 'index', roll across the rows.
If 1 or 'columns', roll across the columns.
For Series this parameter is unused and defaults to 0.
Deprecated since version 2.1.0: The axis keyword is deprecated. For axis=1,
transpose the DataFrame first instead.
closedstr, default None
If 'right', the first point in the window is excluded from calculations.
If 'left', the last point in the window is excluded from calculations.
If 'both', the no points in the window are excluded from calculations.
If 'neither', the first and last points in the window are excluded
from calculations.
Default None ('right').
step : int, default None
Added in version 1.5.0.
Evaluate the window at every step result, equivalent to slicing as
[::step]. window must be an integer. Using a step argument other
than None or 1 will produce a result with a different shape than the input.
Number of decimal places to round each column to. If an int is
given, round each column to the same number of places.
Otherwise dict and Series round to variable numbers of places.
Column names should be in the keys if decimals is a
dict-like, or in the index if decimals is a Series. Any
columns not included in decimals will be left as is. Elements
of decimals which are not columns of the input will be
ignored.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Number of items from axis to return. Cannot be used with frac.
Default = 1 if frac = None.
fracfloat, optional
Fraction of axis items to return. Cannot be used with n.
replacebool, default False
Allow or disallow sampling of the same row more than once.
weightsstr or ndarray-like, optional
Default ‘None’ results in equal probability weighting.
If passed a Series, will align with target object on index. Index
values in weights not found in sampled object will be ignored and
index values in sampled object not in weights will be assigned
weights of zero.
If called on a DataFrame, will accept the name of a column
when axis = 0.
Unless weights are a Series, weights must be same length as axis
being sampled.
If weights do not sum to 1, they will be normalized to sum to 1.
Missing values in the weights column will be treated as zero.
Infinite values not allowed.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype, but note that
this will return all object dtype columns. With
pd.options.future.infer_string enabled, using "str" will
work to select all string columns.
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.sem with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
skipnabool, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
ddofint, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
The axis to update. The value 0 identifies the rows. For Series
this parameter is unused and defaults to 0.
copybool, default True
Whether to make a copy of the underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
allows_duplicate_labelsbool, optional
Whether the returned object allows duplicate labels.
This method returns a new object that’s a view on the same data
as the input. Mutating the input or the output values will be reflected
in the other.
This method is intended to be used in method chains.
“Flags” differ from “metadata”. Flags reflect properties of the
pandas object (the Series or DataFrame). Metadata refer to properties
of the dataset, and should be stored in DataFrame.attrs.
Set the DataFrame index (row labels) using one or more existing
columns or arrays (of the correct length). The index can replace the
existing index or expand on it.
This parameter can be either a single column key, a single array of
the same length as the calling DataFrame, or a list containing an
arbitrary combination of column keys and arrays. Here, “array”
encompasses Series, Index, np.ndarray, and
instances of Iterator.
dropbool, default True
Delete columns to be used as the new index.
appendbool, default False
Whether to append columns to existing index.
inplacebool, default False
Whether to modify the DataFrame rather than creating a new one.
verify_integritybool, default False
Check the new index for duplicates. Otherwise defer the check until
necessary. Setting to False will improve the performance of this
method.
DataFrame.reset_index : Opposite of set_index.
DataFrame.reindex : Change to new indices or expand indices.
DataFrame.reindex_like : Change to same indices as other DataFrame.
Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data.
If freq is passed (in this case, the index must be date or datetime,
or it will raise a NotImplementedError), the index will be
increased using the periods and the freq. freq can be inferred
when specified as “infer” as long as either freq or inferred_freq
attribute is set in the index.
Number of periods to shift. Can be positive or negative.
If an iterable of ints, the data will be shifted once by each int.
This is equivalent to shifting by one value at a time and
concatenating all resulting frames. The resulting columns will have
the shift suffixed to their column names. For multiple periods,
axis must not be 1.
freqDateOffset, tseries.offsets, timedelta, or str, optional
Offset to use from the tseries module or time rule (e.g. ‘EOM’).
If freq is specified then the index values are shifted but the
data is not realigned. That is, use freq if you would like to
extend the index when shifting and preserve the original data.
If freq is specified as “infer” then it will be inferred from
the freq or inferred_freq attributes of the index. If neither of
those attributes exist, a ValueError is thrown.
axis{0 or ‘index’, 1 or ‘columns’, None}, default None
Shift direction. For Series this parameter is unused and defaults to 0.
fill_valueobject, optional
The scalar value to use for newly introduced missing values.
the default depends on the dtype of self.
For numeric data, np.nan is used.
For datetime, timedelta, or period data, etc. NaT is used.
For extension dtypes, self.dtype.na_value is used.
suffixstr, optional
If str and periods is an iterable, this is added after the column
name and before the shift value for each shifted column name.
>>> df.shift(periods=3) Col1 Col2 Col32020-01-01 NaN NaN NaN2020-01-02 NaN NaN NaN2020-01-03 NaN NaN NaN2020-01-04 10.0 13.0 17.02020-01-05 20.0 23.0 27.0
>>> df.shift(periods=1,axis="columns") Col1 Col2 Col32020-01-01 NaN 10 132020-01-02 NaN 20 232020-01-03 NaN 15 182020-01-04 NaN 30 332020-01-05 NaN 45 48
Choice of sorting algorithm. See also numpy.sort() for more
information. mergesort and stable are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position{‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the end.
Not implemented for MultiIndex.
sort_remainingbool, default True
If True and sorting by level and index is multilevel, sort by other
levels too (in order) after sorting by specified level.
ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
keycallable, optional
If not None, apply the key function to the index values
before sorting. This is similar to the key argument in the
builtin sorted() function, with the notable difference that
this key function should be vectorized. It should expect an
Index and return an Index of the same shape. For MultiIndex
inputs, the key is applied per level.
Choice of sorting algorithm. See also numpy.sort() for more
information. mergesort and stable are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position{‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the
end.
ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
keycallable, optional
Apply the key function to the values
before sorting. This is similar to the key argument in the
builtin sorted() function, with the notable difference that
this key function should be vectorized. It should expect a
Series and return a Series with the same shape as the input.
It will be applied to each column in by independently.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
This method is most useful when you don’t know if your
object is a Series or DataFrame, but you do know it has just a single
column. In that case you can safely call squeeze to ensure you have a
Series.
Series.iloc : Integer-location based indexing for selecting scalars.
DataFrame.iloc : Integer-location based indexing for selecting Series.
Series.to_frame : Inverse of DataFrame.squeeze for a
Stack the prescribed level(s) from columns to index.
Return a reshaped DataFrame or Series having a multi-level
index with one or more new inner-most levels compared to the current
DataFrame. The new inner-most levels are created by pivoting the
columns of the current dataframe:
if the columns have a single level, the output is a Series;
if the columns have multiple levels, the new index
level(s) is (are) taken from the prescribed level(s) and
the output is a DataFrame.
Level(s) to stack from the column axis onto the index
axis, defined as one index or label, or a list of indices
or labels.
dropnabool, default True
Whether to drop rows in the resulting Frame/Series with
missing values. Stacking a column level onto the index
axis can create combinations of index and column values
that are missing from the original dataframe. See Examples
section.
sortbool, default True
Whether to sort the levels of the resulting MultiIndex.
future_stackbool, default False
Whether to use the new implementation that will replace the current
implementation in pandas 3.0. When True, dropna and sort have no impact
on the result and must remain unspecified. See pandas 2.1.0 Release
notes for more details.
The function is named by analogy with a collection of books
being reorganized from being side by side on a horizontal
position (the columns of the dataframe) to being stacked
vertically on top of each other (in the index of the
dataframe).
It is common to have missing values when stacking a dataframe
with multi-level columns, as the stacked dataframe typically
has more values than the original dataframe. Missing values
are filled with NaNs:
>>> df_multi_level_cols2 weight height kg mcat 1.0 2.0dog 3.0 4.0>>> df_multi_level_cols2.stack(future_stack=True) weight heightcat kg 1.0 NaN m NaN 2.0dog kg 3.0 NaN m NaN 4.0
Prescribing the level(s) to be stacked
The first parameter controls which level or levels are stacked:
>>> df_multi_level_cols2.stack(0,future_stack=True) kg mcat weight 1.0 NaN height NaN 2.0dog weight 3.0 NaN height NaN 4.0>>> df_multi_level_cols2.stack([0,1],future_stack=True)cat weight kg 1.0 height m 2.0dog weight kg 3.0 height m 4.0dtype: float64
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.std with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
skipnabool, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
ddofint, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.sum with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
Added in version 2.0.0.
skipnabool, default True
Exclude NA/null values when computing the result.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
min_countint, default 0
The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
>>> df=pd.DataFrame(... {"Grade":["A","B","A","C"]},... index=[... ["Final exam","Final exam","Coursework","Coursework"],... ["History","Geography","History","Geography"],... ["January","February","March","April"],... ],... )>>> df GradeFinal exam History January A Geography February BCoursework History March A Geography April C
In the following example, we will swap the levels of the indices.
Here, we will swap the levels column-wise, but levels can be swapped row-wise
in a similar manner. Note that column-wise is the default behaviour.
By not supplying any arguments for i and j, we swap the last and second to
last indices.
>>> df.swaplevel() GradeFinal exam January History A February Geography BCoursework March History A April Geography C
By supplying one argument, we can choose which index to swap the last
index with. We can for example swap the first index with the last one as
follows.
>>> df.swaplevel(0) GradeJanuary History Final exam AFebruary Geography Final exam BMarch History Coursework AApril Geography Coursework C
We can also define explicitly which indices we want to swap by supplying values
for both i and j. Here, we for example swap the first and second indices.
>>> df.swaplevel(0,1) GradeHistory Final exam January AGeography Final exam February BHistory Coursework March AGeography Coursework April C
This function returns last n rows from the object based on
position. It is useful for quickly verifying data, for example,
after sorting or appending rows.
For negative values of n, this function returns all rows except
the first |n| rows, equivalent to df[|n|:].
If n is larger than the number of rows, this function returns all rows.
Return the elements in the given positional indices along an axis.
This means that we are not indexing according to actual values in
the index attribute of the object. We are indexing according to the
actual position of the element in the object.
An array of ints indicating which positions to take.
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
The axis on which to select elements. 0 means that we are
selecting rows, 1 means that we are selecting columns.
For Series this parameter is unused and defaults to 0.
DataFrame.loc : Select a subset of a DataFrame by labels.
DataFrame.iloc : Select a subset of a DataFrame by positions.
numpy.take : Take elements from an array along an axis.
>>> df=pd.DataFrame([('falcon','bird',389.0),... ('parrot','bird',24.0),... ('lion','mammal',80.5),... ('monkey','mammal',np.nan)],... columns=['name','class','max_speed'],... index=[0,2,3,1])>>> df name class max_speed0 falcon bird 389.02 parrot bird 24.03 lion mammal 80.51 monkey mammal NaN
Take elements at positions 0 and 3 along the axis 0 (default).
Note how the actual indices selected (0 and 1) do not correspond to
our selected indices 0 and 3. That’s because we are selecting the 0th
and 3rd rows, not rows whose indices equal 0 and 3.
>>> df.take([0,3]) name class max_speed0 falcon bird 389.01 monkey mammal NaN
Take elements at indices 1 and 2 along the axis 1 (column selection).
>>> df.take([1,2],axis=1) class max_speed0 bird 389.02 bird 24.03 mammal 80.51 mammal NaN
We may take elements using negative integers for positive indices,
starting from the end of the object, just like with Python lists.
>>> df.take([-1,-2]) name class max_speed1 monkey mammal NaN3 lion mammal 80.5
path_or_bufstr, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like
object implementing a write() function. If None, the result is
returned as a string. If a non-binary file object is passed, it should
be opened with newline=’’, disabling universal newlines. If a binary
file object is passed, mode might need to contain a ‘b’.
sepstr, default ‘,’
String of length 1. Field delimiter for the output file.
na_repstr, default ‘’
Missing data representation.
float_formatstr, Callable, default None
Format string for floating point numbers. If a Callable is given, it takes
precedence over other numeric formatting parameters, like decimal.
columnssequence, optional
Columns to write.
headerbool or list of str, default True
Write out the column names. If a list of strings is given it is
assumed to be aliases for the column names.
indexbool, default True
Write row names (index).
index_labelstr or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and
header and index are True, then the index names are used. A
sequence should be given if the object uses MultiIndex. If
False do not print fields for index names. Use index_label=False
for easier importing in R.
mode{‘w’, ‘x’, ‘a’}, default ‘w’
Forwarded to either open(mode=) or fsspec.open(mode=) to control
the file opening. Typical values include:
‘w’, truncate the file first.
‘x’, exclusive creation, failing if the file already exists.
‘a’, append to the end of file if it exists.
encodingstr, optional
A string representing the encoding to use in the output file,
defaults to ‘utf-8’. encoding is not supported if path_or_buf
is a non-binary file object.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
May be a dict with key ‘method’ as compression mode
and other entries as additional compression options if
compression mode is ‘zip’.
Passing compression options as keys in dict is
supported for compression modes ‘gzip’, ‘bz2’, ‘zstd’, and ‘zip’.
quotingoptional constant from csv module
Defaults to csv.QUOTE_MINIMAL. If you have set a float_format
then floats are converted to strings and thus csv.QUOTE_NONNUMERIC
will treat them as non-numeric.
quotecharstr, default ‘"’
String of length 1. Character used to quote fields.
lineterminatorstr, optional
The newline character or character sequence to use in the output
file. Defaults to os.linesep, which depends on the OS in which
this method is called (’\n’ for linux, ‘\r\n’ for Windows, i.e.).
Changed in version 1.5.0: Previously was line_terminator, changed for consistency with
read_csv and the standard library ‘csv’ module.
chunksizeint or None
Rows to write at a time.
date_formatstr, default None
Format string for datetime objects.
doublequotebool, default True
Control quoting of quotechar inside a field.
escapecharstr, default None
String of length 1. Character used to escape sep and quotechar
when appropriate.
decimalstr, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for
European data.
errorsstr, default ‘strict’
Specifies how encoding and decoding errors are to be handled.
See the errors argument for open() for a full list
of options.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
‘records’ : list like
[{column -> value}, … , {column -> value}]
‘index’ : dict like {index -> {column -> value}}
Added in version 1.4.0: ‘tight’ as an allowed value for the orient argument
intoclass, default dict
The collections.abc.MutableMapping subclass used for all Mappings
in the return value. Can be the actual class or an empty
instance of the mapping type you want. If you want a
collections.defaultdict, you must pass it initialized.
indexbool, default True
Whether to include the index item (and index_names item if orient
is ‘tight’) in the returned dictionary. Can only be False
when orient is ‘split’ or ‘tight’.
To write a single object to an Excel .xlsx file it is only necessary to
specify a target file name. To write to multiple sheets it is necessary to
create an ExcelWriter object with a target file name, and specify a sheet
in the file to write to.
Multiple sheets may be written to by specifying unique sheet_name.
With all data written to the file it is necessary to save the changes.
Note that creating an ExcelWriter object with a file name that already
exists will result in the contents of the existing file being erased.
excel_writerpath-like, file-like, or ExcelWriter object
File path or existing ExcelWriter.
sheet_namestr, default ‘Sheet1’
Name of sheet which will contain DataFrame.
na_repstr, default ‘’
Missing data representation.
float_formatstr, optional
Format string for floating point numbers. For example
float_format="%.2f" will format 0.1234 to 0.12.
columnssequence or list of str, optional
Columns to write.
headerbool or list of str, default True
Write out the column names. If a list of string is given it is
assumed to be aliases for the column names.
indexbool, default True
Write row names (index).
index_labelstr or sequence, optional
Column label for index column(s) if desired. If not specified, and
header and index are True, then the index names are used. A
sequence should be given if the DataFrame uses MultiIndex.
startrowint, default 0
Upper left cell row to dump data frame.
startcolint, default 0
Upper left cell column to dump data frame.
enginestr, optional
Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this
via the options io.excel.xlsx.writer or
io.excel.xlsm.writer.
merge_cellsbool, default True
Write MultiIndex and Hierarchical Rows as merged cells.
inf_repstr, default ‘inf’
Representation for infinity (there is no native representation for
infinity in Excel).
freeze_panestuple of int (length 2), optional
Specifies the one-based bottommost row and rightmost column that
is to be frozen.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
Added in version 1.2.0.
engine_kwargsdict, optional
Arbitrary keyword arguments passed to excel engine.
to_csv : Write DataFrame to a comma-separated values (csv) file.
ExcelWriter : Class for writing DataFrame objects into excel sheets.
read_excel : Read an Excel file into a pandas DataFrame.
read_csv : Read a comma-separated values (csv) file into DataFrame.
io.formats.style.Styler.to_excel : Add styles to Excel sheet.
To set the library that is used to write the Excel file,
you can pass the engine keyword (the default engine is
automatically chosen depending on the file extension):
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. If a string or a path,
it will be used as Root Directory path when writing a partitioned dataset.
This function writes the dataframe as a feather file. Requires a default
index. For saving the DataFrame with your custom index use a method that
supports custom indices e.g. to_parquet.
Changed in version 1.5.0: Default value is changed to True. Google has deprecated the
auth_local_webserver=False“out of band” (copy-paste)
flow.
table_schemalist of dicts, optional
List of BigQuery table fields to which according DataFrame
columns conform to, e.g. [{'name':'col1','type':'STRING'},...]. If schema is not provided, it will be
generated according to dtypes of DataFrame columns. See
BigQuery API documentation on available names of a field.
New in version 0.3.1 of pandas-gbq.
locationstr, optional
Location where the load job should run. See the BigQuery locations
documentation for a
list of available locations. The location must match that of the
target dataset.
New in version 0.5.0 of pandas-gbq.
progress_barbool, default True
Use the library tqdm to show the progress bar for the upload,
chunk by chunk.
Credentials for accessing Google APIs. Use this parameter to
override default credentials, such as to use Compute Engine
google.auth.compute_engine.Credentials or Service
Account google.oauth2.service_account.Credentials
directly.
Write the contained data to an HDF5 file using HDFStore.
Hierarchical Data Format (HDF) is self-describing, allowing an
application to interpret the structure and contents of a file with
no outside information. One HDF file can hold a mix of related objects
which can be accessed as a group or as individual objects.
In order to add another DataFrame or Series to an existing HDF file
please use append mode and a different a key.
Warning
One can store a subclass of DataFrame or Series to HDF5,
but the type of the subclass is lost upon storing.
Specifies the compression library to be used.
These additional compressors for Blosc are supported
(default if no compressor specified: ‘blosc:blosclz’):
{‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’,
‘blosc:zlib’, ‘blosc:zstd’}.
Specifying a compression library which is not available issues
a ValueError.
appendbool, default False
For Table formats, append the input data to the existing.
format{‘fixed’, ‘table’, None}, default ‘fixed’
Possible values:
‘fixed’: Fixed format. Fast writing/reading. Not-appendable,
nor searchable.
‘table’: Table format. Write as a PyTables Table structure
which may perform worse but allow more flexible operations
like searching / selecting subsets of the data.
If None, pd.get_option(‘io.hdf.default_format’) is checked,
followed by fallback to “fixed”.
indexbool, default True
Write DataFrame index as a column.
min_itemsizedict or int, optional
Map column names to minimum string sizes for columns.
nan_repAny, optional
How to represent null values as str.
Not allowed with append=True.
dropnabool, default False, optional
Remove missing values.
data_columnslist of columns or True, optional
List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes
of the object are indexed. See
Query via data columns. for
more information.
Applicable only to format=’table’.
errorsstr, default ‘strict’
Specifies how encoding and decoding errors are to be handled.
See the errors argument for open() for a full list
of options.
read_hdf : Read from HDF file.
DataFrame.to_orc : Write a DataFrame to the binary orc format.
DataFrame.to_parquet : Write a DataFrame to the binary parquet format.
DataFrame.to_sql : Write to a SQL table.
DataFrame.to_feather : Write out feather-format for DataFrames.
DataFrame.to_csv : Write out to a csv file.
bufstr, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
columnsarray-like, optional, default None
The subset of columns to write. Writes all columns by default.
col_spacestr or int, list or dict of int or str, optional
The minimum width of each column in CSS length units. An int is assumed to be px units..
headerbool, optional
Whether to print column labels, default True.
indexbool, optional, default True
Whether to print index (row) labels.
na_repstr, optional, default ‘NaN’
String representation of NaN to use.
formatterslist, tuple or dict of one-param. functions, optional
Formatter functions to apply to columns’ elements by position or
name.
The result of each function must be a unicode string.
List/tuple must be of length equal to the number of columns.
Formatter function to apply to columns’ elements if they are
floats. This function must return a unicode string and will be
applied only to the non-NaN elements, with NaN being
handled by na_rep.
sparsifybool, optional, default True
Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row.
index_namesbool, optional, default True
Prints the names of the indexes.
justifystr, default None
How to justify the column labels. If None uses the option from
the print configuration (controlled by set_option), ‘right’ out
of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rowsint, optional
Maximum number of rows to display in the console.
max_colsint, optional
Maximum number of columns to display in the console.
show_dimensionsbool, default False
Display DataFrame dimensions (number of rows by number of columns).
decimalstr, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe.
bold_rowsbool, default True
Make the row labels bold in the output.
classesstr or list or tuple, default None
CSS class(es) to apply to the resulting html table.
escapebool, default True
Convert the characters <, >, and & to HTML-safe sequences.
notebook{True, False}, default False
Whether the generated HTML is for IPython Notebook.
borderint
A border=border attribute is included in the opening
<table> tag. Default pd.options.display.html.border.
table_idstr, optional
A css id is included in the opening <table> tag if specified.
‘records’ : list like [{column -> value}, … , {column -> value}]
‘index’ : dict like {index -> {column -> value}}
‘columns’ : dict like {column -> {index -> value}}
‘values’ : just the values array
‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}}
Describing the data, where data component is like orient='records'.
date_format{None, ‘epoch’, ‘iso’}
Type of date conversion. ‘epoch’ = epoch milliseconds,
‘iso’ = ISO8601. The default depends on the orient. For
orient='table', the default is ‘iso’. For all other orients,
the default is ‘epoch’.
double_precisionint, default 10
The number of decimal places to use when encoding
floating point values. The possible maximal value is 15.
Passing double_precision greater than 15 will raise a ValueError.
force_asciibool, default True
Force encoded string to be ASCII.
date_unitstr, default ‘ms’ (milliseconds)
The time unit to encode to, governs timestamp and ISO8601
precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond,
microsecond, and nanosecond respectively.
default_handlercallable, default None
Handler to call if object cannot otherwise be converted to a
suitable format for JSON. Should receive a single argument which is
the object to convert and return a serialisable object.
linesbool, default False
If ‘orient’ is ‘records’ write out line-delimited json format. Will
throw ValueError if incorrect ‘orient’ since others are not
list-like.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
indexbool or None, default None
The index is only used when ‘orient’ is ‘split’, ‘index’, ‘column’,
or ‘table’. Of these, ‘index’ and ‘column’ do not support
index=False.
indentint, optional
Length of whitespace used to indent each record.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
modestr, default ‘w’ (writing)
Specify the IO mode for output when supplying a path_or_buf.
Accepted args are ‘w’ (writing) and ‘a’ (append) only.
mode=’a’ is only supported when lines is True and orient is ‘records’.
The behavior of indent=0 varies from the stdlib, which does not
indent the output but does insert newlines. Currently, indent=0
and the default indent=None are equivalent in pandas, though this
may change in a future release.
orient='table' contains a ‘pandas_version’ field under ‘schema’.
This stores the version of pandas used in the latest revision of the
schema.
bufstr, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
columnslist of label, optional
The subset of columns to write. Writes all columns by default.
headerbool or list of str, default True
Write out the column names. If a list of strings is given,
it is assumed to be aliases for the column names.
indexbool, default True
Write row names (index).
na_repstr, default ‘NaN’
Missing data representation.
formatterslist of functions or dict of {{str: function}}, optional
Formatter functions to apply to columns’ elements by position or
name. The result of each function must be a unicode string.
List must be of length equal to the number of columns.
float_formatone-parameter function or str, optional, default None
Formatter for floating point numbers. For example
float_format="%.2f" and float_format="{{:0.2f}}".format will
both result in 0.1234 being formatted as 0.12.
sparsifybool, optional
Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row. By default, the value will be
read from the config module.
index_namesbool, default True
Prints the names of the indexes.
bold_rowsbool, default False
Make the row labels bold in the output.
column_formatstr, optional
The columns format as specified in LaTeX table format e.g. ‘rcl’ for 3
columns. By default, ‘l’ will be used for all columns except
columns of numbers, which default to ‘r’.
longtablebool, optional
Use a longtable environment instead of tabular. Requires
adding a usepackage{{longtable}} to your LaTeX preamble.
By default, the value will be read from the pandas config
module, and set to True if the option styler.latex.environment is
“longtable”.
Changed in version 2.0.0: The pandas option affecting this argument has changed.
escapebool, optional
By default, the value will be read from the pandas config
module and set to True if the option styler.format.escape is
“latex”. When set to False prevents from escaping latex special
characters in column names.
Changed in version 2.0.0: The pandas option affecting this argument has changed, as has the
default value to False.
encodingstr, optional
A string representing the encoding to use in the output file,
defaults to ‘utf-8’.
decimalstr, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe.
multicolumnbool, default True
Use multicolumn to enhance MultiIndex columns.
The default will be read from the config module, and is set
as the option styler.sparse.columns.
Changed in version 2.0.0: The pandas option affecting this argument has changed.
multicolumn_formatstr, default ‘r’
The alignment for multicolumns, similar to column_format
The default will be read from the config module, and is set as the option
styler.latex.multicol_align.
Changed in version 2.0.0: The pandas option affecting this argument has changed, as has the
default value to “r”.
multirowbool, default True
Use multirow to enhance MultiIndex rows. Requires adding a
usepackage{{multirow}} to your LaTeX preamble. Will print
centered labels (instead of top-aligned) across the contained
rows, separating groups via clines. The default will be read
from the pandas config module, and is set as the option
styler.sparse.index.
Changed in version 2.0.0: The pandas option affecting this argument has changed, as has the
default value to True.
captionstr or tuple, optional
Tuple (full_caption, short_caption),
which results in \caption[short_caption]{{full_caption}};
if a single string is passed, no short caption will be set.
labelstr, optional
The LaTeX label to be placed inside \label{{}} in the output.
This is used with \ref{{}} in the main .tex file.
positionstr, optional
The LaTeX positional argument for tables, to be placed after
\begin{{}} in the output.
As of v2.0.0 this method has changed to use the Styler implementation as
part of Styler.to_latex() via jinja2 templating. This means
that jinja2 is a requirement, and needs to be installed, for this method
to function. It is advised that users switch to using Styler, since that
implementation is more frequently updated and contains much more
flexibility with the output.
bufstr, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
modestr, optional
Mode in which file is opened, “wt” by default.
indexbool, optional, default True
Add index (row) labels.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
By default, the dtype of the returned array will be the common NumPy
dtype of all types in the DataFrame. For example, if the dtypes are
float16 and float32, the results dtype will be float32.
This may require copying data and coercing values, which may be
expensive.
Whether to ensure that the returned value is not a view on
another array. Note that copy=False does not ensure that
to_numpy() is no-copy. Rather, copy=True ensure that
a copy is made, even if not strictly necessary.
na_valueAny, optional
The value to use for missing values. The default value depends
on dtype and the dtypes of the DataFrame columns.
If a string, it will be used as Root Directory path
when writing a partitioned dataset. By file-like object,
we refer to objects with a write() method, such as a file handle
(e.g. via builtin open function). If path is None,
a bytes object is returned.
engine{‘pyarrow’}, default ‘pyarrow’
ORC library to use.
indexbool, optional
If True, include the dataframe’s index(es) in the file output.
If False, they will not be written to the file.
If None, similar to infer the dataframe’s index(es)
will be saved. However, instead of being saved as values,
the RangeIndex will be stored as a range in the metadata so it
doesn’t require much space and is faster. Other indexes will
be included as columns in the file output.
engine_kwargsdict[str, Any] or None, default None
Additional keyword arguments passed to pyarrow.orc.write_table().
read_orc : Read a ORC file.
DataFrame.to_parquet : Write a parquet file.
DataFrame.to_csv : Write a csv file.
DataFrame.to_sql : Write to a sql table.
DataFrame.to_hdf : Write to hdf.
This function writes the dataframe as a parquet file. You can choose different parquet
backends, and have the option of compression. See
the user guide for more details.
pathstr, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. If None, the result is
returned as bytes. If a string or path, it will be used as Root Directory
path when writing a partitioned dataset.
Parquet library to use. If ‘auto’, then the option
io.parquet.engine is used. The default io.parquet.engine
behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if
‘pyarrow’ is unavailable.
compressionstr or None, default ‘snappy’
Name of the compression to use. Use None for no compression.
Supported options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.
indexbool, default None
If True, include the dataframe’s index(es) in the file output.
If False, they will not be written to the file.
If None, similar to True the dataframe’s index(es)
will be saved. However, instead of being saved as values,
the RangeIndex will be stored as a range in the metadata so it
doesn’t require much space and is faster. Other indexes will
be included as columns in the file output.
partition_colslist, optional, default None
Column names by which to partition the dataset.
Columns are partitioned in the order they are given.
Must be None if path is not a string.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
read_parquet : Read a parquet file.
DataFrame.to_orc : Write an orc file.
DataFrame.to_csv : Write a csv file.
DataFrame.to_sql : Write to a sql table.
DataFrame.to_hdf : Write to hdf.
If you want to get a buffer to the parquet content you can use a io.BytesIO
object, as long as you don’t use partition_cols, which creates multiple files.
If False then underlying input data is not copied.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. File path where
the pickled object will be stored.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
protocolint
Int which indicates which protocol should be used by the pickler,
default HIGHEST_PROTOCOL (see [1]_ paragraph 12.1.2). The possible
values are 0, 1, 2, 3, 4, 5. A negative value for the protocol
parameter is equivalent to setting its value to HIGHEST_PROTOCOL.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
read_pickle : Load pickled pandas object (or any object) from file.
DataFrame.to_hdf : Write DataFrame to an HDF5 file.
DataFrame.to_sql : Write DataFrame to a SQL database.
DataFrame.to_parquet : Write a DataFrame to the binary parquet format.
Include index in resulting record array, stored in ‘index’
field or using the index label, if set.
column_dtypesstr, type, dict, default None
If a string or type, the data type to store all columns. If
a dictionary, a mapping of column names and indices (zero-indexed)
to specific data types.
index_dtypesstr, type, dict, default None
If a string or type, the data type to store all index levels. If
a dictionary, a mapping of index level names and indices
(zero-indexed) to specific data types.
consqlalchemy.engine.(Engine or Connection) or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects. The user
is responsible for engine disposal and connection closure for the SQLAlchemy
connectable. See here.
If passing a sqlalchemy.engine.Connection which is already in a transaction,
the transaction will not be committed. If passing a sqlite3.Connection,
it will not be possible to roll back the record insertion.
schemastr, optional
Specify the schema (if database flavor supports this). If None, use
default schema.
replace: Drop the table before inserting new values.
append: Insert new values to the existing table.
indexbool, default True
Write DataFrame index as a column. Uses index_label as the column
name in the table. Creates a table index for this column.
index_labelstr or sequence, default None
Column label for index column(s). If None is given (default) and
index is True, then the index names are used.
A sequence should be given if the DataFrame uses MultiIndex.
chunksizeint, optional
Specify the number of rows in each batch to be written at a time.
By default, all rows will be written at once.
dtypedict or scalar, optional
Specifying the datatype for columns. If a dictionary is used, the
keys should be the column names and the values should be the
SQLAlchemy types or strings for the sqlite3 legacy mode. If a
scalar is provided, it will be applied to all columns.
method{None, ‘multi’, callable}, optional
Controls the SQL insertion clause used:
None : Uses standard SQL INSERT clause (one per row).
‘multi’: Pass multiple values in a single INSERT clause.
callable with signature (pd_table,conn,keys,data_iter).
Details and a sample callable implementation can be found in the
section insert method.
Number of rows affected by to_sql. None is returned if the callable
passed into method does not return an integer number of rows.
The number of returned rows affected is the sum of the rowcount
attribute of sqlite3.Cursor or SQLAlchemy connectable which may not
reflect the exact number of written rows as stipulated in the
sqlite3 or
SQLAlchemy.
Timezone aware datetime columns will be written as
Timestampwithtimezone type with SQLAlchemy if supported by the
database. Otherwise, the datetimes will be stored as timezone unaware
timestamps local to the original timezone.
Not all datastores support method="multi". Oracle, for example,
does not support multi-value insert.
Use method to define a callable insertion method to do nothing
if there’s a primary key conflict on a table in a PostgreSQL database.
>>> fromsqlalchemy.dialects.postgresqlimportinsert>>> definsert_on_conflict_nothing(table,conn,keys,data_iter):... # "a" is the primary key in "conflict_table"... data=[dict(zip(keys,row))forrowindata_iter]... stmt=insert(table.table).values(data).on_conflict_do_nothing(index_elements=["a"])... result=conn.execute(stmt)... returnresult.rowcount>>> df_conflict.to_sql(name="conflict_table",con=conn,if_exists="append",method=insert_on_conflict_nothing)0
For MySQL, a callable to update columns b and c if there’s a conflict
on a primary key.
Specify the dtype (especially useful for integers with missing values).
Notice that while pandas is forced to store the data as floating point,
the database supports nullable integers. When fetching the data with
Python, we get back integer scalars.
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function.
convert_datesdict
Dictionary mapping columns containing datetime types to stata
internal format to use when writing the dates. Options are ‘tc’,
‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer
or a name. Datetime columns that do not have a conversion type
specified will be converted to ‘tc’. Raises NotImplementedError if
a datetime column has timezone information.
write_indexbool
Write the index to Stata dataset.
byteorderstr
Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.
time_stampdatetime
A datetime to use as file creation date. Default is the current
time.
data_labelstr, optional
A label for the data set. Must be 80 characters or smaller.
variable_labelsdict
Dictionary containing columns as keys and variable labels as
values. Each label must be 80 characters or smaller.
version{114, 117, 118, 119, None}, default 114
Version to use in the output dta file. Set to None to let pandas
decide between 118 or 119 formats depending on the number of
columns in the frame. Version 114 can be read by Stata 10 and
later. Version 117 can be read by Stata 13 or later. Version 118
is supported in Stata 14 and later. Version 119 is supported in
Stata 15 and later. Version 114 limits string variables to 244
characters or fewer while versions 117 and later allow strings
with lengths up to 2,000,000 characters. Versions 118 and 119
support Unicode characters, and version 119 supports more than
32,767 variables.
Version 119 should usually only be used when the number of
variables exceeds the capacity of dta format 118. Exporting
smaller datasets in format 119 may have unintended consequences,
and, as of November 2020, Stata SE cannot read version 119 files.
convert_strllist, optional
List of column names to convert to string columns to Stata StrL
format. Only available if version is 117. Storing strings in the
StrL format can produce smaller dta files if strings have more than
8 characters and values are repeated.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
value_labelsdict of dicts
Dictionary containing columns as keys and dictionaries of column value
to labels as values. Labels for a single variable must be 32,000
characters or smaller.
read_stata : Import Stata data files.
io.stata.StataWriter : Low-level writer for Stata data files.
io.stata.StataWriter117 : Low-level writer for version 117 files.
bufstr, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
columnsarray-like, optional, default None
The subset of columns to write. Writes all columns by default.
col_spaceint, list or dict of int, optional
The minimum width of each column. If a list of ints is given every integers corresponds with one column. If a dict is given, the key references the column, while the value defines the space to use..
headerbool or list of str, optional
Write out the column names. If a list of columns is given, it is assumed to be aliases for the column names.
indexbool, optional, default True
Whether to print index (row) labels.
na_repstr, optional, default ‘NaN’
String representation of NaN to use.
formatterslist, tuple or dict of one-param. functions, optional
Formatter functions to apply to columns’ elements by position or
name.
The result of each function must be a unicode string.
List/tuple must be of length equal to the number of columns.
Formatter function to apply to columns’ elements if they are
floats. This function must return a unicode string and will be
applied only to the non-NaN elements, with NaN being
handled by na_rep.
sparsifybool, optional, default True
Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row.
index_namesbool, optional, default True
Prints the names of the indexes.
justifystr, default None
How to justify the column labels. If None uses the option from
the print configuration (controlled by set_option), ‘right’ out
of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rowsint, optional
Maximum number of rows to display in the console.
max_colsint, optional
Maximum number of columns to display in the console.
show_dimensionsbool, default False
Display DataFrame dimensions (number of rows by number of columns).
decimalstr, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe.
line_widthint, optional
Width to wrap a line in characters.
min_rowsint, optional
The number of rows to display in the console in a truncated repr
(when number of rows is above max_rows).
max_colwidthint, optional
Max width to truncate each column in characters. By default, no limit.
Convention for converting period to timestamp; start of period
vs. end.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert (the index by default).
copybool, default True
If False then underlying input data is not copied.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
path_or_bufferstr, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like
object implementing a write() function. If None, the result is returned
as a string.
indexbool, default True
Whether to include index in XML document.
root_namestr, default ‘data’
The name of root element in XML document.
row_namestr, default ‘row’
The name of row element in XML document.
na_repstr, optional
Missing data representation.
attr_colslist-like, optional
List of columns to write as attributes in row element.
Hierarchical columns will be flattened with underscore
delimiting the different levels.
elem_colslist-like, optional
List of columns to write as children in row element. By default,
all columns output as children of row element. Hierarchical
columns will be flattened with underscore delimiting the
different levels.
namespacesdict, optional
All namespaces to be defined in root element. Keys of dict
should be prefix names and values of dict corresponding URIs.
Default namespaces should be given empty string key. For
example,
namespaces={"":"https://example.com"}
prefixstr, optional
Namespace prefix to be used for every element and/or attribute
in document. This should be one of the keys in namespaces
dict.
encodingstr, default ‘utf-8’
Encoding of the resulting document.
xml_declarationbool, default True
Whether to include the XML declaration at start of document.
pretty_printbool, default True
Whether output should be pretty printed with indentation and
line breaks.
parser{‘lxml’,’etree’}, default ‘lxml’
Parser module to use for building of tree. Only ‘lxml’ and
‘etree’ are supported. With ‘lxml’, the ability to use XSLT
stylesheet is supported.
stylesheetstr, path object or file-like object, optional
A URL, file-like object, or a raw string containing an XSLT
script used to transform the raw XML output. Script should use
layout of elements and attributes from original output. This
argument requires lxml to be installed. Only XSLT 1.0
scripts and not later versions is currently supported.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buffer’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
Function to use for transforming the data. If a function, must either
work when passed a DataFrame or when passed to DataFrame.apply. If func
is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
function
string function name
list-like of functions and/or function names, e.g. [np.exp,'sqrt']
dict-like of axis labels -> functions, function names or list-like of such.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’: apply function to each column.
If 1 or ‘columns’: apply function to each row.
>>> df=pd.DataFrame({... "c":[1,1,1,2,2,2,2],... "type":["m","n","o","m","m","n","n"]... })>>> df c type0 1 m1 1 n2 1 o3 2 m4 2 m5 2 n6 2 n>>> df['size']=df.groupby('c')['type'].transform(len)>>> df c type size0 1 m 31 1 n 32 1 o 33 2 m 44 2 m 45 2 n 46 2 n 4
Whether to copy the data after transposing, even for DataFrames
with a single dtype.
Note that a copy is always required for mixed dtype DataFrames,
or for DataFrames with any extension types.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
Transposing a DataFrame with mixed dtypes will result in a homogeneous
DataFrame with the object dtype. In such a case, a copy of the data
is always made.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Axis to truncate. Truncates the index (rows) by default.
For Series this parameter is unused and defaults to 0.
copybool, default is True,
Return a copy of the truncated section.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
>>> df=pd.DataFrame({'A':['a','b','c','d','e'],... 'B':['f','g','h','i','j'],... 'C':['k','l','m','n','o']},... index=[1,2,3,4,5])>>> df A B C1 a f k2 b g l3 c h m4 d i n5 e j o
>>> df.truncate(before=2,after=4) A B C2 b g l3 c h m4 d i n
The columns of a DataFrame can be truncated.
>>> df.truncate(before="A",after="B",axis="columns") A B1 a f2 b g3 c h4 d i5 e j
For Series, only rows can be truncated.
>>> df['A'].truncate(before=2,after=4)2 b3 c4 dName: A, dtype: object
The index values in truncate can be datetimes or string
dates.
Because the index is a DatetimeIndex containing only dates, we can
specify before and after as strings. They will be coerced to
Timestamps before truncation.
Note that truncate assumes a 0 value for any unspecified time
component (midnight). This differs from partial string slicing, which
returns any partially matching dates.
Target time zone. Passing None will convert to
UTC and remove the timezone information.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert
levelint, str, default None
If axis is a MultiIndex, convert a specific level. Otherwise
must be None.
copybool, default True
Also make a copy of the underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
Time zone to localize. Passing None will remove the
time zone information and preserve local time.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to localize
levelint, str, default None
If axis ia a MultiIndex, localize a specific level. Otherwise
must be None.
copybool, default True
Also make a copy of the underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
When clocks moved backward due to DST, ambiguous times may arise.
For example in Central European Time (UTC+01), when going from
03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at
00:30:00 UTC and at 01:30:00 UTC. In such a situation, the
ambiguous parameter dictates how ambiguous times should be
handled.
‘infer’ will attempt to infer fall dst-transition hours based on
order
bool-ndarray where True signifies a DST time, False designates
a non-DST time (note that this flag is only applicable for
ambiguous times)
‘NaT’ will return NaT where there are ambiguous times
‘raise’ will raise an AmbiguousTimeError if there are ambiguous
times.
nonexistentstr, default ‘raise’
A nonexistent time does not exist in a particular timezone
where clocks moved forward due to DST. Valid values are:
‘shift_forward’ will shift the nonexistent time forward to the
closest existing time
‘shift_backward’ will shift the nonexistent time backward to the
closest existing time
‘NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
‘raise’ will raise an NonExistentTimeError if there are
nonexistent times.
If the DST transition causes nonexistent times, you can shift these
dates forward or backward with a timedelta object or ‘shift_forward’
or ‘shift_backward’.
>>> index=pd.MultiIndex.from_tuples([('one','a'),('one','b'),... ('two','a'),('two','b')])>>> s=pd.Series(np.arange(1.0,5.0),index=index)>>> sone a 1.0 b 2.0two a 3.0 b 4.0dtype: float64
>>> s.unstack(level=-1) a bone 1.0 2.0two 3.0 4.0
>>> s.unstack(level=0) one twoa 1.0 3.0b 2.0 4.0
>>> df=s.unstack(level=0)>>> df.unstack()one a 1.0 b 2.0two a 3.0 b 4.0dtype: float64
otherDataFrame, or object coercible into a DataFrame
Should have at least one matching index/column label
with the original DataFrame. If a Series is passed,
its name attribute must be set, and that will be
used as the column name to align with the original DataFrame.
join{‘left’}, default ‘left’
Only left join is implemented, keeping the index and columns of the
original object.
overwritebool, default True
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values
with values from other.
False: only update values that are NA in
the original DataFrame.
The DataFrame’s length does not increase as a result of the update,
only values at matching index/column labels are updated.
>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_df=pd.DataFrame({'B':['d','e','f','g','h','i']})>>> df.update(new_df)>>> df A B0 a d1 b e2 c f
>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_df=pd.DataFrame({'B':['d','f']},index=[0,2])>>> df.update(new_df)>>> df A B0 a d1 b y2 c f
For Series, its name attribute must be set.
>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_column=pd.Series(['d','e','f'],name='B')>>> df.update(new_column)>>> df A B0 a d1 b e2 c f
If other contains NaNs the corresponding values are not updated
in the original dataframe.
The returned Series will have a MultiIndex with one level per input
column but an Index (non-multi) for a single label. By default, rows
that contain any NA values are omitted from the result. By default,
the resulting Series will be in descending order so that the first
element is the most frequently-occurring row.
With dropna set to False we can also count rows with NA values.
>>> df=pd.DataFrame({'first_name':['John','Anne','John','Beth'],... 'middle_name':['Smith',pd.NA,pd.NA,'Louise']})>>> df first_name middle_name0 John Smith1 Anne <NA>2 John <NA>3 Beth Louise
>>> df.value_counts()first_name middle_nameBeth Louise 1John Smith 1Name: count, dtype: int64
>>> df.value_counts(dropna=False)first_name middle_nameAnne NaN 1Beth Louise 1John Smith 1 NaN 1Name: count, dtype: int64
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.var with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
skipnabool, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
ddofint, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
condbool Series/DataFrame, array-like, or callable
Where cond is True, keep the original value. Where
False, replace with corresponding value from other.
If cond is callable, it is computed on the Series/DataFrame and
should return boolean Series/DataFrame or array. The callable must
not change input Series/DataFrame (though pandas doesn’t check it).
otherscalar, Series/DataFrame, or callable
Entries where cond is False are replaced with
corresponding value from other.
If other is callable, it is computed on the Series/DataFrame and
should return scalar or Series/DataFrame. The callable must not
change input Series/DataFrame (though pandas doesn’t check it).
If not specified, entries will be filled with the corresponding
NULL value (np.nan for numpy dtypes, pd.NA for extension
dtypes).
inplacebool, default False
Whether to perform the operation in place on the data.
axisint, default None
Alignment axis if needed. For Series this parameter is
unused and defaults to 0.
The where method is an application of the if-then idiom. For each
element in the calling DataFrame, if cond is True the
element is used; otherwise the corresponding element from the DataFrame
other is used. If the axis of other does not align with axis of
cond Series/DataFrame, the misaligned index positions will be filled with
False.
The signature for DataFrame.where() differs from
numpy.where(). Roughly df1.where(m,df2) is equivalent to
np.where(m,df1,df2).
For further details and examples see the where documentation in
indexing.
The dtype of the object takes precedence. The fill value is casted to
the object’s dtype, if this can be done losslessly.
Even when the index of other is the same as the index of the DataFrame,
the Series will not be reoriented. If index-wise alignment is desired,
DataFrame.add() should be used with axis=’index’.
>>> s2=pd.Series([0.5,1.5],index=['elk','moose'])>>> df[['height','weight']]+s2 elk height moose weightelk NaN NaN NaN NaNmoose NaN NaN NaN NaN
Export the pandas DataFrame as an Arrow C stream PyCapsule.
This relies on pyarrow to convert the pandas DataFrame to the Arrow
format (and follows the default behaviour of pyarrow.Table.from_pandas
in its handling of the index, i.e. store the index as a column except
for RangeIndex).
This conversion is not necessarily zero-copy.
t : str, the type of setting error
force : bool, default False
If True, then force showing an error.
validate if we are doing a setitem on a chained copy.
It is technically possible to figure out that we are setting on
a copy even WITH a multi-dtyped pandas object. In other words, some
blocks may be views while other are not. Currently _is_view will ALWAYS
return False for multi-blocks to avoid having to handle this case.
# This technically need not raise SettingWithCopy if both are view
# (which is not generally guaranteed but is usually True. However,
# this is in general not a good practice and we recommend using .loc.
df.iloc[0:5][‘group’] = ‘a’
Ensures new columns (which go into the BlockManager as new blocks) are
always copied (or a reference is being tracked to them under CoW)
and converted into an array.
Internal version of the take method that sets the _is_copy
attribute to keep track of the parent dataframe (using in indexing
for the SettingWithCopyWarning).
For Series this does the same as the public take (it never sets _is_copy).
See the docstring of take for full explanation of the parameters.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Add a new solver to the dataframe. Initializes value to None by default.
Args:
solver_name: The name of the solver to be added.
configurations: A list of configuration keys for the solver.
initial_value: The value assigned for each index of the new solver.
If not None, must match the index dimension (n_obj * n_inst * n_runs).
DataFrame.apply : Perform any type of operations.
DataFrame.transform : Perform transformation type operations.
pandas.DataFrame.groupby : Perform operations over groups.
pandas.DataFrame.resample : Perform operations over resampled bins.
pandas.DataFrame.rolling : Perform operations over rolling window.
pandas.DataFrame.expanding : Perform operations over expanding window.
pandas.core.window.ewm.ExponentialMovingWindow : Perform operation over exponential
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to
numpy.mean(arr_2d,axis=0).
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected
behavior or errors and are not supported. See gotchas.udf-mutation
for more details.
A passed user-defined-function will be passed a Series for evaluation.
DataFrame.apply : Perform any type of operations.
DataFrame.transform : Perform transformation type operations.
pandas.DataFrame.groupby : Perform operations over groups.
pandas.DataFrame.resample : Perform operations over resampled bins.
pandas.DataFrame.rolling : Perform operations over rolling window.
pandas.DataFrame.expanding : Perform operations over expanding window.
pandas.core.window.ewm.ExponentialMovingWindow : Perform operation over exponential
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to
numpy.mean(arr_2d,axis=0).
agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected
behavior or errors and are not supported. See gotchas.udf-mutation
for more details.
A passed user-defined-function will be passed a Series for evaluation.
other : DataFrame or Series
join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
Type of alignment to be performed.
left: use only keys from left frame, preserve key order.
right: use only keys from right frame, preserve key order.
outer: use union of keys from both frames, sort keys lexicographically.
inner: use intersection of keys from both frames,
preserve the order of the left keys.
axisallowed axis of the other object, default None
Align on index (0), columns (1), or both (None).
levelint or level name, default None
Broadcast across a level, matching Index values on the
passed MultiIndex level.
copybool, default True
Always returns new objects. If copy=False and no reindexing is
required then original objects are returned.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
fill_valuescalar, default np.nan
Value to use for missing values. Defaults to NaN, but can be any
“compatible” value.
Method to use for filling holes in reindexed Series:
pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.
Deprecated since version 2.1.
limitint, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
Deprecated since version 2.1.
fill_axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame, default 0
Filling axis, method and limit.
Deprecated since version 2.1.
broadcast_axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame, default None
Broadcast values along this axis, if aligning two objects of
different dimensions.
>>> df=pd.DataFrame(... [[1,2,3,4],[6,7,8,9]],columns=["D","B","E","A"],index=[1,2]... )>>> other=pd.DataFrame(... [[10,20,30,40],[60,70,80,90],[600,700,800,900]],... columns=["A","B","C","D"],... index=[2,3,4],... )>>> df D B E A1 1 2 3 42 6 7 8 9>>> other A B C D2 10 20 30 403 60 70 80 904 600 700 800 900
Align on columns:
>>> left,right=df.align(other,join="outer",axis=1)>>> left A B C D E1 4 2 NaN 1 32 9 7 NaN 6 8>>> right A B C D E2 10 20 30 40 NaN3 60 70 80 90 NaN4 600 700 800 900 NaN
We can also align on the index:
>>> left,right=df.align(other,join="outer",axis=0)>>> left D B E A1 1.0 2.0 3.0 4.02 6.0 7.0 8.0 9.03 NaN NaN NaN NaN4 NaN NaN NaN NaN>>> right A B C D1 NaN NaN NaN NaN2 10.0 20.0 30.0 40.03 60.0 70.0 80.0 90.04 600.0 700.0 800.0 900.0
Finally, the default axis=None will align on both index and columns:
>>> left,right=df.align(other,join="outer",axis=None)>>> left A B C D E1 4.0 2.0 NaN 1.0 3.02 9.0 7.0 NaN 6.0 8.03 NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN>>> right A B C D E1 NaN NaN NaN NaN NaN2 10.0 20.0 30.0 40.0 NaN3 60.0 70.0 80.0 90.0 NaN4 600.0 700.0 800.0 900.0 NaN
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced. For Series this parameter
is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the
original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the
original index.
None : reduce all axes, return a scalar.
bool_onlybool, default False
Include only boolean columns. Not implemented for Series.
skipnabool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is
True, then the result will be True, as for an empty row/column.
If skipna is False, then NA are treated as True, because these are not
equal to zero.
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced. For Series this parameter
is unused and defaults to 0.
0 / ‘index’ : reduce the index, return a Series whose index is the
original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the
original index.
None : reduce all axes, return a scalar.
bool_onlybool, default False
Include only boolean columns. Not implemented for Series.
skipnabool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is
True, then the result will be False, as for an empty row/column.
If skipna is False, then NA are treated as True, because these are not
equal to zero.
numpy.any : Numpy version of this method.
Series.any : Return whether any element is True.
Series.all : Return whether all elements are True.
DataFrame.any : Return whether any element is True over requested axis.
DataFrame.all : Return whether all elements are True over requested axis.
Objects passed to the function are Series objects whose index is
either the DataFrame’s index (axis=0) or the DataFrame’s columns
(axis=1). By default (result_type=None), the final return type
is inferred from the return type of the applied function. Otherwise,
it depends on the result_type argument.
Determines if row or column is passed as a Series or ndarray object:
False : passes each row or column as a Series to the
function.
True : the passed function will receive ndarray objects
instead.
If you are just applying a NumPy reduction function this will
achieve much better performance.
‘expand’ : list-like results will be turned into columns.
‘reduce’ : returns a Series if possible rather than expanding
list-like results. This is the opposite of ‘expand’.
‘broadcast’ : results will be broadcast to the original shape
of the DataFrame, the original index and columns will be
retained.
The default behaviour (None) depends on the return value of the
applied function: list-like results will be returned as a Series
of those. However if the apply function returns a Series these
are expanded to columns.
argstuple
Positional arguments to pass to func in addition to the
array/series.
by_rowFalse or “compat”, default “compat”
Only has an effect when func is a listlike or dictlike of funcs
and the func isn’t a string.
If “compat”, will if possible first translate the func into pandas
methods (e.g. Series().apply(np.sum) will be translated to
Series().sum()). If that doesn’t work, will try call to apply again with
by_row=True and if that fails, will call apply again with
by_row=False (backward compatible).
If False, the funcs will be passed the whole Series at once.
Added in version 2.1.0.
engine{‘python’, ‘numba’}, default ‘python’
Choose between the python (default) engine or the numba engine in apply.
The numba engine will attempt to JIT compile the passed function,
which may result in speedups for large DataFrames.
It also supports the following engine_kwargs :
nopython (compile the function in nopython mode)
nogil (release the GIL inside the JIT compiled function)
parallel (try to apply the function in parallel over the DataFrame)
Note: Due to limitations within numba/how pandas interfaces with numba,
you should only use this if raw=True
Note: The numba compiler only supports a subset of
valid Python/numpy operations.
Pass keyword arguments to the engine.
This is currently only used by the numba engine,
see the documentation for the engine argument for more information.
DataFrame.map: For elementwise operations.
DataFrame.aggregate: Only perform aggregating type operations.
DataFrame.transform: Only perform transforming type operations.
Passing result_type='broadcast' will ensure the same shape
result, whether list-like or scalar is returned by the function,
and broadcast it along the axis. The resulting column names will
be the originals.
>>> df.apply(lambdax:[1,2],axis=1,result_type='broadcast') A B0 1 21 1 22 1 2
DataFrame.apply : Apply a function along input axis of DataFrame.
DataFrame.map : Apply a function along input axis of DataFrame.
DataFrame.replace: Replace values given in to_replace with value.
Returns the original data conformed to a new index with the specified
frequency.
If the index of this Series/DataFrame is a PeriodIndex, the new index
is the result of transforming the original index with
PeriodIndex.asfreq (so the original index
will map one-to-one to the new index).
Otherwise, the new index will be equivalent to pd.date_range(start,end,freq=freq) where start and end are, respectively, the first and
last entries in the original index (see pandas.date_range()). The
values corresponding to any timesteps in the new index which were not present
in the original index will be null (NaN), unless a method for filling
such unknowns is provided (see the method parameter below).
The resample() method is more appropriate if an operation on each group of
timesteps (such as an aggregate) is necessary to represent the data at the new
frequency.
Return the last row(s) without any NaNs before where.
The last row (for each element in where, if list) without any
NaN is taken.
In case of a DataFrame, the last row without NaN
considering only the subset of columns (if not None)
If there is no good value, NaN is returned for a Series or
a Series of NaN values for a DataFrame
The column names are keywords. If the values are
callable, they are computed on the DataFrame and
assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it).
If the values are not callable, (e.g. a Series, scalar, or array),
they are simply assigned.
Assigning multiple columns within the same assign is possible.
Later items in ‘**kwargs’ may refer to newly created or modified
columns in ‘df’; items are computed and assigned into ‘df’ in order.
dtypestr, data type, Series or Mapping of column name -> data type
Use a str, numpy.dtype, pandas.ExtensionDtype or Python type to
cast entire pandas object to the same type. Alternatively, use a
mapping, e.g. {col: dtype, …}, where col is a column label and dtype is
a numpy.dtype or Python type to cast one or more of the DataFrame’s
columns to column-specific types.
copybool, default True
Return a copy when copy=True (be very careful setting
copy=False as changes to values then may propagate to other
pandas objects).
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
errors{‘raise’, ‘ignore’}, default ‘raise’
Control raising of exceptions on invalid data for provided dtype.
raise : allow exceptions to be raised
ignore : suppress exceptions. On error return original object.
to_datetime : Convert argument to datetime.
to_timedelta : Convert argument to timedelta.
to_numeric : Convert argument to a numeric type.
numpy.ndarray.astype : Cast a numpy array to a specified type.
Changed in version 2.0.0: Using astype to convert from timezone-naive dtype to
timezone-aware dtype will raise an exception.
Use Series.dt.tz_localize() instead.
between_time : Select values between particular times of the day.
first : Select initial periods of time series based on a date offset.
last : Select final periods of time series based on a date offset.
DatetimeIndex.indexer_at_time : Get just the index locations for
Return the best configuration for the given objective over the instances.
Args:
solver: The solver for which we determine the best configuration
objective: The objective for which we calculate the best configuration
instances: The instances which should be selected for the evaluation
Returns:
The best configuration id and its aggregated performance.
Return the best performance for each instance in the portfolio.
Args:
objective: The objective for which we calculate the best performance
instances: The instances which should be selected for the evaluation
run_id: The run for which we calculate the best performance. If None,
we consider all runs.
exclude_solvers: List of (solver, config_id) to exclude in the calculation.
Returns:
The best performance for each instance in the portfolio.
at_time : Select values at a particular time of the day.
first : Select initial periods of time series based on a date offset.
last : Select final periods of time series based on a date offset.
DatetimeIndex.indexer_between_time : Get just the index locations for
axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame
Axis along which to fill missing values. For Series
this parameter is unused and defaults to 0.
inplacebool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limitint, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
If limit is specified, consecutive NaNs will be filled with this
restriction.
None: No fill restriction.
‘inside’: Only fill NaNs surrounded by valid values
(interpolate).
‘outside’: Only fill NaNs outside valid values (extrapolate).
Added in version 2.2.0.
downcastdict, default is None
A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
Return the bool of a single element Series or DataFrame.
Deprecated since version 2.1.0: bool is deprecated and will be removed in future version of pandas.
For Series use pandas.Series.item.
This must be a boolean scalar value, either True or False. It will raise a
ValueError if the Series or DataFrame does not have exactly 1 element, or that
element is not boolean (integer values 0 and 1 will also raise an exception).
Series.astype : Change the data type of a Series, including to boolean.
DataFrame.astype : Change the data type of a DataFrame, including to boolean.
numpy.bool_ : NumPy boolean data type, used by pandas for boolean values.
Make a box-and-whisker plot from DataFrame columns, optionally grouped
by some other columns. A box plot is a method for graphically depicting
groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data,
with a line at the median (Q2). The whiskers extend from the edges
of box to show the range of the data. By default, they extend no more than
1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest
data point within that interval. Outliers are plotted as separate dots.
For further details see
Wikipedia’s entry for boxplot.
Column name or list of names, or vector.
Can be any valid input to pandas.DataFrame.groupby().
bystr or array-like, optional
Column in the DataFrame to pandas.DataFrame.groupby().
One box-plot will be done per value of columns in by.
axobject of class matplotlib.axes.Axes, optional
The matplotlib axes to be used by boxplot.
fontsizefloat or str
Tick label font size in points or as a string (e.g., large).
rotfloat, default 0
The rotation angle of labels (in degrees)
with respect to the screen coordinate system.
gridbool, default True
Setting this to True will show the grid.
figsizeA tuple (width, height) in inches
The size of the figure to create in matplotlib.
layouttuple (rows, columns), optional
For example, (3, 5) will display the subplots
using 3 rows and 5 columns, starting from the top-left.
return_type{‘axes’, ‘dict’, ‘both’} or None, default ‘axes’
The kind of object to return. The default is axes.
‘axes’ returns the matplotlib axes the boxplot is drawn on.
‘dict’ returns a dictionary whose values are the matplotlib
Lines of the boxplot.
‘both’ returns a namedtuple with the axes and dict.
when grouping with by, a Series mapping columns to
return_type is returned.
If return_type is None, a NumPy array
of axes with the same shape as layout is returned.
backendstr, default None
Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to
specify the plotting.backend for the whole session, set
pd.options.plotting.backend.
The return type depends on the return_type parameter:
‘axes’ : object of class matplotlib.axes.Axes
‘dict’ : dict of matplotlib.lines.Line2D objects
‘both’ : a namedtuple with structure (ax, lines)
For data grouped with by, return a Series of the above or a numpy
array:
Series
array (for return_type=None)
Use return_type='dict' when you want to tweak the appearance
of the lines after plotting. In this case a dict containing the Lines
making up the boxes, caps, fliers, medians, and whiskers is returned.
Boxplots can be created for every column in the dataframe
by df.boxplot() or indicating the columns to be used:
Boxplots of variables distributions grouped by the values of a third
variable can be created using the option by. For instance:
A list of strings (i.e. ['X','Y']) can be passed to boxplot
in order to group the data by combination of the variables in the x-axis:
The layout of boxplot can be adjusted giving a tuple to layout:
Additional formatting can be done to the boxplot, like suppressing the grid
(grid=False), rotating the labels in the x-axis (i.e. rot=45)
or changing the fontsize (i.e. fontsize=15):
The parameter return_type can be used to select the type of element
returned by boxplot. When return_type='axes' is selected,
the matplotlib axes on which the boxplot is drawn are returned:
Assigns values outside boundary to boundary values. Thresholds
can be singular values or array like, and in the latter case
the clipping is performed element-wise in the specified axis.
Series.clip : Trim values at input threshold in series.
DataFrame.clip : Trim values at input threshold in dataframe.
numpy.clip : Clip (limit) the values in an array.
Perform column-wise combine with another DataFrame.
Combines a DataFrame with other DataFrame using func
to element-wise combine columns. The row and column indexes of the
resulting DataFrame will be the union of the two.
Example using a true element-wise combine function.
>>> df1=pd.DataFrame({'A':[5,0],'B':[2,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine(df2,np.minimum) A B0 1 21 0 3
Using fill_value fills Nones prior to passing the column to the
merge function.
>>> df1=pd.DataFrame({'A':[0,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine(df2,take_smaller,fill_value=-5) A B0 0 -5.01 0 4.0
However, if the same element in both dataframes is None, that None
is preserved
>>> df1=pd.DataFrame({'A':[0,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[None,3]})>>> df1.combine(df2,take_smaller,fill_value=-5) A B0 0 -5.01 0 3.0
Example that demonstrates the use of overwrite and behavior when
the axis differ between the dataframes.
>>> df1=pd.DataFrame({'A':[0,0],'B':[4,4]})>>> df2=pd.DataFrame({'B':[3,3],'C':[-10,1],},index=[1,2])>>> df1.combine(df2,take_smaller) A B C0 NaN NaN NaN1 NaN 3.0 -10.02 NaN 3.0 1.0
>>> df1.combine(df2,take_smaller,overwrite=False) A B C0 0.0 NaN NaN1 0.0 3.0 -10.02 NaN 3.0 1.0
Demonstrating the preference of the passed in dataframe.
>>> df2=pd.DataFrame({'B':[3,3],'C':[1,1],},index=[1,2])>>> df2.combine(df1,take_smaller) A B C0 0.0 NaN NaN1 0.0 3.0 NaN2 NaN 3.0 NaN
>>> df2.combine(df1,take_smaller,overwrite=False) A B C0 0.0 NaN NaN1 0.0 3.0 1.02 NaN 3.0 1.0
Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame
with non-null values from other DataFrame. The row and column indexes
of the resulting DataFrame will be the union of the two. The resulting
dataframe contains the ‘first’ dataframe values and overrides the
second one values where both first.loc[index, col] and
second.loc[index, col] are not missing values, upon calling
first.combine_first(second).
>>> df1=pd.DataFrame({'A':[None,0],'B':[None,4]})>>> df2=pd.DataFrame({'A':[1,1],'B':[3,3]})>>> df1.combine_first(df2) A B0 1.0 3.01 0.0 4.0
Null values still persist if the location of that null value
does not exist in other
>>> df1=pd.DataFrame({'A':[None,0],'B':[4,None]})>>> df2=pd.DataFrame({'B':[3,3],'C':[1,1]},index=[1,2])>>> df1.combine_first(df2) A B C0 NaN 4.0 NaN1 0.0 3.0 1.02 NaN 3.0 1.0
>>> df=pd.DataFrame(... {... "col1":["a","a","b","b","a"],... "col2":[1.0,2.0,3.0,np.nan,5.0],... "col3":[1.0,2.0,3.0,4.0,5.0]... },... columns=["col1","col2","col3"],... )>>> df col1 col2 col30 a 1.0 1.01 a 2.0 2.02 b 3.0 3.03 b NaN 4.04 a 5.0 5.0
>>> df2=df.copy()>>> df2.loc[0,'col1']='c'>>> df2.loc[2,'col3']=4.0>>> df2 col1 col2 col30 c 1.0 1.01 a 2.0 2.02 b 3.0 4.03 b NaN 4.04 a 5.0 5.0
Align the differences on columns
>>> df.compare(df2) col1 col3 self other self other0 a c NaN NaN2 NaN NaN 3.0 4.0
Assign result_names
>>> df.compare(df2,result_names=("left","right")) col1 col3 left right left right0 a c NaN NaN2 NaN NaN 3.0 4.0
Stack the differences on rows
>>> df.compare(df2,align_axis=0) col1 col30 self a NaN other c NaN2 self NaN 3.0 other NaN 4.0
Keep the equal values
>>> df.compare(df2,keep_equal=True) col1 col3 self other self other0 a c 1.0 1.02 b b 3.0 4.0
Keep all original rows and columns
>>> df.compare(df2,keep_shape=True) col1 col2 col3 self other self other self other0 a c NaN NaN NaN NaN1 NaN NaN NaN NaN NaN NaN2 NaN NaN NaN NaN 3.0 4.03 NaN NaN NaN NaN NaN NaN4 NaN NaN NaN NaN NaN NaN
Keep all original rows and columns and also all original values
>>> df.compare(df2,keep_shape=True,keep_equal=True) col1 col2 col3 self other self other self other0 a c 1.0 1.0 1.0 1.01 a a 2.0 2.0 2.0 2.02 b b 3.0 3.0 3.0 4.03 b b NaN NaN 4.0 4.04 a a 5.0 5.0 5.0 5.0
Return the (best) configuration performance for objective over the instances.
Args:
solver: The solver for which we determine evaluate the configuration
configuration: The configuration (id) to evaluate
objective: The objective for which we calculate find the best value
instances: The instances which should be selected for the evaluation
per_instance: Whether to return the performance per instance,
or aggregated.
Returns:
The (best) configuration id and its aggregated performance.
Whether object dtypes should be converted to the best possible types.
convert_stringbool, default True
Whether object dtypes should be converted to StringDtype().
convert_integerbool, default True
Whether, if possible, conversion can be done to integer extension types.
convert_booleanbool, defaults True
Whether object dtypes should be converted to BooleanDtypes().
convert_floatingbool, defaults True
Whether, if possible, conversion can be done to floating extension types.
If convert_integer is also True, preference will be give to integer
dtypes if the floats can be faithfully casted to integers.
By default, convert_dtypes will attempt to convert a Series (or each
Series in a DataFrame) to dtypes that support pd.NA. By using the options
convert_string, convert_integer, convert_boolean and
convert_floating, it is possible to turn off individual conversions
to StringDtype, the integer extension types, BooleanDtype
or floating extension types, respectively.
For object-dtyped columns, if infer_objects is True, use the inference
rules as during normal Series/DataFrame construction. Then, if possible,
convert to StringDtype, BooleanDtype or an appropriate integer
or floating extension type, otherwise leave as object.
If the dtype is integer, convert to an appropriate integer extension type.
If the dtype is numeric, and consists of all integers, convert to an
appropriate integer extension type. Otherwise, convert to an
appropriate floating extension type.
In the future, as new dtypes are added that support pd.NA, the results
of this method will change to support those new dtypes.
When deep=True (default), a new object will be created with a
copy of the calling object’s data and indices. Modifications to
the data or indices of the copy will not be reflected in the
original object (see notes below).
When deep=False, a new object will be created without copying
the calling object’s data or index (only references to the data
and index are copied). Any changes to the data of the original
will be reflected in the shallow copy (and vice versa).
Note
The deep=False behaviour as described above will change
in pandas 3.0. Copy-on-Write
will be enabled by default, which means that the “shallow” copy
is that is returned with deep=False will still avoid making
an eager copy, but changes to the data of the original will no
longer be reflected in the shallow copy (or vice versa). Instead,
it makes use of a lazy (deferred) copy mechanism that will copy
the data only when any changes to the original or shallow copy is
made.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
When deep=True, data is copied but actual Python objects
will not be copied recursively, only the reference to the object.
This is in contrast to copy.deepcopy in the Standard Library,
which recursively copies object data (see examples below).
While Index objects are copied when deep=True, the underlying
numpy array is not copied for performance reasons. Since Index is
immutable, the underlying data can be safely shared and a copy
is not needed.
Since pandas is not thread safe, see the
gotchas when copying in a threading
environment.
When copy_on_write in pandas config is set to True, the
copy_on_write config takes effect even when deep=False.
This means that any changes to the copied data would make a new copy
of the data upon write (and vice versa). Changes made to either the
original or copied variable would not be reflected in the counterpart.
See Copy_on_Write for more information.
Updates to the data shared by shallow copy and original is reflected
in both (NOTE: this will no longer be true for pandas >= 3.0);
deep copy remains unchanged.
Note that when copying an object containing Python objects, a deep copy
will copy the data, but will not do so recursively. Updating a nested
data object will be reflected in the deep copy.
method{‘pearson’, ‘kendall’, ‘spearman’} or callable
Method of correlation:
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
callable: callable with input two 1d ndarrays
and returning a float. Note that the returned matrix from corr
will have 1 along the diagonals and will be symmetric
regardless of the callable’s behavior.
min_periodsint, optional
Minimum number of observations required per pair of columns
to have a valid result. Currently only available for Pearson
and Spearman correlation.
numeric_onlybool, default False
Include only float, int or boolean data.
Added in version 1.5.0.
Changed in version 2.0.0: The default value of numeric_only is now False.
Pairwise correlation is computed between rows or columns of
DataFrame with rows or columns of Series or DataFrame. DataFrames
are first aligned along both axes before computing the
correlations.
Series.count: Number of non-NA elements in a Series.
DataFrame.value_counts: Count unique combinations of columns.
DataFrame.shape: Number of DataFrame rows and columns (including NA
elements).
DataFrame.isna: Boolean same-sized DataFrame showing places of NA
>>> df=pd.DataFrame({"Person":... ["John","Myla","Lewis","John","Myla"],... "Age":[24.,np.nan,21.,33,26],... "Single":[False,True,True,True,False]})>>> df Person Age Single0 John 24.0 False1 Myla NaN True2 Lewis 21.0 True3 John 33.0 True4 Myla 26.0 False
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame.
The returned data frame is the covariance matrix of the columns
of the DataFrame.
Both NA and null values are automatically excluded from the
calculation. (See the note below about bias from missing values.)
A threshold can be set for the minimum number of
observations for each value created. Comparisons with observations
below this threshold will be returned as NaN.
This method is generally used for the analysis of time series data to
understand the relationship between different measures
across time.
Minimum number of observations required per pair of columns
to have a valid result.
ddofint, default 1
Delta degrees of freedom. The divisor used in calculations
is N-ddof, where N represents the number of elements.
This argument is applicable only when no nan is in the dataframe.
numeric_onlybool, default False
Include only float, int or boolean data.
Added in version 1.5.0.
Changed in version 2.0.0: The default value of numeric_only is now False.
Returns the covariance matrix of the DataFrame’s time series.
The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that
data is missing at random)
the returned covariance matrix will be an unbiased estimate
of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable
because the estimate covariance matrix is not guaranteed to be positive
semi-definite. This could lead to estimate correlations having
absolute values which are greater than one, and/or a non-invertible
covariance matrix. See Estimation of covariance matrices for more details.
>>> np.random.seed(42)>>> df=pd.DataFrame(np.random.randn(1000,5),... columns=['a','b','c','d','e'])>>> df.cov() a b c d ea 0.998438 -0.020161 0.059277 -0.008943 0.014144b -0.020161 1.059352 -0.008543 -0.024738 0.009826c 0.059277 -0.008543 1.010670 -0.001486 -0.000271d -0.008943 -0.024738 -0.001486 0.921297 -0.013692e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional min_periods keyword
that specifies the required minimum number of non-NA observations for
each column pair in order to have a valid result:
>>> np.random.seed(42)>>> df=pd.DataFrame(np.random.randn(20,3),... columns=['a','b','c'])>>> df.loc[df.index[:5],'a']=np.nan>>> df.loc[df.index[5:10],'b']=np.nan>>> df.cov(min_periods=12) a b ca 0.316741 NaN -0.150812b NaN 1.248003 0.191417c -0.150812 0.191417 0.895202
Descriptive statistics include those that summarize the central
tendency, dispersion and shape of a
dataset’s distribution, excluding NaN values.
Analyzes both numeric and object series, as well
as DataFrame column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes
below for more detail.
The percentiles to include in the output. All should
fall between 0 and 1. The default is
[.25,.5,.75], which returns the 25th, 50th, and
75th percentiles.
include‘all’, list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored
for Series. Here are the options:
‘all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number. To limit it instead to object columns submit
the numpy.object data type. Strings
can also be used in the style of
select_dtypes (e.g. df.describe(include=['O'])). To
select pandas categorical columns, use 'category'
None (default) : The result will include all numeric columns.
excludelist-like of dtypes or None (default), optional,
A black list of data types to omit from the result. Ignored
for Series. Here are the options:
A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number. To exclude object columns submit the data
type numpy.object. Strings can also be used in the style of
select_dtypes (e.g. df.describe(exclude=['O'])). To
exclude pandas categorical columns, use 'category'
DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the observations.
DataFrame.select_dtypes: Subset of a DataFrame including/excluding
For numeric data, the result’s index will include count,
mean, std, min, max as well as lower, 50 and
upper percentiles. By default the lower percentile is 25 and the
upper percentile is 75. The 50 percentile is the
same as the median.
For object data (e.g. strings or timestamps), the result’s index
will include count, unique, top, and freq. The top
is the most common value. The freq is the most common value’s
frequency. Timestamps also include the first and last items.
If multiple object values have the highest count, then the
count and top results will be arbitrarily chosen from
among those with the highest count.
For mixed data types provided via a DataFrame, the default is to
return only an analysis of numeric columns. If the dataframe consists
only of object and categorical data without any numeric columns, the
default is to return an analysis of both the object and categorical
columns. If include='all' is provided as an option, the result
will include a union of attributes of each type.
The include and exclude parameters can be used to limit
which columns in a DataFrame are analyzed for the output.
The parameters are ignored when analyzing a Series.
Describing all columns of a DataFrame regardless of data type.
>>> df.describe(include='all') categorical numeric objectcount 3 3.0 3unique 3 NaN 3top f NaN afreq 1 NaN 1mean NaN 2.0 NaNstd NaN 1.0 NaNmin NaN 1.0 NaN25% NaN 1.5 NaN50% NaN 2.0 NaN75% NaN 2.5 NaNmax NaN 3.0 NaN
Describing a column from a DataFrame by accessing it as
an attribute.
Excluding object columns from a DataFrame description.
>>> df.describe(exclude=[object]) categorical numericcount 3 3.0unique 3 NaNtop f NaNfreq 1 NaNmean NaN 2.0std NaN 1.0min NaN 1.025% NaN 1.550% NaN 2.075% NaN 2.5max NaN 3.0
For boolean dtypes, this uses operator.xor() rather than
operator.sub().
The result is calculated according to current dtype in DataFrame,
however dtype of the result is always float64.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
If other is a Series, return the matrix product between self and
other as a Series. If other is a DataFrame or a numpy.array, return
the matrix product of self and other in a DataFrame of a np.array.
The dimensions of DataFrame and other must be compatible in order to
compute the matrix multiplication. In addition, the column names of
DataFrame and the index of other must contain the same values, as they
will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the
matrix product here.
Remove rows or columns by specifying label names and corresponding
axis, or by directly specifying index or column names. When using a
multi-index, labels on different levels can be removed by specifying
the level. See the user guide
for more information about the now unused levels.
Drop a specific index combination from the MultiIndex
DataFrame, i.e., drop the combination 'falcon' and
'weight', which deletes only the corresponding row
>>> df=pd.DataFrame({"name":['Alfred','Batman','Catwoman'],... "toy":[np.nan,'Batmobile','Bullwhip'],... "born":[pd.NaT,pd.Timestamp("1940-04-25"),... pd.NaT]})>>> df name toy born0 Alfred NaN NaT1 Batman Batmobile 1940-04-252 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() name toy born1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against
each other to see if they have the same shape and elements. NaNs in
the same location are considered equal.
The row/column index do not need to have the same type, as long
as the values are considered equal. Corresponding columns and
index must be of the same dtype.
DataFrames df and different_column_type have the same element
types and values, but have different types for the column labels,
which will still return True.
DataFrames df and different_data_type have different types for the
same values for their elements, and will return False even though
their column labels are the same values and types.
Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows
eval to run arbitrary code, which can make you vulnerable to code
injection if you pass user input to this function.
If the expression contains an assignment, whether to perform the
operation inplace and mutate the existing DataFrame. Otherwise,
a new DataFrame is returned.
Exactly one of com, span, halflife, or alpha must be
provided if times is not provided. If times is provided,
halflife and one of com, span or alpha may be provided.
If times is specified, a timedelta convertible unit over which an
observation decays to half its value. Only applicable to mean(),
and halflife value will not apply to the other functions.
alphafloat, optional
Specify smoothing factor \(\alpha\) directly
\(0 < \alpha \leq 1\).
min_periodsint, default 0
Minimum number of observations in window required to have a value;
otherwise, result is np.nan.
adjustbool, default True
Divide by decaying adjustment factor in beginning periods to account
for imbalance in relative weightings (viewing EWMA as a moving average).
When adjust=True (default), the EW function is calculated using weights
\(w_i = (1 - \alpha)^i\). For example, the EW moving average of the series
[\(x_0, x_1, ..., x_t\)] would be:
When ignore_na=False (default), weights are based on absolute positions.
For example, the weights of \(x_0\) and \(x_2\) used in calculating
the final weighted average of [\(x_0\), None, \(x_2\)] are
\((1-\alpha)^2\) and \(1\) if adjust=True, and
\((1-\alpha)^2\) and \(\alpha\) if adjust=False.
When ignore_na=True, weights are based
on relative positions. For example, the weights of \(x_0\) and \(x_2\)
used in calculating the final weighted average of
[\(x_0\), None, \(x_2\)] are \(1-\alpha\) and \(1\) if
adjust=True, and \(1-\alpha\) and \(\alpha\) if adjust=False.
axis{0, 1}, default 0
If 0 or 'index', calculate across the rows.
If 1 or 'columns', calculate across the columns.
For Series this parameter is unused and defaults to 0.
times : np.ndarray, Series, default None
Only applicable to mean().
Times corresponding to the observations. Must be monotonically increasing and
datetime64[ns] dtype.
If 1-D array like, a sequence with the same shape as the observations.
methodstr {‘single’, ‘table’}, default ‘single’
Added in version 1.4.0.
Execute the rolling operation per single column or row ('single')
or over the entire object ('table').
This argument is only implemented when specifying engine='numba'
in the method call.
Column(s) to explode.
For multiple columns, specify a non-empty list with each element
be str or tuple, and all specified columns their list-like data
on same row of the frame must have matching length.
Added in version 1.3.0: Multi-column explode
ignore_indexbool, default False
If True, the resulting index will be labeled 0, 1, …, n - 1.
This routine will explode list-likes including lists, tuples, sets,
Series, and np.ndarray. The result dtype of the subset rows will
be object. Scalars will be returned unchanged, and empty list-likes will
result in a np.nan for that row. In addition, the ordering of rows in the
output will be non-deterministic when exploding sets.
axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame
Axis along which to fill missing values. For Series
this parameter is unused and defaults to 0.
inplacebool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limitint, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
If limit is specified, consecutive NaNs will be filled with this
restriction.
None: No fill restriction.
‘inside’: Only fill NaNs surrounded by valid values
(interpolate).
‘outside’: Only fill NaNs outside valid values (extrapolate).
Added in version 2.2.0.
downcastdict, default is None
A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
>>> df=pd.DataFrame([[np.nan,2,np.nan,0],... [3,4,np.nan,1],... [np.nan,np.nan,np.nan,np.nan],... [np.nan,3,np.nan,4]],... columns=list("ABCD"))>>> df A B C D0 NaN 2.0 NaN 0.01 3.0 4.0 NaN 1.02 NaN NaN NaN NaN3 NaN 3.0 NaN 4.0
>>> df.ffill() A B C D0 NaN 2.0 NaN 0.01 3.0 4.0 NaN 1.02 3.0 4.0 NaN 1.03 3.0 3.0 NaN 4.0
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). Values not
in the dict/Series/DataFrame will not be filled. This value cannot
be a list.
Method to use for filling holes in reindexed Series:
ffill: propagate last valid observation forward to next valid.
backfill / bfill: use next valid observation to fill gap.
Deprecated since version 2.1.0: Use ffill or bfill instead.
axis{0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame
Axis along which to fill missing values. For Series
this parameter is unused and defaults to 0.
inplacebool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limitint, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
downcastdict, default is None
A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
ffill : Fill values by propagating the last valid observation to next valid.
bfill : Fill values by using the next valid observation to fill the gap.
interpolate : Fill NaN values using interpolation.
reindex : Conform object to new index.
asfreq : Convert TimeSeries to specified frequency.
>>> df=pd.DataFrame([[np.nan,2,np.nan,0],... [3,4,np.nan,1],... [np.nan,np.nan,np.nan,np.nan],... [np.nan,3,np.nan,4]],... columns=list("ABCD"))>>> df A B C D0 NaN 2.0 NaN 0.01 3.0 4.0 NaN 1.02 NaN NaN NaN NaN3 NaN 3.0 NaN 4.0
Replace all NaN elements with 0s.
>>> df.fillna(0) A B C D0 0.0 2.0 0.0 0.01 3.0 4.0 0.0 1.02 0.0 0.0 0.0 0.03 0.0 3.0 0.0 4.0
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1,
2, and 3 respectively.
>>> values={"A":0,"B":1,"C":2,"D":3}>>> df.fillna(value=values) A B C D0 0.0 2.0 2.0 0.01 3.0 4.0 2.0 1.02 0.0 1.0 2.0 3.03 0.0 3.0 2.0 4.0
Only replace the first NaN element.
>>> df.fillna(value=values,limit=1) A B C D0 0.0 2.0 2.0 0.01 3.0 4.0 NaN 1.02 NaN 1.0 NaN 3.03 NaN 3.0 NaN 4.0
When filling using a DataFrame, replacement happens along
the same column names and same indices
>>> df2=pd.DataFrame(np.zeros((4,4)),columns=list("ABCE"))>>> df.fillna(df2) A B C D0 0.0 2.0 0.0 0.01 3.0 4.0 0.0 1.02 0.0 0.0 0.0 NaN3 0.0 3.0 0.0 4.0
Note that column D is not affected since it is not present in df2.
Keep labels from axis for which “like in label == True”.
regexstr (regular expression)
Keep labels from axis for which re.search(regex, label) == True.
axis{0 or ‘index’, 1 or ‘columns’, None}, default None
The axis to filter on, expressed either as an index (int)
or axis name (str). By default this is the info axis, ‘columns’ for
DataFrame. For Series this parameter is unused and defaults to None.
last : Select final periods of time series based on a date offset.
at_time : Select values at a particular time of the day.
between_time : Select values between particular times of the day.
Notice the data for 3 first calendar days were returned, not the first
3 days observed in the dataset, and therefore data for 2018-04-13 was
not returned.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Return a list of performance computation jobs there are to be done.
Get a list of tuple[instance, solver] to run from the performance data.
If rerun is False (default), get only the tuples that don’t have a
value, else (True) get all the tuples.
Args:
rerun: Boolean indicating if we want to rerun all jobs
Returns:
A tuple of (solver, config, instance, run) combinations
Group DataFrame using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the
object, applying a function, and combining the results. This can be
used to group large amounts of data and compute operations on these
groups.
bymapping, function, label, pd.Grouper or list of such
Used to determine the groups for the groupby.
If by is a function, it’s called on each value of the object’s
index. If a dict or Series is passed, the Series or dict VALUES
will be used to determine the groups (the Series’ values are first
aligned; see .align() method). If a list or ndarray of length
equal to the selected axis is passed (see the groupby user guide),
the values are used as-is to determine the groups. A label or list
of labels may be passed to group by the columns in self.
Notice that a tuple is interpreted as a (single) key.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1). For Series this parameter
is unused and defaults to 0.
Deprecated since version 2.1.0: Will be removed and behave like axis=0 in a future version.
For axis=1, do frame.T.groupby(...) instead.
levelint, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular
level or levels. Do not specify both by and level.
as_indexbool, default True
Return object with group labels as the
index. Only relevant for DataFrame input. as_index=False is
effectively “SQL-style” grouped output. This argument has no effect
on filtrations (see the filtrations in the user guide),
such as head(), tail(), nth() and in transformations
(see the transformations in the user guide).
sortbool, default True
Sort group keys. Get better performance by turning this off.
Note this does not influence the order of observations within each
group. Groupby preserves the order of rows within each group. If False,
the groups will appear in the same order as they did in the original DataFrame.
This argument has no effect on filtrations (see the filtrations in the user guide),
such as head(), tail(), nth() and in transformations
(see the transformations in the user guide).
Changed in version 2.0.0: Specifying sort=False with an ordered categorical grouper will no
longer sort the values.
group_keysbool, default True
When calling apply and the by argument produces a like-indexed
(i.e. a transform) result, add group keys to
index to identify pieces. By default group keys are not included
when the result’s index (and column) labels match the inputs, and
are included otherwise.
Changed in version 1.5.0: Warns that group_keys will no longer be ignored when the
result from apply is a like-indexed Series or DataFrame.
Specify group_keys explicitly to include the group keys or
not.
Changed in version 2.0.0: group_keys now defaults to True.
observedbool, default False
This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
Deprecated since version 2.1.0: The default value will change to True in a future version of pandas.
dropnabool, default True
If True, and if group keys contain NA values, NA values together
with row/column will be dropped.
If False, NA values will also be treated as the key in groups.
See the user guide for more
detailed usage and examples, including splitting an object into groups,
iterating through groups, selecting a group, aggregation, and more.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
This function returns the first n rows for the object based
on position. It is useful for quickly testing if your object
has the right type of data in it.
For negative values of n, this function returns all rows except
the last |n| rows, equivalent to df[:n].
If n is larger than the number of rows, this function returns all rows.
A `histogram`_ is a representation of the distribution of data.
This function calls matplotlib.pyplot.hist(), on each series in
the DataFrame, resulting in one histogram per column.
If passed, will be used to limit data to a subset of columns.
byobject, optional
If passed, then used to form histograms for separate groups.
gridbool, default True
Whether to show axis grid lines.
xlabelsizeint, default None
If specified changes the x-axis label size.
xrotfloat, default None
Rotation of x axis labels. For example, a value of 90 displays the
x labels rotated 90 degrees clockwise.
ylabelsizeint, default None
If specified changes the y-axis label size.
yrotfloat, default None
Rotation of y axis labels. For example, a value of 90 displays the
y labels rotated 90 degrees clockwise.
axMatplotlib axes object, default None
The axes to plot the histogram on.
sharexbool, default True if ax is None else False
In case subplots=True, share x axis and set some x axis labels to
invisible; defaults to True if ax is None otherwise False if an ax
is passed in.
Note that passing in both an ax and sharex=True will alter all x axis
labels for all subplots in a figure.
shareybool, default False
In case subplots=True, share y axis and set some y axis labels to
invisible.
figsizetuple, optional
The size in inches of the figure to create. Uses the value in
matplotlib.rcParams by default.
layouttuple, optional
Tuple of (rows, columns) for the layout of the histograms.
binsint or sequence, default 10
Number of histogram bins to be used. If an integer is given, bins + 1
bin edges are calculated and returned. If bins is a sequence, gives
bin edges, including left edge of first bin and right edge of last
bin. In this case, bins is returned unmodified.
backendstr, default None
Backend to use instead of the backend specified in the option
plotting.backend. For instance, ‘matplotlib’. Alternatively, to
specify the plotting.backend for the whole session, set
pd.options.plotting.backend.
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped
columns, leaving non-object and unconvertible
columns unchanged. The inference rules are the
same as during normal Series/DataFrame construction.
Whether to make a copy for non-object or non-inferable columns
or Series.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
to_datetime : Convert argument to datetime.
to_timedelta : Convert argument to timedelta.
to_numeric : Convert argument to numeric type.
convert_dtypes : Convert argument to best possible dtype.
Whether to print the full summary. By default, the setting in
pandas.options.display.max_info_columns is followed.
bufwritable buffer, defaults to sys.stdout
Where to send the output. By default, the output is printed to
sys.stdout. Pass a writable buffer if you need to further process
the output.
max_colsint, optional
When to switch from the verbose to the truncated output. If the
DataFrame has more than max_cols columns, the truncated output
is used. By default, the setting in
pandas.options.display.max_info_columns is used.
memory_usagebool, str, optional
Specifies whether total memory usage of the DataFrame
elements (including the index) should be displayed. By default,
this follows the pandas.options.display.memory_usage setting.
True always show memory usage. False never shows memory usage.
A value of ‘deep’ is equivalent to “True with deep introspection”.
Memory usage is shown in human-readable units (base-2
representation). Without deep introspection a memory estimation is
made based in column dtype and number of rows assuming values
consume the same memory amount for corresponding dtypes. With deep
memory introspection, a real memory usage calculation is performed
at the cost of computational resources. See the
Frequently Asked Questions for more
details.
show_countsbool, optional
Whether to show the non-null counts. By default, this is shown
only if the DataFrame is smaller than
pandas.options.display.max_info_rows and
pandas.options.display.max_info_columns. A value of True always
shows the counts, and False never shows the counts.
‘linear’: Ignore the index and treat the values as equally
spaced. This is the only method supported on MultiIndexes.
‘time’: Works on daily and higher resolution data to interpolate
given length of interval.
‘index’, ‘values’: use the actual numerical values of the index.
‘pad’: Fill in NaNs using existing values.
‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’,
‘barycentric’, ‘polynomial’: Passed to
scipy.interpolate.interp1d, whereas ‘spline’ is passed to
scipy.interpolate.UnivariateSpline. These methods use the numerical
values of the index. Both ‘polynomial’ and ‘spline’ require that
you also specify an order (int), e.g.
df.interpolate(method='polynomial',order=5). Note that,
slinear method in Pandas refers to the Scipy first order spline
instead of Pandas first order spline.
‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’,
‘cubicspline’: Wrappers around the SciPy interpolation methods of
similar names. See Notes.
‘from_derivatives’: Refers to
scipy.interpolate.BPoly.from_derivatives.
axis{{0 or ‘index’, 1 or ‘columns’, None}}, default None
Axis to interpolate along. For Series this parameter is unused
and defaults to 0.
limitint, optional
Maximum number of consecutive NaNs to fill. Must be greater than
0.
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’
methods are wrappers around the respective SciPy implementations of
similar names. These use the actual numerical values of the index.
For more information on their behavior, see the
SciPy documentation.
Filling in NaN in a Series via polynomial interpolation or splines:
Both ‘polynomial’ and ‘spline’ methods require that you also specify
an order (int).
Fill the DataFrame forward (that is, going down) along each column
using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently,
because there is no entry after it to use for interpolation.
Note how the first entry in column ‘b’ remains NaN, because there
is no entry before it to use for interpolation.
>>> df=pd.DataFrame([(0.0,np.nan,-1.0,1.0),... (np.nan,2.0,np.nan,np.nan),... (2.0,3.0,np.nan,9.0),... (np.nan,4.0,-4.0,16.0)],... columns=list('abcd'))>>> df a b c d0 0.0 NaN -1.0 1.01 NaN 2.0 NaN NaN2 2.0 3.0 NaN 9.03 NaN 4.0 -4.0 16.0>>> df.interpolate(method='linear',limit_direction='forward',axis=0) a b c d0 0.0 NaN -1.0 1.01 1.0 2.0 -2.0 5.02 2.0 3.0 -3.0 9.03 2.0 4.0 -4.0 16.0
frame.isetitem(loc,value) is an in-place method as it will
modify the DataFrame in place (not returning a new object). In contrast to
frame.iloc[:,i]=value which will try to update the existing values in
place, frame.isetitem(loc,value) will not update the values of the column
itself in place, it will instead insert a new array.
In cases where frame.columns is unique, this is equivalent to
frame[frame.columns[i]]=value.
The result will only be true at a location if all the
labels match. If values is a Series, that’s the index. If
values is a dict, the keys must be the column names,
which must match. If values is a DataFrame,
then both the index and column labels must match.
DataFrame.eq: Equality test for DataFrame.
Series.isin: Equivalent method on Series.
Series.str.contains: Test if pattern or regex is contained within a
Return a boolean same-sized object indicating if the values are NA.
NA values, such as None or numpy.NaN, gets mapped to True
values.
Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
>>> df=pd.DataFrame(dict(age=[5,6,np.nan],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy0 False True False True1 False False False False2 True False False False
Return a boolean same-sized object indicating if the values are NA.
NA values, such as None or numpy.NaN, gets mapped to True
values.
Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
>>> df=pd.DataFrame(dict(age=[5,6,np.nan],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy0 False True False True1 False False False False2 True False False False
Because iterrows returns a Series for each row,
it does not preserve dtypes across the rows (dtypes are
preserved across columns for DataFrames).
To preserve dtypes while iterating over the rows, it is better
to use itertuples() which returns namedtuples of the values
and which is generally faster than iterrows.
You should never modify something you are iterating over.
This is not guaranteed to work in all cases. Depending on the
data types, the iterator returns a copy and not a view, and writing
to it will have no effect.
An object to iterate over namedtuples for each row in the
DataFrame with the first field possibly being the index and
following fields being the column values.
otherDataFrame, Series, or a list containing any combination of them
Index should be similar to one of the columns in this one. If a
Series is passed, its name attribute must be set, and that will be
used as the column name in the resulting joined DataFrame.
onstr, list of str, or array-like, optional
Column or index level name(s) in the caller to join on the index
in other, otherwise joins index-on-index. If multiple
values given, the other DataFrame must have a MultiIndex. Can
pass an array as the join key if it is not already contained in
the calling DataFrame. Like an Excel VLOOKUP operation.
>>> df.join(other,lsuffix='_caller',rsuffix='_other') key_caller A key_other B0 K0 A0 K0 B01 K1 A1 K1 B12 K2 A2 K2 B23 K3 A3 NaN NaN4 K4 A4 NaN NaN5 K5 A5 NaN NaN
If we want to join using the key columns, we need to set key to be
the index in both df and other. The joined DataFrame will have
key as its index.
>>> df.set_index('key').join(other.set_index('key')) A BkeyK0 A0 B0K1 A1 B1K2 A2 B2K3 A3 NaNK4 A4 NaNK5 A5 NaN
Another option to join using the key columns is to use the on
parameter. DataFrame.join always uses other’s index but we can use
any column in df. This method preserves the original DataFrame’s
index in the result.
first : Select initial periods of time series based on a date offset.
at_time : Select values at a particular time of the day.
between_time : Select values between particular times of the day.
Notice the data for 3 last calendar days were returned, not the last
3 observed days in the dataset, and therefore data for 2018-04-11 was
not returned.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
DataFrame.apply : Apply a function along input axis of DataFrame.
DataFrame.replace: Replace values given in to_replace with value.
Series.map : Apply a function elementwise on a Series.
Return the marginal contribution of the solver configuration on the instances.
Args:
objective: The objective for which we calculate the marginal contribution.
instances: The instances which should be selected for the evaluation
sort: Whether to sort the results afterwards
Returns:
The marginal contribution of each solver (configuration) as:
[(solver, config_id, marginal_contribution, portfolio_best_performance_without_solver)]
condbool Series/DataFrame, array-like, or callable
Where cond is False, keep the original value. Where
True, replace with corresponding value from other.
If cond is callable, it is computed on the Series/DataFrame and
should return boolean Series/DataFrame or array. The callable must
not change input Series/DataFrame (though pandas doesn’t check it).
otherscalar, Series/DataFrame, or callable
Entries where cond is True are replaced with
corresponding value from other.
If other is callable, it is computed on the Series/DataFrame and
should return scalar or Series/DataFrame. The callable must not
change input Series/DataFrame (though pandas doesn’t check it).
If not specified, entries will be filled with the corresponding
NULL value (np.nan for numpy dtypes, pd.NA for extension
dtypes).
inplacebool, default False
Whether to perform the operation in place on the data.
axisint, default None
Alignment axis if needed. For Series this parameter is
unused and defaults to 0.
The mask method is an application of the if-then idiom. For each
element in the calling DataFrame, if cond is False the
element is used; otherwise the corresponding element from the DataFrame
other is used. If the axis of other does not align with axis of
cond Series/DataFrame, the misaligned index positions will be filled with
True.
The signature for DataFrame.where() differs from
numpy.where(). Roughly df1.where(m,df2) is equivalent to
np.where(m,df1,df2).
For further details and examples see the mask documentation in
indexing.
The dtype of the object takes precedence. The fill value is casted to
the object’s dtype, if this can be done losslessly.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (id_vars), while all other
columns, considered measured variables (value_vars), are “unpivoted” to
the row axis, leaving just two non-identifier columns, ‘variable’ and
‘value’.
Specifies whether to include the memory usage of the DataFrame’s
index in returned Series. If index=True, the memory usage of
the index is the first item in the output.
deepbool, default False
If True, introspect the data deeply by interrogating
object dtypes for system-level memory consumption, and include
it in the returned values.
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes will be ignored. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on.
When performing a cross merge, no column specifications to merge on are
allowed.
Warning
If both key columns contain rows where the key is a null value, those
rows will be matched against each other. This is different from usual SQL
join behaviour and can lead to unexpected results.
left: use only keys from left frame, similar to a SQL left outer join;
preserve key order.
right: use only keys from right frame, similar to a SQL right outer join;
preserve key order.
outer: use union of keys from both frames, similar to a SQL full outer
join; sort keys lexicographically.
inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
cross: creates the cartesian product from both frames, preserves the order
of the left keys.
onlabel or list
Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
left_onlabel or list, or array-like
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
right_onlabel or list, or array-like
Column or index level names to join on in the right DataFrame. Can also
be an array or list of arrays of the length of the right DataFrame.
These arrays are treated as if they are columns.
left_indexbool, default False
Use the index from the left DataFrame as the join key(s). If it is a
MultiIndex, the number of keys in the other DataFrame (either the index
or a number of columns) must match the number of levels.
right_indexbool, default False
Use the index from the right DataFrame as the join key. Same caveats as
left_index.
sortbool, default False
Sort the join keys lexicographically in the result DataFrame. If False,
the order of the join keys depends on the join type (how keyword).
suffixeslist-like, default is (“_x”, “_y”)
A length-2 sequence where each element is optionally a string
indicating the suffix to add to overlapping column names in
left and right respectively. Pass a value of None instead
of a string to indicate that the column name from left or
right should be left as-is, with no suffix. At least one of the
values must not be None.
copybool, default True
If False, avoid copy if possible.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
indicatorbool or str, default False
If True, adds a column to the output DataFrame called “_merge” with
information on the source of each row. The column can be given a different
name by providing a string argument. The column will have a Categorical
type with the value of “left_only” for observations whose merge key only
appears in the left DataFrame, “right_only” for observations
whose merge key only appears in the right DataFrame, and “both”
if the observation’s merge key is found in both DataFrames.
validatestr, optional
If specified, checks if merge is of specified type.
“one_to_one” or “1:1”: check if merge keys are unique in both
left and right datasets.
“one_to_many” or “1:m”: check if merge keys are unique in left
dataset.
“many_to_one” or “m:1”: check if merge keys are unique in right
dataset.
“many_to_many” or “m:m”: allowed, but does not result in checks.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
>>> df=pd.DataFrame([('bird',2,2),... ('mammal',4,np.nan),... ('arthropod',8,0),... ('bird',2,np.nan)],... index=('falcon','horse','spider','ostrich'),... columns=('species','legs','wings'))>>> df species legs wingsfalcon bird 2 2.0horse mammal 4 NaNspider arthropod 8 0.0ostrich bird 2 NaN
By default, missing values are not considered, and the mode of wings
are both 0 and 2. Because the resulting DataFrame has two rows,
the second row of species and legs contains NaN.
>>> df.mode() species legs wings0 bird 2.0 0.01 NaN NaN 2.0
Setting dropna=FalseNaN values are considered and they can be
the mode (like for wings).
>>> df.mode(dropna=False) species legs wings0 bird 2 NaN
Setting numeric_only=True, only the mode of numeric columns is
computed, and columns of other types are ignored.
>>> df.mode(numeric_only=True) legs wings0 2.0 0.01 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
DataFrame.eq : Compare DataFrames for equality elementwise.
DataFrame.ne : Compare DataFrames for inequality elementwise.
DataFrame.le : Compare DataFrames for less than inequality
or equality elementwise.
DataFrame.ltCompare DataFrames for strictly less than
inequality elementwise.
DataFrame.geCompare DataFrames for greater than inequality
or equality elementwise.
DataFrame.gtCompare DataFrames for strictly greater than
>>> df_multindex=pd.DataFrame({'cost':[250,150,100,150,300,220],... 'revenue':[100,250,300,200,175,225]},... index=[['Q1','Q1','Q1','Q2','Q2','Q2'],... ['A','B','C','A','B','C']])>>> df_multindex cost revenueQ1 A 250 100 B 150 250 C 100 300Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex,level=1) cost revenueQ1 A True True B True True C True TrueQ2 A False True B True False C True False
Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in
descending order. The columns that are not specified are returned as
well, but not used for ordering.
This method is equivalent to
df.sort_values(columns,ascending=False).head(n), but more
performant.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
NA values, such as None or numpy.NaN, get mapped to False
values.
>>> df=pd.DataFrame(dict(age=[5,6,np.nan],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy0 True False True False1 True True True True2 False True True True
DataFrame.notnull is an alias for DataFrame.notna.
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA.
Non-missing values get mapped to True. Characters such as empty
strings '' or numpy.inf are not considered NA values
(unless you set pandas.options.mode.use_inf_as_na=True).
NA values, such as None or numpy.NaN, get mapped to False
values.
>>> df=pd.DataFrame(dict(age=[5,6,np.nan],... born=[pd.NaT,pd.Timestamp('1939-05-27'),... pd.Timestamp('1940-04-25')],... name=['Alfred','Batman',''],... toy=[None,'Batmobile','Joker']))>>> df age born name toy0 5.0 NaT Alfred None1 6.0 1939-05-27 Batman Batmobile2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy0 True False True False1 True True True True2 False True True True
Return the first n rows ordered by columns in ascending order.
Return the first n rows with the smallest values in columns, in
ascending order. The columns that are not specified are returned as
well, but not used for ordering.
This method is equivalent to
df.sort_values(columns,ascending=True).head(n), but more
performant.
Fractional change between the current and a prior element.
Computes the fractional change from the immediately previous row by
default. This is useful in comparing the fraction of change in a time
series of elements.
Note
Despite the name of this method, it calculates fractional change
(also known as per unit change or relative change) and not
percentage change. If you need the percentage change, multiply
these values by 100.
Series.diff : Compute the difference of two elements in a Series.
DataFrame.diff : Compute the difference of two elements in a DataFrame.
Series.shift : Shift the index by some number of periods.
DataFrame.shift : Shift the index by some number of periods.
Function to apply to the Series/DataFrame.
args, and kwargs are passed into func.
Alternatively a (callable,data_keyword) tuple where
data_keyword is a string indicating the keyword of
callable that expects the Series/DataFrame.
DataFrame.apply : Apply a function along input axis of DataFrame.
DataFrame.map : Apply a function elementwise on a whole DataFrame.
Series.map : Apply a mapping correspondence on a
If you have a function that takes the data as (say) the second
argument, pass a tuple indicating which keyword expects the
data. For example, suppose national_insurance takes its data as df
in the second argument:
Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a “pivot” table) based on column values. Uses
unique values from specified index / columns to form axes of the
resulting DataFrame. This function does not support data
aggregation, multiple values will result in a MultiIndex in the
columns. See the User Guide for more on reshaping.
Column to use to make new frame’s index. If not given, uses existing index.
valuesstr, object or a list of the previous, optional
Column(s) to use for populating new frame’s values. If not
specified, all remaining columns will be used and the result will
have hierarchically indexed columns.
>>> df=pd.DataFrame({'foo':['one','one','one','two','two',... 'two'],... 'bar':['A','B','C','A','B','C'],... 'baz':[1,2,3,4,5,6],... 'zoo':['x','y','z','q','w','t']})>>> df foo bar baz zoo0 one A 1 x1 one B 2 y2 one C 3 z3 two A 4 q4 two B 5 w5 two C 6 t
>>> df.pivot(index='foo',columns='bar',values='baz')bar A B Cfooone 1 2 3two 4 5 6
>>> df.pivot(index='foo',columns='bar')['baz']bar A B Cfooone 1 2 3two 4 5 6
>>> df.pivot(index='foo',columns='bar',values=['baz','zoo']) baz zoobar A B C A B Cfooone 1 2 3 x y ztwo 4 5 6 q w t
You could also assign a list of column names or a list of index names.
>>> df.pivot(index=["lev1","lev2"],columns=["lev3"],values="values") lev3 1 2lev1 lev2 1 1 0.0 1.0 2 2.0 NaN 2 1 4.0 3.0 2 NaN 5.0
A ValueError is raised if there are any duplicates.
>>> df=pd.DataFrame({"foo":['one','one','two','two'],... "bar":['A','A','B','C'],... "baz":[1,2,3,4]})>>> df foo bar baz0 one A 11 one A 22 two B 33 two C 4
Notice that the first two rows are the same for our index
and columns arguments.
indexcolumn, Grouper, array, or list of the previous
Keys to group by on the pivot table index. If a list is passed,
it can contain any of the other types (except list). If an array is
passed, it must be the same length as the data and will be used in
the same manner as column values.
columnscolumn, Grouper, array, or list of the previous
Keys to group by on the pivot table column. If a list is passed,
it can contain any of the other types (except list). If an array is
passed, it must be the same length as the data and will be used in
the same manner as column values.
aggfuncfunction, list of functions, dict, default “mean”
If a list of functions is passed, the resulting pivot table will have
hierarchical columns whose top level are the function names
(inferred from the function objects themselves).
If a dict is passed, the key is column to aggregate and the value is
function or list of functions. If margin=True, aggfunc will be
used to calculate the partial aggregates.
fill_valuescalar, default None
Value to replace missing values with (in the resulting pivot table,
after aggregation).
marginsbool, default False
If margins=True, special All columns and rows
will be added with partial group aggregates across the categories
on the rows and columns.
dropnabool, default True
Do not include columns whose entries are all NaN. If True,
rows with a NaN value in any column will be omitted before
computing margins.
margins_namestr, default ‘All’
Name of the row / column that will contain the totals
when margins is True.
observedbool, default False
This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers.
If False: show all values for categorical groupers.
Deprecated since version 2.2.0: The default value of False is deprecated and will change to
True in a future version of pandas.
>>> df=pd.DataFrame({"A":["foo","foo","foo","foo","foo",... "bar","bar","bar","bar"],... "B":["one","one","one","two","two",... "one","one","two","two"],... "C":["small","large","large","small",... "small","large","small","small",... "large"],... "D":[1,2,2,3,3,4,5,6,7],... "E":[2,4,5,5,6,6,8,9,9]})>>> df A B C D E0 foo one small 1 21 foo one large 2 42 foo one large 2 53 foo two small 3 54 foo two small 3 65 bar one large 4 66 bar one small 5 87 bar two small 6 98 bar two large 7 9
This first example aggregates values by taking the sum.
>>> table=pd.pivot_table(df,values='D',index=['A','B'],... columns=['C'],aggfunc="sum")>>> tableC large smallA Bbar one 4.0 5.0 two 7.0 6.0foo one 4.0 1.0 two NaN 6.0
We can also fill missing values using the fill_value parameter.
>>> table=pd.pivot_table(df,values='D',index=['A','B'],... columns=['C'],aggfunc="sum",fill_value=0)>>> tableC large smallA Bbar one 4 5 two 7 6foo one 4 1 two 0 6
The next example aggregates by taking the mean across multiple columns.
>>> table=pd.pivot_table(df,values=['D','E'],index=['A','C'],... aggfunc={'D':"mean",'E':"mean"})>>> table D EA Cbar large 5.500000 7.500000 small 5.500000 8.500000foo large 2.000000 4.500000 small 2.333333 4.333333
We can also calculate multiple types of aggregations for any given
value column.
>>> table=pd.pivot_table(df,values=['D','E'],index=['A','C'],... aggfunc={'D':"mean",... 'E':["min","max","mean"]})>>> table D E mean max mean minA Cbar large 5.500000 9 7.500000 6 small 5.500000 9 8.500000 8foo large 2.000000 5 4.500000 4 small 2.333333 6 4.333333 2
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.prod with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
Added in version 2.0.0.
skipnabool, default True
Exclude NA/null values when computing the result.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
min_countint, default 0
The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.prod with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
Added in version 2.0.0.
skipnabool, default True
Exclude NA/null values when computing the result.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
min_countint, default 0
The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
This optional parameter specifies the interpolation method to use,
when the desired quantile lies between two data points i and j:
linear: i + (j - i) * fraction, where fraction is the
fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
method{‘single’, ‘table’}, default ‘single’
Whether to compute quantiles per-column (‘single’) or over all columns
(‘table’). When ‘table’, the only allowed interpolation methods are
‘nearest’, ‘lower’, and ‘higher’.
You can refer to variables
in the environment by prefixing them with an ‘@’ character like
@a+b.
You can refer to column names that are not valid Python variable names
by surrounding them in backticks. Thus, column names containing spaces
or punctuations (besides underscores) or starting with digits must be
surrounded by backticks. (For example, a column named “Area (cm^2)” would
be referenced as `Area(cm^2)`). Column names which are Python keywords
(like “list”, “for”, “import”, etc) cannot be used.
For example, if one of your columns is called aa and you want
to sum it with b, your query should be `aa`+b.
inplacebool
Whether to modify the DataFrame rather than creating a new one.
The result of the evaluation of this expression is first passed to
DataFrame.loc and if that fails because of a
multidimensional key (e.g., a DataFrame) then the result will be passed
to DataFrame.__getitem__().
This method uses the top-level eval() function to
evaluate the passed query.
The query() method uses a slightly
modified Python syntax by default. For example, the & and |
(bitwise) operators have the precedence of their boolean cousins,
and and or. This is syntactically valid Python,
however the semantics are different.
You can change the semantics of the expression by passing the keyword
argument parser='python'. This enforces the same semantics as
evaluation in Python space. Likewise, you can pass engine='python'
to evaluate an expression using Python itself as a backend. This is not
recommended as it is inefficient compared to using numexpr as the
engine.
The DataFrame.index and
DataFrame.columns attributes of the
DataFrame instance are placed in the query namespace
by default, which allows you to treat both the index and columns of the
frame as a column in the frame.
The identifier index is used for the frame index; you can also
use the name of the index to identify it in a query. Please note that
Python keywords may not be used as identifiers.
For further details and examples see the query documentation in
indexing.
Backtick quoted variables
Backtick quoted variables are parsed as literal Python code and
are converted internally to a Python valid identifier.
This can lead to the following problems.
During parsing a number of disallowed characters inside the backtick
quoted string are replaced by strings that are allowed as a Python identifier.
These characters include all operators in Python, the space character, the
question mark, the exclamation mark, the dollar sign, and the euro sign.
For other characters that fall outside the ASCII range (U+0001..U+007F)
and those that are not further specified in PEP 3131,
the query parser will raise an error.
This excludes whitespace different than the space character,
but also the hashtag (as it is used for comments) and the backtick
itself (backtick can also not be escaped).
In a special case, quotes that make a pair around a backtick can
confuse the parser.
For example, `it's`>`that's` will raise an error,
as it forms a quoted string ('s>`that') with a backtick inside.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
The following example shows how the method behaves with the above
parameters:
default_rank: this is the default behaviour obtained without using
any parameter.
max_rank: setting method='max' the records that have the
same values are ranked using the highest rank (e.g.: since ‘cat’
and ‘dog’ are both in the 2nd and 3rd position, rank 3 is assigned.)
NA_bottom: choosing na_option='bottom', if there are records
with NaN values they are placed at the bottom of the ranking.
pct_rank: when setting pct=True, the ranking is expressed as
percentile rank.
>>> df['default_rank']=df['Number_legs'].rank()>>> df['max_rank']=df['Number_legs'].rank(method='max')>>> df['NA_bottom']=df['Number_legs'].rank(na_option='bottom')>>> df['pct_rank']=df['Number_legs'].rank(pct=True)>>> df Animal Number_legs default_rank max_rank NA_bottom pct_rank0 cat 4.0 2.5 3.0 2.5 0.6251 penguin 2.0 1.0 1.0 1.0 0.2502 dog 4.0 2.5 3.0 2.5 0.6253 spider 8.0 4.0 4.0 4.0 1.0004 snake NaN NaN NaN 5.0 NaN
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Conform DataFrame to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object
is produced unless the new index is equivalent to the current one and
copy=False.
Method to use for filling holes in reindexed DataFrame.
Please note: this is only applicable to DataFrames/Series with a
monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next
valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.
copybool, default True
Return a new object, even if the passed indexes are the same.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
levelint or name
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuescalar, default np.nan
Value to use for missing values. Defaults to NaN, but can be any
“compatible” value.
limitint, default None
Maximum number of consecutive elements to forward or backward fill.
toleranceoptional
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation abs(index[indexer]-target)<=tolerance.
Tolerance may be a scalar value, which applies the same tolerance
to all values, or list-like, which applies variable tolerance per
element. List-like includes list, tuple, array, Series, and must be
the same size as the index and its dtype must exactly match the
index’s type.
DataFrame.set_index : Set row labels.
DataFrame.reset_index : Remove row labels or move them to new columns.
DataFrame.reindex_like : Change to same indices as other DataFrame.
Create a new index and reindex the dataframe. By default
values in the new index that do not have corresponding
records in the dataframe are assigned NaN.
>>> new_index=['Safari','Iceweasel','Comodo Dragon','IE10',... 'Chrome']>>> df.reindex(new_index) http_status response_timeSafari 404.0 0.07Iceweasel NaN NaNComodo Dragon NaN NaNIE10 404.0 0.08Chrome 200.0 0.02
We can fill in the missing values by passing a value to
the keyword fill_value. Because the index is not monotonically
increasing or decreasing, we cannot use arguments to the keyword
method to fill the NaN values.
To further illustrate the filling functionality in
reindex, we will create a dataframe with a
monotonically increasing index (for example, a sequence
of dates).
The index entries that did not have a value in the original data frame
(for example, ‘2009-12-29’) are by default filled with NaN.
If desired, we can fill in the missing values using one of several
options.
For example, to back-propagate the last valid value to fill the NaN
values, pass bfill as an argument to the method keyword.
Please note that the NaN value present in the original dataframe
(at index value 2010-01-03) will not be filled by any of the
value propagation schemes. This is because filling while reindexing
does not look at dataframe values, but only compares the original and
desired indexes. If you do want to fill in the NaN values present
in the original dataframe, use the fillna() method.
Return an object with matching indices as other object.
Conform the object to the same index on all axes. Optional
filling logic, placing NaN in locations having no value
in the previous index. A new object is produced unless the
new index is equivalent to the current one and copy=False.
Method to use for filling holes in reindexed DataFrame.
Please note: this is only applicable to DataFrames/Series with a
monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: propagate last valid observation forward to next
valid
backfill / bfill: use next valid observation to fill gap
nearest: use nearest valid observations to fill gap.
copybool, default True
Return a new object, even if the passed indexes are the same.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
limitint, default None
Maximum number of consecutive labels to fill for inexact matches.
toleranceoptional
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations must
satisfy the equation abs(index[indexer]-target)<=tolerance.
Tolerance may be a scalar value, which applies the same tolerance
to all values, or list-like, which applies variable tolerance per
element. List-like includes list, tuple, array, Series, and must be
the same size as the index and its dtype must exactly match the
index’s type.
DataFrame.set_index : Set row labels.
DataFrame.reset_index : Remove row labels or move them to new columns.
DataFrame.reindex : Change to new indices or expand indices.
>>> df2 temp_celsius windspeed2014-02-12 28.0 low2014-02-13 30.0 low2014-02-15 35.1 medium
>>> df2.reindex_like(df1) temp_celsius temp_fahrenheit windspeed2014-02-12 28.0 NaN low2014-02-13 30.0 NaN low2014-02-14 NaN NaN NaN2014-02-15 35.1 NaN medium
Dict-like or function transformations to apply to
that axis’ values. Use either mapper and axis to
specify the axis to target with mapper, or index and
columns.
indexdict-like or function
Alternative to specifying axis (mapper,axis=0
is equivalent to index=mapper).
columnsdict-like or function
Alternative to specifying axis (mapper,axis=1
is equivalent to columns=mapper).
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Axis to target with mapper. Can be either the axis name
(‘index’, ‘columns’) or number (0, 1). The default is ‘index’.
copybool, default True
Also copy underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
inplacebool, default False
Whether to modify the DataFrame rather than creating a new one.
If True then value of copy is ignored.
levelint or level name, default None
In case of a MultiIndex, only rename labels in the specified
level.
errors{‘ignore’, ‘raise’}, default ‘ignore’
If ‘raise’, raise a KeyError when a dict-like mapper, index,
or columns contains labels that are not present in the Index
being transformed.
If ‘ignore’, existing keys will be renamed and extra keys will be
ignored.
index, columnsscalar, list-like, dict-like or function, optional
A scalar, list-like, dict-like or functions transformations to
apply to that axis’ values.
Note that the columns parameter is not allowed if the
object is a Series. This parameter only apply for DataFrame
type objects.
Use either mapper and axis to
specify the axis to target with mapper, or index
and/or columns.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to rename. For Series this parameter is unused and defaults to 0.
copybool, default None
Also copy underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
inplacebool, default False
Modifies the object directly, instead of creating a new Series
or DataFrame.
DataFrame.rename_axis supports two calling conventions
(index=index_mapper,columns=columns_mapper,...)
(mapper,axis={'index','columns'},...)
The first calling convention will only modify the names of
the index and/or the names of the Index object that is the columns.
In this case, the parameter copy is ignored.
The second calling convention will modify the names of the
corresponding index if mapper is a list or a scalar.
However, if mapper is dict-like or a function, it will use the
deprecated behavior of modifying the axis labels.
We highly recommend using keyword arguments to clarify your
intent.
Values of the Series/DataFrame are replaced with other values dynamically.
This differs from updating with .loc or .iloc, which require
you to specify a location to update with some value.
to_replacestr, regex, list, dict, Series, int, float, or None
How to find the values that will be replaced.
numeric, str or regex:
numeric: numeric values equal to to_replace will be
replaced with value
str: string exactly matching to_replace will be replaced
with value
regex: regexs matching to_replace will be replaced with
value
list of str, regex, or numeric:
First, if to_replace and value are both lists, they
must be the same length.
Second, if regex=True then all of the strings in both
lists will be interpreted as regexs otherwise they will match
directly. This doesn’t matter much for value since there
are only a few possible substitution regexes you can use.
str, regex and numeric rules apply as above.
dict:
Dicts can be used to specify different replacement values
for different existing values. For example,
{'a':'b','y':'z'} replaces the value ‘a’ with ‘b’ and
‘y’ with ‘z’. To use a dict in this way, the optional value
parameter should not be given.
For a DataFrame a dict can specify that different values
should be replaced in different columns. For example,
{'a':1,'b':'z'} looks for the value 1 in column ‘a’
and the value ‘z’ in column ‘b’ and replaces these values
with whatever is specified in value. The value parameter
should not be None in this case. You can treat this as a
special case of passing two lists except that you are
specifying the column to search in.
For a DataFrame nested dictionaries, e.g.,
{'a':{'b':np.nan}}, are read as follows: look in column
‘a’ for the value ‘b’ and replace it with NaN. The optional value
parameter should not be specified to use a nested dict in this
way. You can nest regular expressions as well. Note that
column names (the top-level dictionary keys in a nested
dictionary) cannot be regular expressions.
None:
This means that the regex argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If value is also None then
this must be a nested dictionary or Series.
See the examples section for examples of each of these.
valuescalar, dict, list, str, regex, default None
Value to replace any values matching to_replace with.
For a DataFrame a dict of values can be used to specify which
value to use for each column (columns not in the dict will not be
filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
inplacebool, default False
If True, performs operation inplace and returns None.
limitint, default None
Maximum size gap to forward or backward fill.
Deprecated since version 2.1.0.
regexbool or same types as to_replace, default False
Whether to interpret to_replace and/or value as regular
expressions. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
to_replace must be None.
method{‘pad’, ‘ffill’, ‘bfill’}
The method to use when for replacement, when to_replace is a
scalar, list or tuple and value is None.
Series.fillna : Fill NA values.
DataFrame.fillna : Fill NA values.
Series.where : Replace values based on boolean condition.
DataFrame.where : Replace values based on boolean condition.
DataFrame.map: Apply a function to a Dataframe elementwise.
Series.map: Map values of Series according to an input mapping or function.
Series.str.replace : Simple string replacement.
Regex substitution is performed under the hood with re.sub. The
rules for substitution for re.sub are the same.
Regular expressions will only substitute on strings, meaning you
cannot provide, for example, a regular expression matching floating
point numbers and expect the columns in your frame that have a
numeric dtype to be matched. However, if those floating point
numbers are strings, then you can do this.
This method has a lot of options. You are encouraged to experiment
and play with this method to gain intuition about how it works.
When dict is used as the to_replace value, it is like
key(s) in the dict are the to_replace part and
value(s) in the dict are the value parameter.
>>> df.replace({0:10,1:100}) A B C0 10 5 a1 100 6 b2 2 7 c3 3 8 d4 4 9 e
>>> df.replace({'A':0,'B':5},100) A B C0 100 100 a1 1 6 b2 2 7 c3 3 8 d4 4 9 e
>>> df.replace({'A':{0:100,4:400}}) A B C0 100 5 a1 1 6 b2 2 7 c3 3 8 d4 400 9 e
Regular expression `to_replace`
>>> df=pd.DataFrame({'A':['bat','foo','bait'],... 'B':['abc','bar','xyz']})>>> df.replace(to_replace=r'^ba.$',value='new',regex=True) A B0 new abc1 foo new2 bait xyz
>>> df.replace({'A':r'^ba.$'},{'A':'new'},regex=True) A B0 new abc1 foo bar2 bait xyz
>>> df.replace(regex=r'^ba.$',value='new') A B0 new abc1 foo new2 bait xyz
>>> df.replace(regex={r'^ba.$':'new','foo':'xyz'}) A B0 new abc1 xyz new2 bait xyz
>>> df.replace(regex=[r'^ba.$','foo'],value='new') A B0 new abc1 new new2 bait xyz
Compare the behavior of s.replace({'a':None}) and
s.replace('a',None) to understand the peculiarities
of the to_replace parameter:
>>> s=pd.Series([10,'a','a','b','a'])
When one uses a dict as the to_replace value, it is like the
value(s) in the dict are equal to the value parameter.
s.replace({'a':None}) is equivalent to
s.replace(to_replace={'a':None},value=None,method=None):
When value is not explicitly passed and to_replace is a scalar, list
or tuple, replace uses the method parameter (default ‘pad’) to do the
replacement. So this is why the ‘a’ values are being replaced by 10
in rows 1 and 2 and ‘b’ in row 4 in this case.
>>> s.replace('a')0 101 102 103 b4 bdtype: object
Deprecated since version 2.1.0: The ‘method’ parameter and padding behavior are deprecated.
On the other hand, if None is explicitly passed for value, it will
be respected:
Convenience method for frequency conversion and resampling of time series.
The object must have a datetime-like index (DatetimeIndex, PeriodIndex,
or TimedeltaIndex), or the caller must pass the label of a datetime-like
series/index to the on/level keyword parameter.
The offset string or object representing target conversion.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Which axis to use for up- or down-sampling. For Series this parameter
is unused and defaults to 0. Must be
DatetimeIndex, TimedeltaIndex or PeriodIndex.
Deprecated since version 2.0.0: Use frame.T.resample(…) instead.
closed{‘right’, ‘left’}, default None
Which side of bin interval is closed. The default is ‘left’
for all frequency offsets except for ‘ME’, ‘YE’, ‘QE’, ‘BME’,
‘BA’, ‘BQE’, and ‘W’ which all have a default of ‘right’.
label{‘right’, ‘left’}, default None
Which bin edge label to label bucket with. The default is ‘left’
for all frequency offsets except for ‘ME’, ‘YE’, ‘QE’, ‘BME’,
‘BA’, ‘BQE’, and ‘W’ which all have a default of ‘right’.
Pass ‘timestamp’ to convert the resulting index to a
DateTimeIndex or ‘period’ to convert it to a PeriodIndex.
By default the input representation is retained.
Deprecated since version 2.2.0: Convert index to desired type explicitly instead.
onstr, optional
For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.
levelstr or int, optional
For a MultiIndex, level (name or number) to use for
resampling. level must be datetime-like.
originTimestamp or str, default ‘start_day’
The timestamp on which to adjust the grouping. The timezone of origin
must match the timezone of the index.
If string, must be one of the following:
‘epoch’: origin is 1970-01-01
‘start’: origin is the first value of the timeseries
‘start_day’: origin is the first day at midnight of the timeseries
‘end’: origin is the last value of the timeseries
‘end_day’: origin is the ceiling midnight of the last day
Added in version 1.3.0.
Note
Only takes effect for Tick-frequencies (i.e. fixed frequencies like
days, hours, and minutes, rather than months or quarters).
offsetTimedelta or str, default is None
An offset timedelta added to the origin.
group_keysbool, default False
Whether to include the group keys in the result index when using
.apply() on the resampled object.
Added in version 1.5.0: Not specifying group_keys will retain values-dependent behavior
from pandas 1.4 and earlier (see pandas 1.5.0 Release notes for examples).
Changed in version 2.0.0: group_keys now defaults to False.
Series.resample : Resample a Series.
DataFrame.resample : Resample a DataFrame.
groupby : Group Series/DataFrame by mapping, function, label, or list of labels.
asfreq : Reindex a Series/DataFrame with the given frequency without grouping.
Downsample the series into 3 minute bins as above, but label each
bin using the right edge instead of the left. Please note that the
value in the bucket used as the label is not included in the bucket,
which it labels. For example, in the original series the
bucket 2000-01-0100:03:00 contains the value 3, but the summed
value in the resampled bucket with the label 2000-01-0100:03:00
does not include 3 (if it did, the summed value would be 6, not 3).
In contrast with the start_day, you can use end_day to take the ceiling
midnight of the largest Timestamp as the end of the bins and drop the bins
not containing data:
Using the given string, rename the DataFrame column which contains the
index data. If the DataFrame has a MultiIndex, this has to be a list or
tuple with length equal to the number of levels.
DataFrame.set_index : Opposite of reset_index.
DataFrame.reindex : Change to new indices or expand indices.
DataFrame.reindex_like : Change to same indices as other DataFrame.
>>> df=pd.DataFrame([('bird',389.0),... ('bird',24.0),... ('mammal',80.5),... ('mammal',np.nan)],... index=['falcon','parrot','lion','monkey'],... columns=('class','max_speed'))>>> df class max_speedfalcon bird 389.0parrot bird 24.0lion mammal 80.5monkey mammal NaN
When we reset the index, the old index is added as a column, and a
new sequential index is used:
>>> df.reset_index() index class max_speed0 falcon bird 389.01 parrot bird 24.02 lion mammal 80.53 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as
a column:
>>> df.reset_index(drop=True) class max_speed0 bird 389.01 bird 24.02 mammal 80.53 mammal NaN
You can also use reset_index with MultiIndex.
>>> index=pd.MultiIndex.from_tuples([('bird','falcon'),... ('bird','parrot'),... ('mammal','lion'),... ('mammal','monkey')],... names=['class','name'])>>> columns=pd.MultiIndex.from_tuples([('speed','max'),... ('species','type')])>>> df=pd.DataFrame([(389.0,'fly'),... (24.0,'fly'),... (80.5,'run'),... (np.nan,'jump')],... index=index,... columns=columns)>>> df speed species max typeclass namebird falcon 389.0 fly parrot 24.0 flymammal lion 80.5 run monkey NaN jump
Using the names parameter, choose a name for the index column:
>>> df.reset_index(names=['classes','names']) classes names speed species max type0 bird falcon 389.0 fly1 bird parrot 24.0 fly2 mammal lion 80.5 run3 mammal monkey NaN jump
If the index has multiple levels, we can reset a subset of them:
>>> df.reset_index(level='class') class speed species max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top
level. We can place it in another level:
>>> df.reset_index(level='class',col_level=1) speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
When the index is inserted under another level, we can specify under
which one with the parameter col_fill:
>>> df.reset_index(level='class',col_level=1,col_fill='species') species speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
If we specify a nonexistent level for col_fill, it is created:
>>> df.reset_index(level='class',col_level=1,col_fill='genus') genus speed species class max typenamefalcon bird 389.0 flyparrot bird 24.0 flylion mammal 80.5 runmonkey mammal NaN jump
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
windowint, timedelta, str, offset, or BaseIndexer subclass
Size of the moving window.
If an integer, the fixed number of observations used for
each window.
If a timedelta, str, or offset, the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period. This is only valid for datetimelike indexes.
To learn more about the offsets & frequency strings, please see this link.
If a BaseIndexer subclass, the window boundaries
based on the defined get_window_bounds method. Additional rolling
keyword arguments, namely min_periods, center, closed and
step will be passed to get_window_bounds.
min_periodsint, default None
Minimum number of observations in window required to have a value;
otherwise, result is np.nan.
For a window that is specified by an offset, min_periods will default to 1.
For a window that is specified by an integer, min_periods will default
to the size of the window.
centerbool, default False
If False, set the window labels as the right edge of the window index.
If True, set the window labels as the center of the window index.
Certain Scipy window types require additional parameters to be passed
in the aggregation function. The additional parameters must match
the keywords specified in the Scipy window type method signature.
onstr, optional
For a DataFrame, a column label or Index level on which
to calculate the rolling window, rather than the DataFrame’s index.
Provided integer column is ignored and excluded from result since
an integer index is not used to calculate the rolling window.
axisint or str, default 0
If 0 or 'index', roll across the rows.
If 1 or 'columns', roll across the columns.
For Series this parameter is unused and defaults to 0.
Deprecated since version 2.1.0: The axis keyword is deprecated. For axis=1,
transpose the DataFrame first instead.
closedstr, default None
If 'right', the first point in the window is excluded from calculations.
If 'left', the last point in the window is excluded from calculations.
If 'both', the no points in the window are excluded from calculations.
If 'neither', the first and last points in the window are excluded
from calculations.
Default None ('right').
step : int, default None
Added in version 1.5.0.
Evaluate the window at every step result, equivalent to slicing as
[::step]. window must be an integer. Using a step argument other
than None or 1 will produce a result with a different shape than the input.
Number of decimal places to round each column to. If an int is
given, round each column to the same number of places.
Otherwise dict and Series round to variable numbers of places.
Column names should be in the keys if decimals is a
dict-like, or in the index if decimals is a Series. Any
columns not included in decimals will be left as is. Elements
of decimals which are not columns of the input will be
ignored.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Number of items from axis to return. Cannot be used with frac.
Default = 1 if frac = None.
fracfloat, optional
Fraction of axis items to return. Cannot be used with n.
replacebool, default False
Allow or disallow sampling of the same row more than once.
weightsstr or ndarray-like, optional
Default ‘None’ results in equal probability weighting.
If passed a Series, will align with target object on index. Index
values in weights not found in sampled object will be ignored and
index values in sampled object not in weights will be assigned
weights of zero.
If called on a DataFrame, will accept the name of a column
when axis = 0.
Unless weights are a Series, weights must be same length as axis
being sampled.
If weights do not sum to 1, they will be normalized to sum to 1.
Missing values in the weights column will be treated as zero.
Infinite values not allowed.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype, but note that
this will return all object dtype columns. With
pd.options.future.infer_string enabled, using "str" will
work to select all string columns.
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.sem with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
skipnabool, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
ddofint, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
The axis to update. The value 0 identifies the rows. For Series
this parameter is unused and defaults to 0.
copybool, default True
Whether to make a copy of the underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
allows_duplicate_labelsbool, optional
Whether the returned object allows duplicate labels.
This method returns a new object that’s a view on the same data
as the input. Mutating the input or the output values will be reflected
in the other.
This method is intended to be used in method chains.
“Flags” differ from “metadata”. Flags reflect properties of the
pandas object (the Series or DataFrame). Metadata refer to properties
of the dataset, and should be stored in DataFrame.attrs.
Set the DataFrame index (row labels) using one or more existing
columns or arrays (of the correct length). The index can replace the
existing index or expand on it.
This parameter can be either a single column key, a single array of
the same length as the calling DataFrame, or a list containing an
arbitrary combination of column keys and arrays. Here, “array”
encompasses Series, Index, np.ndarray, and
instances of Iterator.
dropbool, default True
Delete columns to be used as the new index.
appendbool, default False
Whether to append columns to existing index.
inplacebool, default False
Whether to modify the DataFrame rather than creating a new one.
verify_integritybool, default False
Check the new index for duplicates. Otherwise defer the check until
necessary. Setting to False will improve the performance of this
method.
DataFrame.reset_index : Opposite of set_index.
DataFrame.reindex : Change to new indices or expand indices.
DataFrame.reindex_like : Change to same indices as other DataFrame.
Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data.
If freq is passed (in this case, the index must be date or datetime,
or it will raise a NotImplementedError), the index will be
increased using the periods and the freq. freq can be inferred
when specified as “infer” as long as either freq or inferred_freq
attribute is set in the index.
Number of periods to shift. Can be positive or negative.
If an iterable of ints, the data will be shifted once by each int.
This is equivalent to shifting by one value at a time and
concatenating all resulting frames. The resulting columns will have
the shift suffixed to their column names. For multiple periods,
axis must not be 1.
freqDateOffset, tseries.offsets, timedelta, or str, optional
Offset to use from the tseries module or time rule (e.g. ‘EOM’).
If freq is specified then the index values are shifted but the
data is not realigned. That is, use freq if you would like to
extend the index when shifting and preserve the original data.
If freq is specified as “infer” then it will be inferred from
the freq or inferred_freq attributes of the index. If neither of
those attributes exist, a ValueError is thrown.
axis{0 or ‘index’, 1 or ‘columns’, None}, default None
Shift direction. For Series this parameter is unused and defaults to 0.
fill_valueobject, optional
The scalar value to use for newly introduced missing values.
the default depends on the dtype of self.
For numeric data, np.nan is used.
For datetime, timedelta, or period data, etc. NaT is used.
For extension dtypes, self.dtype.na_value is used.
suffixstr, optional
If str and periods is an iterable, this is added after the column
name and before the shift value for each shifted column name.
>>> df.shift(periods=3) Col1 Col2 Col32020-01-01 NaN NaN NaN2020-01-02 NaN NaN NaN2020-01-03 NaN NaN NaN2020-01-04 10.0 13.0 17.02020-01-05 20.0 23.0 27.0
>>> df.shift(periods=1,axis="columns") Col1 Col2 Col32020-01-01 NaN 10 132020-01-02 NaN 20 232020-01-03 NaN 15 182020-01-04 NaN 30 332020-01-05 NaN 45 48
Choice of sorting algorithm. See also numpy.sort() for more
information. mergesort and stable are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position{‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the end.
Not implemented for MultiIndex.
sort_remainingbool, default True
If True and sorting by level and index is multilevel, sort by other
levels too (in order) after sorting by specified level.
ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
keycallable, optional
If not None, apply the key function to the index values
before sorting. This is similar to the key argument in the
builtin sorted() function, with the notable difference that
this key function should be vectorized. It should expect an
Index and return an Index of the same shape. For MultiIndex
inputs, the key is applied per level.
Choice of sorting algorithm. See also numpy.sort() for more
information. mergesort and stable are the only stable algorithms. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position{‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the
end.
ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
keycallable, optional
Apply the key function to the values
before sorting. This is similar to the key argument in the
builtin sorted() function, with the notable difference that
this key function should be vectorized. It should expect a
Series and return a Series with the same shape as the input.
It will be applied to each column in by independently.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
This method is most useful when you don’t know if your
object is a Series or DataFrame, but you do know it has just a single
column. In that case you can safely call squeeze to ensure you have a
Series.
Series.iloc : Integer-location based indexing for selecting scalars.
DataFrame.iloc : Integer-location based indexing for selecting Series.
Series.to_frame : Inverse of DataFrame.squeeze for a
Stack the prescribed level(s) from columns to index.
Return a reshaped DataFrame or Series having a multi-level
index with one or more new inner-most levels compared to the current
DataFrame. The new inner-most levels are created by pivoting the
columns of the current dataframe:
if the columns have a single level, the output is a Series;
if the columns have multiple levels, the new index
level(s) is (are) taken from the prescribed level(s) and
the output is a DataFrame.
Level(s) to stack from the column axis onto the index
axis, defined as one index or label, or a list of indices
or labels.
dropnabool, default True
Whether to drop rows in the resulting Frame/Series with
missing values. Stacking a column level onto the index
axis can create combinations of index and column values
that are missing from the original dataframe. See Examples
section.
sortbool, default True
Whether to sort the levels of the resulting MultiIndex.
future_stackbool, default False
Whether to use the new implementation that will replace the current
implementation in pandas 3.0. When True, dropna and sort have no impact
on the result and must remain unspecified. See pandas 2.1.0 Release
notes for more details.
The function is named by analogy with a collection of books
being reorganized from being side by side on a horizontal
position (the columns of the dataframe) to being stacked
vertically on top of each other (in the index of the
dataframe).
It is common to have missing values when stacking a dataframe
with multi-level columns, as the stacked dataframe typically
has more values than the original dataframe. Missing values
are filled with NaNs:
>>> df_multi_level_cols2 weight height kg mcat 1.0 2.0dog 3.0 4.0>>> df_multi_level_cols2.stack(future_stack=True) weight heightcat kg 1.0 NaN m NaN 2.0dog kg 3.0 NaN m NaN 4.0
Prescribing the level(s) to be stacked
The first parameter controls which level or levels are stacked:
>>> df_multi_level_cols2.stack(0,future_stack=True) kg mcat weight 1.0 NaN height NaN 2.0dog weight 3.0 NaN height NaN 4.0>>> df_multi_level_cols2.stack([0,1],future_stack=True)cat weight kg 1.0 height m 2.0dog weight kg 3.0 height m 4.0dtype: float64
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.std with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
skipnabool, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
ddofint, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Axis for the function to be applied on.
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.sum with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
Added in version 2.0.0.
skipnabool, default True
Exclude NA/null values when computing the result.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
min_countint, default 0
The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
Series.sum : Return the sum.
Series.min : Return the minimum.
Series.max : Return the maximum.
Series.idxmin : Return the index of the minimum.
Series.idxmax : Return the index of the maximum.
DataFrame.sum : Return the sum over the requested axis.
DataFrame.min : Return the minimum over the requested axis.
DataFrame.max : Return the maximum over the requested axis.
DataFrame.idxmin : Return the index of the minimum over the requested axis.
DataFrame.idxmax : Return the index of the maximum over the requested axis.
>>> df=pd.DataFrame(... {"Grade":["A","B","A","C"]},... index=[... ["Final exam","Final exam","Coursework","Coursework"],... ["History","Geography","History","Geography"],... ["January","February","March","April"],... ],... )>>> df GradeFinal exam History January A Geography February BCoursework History March A Geography April C
In the following example, we will swap the levels of the indices.
Here, we will swap the levels column-wise, but levels can be swapped row-wise
in a similar manner. Note that column-wise is the default behaviour.
By not supplying any arguments for i and j, we swap the last and second to
last indices.
>>> df.swaplevel() GradeFinal exam January History A February Geography BCoursework March History A April Geography C
By supplying one argument, we can choose which index to swap the last
index with. We can for example swap the first index with the last one as
follows.
>>> df.swaplevel(0) GradeJanuary History Final exam AFebruary Geography Final exam BMarch History Coursework AApril Geography Coursework C
We can also define explicitly which indices we want to swap by supplying values
for both i and j. Here, we for example swap the first and second indices.
>>> df.swaplevel(0,1) GradeHistory Final exam January AGeography Final exam February BHistory Coursework March AGeography Coursework April C
This function returns last n rows from the object based on
position. It is useful for quickly verifying data, for example,
after sorting or appending rows.
For negative values of n, this function returns all rows except
the first |n| rows, equivalent to df[|n|:].
If n is larger than the number of rows, this function returns all rows.
Return the elements in the given positional indices along an axis.
This means that we are not indexing according to actual values in
the index attribute of the object. We are indexing according to the
actual position of the element in the object.
An array of ints indicating which positions to take.
axis{0 or ‘index’, 1 or ‘columns’, None}, default 0
The axis on which to select elements. 0 means that we are
selecting rows, 1 means that we are selecting columns.
For Series this parameter is unused and defaults to 0.
DataFrame.loc : Select a subset of a DataFrame by labels.
DataFrame.iloc : Select a subset of a DataFrame by positions.
numpy.take : Take elements from an array along an axis.
>>> df=pd.DataFrame([('falcon','bird',389.0),... ('parrot','bird',24.0),... ('lion','mammal',80.5),... ('monkey','mammal',np.nan)],... columns=['name','class','max_speed'],... index=[0,2,3,1])>>> df name class max_speed0 falcon bird 389.02 parrot bird 24.03 lion mammal 80.51 monkey mammal NaN
Take elements at positions 0 and 3 along the axis 0 (default).
Note how the actual indices selected (0 and 1) do not correspond to
our selected indices 0 and 3. That’s because we are selecting the 0th
and 3rd rows, not rows whose indices equal 0 and 3.
>>> df.take([0,3]) name class max_speed0 falcon bird 389.01 monkey mammal NaN
Take elements at indices 1 and 2 along the axis 1 (column selection).
>>> df.take([1,2],axis=1) class max_speed0 bird 389.02 bird 24.03 mammal 80.51 mammal NaN
We may take elements using negative integers for positive indices,
starting from the end of the object, just like with Python lists.
>>> df.take([-1,-2]) name class max_speed1 monkey mammal NaN3 lion mammal 80.5
path_or_bufstr, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like
object implementing a write() function. If None, the result is
returned as a string. If a non-binary file object is passed, it should
be opened with newline=’’, disabling universal newlines. If a binary
file object is passed, mode might need to contain a ‘b’.
sepstr, default ‘,’
String of length 1. Field delimiter for the output file.
na_repstr, default ‘’
Missing data representation.
float_formatstr, Callable, default None
Format string for floating point numbers. If a Callable is given, it takes
precedence over other numeric formatting parameters, like decimal.
columnssequence, optional
Columns to write.
headerbool or list of str, default True
Write out the column names. If a list of strings is given it is
assumed to be aliases for the column names.
indexbool, default True
Write row names (index).
index_labelstr or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and
header and index are True, then the index names are used. A
sequence should be given if the object uses MultiIndex. If
False do not print fields for index names. Use index_label=False
for easier importing in R.
mode{‘w’, ‘x’, ‘a’}, default ‘w’
Forwarded to either open(mode=) or fsspec.open(mode=) to control
the file opening. Typical values include:
‘w’, truncate the file first.
‘x’, exclusive creation, failing if the file already exists.
‘a’, append to the end of file if it exists.
encodingstr, optional
A string representing the encoding to use in the output file,
defaults to ‘utf-8’. encoding is not supported if path_or_buf
is a non-binary file object.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
May be a dict with key ‘method’ as compression mode
and other entries as additional compression options if
compression mode is ‘zip’.
Passing compression options as keys in dict is
supported for compression modes ‘gzip’, ‘bz2’, ‘zstd’, and ‘zip’.
quotingoptional constant from csv module
Defaults to csv.QUOTE_MINIMAL. If you have set a float_format
then floats are converted to strings and thus csv.QUOTE_NONNUMERIC
will treat them as non-numeric.
quotecharstr, default ‘"’
String of length 1. Character used to quote fields.
lineterminatorstr, optional
The newline character or character sequence to use in the output
file. Defaults to os.linesep, which depends on the OS in which
this method is called (’\n’ for linux, ‘\r\n’ for Windows, i.e.).
Changed in version 1.5.0: Previously was line_terminator, changed for consistency with
read_csv and the standard library ‘csv’ module.
chunksizeint or None
Rows to write at a time.
date_formatstr, default None
Format string for datetime objects.
doublequotebool, default True
Control quoting of quotechar inside a field.
escapecharstr, default None
String of length 1. Character used to escape sep and quotechar
when appropriate.
decimalstr, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for
European data.
errorsstr, default ‘strict’
Specifies how encoding and decoding errors are to be handled.
See the errors argument for open() for a full list
of options.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
‘records’ : list like
[{column -> value}, … , {column -> value}]
‘index’ : dict like {index -> {column -> value}}
Added in version 1.4.0: ‘tight’ as an allowed value for the orient argument
intoclass, default dict
The collections.abc.MutableMapping subclass used for all Mappings
in the return value. Can be the actual class or an empty
instance of the mapping type you want. If you want a
collections.defaultdict, you must pass it initialized.
indexbool, default True
Whether to include the index item (and index_names item if orient
is ‘tight’) in the returned dictionary. Can only be False
when orient is ‘split’ or ‘tight’.
To write a single object to an Excel .xlsx file it is only necessary to
specify a target file name. To write to multiple sheets it is necessary to
create an ExcelWriter object with a target file name, and specify a sheet
in the file to write to.
Multiple sheets may be written to by specifying unique sheet_name.
With all data written to the file it is necessary to save the changes.
Note that creating an ExcelWriter object with a file name that already
exists will result in the contents of the existing file being erased.
excel_writerpath-like, file-like, or ExcelWriter object
File path or existing ExcelWriter.
sheet_namestr, default ‘Sheet1’
Name of sheet which will contain DataFrame.
na_repstr, default ‘’
Missing data representation.
float_formatstr, optional
Format string for floating point numbers. For example
float_format="%.2f" will format 0.1234 to 0.12.
columnssequence or list of str, optional
Columns to write.
headerbool or list of str, default True
Write out the column names. If a list of string is given it is
assumed to be aliases for the column names.
indexbool, default True
Write row names (index).
index_labelstr or sequence, optional
Column label for index column(s) if desired. If not specified, and
header and index are True, then the index names are used. A
sequence should be given if the DataFrame uses MultiIndex.
startrowint, default 0
Upper left cell row to dump data frame.
startcolint, default 0
Upper left cell column to dump data frame.
enginestr, optional
Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this
via the options io.excel.xlsx.writer or
io.excel.xlsm.writer.
merge_cellsbool, default True
Write MultiIndex and Hierarchical Rows as merged cells.
inf_repstr, default ‘inf’
Representation for infinity (there is no native representation for
infinity in Excel).
freeze_panestuple of int (length 2), optional
Specifies the one-based bottommost row and rightmost column that
is to be frozen.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
Added in version 1.2.0.
engine_kwargsdict, optional
Arbitrary keyword arguments passed to excel engine.
to_csv : Write DataFrame to a comma-separated values (csv) file.
ExcelWriter : Class for writing DataFrame objects into excel sheets.
read_excel : Read an Excel file into a pandas DataFrame.
read_csv : Read a comma-separated values (csv) file into DataFrame.
io.formats.style.Styler.to_excel : Add styles to Excel sheet.
To set the library that is used to write the Excel file,
you can pass the engine keyword (the default engine is
automatically chosen depending on the file extension):
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. If a string or a path,
it will be used as Root Directory path when writing a partitioned dataset.
This function writes the dataframe as a feather file. Requires a default
index. For saving the DataFrame with your custom index use a method that
supports custom indices e.g. to_parquet.
Changed in version 1.5.0: Default value is changed to True. Google has deprecated the
auth_local_webserver=False“out of band” (copy-paste)
flow.
table_schemalist of dicts, optional
List of BigQuery table fields to which according DataFrame
columns conform to, e.g. [{'name':'col1','type':'STRING'},...]. If schema is not provided, it will be
generated according to dtypes of DataFrame columns. See
BigQuery API documentation on available names of a field.
New in version 0.3.1 of pandas-gbq.
locationstr, optional
Location where the load job should run. See the BigQuery locations
documentation for a
list of available locations. The location must match that of the
target dataset.
New in version 0.5.0 of pandas-gbq.
progress_barbool, default True
Use the library tqdm to show the progress bar for the upload,
chunk by chunk.
Credentials for accessing Google APIs. Use this parameter to
override default credentials, such as to use Compute Engine
google.auth.compute_engine.Credentials or Service
Account google.oauth2.service_account.Credentials
directly.
Write the contained data to an HDF5 file using HDFStore.
Hierarchical Data Format (HDF) is self-describing, allowing an
application to interpret the structure and contents of a file with
no outside information. One HDF file can hold a mix of related objects
which can be accessed as a group or as individual objects.
In order to add another DataFrame or Series to an existing HDF file
please use append mode and a different a key.
Warning
One can store a subclass of DataFrame or Series to HDF5,
but the type of the subclass is lost upon storing.
Specifies the compression library to be used.
These additional compressors for Blosc are supported
(default if no compressor specified: ‘blosc:blosclz’):
{‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’,
‘blosc:zlib’, ‘blosc:zstd’}.
Specifying a compression library which is not available issues
a ValueError.
appendbool, default False
For Table formats, append the input data to the existing.
format{‘fixed’, ‘table’, None}, default ‘fixed’
Possible values:
‘fixed’: Fixed format. Fast writing/reading. Not-appendable,
nor searchable.
‘table’: Table format. Write as a PyTables Table structure
which may perform worse but allow more flexible operations
like searching / selecting subsets of the data.
If None, pd.get_option(‘io.hdf.default_format’) is checked,
followed by fallback to “fixed”.
indexbool, default True
Write DataFrame index as a column.
min_itemsizedict or int, optional
Map column names to minimum string sizes for columns.
nan_repAny, optional
How to represent null values as str.
Not allowed with append=True.
dropnabool, default False, optional
Remove missing values.
data_columnslist of columns or True, optional
List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes
of the object are indexed. See
Query via data columns. for
more information.
Applicable only to format=’table’.
errorsstr, default ‘strict’
Specifies how encoding and decoding errors are to be handled.
See the errors argument for open() for a full list
of options.
read_hdf : Read from HDF file.
DataFrame.to_orc : Write a DataFrame to the binary orc format.
DataFrame.to_parquet : Write a DataFrame to the binary parquet format.
DataFrame.to_sql : Write to a SQL table.
DataFrame.to_feather : Write out feather-format for DataFrames.
DataFrame.to_csv : Write out to a csv file.
bufstr, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
columnsarray-like, optional, default None
The subset of columns to write. Writes all columns by default.
col_spacestr or int, list or dict of int or str, optional
The minimum width of each column in CSS length units. An int is assumed to be px units..
headerbool, optional
Whether to print column labels, default True.
indexbool, optional, default True
Whether to print index (row) labels.
na_repstr, optional, default ‘NaN’
String representation of NaN to use.
formatterslist, tuple or dict of one-param. functions, optional
Formatter functions to apply to columns’ elements by position or
name.
The result of each function must be a unicode string.
List/tuple must be of length equal to the number of columns.
Formatter function to apply to columns’ elements if they are
floats. This function must return a unicode string and will be
applied only to the non-NaN elements, with NaN being
handled by na_rep.
sparsifybool, optional, default True
Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row.
index_namesbool, optional, default True
Prints the names of the indexes.
justifystr, default None
How to justify the column labels. If None uses the option from
the print configuration (controlled by set_option), ‘right’ out
of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rowsint, optional
Maximum number of rows to display in the console.
max_colsint, optional
Maximum number of columns to display in the console.
show_dimensionsbool, default False
Display DataFrame dimensions (number of rows by number of columns).
decimalstr, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe.
bold_rowsbool, default True
Make the row labels bold in the output.
classesstr or list or tuple, default None
CSS class(es) to apply to the resulting html table.
escapebool, default True
Convert the characters <, >, and & to HTML-safe sequences.
notebook{True, False}, default False
Whether the generated HTML is for IPython Notebook.
borderint
A border=border attribute is included in the opening
<table> tag. Default pd.options.display.html.border.
table_idstr, optional
A css id is included in the opening <table> tag if specified.
‘records’ : list like [{column -> value}, … , {column -> value}]
‘index’ : dict like {index -> {column -> value}}
‘columns’ : dict like {column -> {index -> value}}
‘values’ : just the values array
‘table’ : dict like {‘schema’: {schema}, ‘data’: {data}}
Describing the data, where data component is like orient='records'.
date_format{None, ‘epoch’, ‘iso’}
Type of date conversion. ‘epoch’ = epoch milliseconds,
‘iso’ = ISO8601. The default depends on the orient. For
orient='table', the default is ‘iso’. For all other orients,
the default is ‘epoch’.
double_precisionint, default 10
The number of decimal places to use when encoding
floating point values. The possible maximal value is 15.
Passing double_precision greater than 15 will raise a ValueError.
force_asciibool, default True
Force encoded string to be ASCII.
date_unitstr, default ‘ms’ (milliseconds)
The time unit to encode to, governs timestamp and ISO8601
precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond,
microsecond, and nanosecond respectively.
default_handlercallable, default None
Handler to call if object cannot otherwise be converted to a
suitable format for JSON. Should receive a single argument which is
the object to convert and return a serialisable object.
linesbool, default False
If ‘orient’ is ‘records’ write out line-delimited json format. Will
throw ValueError if incorrect ‘orient’ since others are not
list-like.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buf’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
indexbool or None, default None
The index is only used when ‘orient’ is ‘split’, ‘index’, ‘column’,
or ‘table’. Of these, ‘index’ and ‘column’ do not support
index=False.
indentint, optional
Length of whitespace used to indent each record.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
modestr, default ‘w’ (writing)
Specify the IO mode for output when supplying a path_or_buf.
Accepted args are ‘w’ (writing) and ‘a’ (append) only.
mode=’a’ is only supported when lines is True and orient is ‘records’.
The behavior of indent=0 varies from the stdlib, which does not
indent the output but does insert newlines. Currently, indent=0
and the default indent=None are equivalent in pandas, though this
may change in a future release.
orient='table' contains a ‘pandas_version’ field under ‘schema’.
This stores the version of pandas used in the latest revision of the
schema.
bufstr, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
columnslist of label, optional
The subset of columns to write. Writes all columns by default.
headerbool or list of str, default True
Write out the column names. If a list of strings is given,
it is assumed to be aliases for the column names.
indexbool, default True
Write row names (index).
na_repstr, default ‘NaN’
Missing data representation.
formatterslist of functions or dict of {{str: function}}, optional
Formatter functions to apply to columns’ elements by position or
name. The result of each function must be a unicode string.
List must be of length equal to the number of columns.
float_formatone-parameter function or str, optional, default None
Formatter for floating point numbers. For example
float_format="%.2f" and float_format="{{:0.2f}}".format will
both result in 0.1234 being formatted as 0.12.
sparsifybool, optional
Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row. By default, the value will be
read from the config module.
index_namesbool, default True
Prints the names of the indexes.
bold_rowsbool, default False
Make the row labels bold in the output.
column_formatstr, optional
The columns format as specified in LaTeX table format e.g. ‘rcl’ for 3
columns. By default, ‘l’ will be used for all columns except
columns of numbers, which default to ‘r’.
longtablebool, optional
Use a longtable environment instead of tabular. Requires
adding a usepackage{{longtable}} to your LaTeX preamble.
By default, the value will be read from the pandas config
module, and set to True if the option styler.latex.environment is
“longtable”.
Changed in version 2.0.0: The pandas option affecting this argument has changed.
escapebool, optional
By default, the value will be read from the pandas config
module and set to True if the option styler.format.escape is
“latex”. When set to False prevents from escaping latex special
characters in column names.
Changed in version 2.0.0: The pandas option affecting this argument has changed, as has the
default value to False.
encodingstr, optional
A string representing the encoding to use in the output file,
defaults to ‘utf-8’.
decimalstr, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe.
multicolumnbool, default True
Use multicolumn to enhance MultiIndex columns.
The default will be read from the config module, and is set
as the option styler.sparse.columns.
Changed in version 2.0.0: The pandas option affecting this argument has changed.
multicolumn_formatstr, default ‘r’
The alignment for multicolumns, similar to column_format
The default will be read from the config module, and is set as the option
styler.latex.multicol_align.
Changed in version 2.0.0: The pandas option affecting this argument has changed, as has the
default value to “r”.
multirowbool, default True
Use multirow to enhance MultiIndex rows. Requires adding a
usepackage{{multirow}} to your LaTeX preamble. Will print
centered labels (instead of top-aligned) across the contained
rows, separating groups via clines. The default will be read
from the pandas config module, and is set as the option
styler.sparse.index.
Changed in version 2.0.0: The pandas option affecting this argument has changed, as has the
default value to True.
captionstr or tuple, optional
Tuple (full_caption, short_caption),
which results in \caption[short_caption]{{full_caption}};
if a single string is passed, no short caption will be set.
labelstr, optional
The LaTeX label to be placed inside \label{{}} in the output.
This is used with \ref{{}} in the main .tex file.
positionstr, optional
The LaTeX positional argument for tables, to be placed after
\begin{{}} in the output.
As of v2.0.0 this method has changed to use the Styler implementation as
part of Styler.to_latex() via jinja2 templating. This means
that jinja2 is a requirement, and needs to be installed, for this method
to function. It is advised that users switch to using Styler, since that
implementation is more frequently updated and contains much more
flexibility with the output.
bufstr, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
modestr, optional
Mode in which file is opened, “wt” by default.
indexbool, optional, default True
Add index (row) labels.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
By default, the dtype of the returned array will be the common NumPy
dtype of all types in the DataFrame. For example, if the dtypes are
float16 and float32, the results dtype will be float32.
This may require copying data and coercing values, which may be
expensive.
Whether to ensure that the returned value is not a view on
another array. Note that copy=False does not ensure that
to_numpy() is no-copy. Rather, copy=True ensure that
a copy is made, even if not strictly necessary.
na_valueAny, optional
The value to use for missing values. The default value depends
on dtype and the dtypes of the DataFrame columns.
If a string, it will be used as Root Directory path
when writing a partitioned dataset. By file-like object,
we refer to objects with a write() method, such as a file handle
(e.g. via builtin open function). If path is None,
a bytes object is returned.
engine{‘pyarrow’}, default ‘pyarrow’
ORC library to use.
indexbool, optional
If True, include the dataframe’s index(es) in the file output.
If False, they will not be written to the file.
If None, similar to infer the dataframe’s index(es)
will be saved. However, instead of being saved as values,
the RangeIndex will be stored as a range in the metadata so it
doesn’t require much space and is faster. Other indexes will
be included as columns in the file output.
engine_kwargsdict[str, Any] or None, default None
Additional keyword arguments passed to pyarrow.orc.write_table().
read_orc : Read a ORC file.
DataFrame.to_parquet : Write a parquet file.
DataFrame.to_csv : Write a csv file.
DataFrame.to_sql : Write to a sql table.
DataFrame.to_hdf : Write to hdf.
This function writes the dataframe as a parquet file. You can choose different parquet
backends, and have the option of compression. See
the user guide for more details.
pathstr, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. If None, the result is
returned as bytes. If a string or path, it will be used as Root Directory
path when writing a partitioned dataset.
Parquet library to use. If ‘auto’, then the option
io.parquet.engine is used. The default io.parquet.engine
behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if
‘pyarrow’ is unavailable.
compressionstr or None, default ‘snappy’
Name of the compression to use. Use None for no compression.
Supported options: ‘snappy’, ‘gzip’, ‘brotli’, ‘lz4’, ‘zstd’.
indexbool, default None
If True, include the dataframe’s index(es) in the file output.
If False, they will not be written to the file.
If None, similar to True the dataframe’s index(es)
will be saved. However, instead of being saved as values,
the RangeIndex will be stored as a range in the metadata so it
doesn’t require much space and is faster. Other indexes will
be included as columns in the file output.
partition_colslist, optional, default None
Column names by which to partition the dataset.
Columns are partitioned in the order they are given.
Must be None if path is not a string.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
read_parquet : Read a parquet file.
DataFrame.to_orc : Write an orc file.
DataFrame.to_csv : Write a csv file.
DataFrame.to_sql : Write to a sql table.
DataFrame.to_hdf : Write to hdf.
If you want to get a buffer to the parquet content you can use a io.BytesIO
object, as long as you don’t use partition_cols, which creates multiple files.
If False then underlying input data is not copied.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function. File path where
the pickled object will be stored.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
protocolint
Int which indicates which protocol should be used by the pickler,
default HIGHEST_PROTOCOL (see [1]_ paragraph 12.1.2). The possible
values are 0, 1, 2, 3, 4, 5. A negative value for the protocol
parameter is equivalent to setting its value to HIGHEST_PROTOCOL.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
read_pickle : Load pickled pandas object (or any object) from file.
DataFrame.to_hdf : Write DataFrame to an HDF5 file.
DataFrame.to_sql : Write DataFrame to a SQL database.
DataFrame.to_parquet : Write a DataFrame to the binary parquet format.
Include index in resulting record array, stored in ‘index’
field or using the index label, if set.
column_dtypesstr, type, dict, default None
If a string or type, the data type to store all columns. If
a dictionary, a mapping of column names and indices (zero-indexed)
to specific data types.
index_dtypesstr, type, dict, default None
If a string or type, the data type to store all index levels. If
a dictionary, a mapping of index level names and indices
(zero-indexed) to specific data types.
consqlalchemy.engine.(Engine or Connection) or sqlite3.Connection
Using SQLAlchemy makes it possible to use any DB supported by that
library. Legacy support is provided for sqlite3.Connection objects. The user
is responsible for engine disposal and connection closure for the SQLAlchemy
connectable. See here.
If passing a sqlalchemy.engine.Connection which is already in a transaction,
the transaction will not be committed. If passing a sqlite3.Connection,
it will not be possible to roll back the record insertion.
schemastr, optional
Specify the schema (if database flavor supports this). If None, use
default schema.
replace: Drop the table before inserting new values.
append: Insert new values to the existing table.
indexbool, default True
Write DataFrame index as a column. Uses index_label as the column
name in the table. Creates a table index for this column.
index_labelstr or sequence, default None
Column label for index column(s). If None is given (default) and
index is True, then the index names are used.
A sequence should be given if the DataFrame uses MultiIndex.
chunksizeint, optional
Specify the number of rows in each batch to be written at a time.
By default, all rows will be written at once.
dtypedict or scalar, optional
Specifying the datatype for columns. If a dictionary is used, the
keys should be the column names and the values should be the
SQLAlchemy types or strings for the sqlite3 legacy mode. If a
scalar is provided, it will be applied to all columns.
method{None, ‘multi’, callable}, optional
Controls the SQL insertion clause used:
None : Uses standard SQL INSERT clause (one per row).
‘multi’: Pass multiple values in a single INSERT clause.
callable with signature (pd_table,conn,keys,data_iter).
Details and a sample callable implementation can be found in the
section insert method.
Number of rows affected by to_sql. None is returned if the callable
passed into method does not return an integer number of rows.
The number of returned rows affected is the sum of the rowcount
attribute of sqlite3.Cursor or SQLAlchemy connectable which may not
reflect the exact number of written rows as stipulated in the
sqlite3 or
SQLAlchemy.
Timezone aware datetime columns will be written as
Timestampwithtimezone type with SQLAlchemy if supported by the
database. Otherwise, the datetimes will be stored as timezone unaware
timestamps local to the original timezone.
Not all datastores support method="multi". Oracle, for example,
does not support multi-value insert.
Use method to define a callable insertion method to do nothing
if there’s a primary key conflict on a table in a PostgreSQL database.
>>> fromsqlalchemy.dialects.postgresqlimportinsert>>> definsert_on_conflict_nothing(table,conn,keys,data_iter):... # "a" is the primary key in "conflict_table"... data=[dict(zip(keys,row))forrowindata_iter]... stmt=insert(table.table).values(data).on_conflict_do_nothing(index_elements=["a"])... result=conn.execute(stmt)... returnresult.rowcount>>> df_conflict.to_sql(name="conflict_table",con=conn,if_exists="append",method=insert_on_conflict_nothing)0
For MySQL, a callable to update columns b and c if there’s a conflict
on a primary key.
Specify the dtype (especially useful for integers with missing values).
Notice that while pandas is forced to store the data as floating point,
the database supports nullable integers. When fetching the data with
Python, we get back integer scalars.
String, path object (implementing os.PathLike[str]), or file-like
object implementing a binary write() function.
convert_datesdict
Dictionary mapping columns containing datetime types to stata
internal format to use when writing the dates. Options are ‘tc’,
‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer
or a name. Datetime columns that do not have a conversion type
specified will be converted to ‘tc’. Raises NotImplementedError if
a datetime column has timezone information.
write_indexbool
Write the index to Stata dataset.
byteorderstr
Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.
time_stampdatetime
A datetime to use as file creation date. Default is the current
time.
data_labelstr, optional
A label for the data set. Must be 80 characters or smaller.
variable_labelsdict
Dictionary containing columns as keys and variable labels as
values. Each label must be 80 characters or smaller.
version{114, 117, 118, 119, None}, default 114
Version to use in the output dta file. Set to None to let pandas
decide between 118 or 119 formats depending on the number of
columns in the frame. Version 114 can be read by Stata 10 and
later. Version 117 can be read by Stata 13 or later. Version 118
is supported in Stata 14 and later. Version 119 is supported in
Stata 15 and later. Version 114 limits string variables to 244
characters or fewer while versions 117 and later allow strings
with lengths up to 2,000,000 characters. Versions 118 and 119
support Unicode characters, and version 119 supports more than
32,767 variables.
Version 119 should usually only be used when the number of
variables exceeds the capacity of dta format 118. Exporting
smaller datasets in format 119 may have unintended consequences,
and, as of November 2020, Stata SE cannot read version 119 files.
convert_strllist, optional
List of column names to convert to string columns to Stata StrL
format. Only available if version is 117. Storing strings in the
StrL format can produce smaller dta files if strings have more than
8 characters and values are repeated.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
value_labelsdict of dicts
Dictionary containing columns as keys and dictionaries of column value
to labels as values. Labels for a single variable must be 32,000
characters or smaller.
read_stata : Import Stata data files.
io.stata.StataWriter : Low-level writer for Stata data files.
io.stata.StataWriter117 : Low-level writer for version 117 files.
bufstr, Path or StringIO-like, optional, default None
Buffer to write to. If None, the output is returned as a string.
columnsarray-like, optional, default None
The subset of columns to write. Writes all columns by default.
col_spaceint, list or dict of int, optional
The minimum width of each column. If a list of ints is given every integers corresponds with one column. If a dict is given, the key references the column, while the value defines the space to use..
headerbool or list of str, optional
Write out the column names. If a list of columns is given, it is assumed to be aliases for the column names.
indexbool, optional, default True
Whether to print index (row) labels.
na_repstr, optional, default ‘NaN’
String representation of NaN to use.
formatterslist, tuple or dict of one-param. functions, optional
Formatter functions to apply to columns’ elements by position or
name.
The result of each function must be a unicode string.
List/tuple must be of length equal to the number of columns.
Formatter function to apply to columns’ elements if they are
floats. This function must return a unicode string and will be
applied only to the non-NaN elements, with NaN being
handled by na_rep.
sparsifybool, optional, default True
Set to False for a DataFrame with a hierarchical index to print
every multiindex key at each row.
index_namesbool, optional, default True
Prints the names of the indexes.
justifystr, default None
How to justify the column labels. If None uses the option from
the print configuration (controlled by set_option), ‘right’ out
of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rowsint, optional
Maximum number of rows to display in the console.
max_colsint, optional
Maximum number of columns to display in the console.
show_dimensionsbool, default False
Display DataFrame dimensions (number of rows by number of columns).
decimalstr, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe.
line_widthint, optional
Width to wrap a line in characters.
min_rowsint, optional
The number of rows to display in the console in a truncated repr
(when number of rows is above max_rows).
max_colwidthint, optional
Max width to truncate each column in characters. By default, no limit.
Convention for converting period to timestamp; start of period
vs. end.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert (the index by default).
copybool, default True
If False then underlying input data is not copied.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
path_or_bufferstr, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like
object implementing a write() function. If None, the result is returned
as a string.
indexbool, default True
Whether to include index in XML document.
root_namestr, default ‘data’
The name of root element in XML document.
row_namestr, default ‘row’
The name of row element in XML document.
na_repstr, optional
Missing data representation.
attr_colslist-like, optional
List of columns to write as attributes in row element.
Hierarchical columns will be flattened with underscore
delimiting the different levels.
elem_colslist-like, optional
List of columns to write as children in row element. By default,
all columns output as children of row element. Hierarchical
columns will be flattened with underscore delimiting the
different levels.
namespacesdict, optional
All namespaces to be defined in root element. Keys of dict
should be prefix names and values of dict corresponding URIs.
Default namespaces should be given empty string key. For
example,
namespaces={"":"https://example.com"}
prefixstr, optional
Namespace prefix to be used for every element and/or attribute
in document. This should be one of the keys in namespaces
dict.
encodingstr, default ‘utf-8’
Encoding of the resulting document.
xml_declarationbool, default True
Whether to include the XML declaration at start of document.
pretty_printbool, default True
Whether output should be pretty printed with indentation and
line breaks.
parser{‘lxml’,’etree’}, default ‘lxml’
Parser module to use for building of tree. Only ‘lxml’ and
‘etree’ are supported. With ‘lxml’, the ability to use XSLT
stylesheet is supported.
stylesheetstr, path object or file-like object, optional
A URL, file-like object, or a raw string containing an XSLT
script used to transform the raw XML output. Script should use
layout of elements and attributes from original output. This
argument requires lxml to be installed. Only XSLT 1.0
scripts and not later versions is currently supported.
compressionstr or dict, default ‘infer’
For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buffer’ is
path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’
(otherwise no compression).
Set to None for no compression.
Can also be a dict with key 'method' set
to one of {'zip', 'gzip', 'bz2', 'zstd', 'xz', 'tar'} and
other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile,
bz2.BZ2File, zstandard.ZstdCompressor, lzma.LZMAFile or
tarfile.TarFile, respectively.
As an example, the following could be passed for faster compression and to create
a reproducible gzip archive:
compression={'method':'gzip','compresslevel':1,'mtime':1}.
Added in version 1.5.0: Added support for .tar files.
Changed in version 1.4.0: Zstandard support.
storage_optionsdict, optional
Extra options that make sense for a particular storage connection, e.g.
host, port, username, password, etc. For HTTP(S) URLs the key-value pairs
are forwarded to urllib.request.Request as header options. For other
URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are
forwarded to fsspec.open. Please see fsspec and urllib for more
details, and for more examples on storage options refer here.
Function to use for transforming the data. If a function, must either
work when passed a DataFrame or when passed to DataFrame.apply. If func
is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
function
string function name
list-like of functions and/or function names, e.g. [np.exp,'sqrt']
dict-like of axis labels -> functions, function names or list-like of such.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’: apply function to each column.
If 1 or ‘columns’: apply function to each row.
>>> df=pd.DataFrame({... "c":[1,1,1,2,2,2,2],... "type":["m","n","o","m","m","n","n"]... })>>> df c type0 1 m1 1 n2 1 o3 2 m4 2 m5 2 n6 2 n>>> df['size']=df.groupby('c')['type'].transform(len)>>> df c type size0 1 m 31 1 n 32 1 o 33 2 m 44 2 m 45 2 n 46 2 n 4
Whether to copy the data after transposing, even for DataFrames
with a single dtype.
Note that a copy is always required for mixed dtype DataFrames,
or for DataFrames with any extension types.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
Transposing a DataFrame with mixed dtypes will result in a homogeneous
DataFrame with the object dtype. In such a case, a copy of the data
is always made.
Any single or multiple element data structure, or list-like object.
axis{0 or ‘index’, 1 or ‘columns’}
Whether to compare by the index (0 or ‘index’) or columns.
(1 or ‘columns’). For Series input, axis to match Series index on.
levelint or label
Broadcast across a level, matching Index values on the
passed MultiIndex level.
fill_valuefloat or None, default None
Fill existing missing (NaN) values, and any new element needed for
successful DataFrame alignment, with this value before computation.
If data in both corresponding DataFrame locations is missing
the result will be missing.
Axis to truncate. Truncates the index (rows) by default.
For Series this parameter is unused and defaults to 0.
copybool, default is True,
Return a copy of the truncated section.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
>>> df=pd.DataFrame({'A':['a','b','c','d','e'],... 'B':['f','g','h','i','j'],... 'C':['k','l','m','n','o']},... index=[1,2,3,4,5])>>> df A B C1 a f k2 b g l3 c h m4 d i n5 e j o
>>> df.truncate(before=2,after=4) A B C2 b g l3 c h m4 d i n
The columns of a DataFrame can be truncated.
>>> df.truncate(before="A",after="B",axis="columns") A B1 a f2 b g3 c h4 d i5 e j
For Series, only rows can be truncated.
>>> df['A'].truncate(before=2,after=4)2 b3 c4 dName: A, dtype: object
The index values in truncate can be datetimes or string
dates.
Because the index is a DatetimeIndex containing only dates, we can
specify before and after as strings. They will be coerced to
Timestamps before truncation.
Note that truncate assumes a 0 value for any unspecified time
component (midnight). This differs from partial string slicing, which
returns any partially matching dates.
Target time zone. Passing None will convert to
UTC and remove the timezone information.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert
levelint, str, default None
If axis is a MultiIndex, convert a specific level. Otherwise
must be None.
copybool, default True
Also make a copy of the underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
Time zone to localize. Passing None will remove the
time zone information and preserve local time.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
The axis to localize
levelint, str, default None
If axis ia a MultiIndex, localize a specific level. Otherwise
must be None.
copybool, default True
Also make a copy of the underlying data.
Note
The copy keyword will change behavior in pandas 3.0.
Copy-on-Write
will be enabled by default, which means that all methods with a
copy keyword will use a lazy copy mechanism to defer the copy and
ignore the copy keyword. The copy keyword will be removed in a
future version of pandas.
You can already get the future behavior and improvements through
enabling copy on write pd.options.mode.copy_on_write=True
When clocks moved backward due to DST, ambiguous times may arise.
For example in Central European Time (UTC+01), when going from
03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at
00:30:00 UTC and at 01:30:00 UTC. In such a situation, the
ambiguous parameter dictates how ambiguous times should be
handled.
‘infer’ will attempt to infer fall dst-transition hours based on
order
bool-ndarray where True signifies a DST time, False designates
a non-DST time (note that this flag is only applicable for
ambiguous times)
‘NaT’ will return NaT where there are ambiguous times
‘raise’ will raise an AmbiguousTimeError if there are ambiguous
times.
nonexistentstr, default ‘raise’
A nonexistent time does not exist in a particular timezone
where clocks moved forward due to DST. Valid values are:
‘shift_forward’ will shift the nonexistent time forward to the
closest existing time
‘shift_backward’ will shift the nonexistent time backward to the
closest existing time
‘NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
‘raise’ will raise an NonExistentTimeError if there are
nonexistent times.
If the DST transition causes nonexistent times, you can shift these
dates forward or backward with a timedelta object or ‘shift_forward’
or ‘shift_backward’.
>>> index=pd.MultiIndex.from_tuples([('one','a'),('one','b'),... ('two','a'),('two','b')])>>> s=pd.Series(np.arange(1.0,5.0),index=index)>>> sone a 1.0 b 2.0two a 3.0 b 4.0dtype: float64
>>> s.unstack(level=-1) a bone 1.0 2.0two 3.0 4.0
>>> s.unstack(level=0) one twoa 1.0 3.0b 2.0 4.0
>>> df=s.unstack(level=0)>>> df.unstack()one a 1.0 b 2.0two a 3.0 b 4.0dtype: float64
otherDataFrame, or object coercible into a DataFrame
Should have at least one matching index/column label
with the original DataFrame. If a Series is passed,
its name attribute must be set, and that will be
used as the column name to align with the original DataFrame.
join{‘left’}, default ‘left’
Only left join is implemented, keeping the index and columns of the
original object.
overwritebool, default True
How to handle non-NA values for overlapping keys:
True: overwrite original DataFrame’s values
with values from other.
False: only update values that are NA in
the original DataFrame.
The DataFrame’s length does not increase as a result of the update,
only values at matching index/column labels are updated.
>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_df=pd.DataFrame({'B':['d','e','f','g','h','i']})>>> df.update(new_df)>>> df A B0 a d1 b e2 c f
>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_df=pd.DataFrame({'B':['d','f']},index=[0,2])>>> df.update(new_df)>>> df A B0 a d1 b y2 c f
For Series, its name attribute must be set.
>>> df=pd.DataFrame({'A':['a','b','c'],... 'B':['x','y','z']})>>> new_column=pd.Series(['d','e','f'],name='B')>>> df.update(new_column)>>> df A B0 a d1 b e2 c f
If other contains NaNs the corresponding values are not updated
in the original dataframe.
The returned Series will have a MultiIndex with one level per input
column but an Index (non-multi) for a single label. By default, rows
that contain any NA values are omitted from the result. By default,
the resulting Series will be in descending order so that the first
element is the most frequently-occurring row.
With dropna set to False we can also count rows with NA values.
>>> df=pd.DataFrame({'first_name':['John','Anne','John','Beth'],... 'middle_name':['Smith',pd.NA,pd.NA,'Louise']})>>> df first_name middle_name0 John Smith1 Anne <NA>2 John <NA>3 Beth Louise
>>> df.value_counts()first_name middle_nameBeth Louise 1John Smith 1Name: count, dtype: int64
>>> df.value_counts(dropna=False)first_name middle_nameAnne NaN 1Beth Louise 1John Smith 1 NaN 1Name: count, dtype: int64
For Series this parameter is unused and defaults to 0.
Warning
The behavior of DataFrame.var with axis=None is deprecated,
in a future version this will reduce over both axes and return a scalar
To retain the old behavior, pass axis=0 (or do not pass axis).
skipnabool, default True
Exclude NA/null values. If an entire row/column is NA, the result
will be NA.
ddofint, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
numeric_onlybool, default False
Include only float, int, boolean columns. Not implemented for Series.
Users are allowed to use the Performance Dataframe without the second and
fourth dimension (Objective and Run respectively) in the case they only
have one objective or only do one run. This method adjusts the indexing for
those cases accordingly.
Args:
objective: The given objective name
run_id: The given run index
Returns:
A tuple representing the (possibly adjusted) Objective and Run index.
Method to check whether the specified objective is valid.
Users are allowed to index the dataframe without specifying all dimensions.
However, when dealing with multiple objectives this is not allowed and this
is verified here. If we have only one objective this is returned. Otherwise,
if an objective is specified by the user this is returned.
condbool Series/DataFrame, array-like, or callable
Where cond is True, keep the original value. Where
False, replace with corresponding value from other.
If cond is callable, it is computed on the Series/DataFrame and
should return boolean Series/DataFrame or array. The callable must
not change input Series/DataFrame (though pandas doesn’t check it).
otherscalar, Series/DataFrame, or callable
Entries where cond is False are replaced with
corresponding value from other.
If other is callable, it is computed on the Series/DataFrame and
should return scalar or Series/DataFrame. The callable must not
change input Series/DataFrame (though pandas doesn’t check it).
If not specified, entries will be filled with the corresponding
NULL value (np.nan for numpy dtypes, pd.NA for extension
dtypes).
inplacebool, default False
Whether to perform the operation in place on the data.
axisint, default None
Alignment axis if needed. For Series this parameter is
unused and defaults to 0.
The where method is an application of the if-then idiom. For each
element in the calling DataFrame, if cond is True the
element is used; otherwise the corresponding element from the DataFrame
other is used. If the axis of other does not align with axis of
cond Series/DataFrame, the misaligned index positions will be filled with
False.
The signature for DataFrame.where() differs from
numpy.where(). Roughly df1.where(m,df2) is equivalent to
np.where(m,df1,df2).
For further details and examples see the where documentation in
indexing.
The dtype of the object takes precedence. The fill value is casted to
the object’s dtype, if this can be done losslessly.
selection_scenario: The scenario to construct the Selector for.
run_on: Which runner to use. Defaults to slurm.
job_name: Name to give the construction job when submitting.
sbatch_options: Additional options to pass to sbatch.
slurm_prepend: Slurm script to prepend to the sbatch
base_dir: The base directory to run the Selector in.
Run the Selector CLI and write result to the Scenario PerformanceDataFrame.
Args:
scenario_path: The path to the scenario with the Selector to run.
instance_set: The instance set to run the Selector on.
feature_data: The instance feature data to use.
run_on: Which runner to use. Defaults to slurm.
sbatch_options: Additional options to pass to sbatch.
slurm_prepend: Slurm script to prepend to the sbatch
job_name: Name to give the Slurm job when submitting.
dependencies: List of dependencies to add to the job.
log_dir: The directory to write logs to.
Build the solver call on an instance with a configuration.
Args:
instance: Path to the instance.
objectives: List of sparkle objectives.
seed: Seed of the solver.
cutoff_time: Cutoff time for the solver.
configuration: Configuration of the solver.
log_dir: Directory path for logs.
Returns:
List of commands and arguments to execute the solver.
solver_output: The output of the solver run which needs to be parsed
solver_call: The solver call used to run the solver
objectives: The objectives to apply to the solver output
verifier: The verifier to check the solver output
Run the solver on an instance with a certain configuration.
Args:
instances: The instance(s) to run the solver on, list in case of multi-file.
In case of an instance set, will run on all instances in the set.
objectives: List of sparkle objectives.
seed: Seed to run the solver with. Fill with abitrary int in case of
determnistic solver.
cutoff_time: The cutoff time for the solver, measured through RunSolver.
If None, will be executed without RunSolver.
configuration: The solver configuration to use. Can be empty.
run_on: Whether to run on slurm or locally.
sbatch_options: The sbatch options to use.
slurm_prepend: The script to prepend to a slurm script.
log_dir: The log directory to use.
Returns:
Solver output dict possibly with runsolver values.
Run the solver from and place the results in the performance dataframe.
This in practice actually runs Solver.run, but has a little script before/after,
to read and write to the performance dataframe.
Args:
instances: The instance(s) to run the solver on. In case of an instance set,
or list, will create a job for all instances in the set/list.
config_ids: The config indices to use in the performance dataframe.
performance_dataframe: The performance dataframe to use.
run_ids: List of run ids to use. If list of list, a list of runs is given
per instance. Otherwise, all runs are used for each instance.
cutoff_time: The cutoff time for the solver, measured through RunSolver.
objective: The objective to use, only relevant when determining the best
configuration.
train_set: The training set to use. If present, will determine the best
configuration of the solver using these instances and run with it on
all instances in the instance argument.
sbatch_options: List of slurm batch options to use
slurm_prepend: Slurm script to prepend to the sbatch
dependencies: List of slurm runs to use as dependencies
log_dir: Path where to place output files. Defaults to CWD.
base_dir: Path where to place output files.
job_name: Name of the job
If None, will generate a name based on Solver and Instances
run_on: On which platform to run the jobs. Default: Slurm.
log_dir: Directory to store job logs
sbatch_options: Options to pass to sbatch
slurm_prepend: Script to prepend to sbatch script
run_on: Determines to which RunRunner queue the job is added
Returns:
A list of Run objects. Empty when running locally.
This method is shared by the configurators and should be called by the
implementation/subclass of the configurator.
Args:
configuration_commands: List of configurator commands to execute
data_target: Performance data to store the results.
output: Output directory.
scenario: ConfigurationScenario to execute.
configuration_ids: List of configuration ids that are to be created
validate_after: Whether the configurations should be validated
sbatch_options: List of slurm batch options to use
slurm_prepend: Slurm script to prepend to the sbatch
num_parallel_jobs: The maximum number of jobs to run in parallel
base_dir: The base_dir of RunRunner where the sbatch scripts will be placed
run_on: On which platform to run the jobs. Default: Slurm.
Method to restructure and clean up after a single configurator call.
Args:
output_source: Path to the output file of the configurator run.
output_target: Path to the Performance DataFrame to store result.
scenario: ConfigurationScenario of the configuration.
configuration_id: ID (of the run) of the configuration.
If the output_target is None, return the configuration.
Args:
scenario: ConfigurationScenario of the configuration. Should be removed.
configuration_id: ID (of the run) of the configuration.
configuration: Configuration to save.
output_target: Path to the Performance DataFrame to store result.
selection_scenario: The scenario to construct the Selector for.
run_on: Which runner to use. Defaults to slurm.
job_name: Name to give the construction job when submitting.
sbatch_options: Additional options to pass to sbatch.
slurm_prepend: Slurm script to prepend to the sbatch
base_dir: The base directory to run the Selector in.
Run the Selector CLI and write result to the Scenario PerformanceDataFrame.
Args:
scenario_path: The path to the scenario with the Selector to run.
instance_set: The instance set to run the Selector on.
feature_data: The instance feature data to use.
run_on: Which runner to use. Defaults to slurm.
sbatch_options: Additional options to pass to sbatch.
slurm_prepend: Slurm script to prepend to the sbatch
job_name: Name to give the Slurm job when submitting.
dependencies: List of dependencies to add to the job.
log_dir: The directory to write logs to.
instance: The instance to run on
feature_group: The optional feature group to run the extractor for.
output_file: Optional file to write the output to.
runsolver_args: The arguments for runsolver. If not present,
will run the extractor without runsolver.
cutoff_time: The maximum runtime.
log_dir: Directory path for logs.
extractor_path: Path to the executable
instance: Path to the instance to run on
feature_group: The feature group to compute. Must be supported by the
extractor to use.
output_file: Target output. If None, piped to the RunRunner job.
cutoff_time: CPU cutoff time in seconds
log_dir: Directory to write logs. Defaults to CWD.
Returns:
The features or None if an output file is used, or features can not be found.
Run the Extractor CLI and write result to the FeatureDataFrame.
Args:
instance_set: The instance set to run the Extractor on.
feature_dataframe: The feature dataframe to write to.
cutoff_time: CPU cutoff time in seconds
feature_group: The feature group to compute. If left empty,
will run on all feature groups.
run_on: The runner to use.
sbatch_options: Additional options to pass to sbatch.
srun_options: Additional options to pass to srun.
parallel_jobs: Number of parallel jobs to run.
slurm_prepend: Slurm script to prepend to the sbatch
dependencies: List of dependencies to add to the job.
log_dir: The directory to write logs to.
selection_scenario: The scenario to construct the Selector for.
run_on: Which runner to use. Defaults to slurm.
job_name: Name to give the construction job when submitting.
sbatch_options: Additional options to pass to sbatch.
slurm_prepend: Slurm script to prepend to the sbatch
base_dir: The base directory to run the Selector in.
Run the Selector CLI and write result to the Scenario PerformanceDataFrame.
Args:
scenario_path: The path to the scenario with the Selector to run.
instance_set: The instance set to run the Selector on.
feature_data: The instance feature data to use.
run_on: Which runner to use. Defaults to slurm.
sbatch_options: Additional options to pass to sbatch.
slurm_prepend: Slurm script to prepend to the sbatch
job_name: Name to give the Slurm job when submitting.
dependencies: List of dependencies to add to the job.
log_dir: The directory to write logs to.
Build the solver call on an instance with a configuration.
Args:
instance: Path to the instance.
objectives: List of sparkle objectives.
seed: Seed of the solver.
cutoff_time: Cutoff time for the solver.
configuration: Configuration of the solver.
log_dir: Directory path for logs.
Returns:
List of commands and arguments to execute the solver.
solver_output: The output of the solver run which needs to be parsed
solver_call: The solver call used to run the solver
objectives: The objectives to apply to the solver output
verifier: The verifier to check the solver output
Run the solver on an instance with a certain configuration.
Args:
instances: The instance(s) to run the solver on, list in case of multi-file.
In case of an instance set, will run on all instances in the set.
objectives: List of sparkle objectives.
seed: Seed to run the solver with. Fill with abitrary int in case of
determnistic solver.
cutoff_time: The cutoff time for the solver, measured through RunSolver.
If None, will be executed without RunSolver.
configuration: The solver configuration to use. Can be empty.
run_on: Whether to run on slurm or locally.
sbatch_options: The sbatch options to use.
slurm_prepend: The script to prepend to a slurm script.
log_dir: The log directory to use.
Returns:
Solver output dict possibly with runsolver values.
Run the solver from and place the results in the performance dataframe.
This in practice actually runs Solver.run, but has a little script before/after,
to read and write to the performance dataframe.
Args:
instances: The instance(s) to run the solver on. In case of an instance set,
or list, will create a job for all instances in the set/list.
config_ids: The config indices to use in the performance dataframe.
performance_dataframe: The performance dataframe to use.
run_ids: List of run ids to use. If list of list, a list of runs is given
per instance. Otherwise, all runs are used for each instance.
cutoff_time: The cutoff time for the solver, measured through RunSolver.
objective: The objective to use, only relevant when determining the best
configuration.
train_set: The training set to use. If present, will determine the best
configuration of the solver using these instances and run with it on
all instances in the instance argument.
sbatch_options: List of slurm batch options to use
slurm_prepend: Slurm script to prepend to the sbatch
dependencies: List of slurm runs to use as dependencies
log_dir: Path where to place output files. Defaults to CWD.
base_dir: Path where to place output files.
job_name: Name of the job
If None, will generate a name based on Solver and Instances
run_on: On which platform to run the jobs. Default: Slurm.
Add an extractor and its feature names to the dataframe.
Arguments:
extractor: Name of the extractor
extractor_features: Tuples of [FeatureGroup, FeatureName]
values: Initial values of the Extractor per instance in the dataframe.
Add a new solver to the dataframe. Initializes value to None by default.
Args:
solver_name: The name of the solver to be added.
configurations: A list of configuration keys for the solver.
initial_value: The value assigned for each index of the new solver.
If not None, must match the index dimension (n_obj * n_inst * n_runs).
Return the best configuration for the given objective over the instances.
Args:
solver: The solver for which we determine the best configuration
objective: The objective for which we calculate the best configuration
instances: The instances which should be selected for the evaluation
Returns:
The best configuration id and its aggregated performance.
Return the best performance for each instance in the portfolio.
Args:
objective: The objective for which we calculate the best performance
instances: The instances which should be selected for the evaluation
run_id: The run for which we calculate the best performance. If None,
we consider all runs.
exclude_solvers: List of (solver, config_id) to exclude in the calculation.
Returns:
The best performance for each instance in the portfolio.
Return the (best) configuration performance for objective over the instances.
Args:
solver: The solver for which we determine evaluate the configuration
configuration: The configuration (id) to evaluate
objective: The objective for which we calculate find the best value
instances: The instances which should be selected for the evaluation
per_instance: Whether to return the performance per instance,
or aggregated.
Returns:
The (best) configuration id and its aggregated performance.
Return a list of performance computation jobs there are to be done.
Get a list of tuple[instance, solver] to run from the performance data.
If rerun is False (default), get only the tuples that don’t have a
value, else (True) get all the tuples.
Args:
rerun: Boolean indicating if we want to rerun all jobs
Returns:
A tuple of (solver, config, instance, run) combinations
Return the marginal contribution of the solver configuration on the instances.
Args:
objective: The objective for which we calculate the marginal contribution.
instances: The instances which should be selected for the evaluation
sort: Whether to sort the results afterwards
Returns:
The marginal contribution of each solver (configuration) as:
[(solver, config_id, marginal_contribution, portfolio_best_performance_without_solver)]
Users are allowed to use the Performance Dataframe without the second and
fourth dimension (Objective and Run respectively) in the case they only
have one objective or only do one run. This method adjusts the indexing for
those cases accordingly.
Args:
objective: The given objective name
run_id: The given run index
Returns:
A tuple representing the (possibly adjusted) Objective and Run index.
Method to check whether the specified objective is valid.
Users are allowed to index the dataframe without specifying all dimensions.
However, when dealing with multiple objectives this is not allowed and this
is verified here. If we have only one objective this is returned. Otherwise,
if an objective is specified by the user this is returned.
Exports a config space object to a specific PCS convention.
Args:
configspace: ConfigurationSpace, the space to convert
pcs_format: PCSConvention, the convention to conver to
file: Path, the file to write to. If None, will return string.
Returns:
String in case of no file path given, otherwise None.
Gather the additional parameters for the solver call.
Args:
args_dict: Dictionary mapping argument names to their currently held values
prefix: Prefix of the command line options
postfix: Postfix of the command line options
Try to resolve the objective class by (case-sensitive) name.
convention: objective_name(variable-k)?(:[min|max])?(:[metric|objective])?
Here, min|max refers to the minimisation or maximisation of the objective
and metric|objective refers to whether the objective should be optimized
or just recorded.
Order of resolving:
class_name of user defined SparkleObjectives
class_name of sparkle defined SparkleObjectives
default SparkleObjective with minimization unless specified as max
Args:
objective_name: The name of the objective class. Can include parameter value k.
Returns:
Instance of the Objective class or None if not found.