Chapter 2 Pandas
2.1 Panda Basics
2.1.1 Generate Matrix from Arrays
Go back to fan’s Python Code Examples Repository (bookdown site) or the pyfan Package (API).
import numpy as np
import pandas as pd
import random as random
import string as string
2.1.1.1 Single Arrays to Matrix
Given various arrays, generate a matrix
123)
np.random.seed(# Concatenate to matrix
= np.column_stack(np.random.randint(10, size=(5, 3)))
mt_abc # Matrix to data frame with columns and row names
= pd.DataFrame(data=mt_abc,
df_abc =[ 'r' + str(it_col) for it_col in np.array(range(1, mt_abc.shape[0]+1))],
index=[ 'c' + str(it_col) for it_col in np.array(range(1, mt_abc.shape[1]+1))])
columns# Print
print(df_abc)
## c1 c2 c3 c4 c5
## r1 2 1 6 1 0
## r2 2 3 1 9 9
## r3 6 9 0 0 3
2.1.1.2 Generate a Testing Dataframe with String and Numeric Values
Generate a test dataframe with string and numeric variables. For testing purposes.
# Seed
456)
np.random.seed(456)
random.seed(
# Numeric matrix 3 rows 4 columns
= np.random.randint(10, size=(3, 4))
mt_numeric
# String block 5 letters per word, 3 rows and 3 columns of words
= ''.join(random.choice(string.ascii_lowercase) for ctr in range(5*3*3))
st_rand_word_block = [st_rand_word_block[ctr: ctr + 5].capitalize() for ctr in range(0, len(st_rand_word_block), 5)]
ls_st_rand_word = np.reshape(ls_st_rand_word, [3,3])
mt_string
# Combine string and numeric matrix
= np.column_stack([mt_numeric, mt_string])
mt_data
# Matrix to dataframe
= pd.DataFrame(data=mt_data,
df_data =[ 'r' + str(it_col) for it_col in np.array(range(1, mt_data.shape[0]+1))],
index=[ 'c' + str(it_col) for it_col in np.array(range(1, mt_data.shape[1]+1))])
columns
# Print table
print(df_data)
## c1 c2 c3 c4 c5 c6 c7
## r1 5 9 4 5 Xoonm Zubtx Zqdkp
## r2 7 1 8 3 Ydcpw Obiee Gfxmq
## r3 5 2 4 2 Tzrwu Srwvp Kcsrb
2.1.2 Select Rows and Columns from Dataframe
Go back to fan’s Python Code Examples Repository (bookdown site) or the pyfan Package (API).
import numpy as np
import pandas as pd
import random as random
import string as string
2.1.2.1 Generate a Testing Dataframe
Generate a testing dataframe for selection and other tests.
# Seed
999)
np.random.seed(999)
random.seed(# Numeric matrix 3 rows 4 columns
= np.random.randint(10, size=(5, 4))
mt_numeric = ''.join(random.choice(string.ascii_lowercase) for ctr in range(5*5*3))
st_rand_word_block = np.reshape([st_rand_word_block[ctr: ctr + 5].capitalize() for ctr in range(0, len(st_rand_word_block), 5)], [5,3])
mt_string = np.column_stack([mt_numeric, mt_string])
mt_data
# Matrix to dataframe
= pd.DataFrame(data=mt_data,
df_data =[ 'r' + str(it_col) for it_col in np.array(range(1, mt_data.shape[0]+1))],
index=[ 'c' + str(it_col) for it_col in np.array(range(1, mt_data.shape[1]+1))])
columns
# Replace values
= df_data.replace(['Zvcss', 'Dugei', 'Ciagu'], 'Zqovt')
df_data
# Print table
print(df_data)
## c1 c2 c3 c4 c5 c6 c7
## r1 0 5 1 8 Zqovt Rppez Ukuzu
## r2 1 9 3 0 Zqovt Sbwyi Mzhum
## r3 5 8 8 0 Zqovt Qgfvk Fcrto
## r4 5 2 5 7 Wxlev Upoax Bhdxu
## r5 4 6 2 7 Hmziq Lbyfo Dntrz
2.1.2.2 Select Rows Based on Column/Variable Values
There is a dataframe with many rows, select a subset of rows where a particular column/variable’s value is equal to some value.
# Concatenate to matrix
= df_data.loc[df_data['c5'] == 'Zqovt']
df_data_subset # Print
print(df_data_subset)
## c1 c2 c3 c4 c5 c6 c7
## r1 0 5 1 8 Zqovt Rppez Ukuzu
## r2 1 9 3 0 Zqovt Sbwyi Mzhum
## r3 5 8 8 0 Zqovt Qgfvk Fcrto
See How to select rows from a DataFrame based on column values.
2.1.3 Pandas Importing and Exporting
Go back to fan’s Python Code Examples Repository (bookdown site) or the pyfan Package (API).
2.1.3.1 Export a Dataframe to CSV in User Download with Automatic File Name
During debugging and testing, a large dataframe is generated, but certain operation produces error. To fully debug, drop into debugger on error in PyCharm, and use console to generate a dataframe of just the matrix at issue. Now export this dataframe to csv in the fastest way possible.
- Find user home path, generate a download subdirectory if it does not exist.
- Export the current dataframe to csv in that file, with auto row and column names.
- The dataframe will be named after the current variable array name, and will have a time suffix added.
Replace the mt_abc line below, use a different name that should appear in the saved file name.
# Import Pathlib and panddas
import pandas as pd
import numpy as np
from pathlib import Path
import time
# replace mt_abc line by the matrix currently used
= np.column_stack(np.random.randint(10, size=(5, 3)))
mt_abc # Save results to C:\Users\fan\Downloads\PythonDebug, generate if does not exist.
= Path.joinpath(Path.home(), "Downloads", "PythonDebug")
srt_pydebug =True, exist_ok=True)
srt_pydebug.mkdir(parents# Matrix to data frame with columns and row names
= pd.DataFrame(data=mt_abc,
df2export =['r' + str(it_col) for it_col in np.array(range(1, mt_abc.shape[0] + 1))],
index=['c' + str(it_col) for it_col in np.array(range(1, mt_abc.shape[1] + 1))])
columns# Export File Name
= Path.joinpath(srt_pydebug, f'{mt_abc=}'.split('=')[0] + '-' + time.strftime("%Y%m%d-%H%M%S") + '.csv')
spn_csv_path # export
=",")
df2export.to_csv(spn_csv_path, sep
# print
print(f'{srt_pydebug=}')
## srt_pydebug=WindowsPath('C:/Users/fan/Downloads/PythonDebug')
print(f'{spn_csv_path=}')
## spn_csv_path=WindowsPath('C:/Users/fan/Downloads/PythonDebug/mt_abc-20201228-220153.csv')