Chapter 2 Pandas

2.1 Panda Basics

2.1.1 Generate Matrix from Arrays

Go back to fan’s Python Code Examples Repository (bookdown site) or the pyfan Package (API).

import numpy as np
import pandas as pd
import random as random
import string as string

2.1.1.1 Single Arrays to Matrix

Given various arrays, generate a matrix

np.random.seed(123)
# Concatenate to matrix
mt_abc = np.column_stack(np.random.randint(10, size=(5, 3)))
# Matrix to data frame with columns and row names
df_abc = pd.DataFrame(data=mt_abc,
            index=[ 'r' + str(it_col) for it_col in np.array(range(1, mt_abc.shape[0]+1))],
            columns=[ 'c' + str(it_col) for it_col in np.array(range(1, mt_abc.shape[1]+1))])
# Print
print(df_abc)
##     c1  c2  c3  c4  c5
## r1   2   1   6   1   0
## r2   2   3   1   9   9
## r3   6   9   0   0   3

2.1.1.2 Generate a Testing Dataframe with String and Numeric Values

Generate a test dataframe with string and numeric variables. For testing purposes.

# Seed
np.random.seed(456)
random.seed(456)

# Numeric matrix 3 rows 4 columns
mt_numeric = np.random.randint(10, size=(3, 4))

# String block 5 letters per word, 3 rows and 3 columns of words
st_rand_word_block = ''.join(random.choice(string.ascii_lowercase) for ctr in range(5*3*3))
ls_st_rand_word = [st_rand_word_block[ctr: ctr + 5].capitalize() for ctr in range(0, len(st_rand_word_block), 5)]
mt_string = np.reshape(ls_st_rand_word, [3,3])

# Combine string and numeric matrix
mt_data = np.column_stack([mt_numeric, mt_string])

# Matrix to dataframe
df_data = pd.DataFrame(data=mt_data,
                       index=[ 'r' + str(it_col) for it_col in np.array(range(1, mt_data.shape[0]+1))],
                       columns=[ 'c' + str(it_col) for it_col in np.array(range(1, mt_data.shape[1]+1))])
            
# Print table
print(df_data)
##    c1 c2 c3 c4     c5     c6     c7
## r1  5  9  4  5  Xoonm  Zubtx  Zqdkp
## r2  7  1  8  3  Ydcpw  Obiee  Gfxmq
## r3  5  2  4  2  Tzrwu  Srwvp  Kcsrb

2.1.2 Select Rows and Columns from Dataframe

Go back to fan’s Python Code Examples Repository (bookdown site) or the pyfan Package (API).

import numpy as np
import pandas as pd
import random as random
import string as string

2.1.2.1 Generate a Testing Dataframe

Generate a testing dataframe for selection and other tests.

# Seed
np.random.seed(999)
random.seed(999)
# Numeric matrix 3 rows 4 columns
mt_numeric = np.random.randint(10, size=(5, 4))
st_rand_word_block = ''.join(random.choice(string.ascii_lowercase) for ctr in range(5*5*3))
mt_string = np.reshape([st_rand_word_block[ctr: ctr + 5].capitalize() for ctr in range(0, len(st_rand_word_block), 5)], [5,3])
mt_data = np.column_stack([mt_numeric, mt_string])

# Matrix to dataframe
df_data = pd.DataFrame(data=mt_data,
                       index=[ 'r' + str(it_col) for it_col in np.array(range(1, mt_data.shape[0]+1))],
                       columns=[ 'c' + str(it_col) for it_col in np.array(range(1, mt_data.shape[1]+1))])

# Replace values
df_data = df_data.replace(['Zvcss', 'Dugei', 'Ciagu'], 'Zqovt')

# Print table
print(df_data)
##    c1 c2 c3 c4     c5     c6     c7
## r1  0  5  1  8  Zqovt  Rppez  Ukuzu
## r2  1  9  3  0  Zqovt  Sbwyi  Mzhum
## r3  5  8  8  0  Zqovt  Qgfvk  Fcrto
## r4  5  2  5  7  Wxlev  Upoax  Bhdxu
## r5  4  6  2  7  Hmziq  Lbyfo  Dntrz

2.1.2.2 Select Rows Based on Column/Variable Values

There is a dataframe with many rows, select a subset of rows where a particular column/variable’s value is equal to some value.

# Concatenate to matrix
df_data_subset = df_data.loc[df_data['c5'] == 'Zqovt']
# Print
print(df_data_subset)
##    c1 c2 c3 c4     c5     c6     c7
## r1  0  5  1  8  Zqovt  Rppez  Ukuzu
## r2  1  9  3  0  Zqovt  Sbwyi  Mzhum
## r3  5  8  8  0  Zqovt  Qgfvk  Fcrto

See How to select rows from a DataFrame based on column values.

2.1.3 Pandas Importing and Exporting

Go back to fan’s Python Code Examples Repository (bookdown site) or the pyfan Package (API).

2.1.3.1 Export a Dataframe to CSV in User Download with Automatic File Name

During debugging and testing, a large dataframe is generated, but certain operation produces error. To fully debug, drop into debugger on error in PyCharm, and use console to generate a dataframe of just the matrix at issue. Now export this dataframe to csv in the fastest way possible.

  1. Find user home path, generate a download subdirectory if it does not exist.
  2. Export the current dataframe to csv in that file, with auto row and column names.
  3. The dataframe will be named after the current variable array name, and will have a time suffix added.

Replace the mt_abc line below, use a different name that should appear in the saved file name.

# Import Pathlib and panddas
import pandas as pd
import numpy as np
from pathlib import Path
import time

# replace mt_abc line by the matrix currently used
mt_abc = np.column_stack(np.random.randint(10, size=(5, 3)))
# Save results to C:\Users\fan\Downloads\PythonDebug, generate if does not exist.
srt_pydebug = Path.joinpath(Path.home(), "Downloads", "PythonDebug")
srt_pydebug.mkdir(parents=True, exist_ok=True)
# Matrix to data frame with columns and row names
df2export = pd.DataFrame(data=mt_abc,
                         index=['r' + str(it_col) for it_col in np.array(range(1, mt_abc.shape[0] + 1))],
                         columns=['c' + str(it_col) for it_col in np.array(range(1, mt_abc.shape[1] + 1))])
# Export File Name
spn_csv_path = Path.joinpath(srt_pydebug, f'{mt_abc=}'.split('=')[0] + '-' + time.strftime("%Y%m%d-%H%M%S") + '.csv')
# export
df2export.to_csv(spn_csv_path, sep=",")

# print
print(f'{srt_pydebug=}')
## srt_pydebug=WindowsPath('C:/Users/fan/Downloads/PythonDebug')
print(f'{spn_csv_path=}')
## spn_csv_path=WindowsPath('C:/Users/fan/Downloads/PythonDebug/mt_abc-20201228-220153.csv')