# Import modules import pandas as pd import numpy as np
# Create a dataframe
# Create a dataframe raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan], 'nationality': ['USA', 'USA', 'France', 'UK', 'UK'], 'age': [42, 52, 36, 24, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age']) df
first_name | nationality | age | |
---|---|---|---|
0 | Jason | USA | 42 |
1 | Molly | USA | 52 |
2 | NaN | France | 36 |
3 | NaN | UK | 24 |
4 | NaN | UK | 70 |
Method 1: Using Boolean Variables
# Create variable with TRUE if nationality is USA american = df['nationality'] == "USA" # Create variable with TRUE if age is greater than 50 elderly = df['age'] > 50 # Select all casess where nationality is USA and age is greater than 50 df[american & elderly]
first_name | nationality | age | |
---|---|---|---|
1 | Molly | USA | 52 |
Method 2: Using variable attributes
# Select all cases where the first name is not missing and nationality is USA df[df['first_name'].notnull() & (df['nationality'] == "USA")]
first_name | nationality | age | |
---|---|---|---|
0 | Jason | USA | 42 |
1 | Molly | USA | 52 |
And with my data set I code simple code with works and on cmd screen I see:
import pandas as pd
import numpy as np
dataset = pd.read_csv("C:\\Users\\ANTRAS\\Documents\\PANDASTEST.txt", delimiter="\t")
#print(dataset)
numerous = dataset['Values'] > 30000
dataset[numerous]
forprint = dataset[numerous]
print(forprint)
C:\Python27\python.exe C:/Users/ANTRAS/.PyCharmCE2017.2/config/scratches/scratch_11.py
Name ID Values Rules
1 FIRST 4152.0 50000.0 SELL
2 SECOND 4153.0 70000.0 BUY
3 THIRD 4154.0 40000.0 OFFER
Process finished with exit code 0
And with value less than 30000 I get
numerous = dataset['Values'] < 30000
Name ID Values Rules
4 FOURTH 4155.0 1500.0 SWAP
5 FIFTH 4156.0 2000.0 FREE
6 SIXTH 4157.0 500.0 GET
7 SEVENTH 4158.0 1000.0 GO
Process finished with exit code 0
Another operations needed are to delete rows which names contain only underscores (since with regex I remove all non-alphanumeric characters) and values cell contain non-numerical values. Afterwards I will extract data using keywords list to new data frame and export it to csv file.
Here again I copy Chris Alon https://chrisalbon.com/
Dropping Rows And Columns In pandas Dataframe
Import modules
import pandas as pd
Create a dataframe
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [2012, 2012, 2013, 2014, 2014], 'reports': [4, 24, 31, 2, 3]} df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']) df
name | reports | year | |
---|---|---|---|
Cochice | Jason | 4 | 2012 |
Pima | Molly | 24 | 2012 |
Santa Cruz | Tina | 31 | 2013 |
Maricopa | Jake | 2 | 2014 |
Yuma | Amy | 3 | 2014 |
Drop an observation (row)
df.drop(['Cochice', 'Pima'])
name | reports | year | |
---|---|---|---|
Santa Cruz | Tina | 31 | 2013 |
Maricopa | Jake | 2 | 2014 |
Yuma | Amy | 3 | 2014 |
Drop a variable (column)
Note: axis=1 denotes that we are referring to a column, not a row
df.drop('reports', axis=1)
name | year | |
---|---|---|
Cochice | Jason | 2012 |
Pima | Molly | 2012 |
Santa Cruz | Tina | 2013 |
Maricopa | Jake | 2014 |
Yuma | Amy | 2014 |
Drop a row if it contains a certain value (in this case, "Tina")
Specifically: Create a new dataframe called df that includes all rows where the value of a cell in the name column does not equal "Tina"
df[df.name != 'Tina']
name | reports | year | |
---|---|---|---|
Cochice | Jason | 4 | 2012 |
Pima | Molly | 24 | 2012 |
Maricopa | Jake | 2 | 2014 |
Yuma | Amy | 3 | 2014 |
To find wheather it is numeric or non-numeric value requires more sophisticated methods.
On stackoverflow I found solutions how to find with numpy isreal method numeric and non-numeric values. Here we come with Biblical alpha numeric notions. But main method is Pandas applymap
DataFrame.
applymap
(func)[source]
Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame
I make google search "pandas data frame select rows containing keyword attributes":
To delete rows with specific columns not having only intengers seems complex tasks.
So I tested many codes for Boolean how determine "Values" column to find if values are intenger.
pandas.DataFrame.dropna
also
https://stackoverflow.com/questions/14991195/how-to-remove-rows-with-null-values-from-kth-column-onward-
dropna drops rows if in specified column there is NA and NaN since it is default value for null generated by Pandas.
Pandas remove rows which any string. On stackoverflow great solutions to integrate regex library, ix method is not known to me,
def.loc[:, df.columns.str.contains('a').tolist()]
https://stackoverflow.com/questions/11350770/pandas-dataframe-select-by-partial-string
Same search brings solution of list of arguments from stackoverflow.
For single search value
df[df.values == "banana"]
or
df[df.isin(['banana'])]
For multiple search terms:
df[(df.values == "banana")|(df.values == "apple" ) ]
or
df[df.isin(['banana', "apple"])]
# A B C
# 1 apple banana NaN
# 2 NaN NaN apple
# 3 banana NaN NaN
# 4 apple apple NaN
From Divakar: lines with both are returned.
select_rows(df,['apple','banana'])
# A B C
# 0 apple banana pear
To delete rows with specific columns not having only intengers seems complex tasks.
So I tested many codes for Boolean how determine "Values" column to find if values are intenger.
pandas.DataFrame.dropna
also
https://stackoverflow.com/questions/14991195/how-to-remove-rows-with-null-values-from-kth-column-onward-
df2.dropna(subset=['three', 'four', 'five'], how='all')
dropna drops rows if in specified column there is NA and NaN since it is default value for null generated by Pandas.
import pandas as pd import numpy as np dataset = pd.read_csv("C:\\Users\\ANTRAS\\Documents\\PANDASTEST.txt", delimiter="\t") result3 = dataset.dropna(subset=['Values'], how='all') #OKprint(result3)
Pandas remove rows which any string. On stackoverflow great solutions to integrate regex library, ix method is not known to me,
import re
regex = re.compile("[a-zA-Z]+")
result3 = dataset.ix[dataset.Values.map(lambda x: regex.search(x) is None)]
print(result3)
Name ID Values Rules
FIRST 4152 50000 SELL
SECOND 4153 70000 BUY
second45 7896 sduisl zet
THIRD 4154 40000 OFFER
FOURTH1 4161 540 SELLING
inser 4786 htpp// deit
FOURTH 4155 1500 SWAP
FIFTH 4156 2000 FREE
SIXTH 4157 500 GET
SEVENTH 4158 1000 GO
sweasd 7896 rjasj REX
After:
C:\Python27\python.exe C:/Users/ANTRAS/.PyCharmCE2017.2/config/scratches/scratch_11.py
Name ID Values Rules
0 FIRST 4152 50000 SELL
1 SECOND 4153 70000 BUY
3 THIRD 4154 40000 OFFER
4 FOURTH1 4161 540 SELLING
6 FOURTH 4155 1500 SWAP
7 FIFTH 4156 2000 FREE
8 SIXTH 4157 500 GET
9 SEVENTH 4158 1000 GO
Process finished with exit code 0
C:\Python27\python.exe C:/Users/ANTRAS/.PyCharmCE2017.2/config/scratches/scratch_11.py
Name ID Values Rules
0 FIRST 4152 50000 SELL
1 SECOND 4153 70000 BUY
3 THIRD 4154 40000 OFFER
4 FOURTH1 4161 540 SELLING
6 FOURTH 4155 1500 SWAP
7 FIFTH 4156 2000 FREE
8 SIXTH 4157 500 GET
9 SEVENTH 4158 1000 GO
Process finished with exit code 0
And got but which I can not explain, after I cleaned with Pandas data frame I can not make selection with values in Values column. Though bug fix research was rewarding and I found Programming with Python Python lessons for CU libraries blog. Software Carpentry.
Indexing, slicing and subsetting DataFrames with pandas
The blog is solely for stepby step automation with Pandas.
The bug with lambda function appears when there is no intenger fields in "Values column", so you need to write condition boolean if field is just with numeric exit, i.e. do not execute ix method.
On stackoverflow I found other solution. To be tested. Use Pandas data frames and numpy methods.
I tested these code, they not make any change.
dataset[dataset['Values'].apply(lambda x: isinstance(x, (int, np.int64)))]dataset[dataset['Values'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
#testui = senabaze = pd.read_table('test.txt', col_index='keys_ids', col=names)
#testui.to_csv('testui.csv')