Search This Blog

Saturday, October 28, 2017

Pandas operations with data

After pdf.read_csv('anyfile.txt') I need to sort and selct data. With search " pandas dataframe select data lines will value more than" google shows link Selecting Pandas DataFrame Rows Based On Conditions

# Import modules
import pandas as pd
import numpy as np

# Create a dataframe
# Create a dataframe
raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan], 
        'nationality': ['USA', 'USA', 'France', 'UK', 'UK'], 
        'age': [42, 52, 36, 24, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age'])


Method 1: Using Boolean Variables

# Create variable with TRUE if nationality is USA
american = df['nationality'] == "USA"

# Create variable with TRUE if age is greater than 50
elderly = df['age'] > 50

# Select all casess where nationality is USA and age is greater than 50
df[american & elderly]

Method 2: Using variable attributes
# Select all cases where the first name is not missing and nationality is USA 
df[df['first_name'].notnull() & (df['nationality'] == "USA")]

And with my data set I code simple code with works and on cmd screen I see:

import pandas as pd
import numpy as np
dataset = pd.read_csv("C:\\Users\\ANTRAS\\Documents\\PANDASTEST.txt", delimiter="\t")
numerous = dataset['Values'] > 30000
forprint = dataset[numerous]

C:\Python27\python.exe C:/Users/ANTRAS/.PyCharmCE2017.2/config/scratches/
     Name      ID   Values  Rules
1   FIRST  4152.0  50000.0   SELL
2  SECOND  4153.0  70000.0    BUY
3   THIRD  4154.0  40000.0  OFFER

Process finished with exit code 0

And with value less than 30000 I get

numerous = dataset['Values'] < 30000

C:\Python27\python.exe C:/Users/ANTRAS/.PyCharmCE2017.2/config/scratches/
      Name      ID  Values Rules
4   FOURTH  4155.0  1500.0  SWAP
5    FIFTH  4156.0  2000.0  FREE
6    SIXTH  4157.0   500.0   GET
7  SEVENTH  4158.0  1000.0    GO

Process finished with exit code 0

Another operations needed are to delete rows which names contain only underscores  (since with regex I remove all non-alphanumeric characters) and values cell contain non-numerical values. Afterwards I will extract data using keywords list to new data frame and export it to csv file.

Here again I copy Chris Alon

Dropping Rows And Columns In pandas Dataframe

Import modules

import pandas as pd
Create a dataframe
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
Santa CruzTina312013

Drop an observation (row)

df.drop(['Cochice', 'Pima'])
Santa CruzTina312013

Drop a variable (column)

Note: axis=1 denotes that we are referring to a column, not a row
df.drop('reports', axis=1)
Santa CruzTina2013

Drop a row if it contains a certain value (in this case, "Tina")

Specifically: Create a new dataframe called df that includes all rows where the value of a cell in the name column does not equal "Tina"
df[ != 'Tina']
To find wheather it is numeric or non-numeric value requires more sophisticated methods.

On stackoverflow I found solutions how to find with numpy isreal method numeric and non-numeric values. Here we come with Biblical alpha numeric notions. But main method is Pandas applymap  


Apply a function to a DataFrame that is intended to operate elementwise, i.e. like doing map(func, series) for each series in the DataFrame

I make google search "pandas data frame select rows containing keyword attributes":

def.loc[:, df.columns.str.contains('a').tolist()]

Same search brings solution of list of arguments from stackoverflow.
For single search value
df[df.values  == "banana"]
For multiple search terms:
  df[(df.values  == "banana")|(df.values  == "apple" ) ]
df[df.isin(['banana', "apple"])]

  #         A       B      C
  #  1   apple  banana    NaN
  #  2     NaN     NaN  apple
  #  3  banana     NaN    NaN
  #  4   apple   apple    NaN
From Divakar: lines with both are returned.

 #         A       B     C
 #   0  apple  banana  pear

To delete rows with specific columns not having only intengers seems complex tasks. 

So I tested many codes for Boolean how determine "Values" column to find if values are intenger.



df2.dropna(subset=['three', 'four', 'five'], how='all')

dropna drops rows if in specified column there is NA and NaN since it is default value for null generated by Pandas.

import pandas as pd
import numpy as np
dataset = pd.read_csv("C:\\Users\\ANTRAS\\Documents\\PANDASTEST.txt", delimiter="\t")
result3 = dataset.dropna(subset=['Values'], how='all') #OKprint(result3)

Pandas remove rows which any string. On stackoverflow great solutions to integrate regex library, ix method is not known to me, 
import re
regex = re.compile("[a-zA-Z]+")
result3 = dataset.ix[ x: is None)]

Name ID Values Rules
FIRST 4152 50000 SELL
SECOND 4153 70000 BUY
second45 7896 sduisl zet
THIRD 4154 40000 OFFER
inser 4786 htpp// deit
FOURTH 4155 1500 SWAP
FIFTH 4156 2000 FREE
SIXTH 4157 500 GET
SEVENTH 4158 1000 GO
sweasd 7896 rjasj REX

C:\Python27\python.exe C:/Users/ANTRAS/.PyCharmCE2017.2/config/scratches/
      Name    ID Values    Rules
0    FIRST  4152  50000     SELL
1   SECOND  4153  70000      BUY
3    THIRD  4154  40000    OFFER
4  FOURTH1  4161    540  SELLING
6   FOURTH  4155   1500     SWAP
7    FIFTH  4156   2000     FREE
8    SIXTH  4157    500      GET
9  SEVENTH  4158   1000       GO

Process finished with exit code 0

And got but which I can not explain, after I cleaned with Pandas data frame I can not make selection with values in Values column. Though bug fix research was rewarding and I found Programming with Python Python lessons for CU libraries blog. Software Carpentry. 
Indexing, slicing and subsetting DataFrames with pandas
The blog is solely for stepby step automation with Pandas.
The bug with lambda function appears when there is no intenger fields in "Values column", so you need to write condition boolean if field is just with numeric exit, i.e. do not execute ix method. 

On stackoverflow I found other solution. To  be tested. Use Pandas data frames and numpy methods.
I tested these code, they not make any change.
dataset[dataset['Values'].apply(lambda x: isinstance(x, (int, np.int64)))]dataset[dataset['Values'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
#testui = senabaze = pd.read_table('test.txt', col_index='keys_ids', col=names) #testui.to_csv('testui.csv')

I was looking to pandas cheatsheet and made searches using cheatsheet terminology "boolean indexing", objects and etc.

So I test pandas methods.
C:\Python27\python.exe C:/Users/ANTRAS/.PyCharmCE2017.2/config/scratches/
Name      object
ID         int64
Values    object
Rules     object

dtype: object
I there strong or object in column cell value, you get object type.

The search for advance Pandas Data Manipulation methods was rewarding. I found Python Pandas tutorial for business analysis.
So my problem can be solved by addtional function to check fo column values and exit on float64 type, o investigate and test with numpy for variablve string (object values) in specific column values.

So the problem with dtypes is that for mathematical comparison you have to have float64 type. You can convert int64 type to float64 using astype(float64) method pd (your data name) column(name of colum) plus astype(float64), i.e. example.columninexample.astype(float64). And below working data frame method to concanate data frames, remove duplicates and select rows with values more than 10.000. Note that if you read text file having add header, the data frame digits will be object type and you will not be able to convert directly to float64 (not tested for int64) type/

import pandas as pd

user_cols = ['keys_ids', 'Name', 'Members', 'Rules']
senabaze = pd.read_table('test.txt', sep=',', header=None, index_col='keys_ids', names=user_cols)

#reikia su dtypes tikrimti, skaiciai del header txt file buna object, be header int64 kuriuos per astype reikia padaryti float64
senabaze['Members'] = senabaze.Members.astype(float)

naujiduomenys = pd.read_table('test1.txt', sep=',', header=None, index_col='keys_ids', names=user_cols)

#reikia su dtypes tikrimti, skaiciai del header txt file buna object, be header int64 kuriuos per astype reikia padaryti float64
naujiduomenys['Members'] = naujiduomenys.Members.astype(float)

atnaujinta = pd.concat([senabaze, naujiduomenys])
netoatnaujinta = atnaujinta.drop_duplicates()

#SSS =  netoatnaujinta.Members
#print SSS

booleans = []
for svarbios in netoatnaujinta.Members:

    if svarbios >= 10000:

daugiauneidesimttukst = pd.Series(booleans)

daugiauneidesimttukst = netoatnaujinta.Members >= 10000


netoatnaujinta.loc[netoatnaujinta.Members >= 10000, 'Name']
TTT = netoatnaujinta.loc[netoatnaujinta.Members >= 10000, 'Name']


#testui = senabaze = pd.read_table('test.txt', col_index='keys_ids', col=names)

Python Pandas tutorial for business analysis
Creating Pandas DataFrames & Selecting Data

I encountered problem with reading Adwords data csv file to Pandas Data Frame. Main problem I do not see numeric data when data import, only NaN and astype(float) neither convert(objects)work. The file also is corrupted to convert to text file, null value. I also had to to choose python=engine parameter. I need to import to data frame to match with pytrends Google Trends data. Google trends data bring most relevant to category keyword whether search, picture, youtube or shopper. 100 average most relevant keyword. Google Trends data from cmd has text markups, so to automate export of five keywords data is necessary to csv or text files for further merging with glob library.

The problem to read csv file downloaded from Adwords is UTF-16 coding. Therefore you need to recode csv file to UTF-8. With UTF-8 csv file decreases in size twice.

import codecs
import csv

#for Adwords files

with'TOTAL.csv', 'rU', 'UTF-16') as infile:
    with open('TOTAL8.csv', 'wb') as outfile:
        for line in infile:

So this pandas match of Adwords, Google Trends category relevance for images and searches. The match suggest to use 5 main keywords. Note that Adwords keywords were scrapped from competing sites urls.

Keyword AvgMonth Competition Suggestedbid Inplan Number Traffic_x Region_x Traffic_y Region_y
0 freesia 550 0.64 0.7 N 48 1 England 2 England
1 plants 550 0.59 0.96 Y 52 100 England 100 England
2 alstroemeria 55 0.58 0.74 Y 53 0 England 0 England
3 roses 550 0.57 0.78 Y 54 95 England 95 England
4 florist 550 0.23 2.12 Y 4 85 England 67 London
5 carnation 550 0.16 0.39 N 8 2 England 3 London
6 horseshoe 55 0.16 0.05 N 9 3 England 3 England
7 weddings 55 0.14 0.25 Y 10 38 England 38 England
8 gerberas 55 0.02
Y 17 0 England 0 England
9 butterfly 550 0.01
Y 19 1 England 1 England
10 congratulations 550 0 0.08 Y 22 8 England 8 England
11 germini 55 0
Y 26 13 England 13 England

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.