Search This Blog

Saturday, October 14, 2017

Useful interactive short tutorial of functions from Datacamp

    Today I  plan to finalize integration of already developed programs to process data from text files and match with other data text files. I will need functions and found short tutorial on Python Functions Tutorial Karlijn Willems August 8th, 2017 on Datacam website. The Datacamp  is worth to analyse since there are free educationl materials. The video of integrated program for Facebook Groups and if any new function written I will display. The part of integrated software video is below, and you see adapted code which is in HTML on post python 12345678910OPERACIJA.py:.

                             

    Today is rainy, so I had to take nap for four hours, but I finalized 1st module - bulk extraction of Facebook Groups unique IDs and Group names for further matching. 2nd module is Group members to names list formation an the third is matching of two text files - 1st and 2nd modules outcome. Watch below for 1st module outcome. And you see adapted code which is in HTML on post python 12345678910OPERACIJA.py:



Second module to automatize Facebook Group information merge and convert data scrapped by hand to text file from Facebook member Group information with data of names. membership and policies. The 2nd module merges data text files, splits into seprated text files and converts to list for further processing to match . There is need to process data file to make numerical data. It will be done with 3rd module.




So I found on stackoverflow codes to convert csv to txt files and vice versa, to extract csv files columns to seperate csv files. So I can process csv files column seperately and as text files zip them into new text files and get necessary text file. On Youtube I found Dataschool Pandas tutorials which are great for beginners, which in concise way explain how to start working with Pandas and what are advantages and benefits using Pandas. I share Datashool videos how to start working with Dataframes or .tsv files and read tabulars with Pandas library.  Tutorial uses IPython Notebook from jupyter so you can easily test solutions. I personaly try on PyCharm too. 

    And while search for ready Pandas code on Youtube I recalled computer scientist PhD which origins from Libya but lives and teaches at university in London. He has great and numerous series of Pandas video lessons while explaining practical applications including business. See Pandas DataFrames: Crosstabs, Cross Tabulation, Generating Contingency Tables and Wikipedia for Contingencies Table, this the lesson will be useful if not for this automation code, but for numerous business applications. 


                              

     And while coding extensive search can bring miracle solutions to complete tasks (besides stackoverflow, github, bitbucket and etc.). So on google I searched "blog edu python pandas files match tutorial" blog - for practical examples, edu - to include "universities", tutorials - to show coding. So search showed me on first page Analytics Vidhya (digital marketing analytical guru website) with blog article how to merge dataframes with pandas  pandas.DataFrame.merge function.
    So for the moment, keeping in mind that there is  axis drop pandas function video tutorial, that I have all necessary technological solutions how to automate csv files matching and extract necessary data. Besides Pandas I also use script how to extract info using list of specified keyword. So Pandas is more sophisticated business and science oriented library tht is worth to know, since you can download Google analytics data and process with Pandas and other libraries to analyze in order improve conversions and forecast online sales channels performance and impact.
     Two modules have already done - the outcome can be used to match with Libre Office Groups membership; and I collected all the rest necessary technological solutions how to finalized matching and also I can add script removing match lines with undesired groups like "Post is payble", "Mother..", "Dating..." and etc.
   
    I wrote program plan how process files and merge, and I found that I have need to loop csv or text file to extract column data to seperate files. And with magic "Python blog edu loop csv files to extract column data to new files" I found necessary code on reddit social network , which is quite common source for codes. Below is sample code I found on reddit learn python community, but has not been tested yet. This example how sometimes is fancy to code, but sometimes knowledge and practice of python examples and libraries knowledege is necessary. Anyway. my personal experience show, that good search skills help to program fast.
Extract specific column values from a csv from learnpython

     The promise to make full automation software during one day failed. But only issue remains Pandas merge dataframes to match ID and groups members. I am new to Pandas, but I have tested five false merge case and export data and it works. Pandas has many methods I have to find necessary for automation and further applications. I think also can be issue how text files columns are separated for csv format. Hope to finalize within day.

    So for Pandas I can use underscore for whitespaces, and I found on stackoverflow necessary code to convert txt to csv that after test runs as csv files. and to make code shorter I need regex combination how to clean underscore before and between digits (looks not sophisticated). I found two great sites for regex combination to clean text files rexegg.com and regex101.com. And also I shortened today code with new word groups replacement method - dict combination.

    So with search "regex tutorial  exercize " I found site with interactive lessons and found solution to count after comma underscore:
And change =  re.sub(r',_{0,2}', r",", text1)  ,_{0,2} deleted first underscore in 
_dd_ddd  or _d_ddd number.

    With search "python lib regex" Google showed Pypi Alternative regular expression module, to replace re; regex 2017.09.23 : Python Package Index,  https://pypi.python.org/pypi/regex/. 
    It was wortwhile to spend time on regex since it will be applied for scrapying necessary info (the more
precise is your definition of info needed, faster and in bigger volumes you can process ), you can for text 
analysis and files automation system. 
     I guess that fuzzy matching from  Pypi regex 2017.09.23 would solve problem, but for now I will complete my automation software with splitting columns, processing data  and combining them.

     Another usefull website like regex cheatsheets, was regular-expressions.info.
And finally I made data csv format, text converted to csv. Sounds nothing special, but I used splitting columns and there was an error on last column. So I used trick, I added with script ab plus word "end" suffix to end of each line of text file and it works. Without errors python extracts last column, with data delimited by tab; afterwards I splitted file and deleted underscore among digits and with itertools zipfile added back columns  in necessary for Pandas order.
     Yesterday I tried files with Pandas outlay but delimited by tabs. The text was with underscore breaks, I also had to solve first in any underscore symbol problem. So on stackoverflow I found necessary regex code  re.sub(r'(?:^|(?<=\s))@(?!\s)','',s with link to https://regex101.com/r/tX2bH4/44. So on Pandas with delimited by tabs dataframe from text gives over 35 thousand rows, while from dataframes delimited by comma only thousand two hundred. So I redid scripts to generate text files delimited hust with commas and having headers. But to finalize integration I got PyCharm bug with double script. So I take few days break since to draft final program plan, necessary folders, since finally I will have to add Pandashttp://pandas.pydata.org/ codes to match and merge two dataframes.