Search This Blog

Wednesday, October 18, 2017

SEO productivity tools

    For task I found and installed project  SerpScrap to scrape data from websites and to analyzes. You have to paste code to create .py file and to run from location via cmd python anyword.py. Looks great from documentation. No need to search for free SEO tools.
SerpScrap shows error from urllib.parse import urlparse ImportError: No module named parse. urlparse is standard python package part.  I found good article about parsing in python . urlparse In Python: Tools And Libraries https://tomassetti.me/parsing-in-python/#tools. By scanning with articles I understood that sites parsing requires theoretical understanding. I also upgraded numpy, install http-parser 0.8.3. For now I can not use SerpScrap project.

     So for competitive analysis I search "python python github projects for semantic analysis" and search showed many project. The software reads text from url and find semanticaly main words and hope to be associated. These project use machine learning libaries. Analogically I found on pypi projects for semantic analysis. I took first pypi project with search "python pypi semantic analysis projects".

    Sumy Module for automatic summarization of text documents and HTML pages. I install and tested
with blog with reference Python for small business,  Small business productivity with business. 
I here is the outcome:
C:\Users\ANTRAS\Documents\Test>sumy lex-rank --length=10 --url=http://learningpythontobuildbusiness.blogspot.com/
The 2nd module merges data text files, splits into seprated text files and converts to list for further
processing to match . So I can process csv files column seperately and as text files zip them into 
new text files and get necessary text file.
So for the moment, keeping in mind that there is axis drop pandas function video tutorial,  that I have all necessary technological solutions how to automate csv files matching and extract necessary data.
Two modules have already done - the outcome can be used to match with Libre Office Groups membership; and I  collected all the rest necessary technological solutions how to finalized matching and also I can add script removing match lines with undesired  groups like "Post is payble", "Mother..", "Dating..." and etc.
I wrote program plan how process files and merge, and I found that I have need to loop csv or text file to extract column data to seperate files.
So for Pandas I can use underscore for whitespaces, and I found on stackoverflow necessary code to convert txt  to csv that after test runs as csv files.
I found two great sites for regex combination to clean text files rexegg.com and regex101.com .
I installed and it works, and it is great.
For csv files automation you can also see projects csv match in github.
So I found on github Scrapy tutorial, which I hope to learn in near time.

    So in digital marketing, SEO you can use pyton tools productively to analyze competitors, to collect 
information from numerous sites, to merge and to find most common and etc. 
    With search "python github srap website linking" I found github project 
Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data
to be extracted, scrapely constructs a parser for all similar pages. I will test and write. These kind of 
tools are necessary for productive and competitive SEO.

Scrapely  tests copied from https://github.com/scrapy/scrapely/blob/master/tests/ on cmd show errros:

C:\Users\User\Desktop>python TEST_1.py
Traceback (most recent call last):
  File "TEST_1.py", line 10, in <module>
    from nose_parameterized import parameterized

ImportError: No module named nose_parameterized

C:\Users\User\Desktop>pip install nose-parameterized
Collecting nose-parameterized
  Downloading nose_parameterized-0.6.0-py2.py3-none-any.whl
Installing collected packages: nose-parameterized
Successfully installed nose-parameterized-0.6.0

C:\Users\User\Desktop>python TEST_1.py
C:\Python27\lib\site-packages\nose_parameterized\__init__.py:7: UserWarning: The 'nose-parameterized' package has been renamed 'parameterized'. For the two step migration instructions, see: https://github.com/wolever/parameterized#migrating-from-nose-parameterized-to-parameterized (set NOSE_PARAMETERIZED_NO_WARN=1 to suppress this warning)
  "The 'nose-parameterized' package has been renamed 'parameterized'. "
Traceback (most recent call last):
  File "TEST_1.py", line 12, in <module>
    from scrapely.htmlpage import HtmlPage

ImportError: No module named scrapely.htmlpage

For now I cannot use Scrapely  also. 

    Not tested below. Useful python projects: Search "python github  extract inbound links of website" 
1. Links-Extractor
Extract all internal and external links from a URL in Python.
To run type: python extractor.py [http://url1] [https://url2] and so on.
Find how the script works and the complete tutorial here: 

http://com.puter.tips/2016/12/extract-all-internal-and-external-links.html
   Useful tool for links. Question is about inbound links which are important for SEO.

To test I run:
import extractionimport requestsurl = "http://learningpythontobuildbusiness.blogspot.lt/"html = requests.get(url).textextracted = extraction.Extractor().extract(html, source_url=url)extracted.titleprint extracted.title, extracted.description, extracted.image, extracted.urlprint extracted.titles, extracted.descriptions, extracted.images, extracted.urls

And I get:

C:\Users\User\Desktop>python TEST_1.py
C:\Python27\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified,
 so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, 
but if you run this code on another system, or in a different virtual environment, it may use a different
 parser and behave differently.

The code that caused this warning is on line 8 of the file TEST_1.py. To get rid of this warning, 
change code that looks like this:

 BeautifulSoup(YOUR_MARKUP})
to this:
 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
Run business automating tasks with python programming. Learning to automate small business 
operations with python language. Become competitive creatively integrating 
codes. https://resources.blogblog.com/img/icon18_wrench_allbkg.png 
http://learningpythontobuildbusiness.blogspot.com/

[u'Run business automating tasks with python programming.', u'links-from-link-header 0.1.0', 
u'extraction 0.2', u'Search This Blog', u'Wednesday, October 18, 2017', u'Saturday, October 14, 2017',
 u'SEO productivity tools'] [u'Learning to automate small business operations with python language. 
Become competitive creatively integrating codes.',
 u'Small business requires many tasks to be completed on PC. With python language you need to 
learn to integrate many libraries which can run your daily tasks or to code own solutions to outpace
competitors. There are billions lines of python codes, so the limit is integrating knowledge and 
efficiency. To run business you can predict demand, use artificial intelligence to bring leads, automate
 routines tasks, integrate other programs. From salesmen to scientists use coding. Become curious!']
 ['https://resources.blogblog.com/img/icon18_wrench_allbkg.png',
 'https://resources.blogblog.com/img/icon18_edit_allbkg.gif', 
'https://2.bp.blogspot.com/-jDCSjtYVQII/WdJ7vnoi7lI/AAAAAAAACdw/Zka2cphPTxIpF0VZ6z0z
Gaz0VAwoXqzlQCLcBGAs/s320/WORKFLOW.jpg', 'https://4.bp.blogspot.com/-3NV3jEl6ynk/WdJ
72Nol40I/AAAAAAAACd0/6XaM3MzHYGcsPTvncsfGuJIBBDFcEg7WgCLcBGAs/s320/
PROJECT_STRUCTURE.jpg', 'https://4.bp.blogspot.com/-jsK5Nc4S0kM/WdJ74bgm3SI/
AAAAAAAACd4/RBtN10Dt5NYZHUKb_GciOJ35rYARtvMvQCLcBGAs/s320/
First_glance_3.jpg', 'https://video.google.com/ThumbnailServer2?app=blogger&contentid=
cf39a4434b35c83f&offsetms=5000&itag=w160&sigh=taZlXuotUFjbF31TqFpxmh5b1Xs']
 ['http://learningpythontobuildbusiness.blogspot.com/']

Links-Extractor works

Search "python pypi  extract inbound links of website" 
2. summary-extraction 0.2 
So I rub test:
import summarys = summary.Summary('https://github.com/svven/summary')s.extract()s.title
C:\Python27\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the
 best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on
 another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 6 of the file TEST_1.py. To get rid of this warning, change code that
 looks like this:
 BeautifulSoup(YOUR_MARKUP})
to this:
 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
Traceback (most recent call last):
  File "TEST_1.py", line 6, in <module>
    s.extract()
  File "C:\Python27\lib\site-packages\summary\__init__.py", line 251, in extract
    body = self._get_tag(response, tag_name="body")
  File "C:\Python27\lib\site-packages\summary\__init__.py", line 184, in _get_tag
    for chunk in response.iter_content(CHUNK_SIZE, decode_unicode=True):
  File "C:\Python27\lib\site-packages\requests\utils.py", line 440, in stream_decode_response_unicode
    for chunk in iterator:
  File "C:\Python27\lib\site-packages\requests\models.py", line 745, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "C:\Python27\lib\site-packages\urllib3\response.py", line 432, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "C:\Python27\lib\site-packages\urllib3\response.py", line 598, in read_chunked
    self._update_chunk_length()
  File "C:\Python27\lib\site-packages\urllib3\response.py", line 540, in _update_chunk_length
    line = self._fp.fp.readline()

AttributeError: 'NoneType' object has no attribute 'readline' . No possibility to run project for now.

3.  links-from-link-header 0.1.0 
import links_from_header
header = '<https://api.github.com/user/repos?page=1>; rel="first", <https://api.github.com/user/repos?page=9>; rel="prev", <https://api.github.com/user/repos?page=11>; rel="next", <https://api.github.com/user/repos?page=50>; rel="last"'links_from_header.extract(header)
Outcome no error, on dict on cmd. Project is waste of time. I uninstall the package.

4. extraction 0.2
     Extraction is a Python package for extracting titles, descriptions, images and canonical urls from web pages.  You might want to use Extraction if you’re building a link aggregator where users submit links and you want to display them (like submitting a link to Facebook, Digg or Delicious).
    Extraction is not a web crawling or content retrieval mechanism, rather it is a tool to use on data which has always been retrieved or crawled by a different tool.
    So for SEO job I have necessary tool to select from and can productively make competitive analysis.

So I tested:

import extraction
import requests
url = "http://lethain.com/social-hierarchies-in-engineering-organizations/"
html = requests.get(url).text
extracted = extraction.Extractor().extract(html, source_url=url)
extracted.title
print extracted.title, extracted.description, extracted.image, extracted.url
print extracted.titles, extracted.descriptions, extracted.images, extracted.urls
And I get result:

C:\Users\User\Desktop>python TEST_1.py
C:\Python27\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the
best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another
system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 8 of the file TEST_1.py. To get rid of this warning, change code that looks
like this:
 BeautifulSoup(YOUR_MARKUP})
to this:
 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))
Social Hierarchies in Engineering Organizations When things get bad, people start complaining about percieved social hierarchies. Few things piss off the already angry engineer like knowing they're less important than an architect.
http://lethain.com/static/author.png https://lethain.com/social-hierarchies-in-engineering-organizations/
[u'Social Hierarchies in Engineering Organizations', u'Irrational Exuberance !', u'Classical Engineering Hierarchies',
u'Decouple Role and Title'] [u"When things get bad, people start complaining about percieved social hierarchies.
Few things piss off the already angry engineer like knowing they're less important than an architect.", u'November 4,
2012. Filed under management software-engineering', u"I recently read Anthropology of Mid-Sized Startups , which argues that as technology companies grow they begin behaving as tribes or religions. The article's premise captured me more ,than the execution, and left me thinking about social hierarchies.", u"In the good times, people usually don't explicitly notice social hierarchies--they're too busy doing something they enjoy to analyze why they're pissed off--but once things hit a rough patch, hierarchies tend to become a vocal sore point.", u"They're irksome because they represent an institutionalized incongruence in the valuation of employees: you're not as important as Jim, he's an architect, you're just an engineer.", u"It seems natural that engineers would complain about the management track--managers are always recipients of great hatred, probably because we do a lot of harm--but I've found engineers much more forgiving
of line management than of another leadership track: architects."] ['http://lethain.com/static/author.png']
['https://lethain.com/social-hierarchies-in-engineering-organizations/',
'https://lethain.com//social-hierarchies-in-engineering-organizations/']


Useful source to find python projects is site for trending projects on github https://github.com/trending/python

So I have projects that work and will help to be productive.  But the main issue is below, which will be needed for scraping. Need reading and understanding of issues.
 BeautifulSoup(YOUR_MARKUP})
to this:
 BeautifulSoup(YOUR_MARKUP, "lxml")

So I tried to solve BeautifulSoup problem, but there is parser selection issue. I think it is bug or conflicting library is installed (no idea which). But BeautifulSoup  is important library to use to scrap web and can be used together with Scrapy. Debugging is set aside for a while, lxml was updated and html5lib was installed. The second issued to solve is Microsoft Visual C++ Redistributable for Visual Studio  installation, path. Microsoft Visual C++ Microsoft Visual C++ (often abbreviated to MSVC) is an integrated development environment (IDE) product from Microsoft for the C, C++, and C++/CLI programming languages.  Some Microsoft Applications are programmed in C++ and can be connected to Python.

So updated to latest Python version. No problem with parse module. Should all projects be without bugs.