Gautam Kandlikar

Work hard, ball harder. That's the key.

If this happens, the S&P 500 is in real trouble: Pro – Yahoo Finance

This technical analysis business sometimes feels like it’s straight out of the Onion.

http://finance.yahoo.com/news/happens-p-real-trouble-yamada-182953086.html

Using AWS to solve your problems, case 376

I really don’t know why I didn’t think of it before.

I have been working on a project. It involves going through a bunch of company pages on crunchbase and trying to figure out what ‘industry’ or ‘category’ a company belongs.

Crunchbase has some 200,000 companies, so accessing it is not as getting something on an excel sheet. They do have an api though it is not really well documented and doesn’t always return json documents which are well-formed. Sadly, the latter post which I linked to was written about a year ago and nothing seems to have changed.

I ran into a bunch of hurdles, most of which had as their underlying cause the fact that it took forever to get all this data. I tried a variety of things to speed it up – write information to files for later processing, write information to a SQLite database for later processing, etc. but nothing seemed to stick.

I decided to revamp my approach over the last couple of days to leverage StarCluster and S3. StarCluster is essentially an easy way to spin up an EC2 cluster pre-configured with some applications to facilitate distributed computing. I wasn’t doing heavy distributed computing per se, but I did need to establish a whole bunch of connections to crunchbase, so having that option of multiple instances helped a lot. Similarly, I wanted a place to store those json docs I was retrieving (without having to worry too much about file i/o and/or network bandwidth.) Furthermore, I wanted my documents to be persistent, and not wiped out if I terminated my EC2 instance. S3 made this a natural choice.

This nice thing about this combination is how much less time it took me to troubleshoot things and do exception handling. Here are some good examples:
1. To check if I had already retrieved a document before, I had to write code like:

def fetch_company_file(cb_conn, company):
    global COMPANY_EI, API_KEY, EXCEPTION_FILE, OUTPUTDIR
    query=''.join((COMPANY_EI,company['permalink'],'.js?api_key=',API_KEY))
    company_file=''.join((OUTPUTDIR,'/',company['permalink'],'.json'))
    cb_conn.request("GET",query)
    data=cb_conn.getresponse().read()
    with open(company_file, 'r+') as f:
        if len(f.read()) < 1:
            f.write(data)

but now it’s more like:


def foo():

    #keys on my s3 bucket are of the form crunchbase/data/company-name.json
    keyname=''.join(('crunchbase','/data/',company['permalink'],'.json'))
    
    #unambiguous checking for if my file exists and if I should skip it.
    if keyname in [obj.name for obj in s3bucket.list(prefix=keyname)]:
        continue

    #Next 3 lines have no change from before
    query=''.join((ep,company['permalink'],'.js?api_key=',apikey))
    cb_conn.request("GET",query)
    data=cb_conn.getresponse().read()

    #just pass in a new key name and dump the variable data into it.
    key = s3bucket.new_key(keyname)
    key.set_contents_from_string(data)

2. It also helps me pretty easily recover if I experience a broken connection or something like that. I can now just start skipping through files left and right instead of figure out where I’d dropped my connection, and what would be the best place to start. I had originally tried a couple of ways around it, eg using a counter to keep a track of how man elements of the list of companies I had gone through, but I realized just splitting a couple of hundred thousand items over 16 computers was much better than doing any other kind of handling.

3. It frees up my computer! No more overheating computers! I can sit back and smoke a hookah for all I care.

Which is pretty much what I’m doing right now.

Except not really. I’m just sitting here and watching my EC2 cluster do its work. I already have 172K records. In my first attempt. I had not yet been able to score 20K records without this kind of parallel computing at my disposal!

Screen Shot 2014-03-28 at 12.11.56 AM

I should be done in several minutes here. If I haven’t already passed out on the couch waiting for the counter to tick closer to 220K, I will try to post some summary statistics on the files. But don’t count on it.

Some of things with parallel computing are interesting. By which I mean things in python are interesting.

There’s this notion of Sync/Async execution which takes a lot of getting used to. Then there is functions which let you ‘scatter’ an iterable in a format (chunks or round robins, I think.). Then you can ‘push’ or ‘pull’ a variable from the remote clients and you can even move functions around this way. This last bit was a great find because it prevented me from having to write different sets of code.

The biggest thing I wanted out of this was to have the whole app contained within 1 cell of ipynb. I don’t know if this makes it a necessarily good thing, but I thought it is easier to read and you don’t have to explain to others what you’re trying to do because your code is all there.

Anyway, if you want to do cool things, don’t always try to solve it the conventional way, people have been spending a lot of time trying to make things easier for you. Start using those things!

-Gautam

 

My contrarian thinking.

If you asked me, this is how I would put in ~$10000 into the stock market today. Assuming $7 per transaction.

I want to check out, for giggles, what it’s like being contrarian. My screening algorithm was:

* > 300 M in Mkt cap
* Trading in the low values of its 52-wk range. Mostly: <10% above it’s 52 wk low.

* Trading below its 20-day MA.

This gave me a list of about 240 companies, from which I chose 8. I basically chose them based on whether or not I thought they are cool. It’s stupid, I know. But humor me.

For reference. The S&P 500 was at 1849.04 at today’s close.

Gautam Kandlikar

I just attended an #AWSSummit session on this: Project Rhino

http://www.pythian.com/blog/intel-hadoop-distribution/

When I heard that Intel announced their own Hadoop distribution, my first thought was “Why would they do that?”. This blog post is an attempt to explore why anyone would need their own Hadoop distribution, what Intel can gain by having their own and who is likely to adopt Intel’s distribution.

 

[...]

Let’s start from basics: Intel sells CPUs. That’s their main line of business, but they also write software. For example, Intel’s C compiler is first rate. I used to love working with it. Intel wrote their own compiler so executables generated with it will always use the best Intel features. This means that popular software would run faster on Intels, because their performance features will be used even when developers don’t know about them (Oracle Optimizer attempts to do the same, but with less success).

How does it apply to Hadoop? Clearly Intel noticed that Hadoop clusters tend to have lots of CPUs, and they are interested in making sure that these CPUs are always Intel, possibly by making sure that Hadoops run faster on Intel CPUs.

Let’s look at Intel’s blog post on the topic: http://blogs.intel.com/technology/2013/02/big-data-buzz-intel-jumps-into-hadoop

“The Intel Distribution for Apache Hadoop software is a 100% open source software product that delivers Hardware enhanced performance and security (via features like Intel® AES-NI™ and SSE to accelerate encryption, decryption, and compression operation by up to 14 times).”

“With this distribution Intel is contributing to a number of open source projects relevant to big data such as enabling Hadoop and HDFS to fully utilize the advanced features of the Xeon™ processor, Intel SSD, and Intel 10GbE networking.”

“Intel is contributing enhancements to enable granular access control and demand driven replication in Apache HBase to enhance security and scalability, optimizations to Apache Hive to enable federated queries and reduce latency. ”

Intel is doing for Hadoop the same thing it did for C compilers – make sure they use the best hardware enhancements available in the CPUs and other hardware components available from Intel. The nice thing is that the enhancements are available as open source – Intel doesn’t care that the software is free, since they are selling the hardware!

My first summit as an attendee. #AWSsummit in SF.

Created with VSCO Cam

The standard of mobile photography

-

To learn more, visit vsco.co/vscocam

VirtualEnv and Anaconda – Frustrations

Having anaconda on your mac is nice (IPython Notebook!)

But it totally breaks things if you are trying to use virtualenv on top of that.

Whenever you create a virtualenv, you get a bad looking error which is not very easy to parse.

GKs-MacBook-Air:~ gk$ virtualenv venv
New python executable in venv/bin/python
Installing setuptools, pip...
  Complete output from command /Users/gk/playtestr/bin/python -c "import sys, pip; sys...d\"] + sys.argv[1:]))" setuptools pip:
  Ignoring indexes: https://pypi.python.org/simple/
Exception:
Traceback (most recent call last):
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv_support/pip-1.5.4-py2.py3-none-any.whl/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv_support/pip-1.5.4-py2.py3-none-any.whl/pip/commands/install.py", line 236, in run
    session = self._build_session(options)
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv_support/pip-1.5.4-py2.py3-none-any.whl/pip/basecommand.py", line 52, in _build_session
    session = PipSession()
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv_support/pip-1.5.4-py2.py3-none-any.whl/pip/download.py", line 216, in __init__
    super(PipSession, self).__init__(*args, **kwargs)
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv_support/pip-1.5.4-py2.py3-none-any.whl/pip/_vendor/requests/sessions.py", line 200, in __init__
    self.headers = default_headers()
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv_support/pip-1.5.4-py2.py3-none-any.whl/pip/_vendor/requests/utils.py", line 550, in default_headers
    'User-Agent': default_user_agent(),
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv_support/pip-1.5.4-py2.py3-none-any.whl/pip/_vendor/requests/utils.py", line 519, in default_user_agent
    _implementation = platform.python_implementation()
  File "/Users/gk/anaconda/lib/python2.7/platform.py", line 1499, in python_implementation
    return _sys_version()[0]
  File "/Users/gk/anaconda/lib/python2.7/platform.py", line 1464, in _sys_version
    repr(sys_version))
ValueError: failed to parse CPython sys.version: '2.7.5 (default, Aug 25 2013, 00:04:04) \n[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)]'

Storing debug log for failure in /Users/gk/.pip/pip.log
----------------------------------------
...Installing setuptools, pip...done.
Traceback (most recent call last):
  File "/Users/gk/anaconda/bin/virtualenv", line 11, in 
    sys.exit(main())
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv.py", line 824, in main
    symlink=options.symlink)
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv.py", line 992, in create_environment
    install_wheel(to_install, py_executable, search_dirs)
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv.py", line 960, in install_wheel
    'PIP_NO_INDEX': '1'
  File "/Users/gk/anaconda/lib/python2.7/site-packages/virtualenv.py", line 902, in call_subprocess
    % (cmd_desc, proc.returncode))
OSError: Command /Users/gk/venv/bin/python -c "import sys, pip; sys...d\"] + sys.argv[1:]))" setuptools pip failed with error code 2

I started off trying to figure out what the hell CPython was, and if I could get rid of it. Turns out, it’s a fool’s errand. Don’t bother with my silly ways.

This is what CPython is (and hopefully you see why messing with it is bad) – From StackOverflow:

So what is CPython

CPython is the original Python implementation. It is the implementation you download from Python.org. People call it CPython to distinguish it from other, later, Python implementations, and to distinguish the implementation of the language engine from the Python programming language itself.

The latter part is where your confusion comes from; you need to keep Python-the-language separate from whatever runs the Python code.

CPython happens to be implemented in C. That is just an implementation detail really. CPython compiles your python code into bytecode (transparently) and interprets that bytecode in a evaluation loop.

What about Jython, etc.

Jython, IronPython and PyPy are the current ‘other’ implementations of the Python programming language; these are implemented in Java, C# and RPython (a subset of Python), respectively. Jython compiles your Python code to Java bytecode, so your Python code can run on the JVM. IronPython lets you run Python on the Microsoft CLR. And PyPy, being implemented in (a subset of) Python, lets you run Python code faster than CPython, which rightly should blow your mind. :-)

Actually compiling to C

So CPython does itself not translate your Python code to C. It instead runs a interpreter loop. There is a project that does translate Python-ish code to C, and that is called Cython. Cython adds a few extensions to the Python language, and lets you compile your code to C extensions, code that plugs into the CPython interpreter.

Here’s a basic explanation of what’s going on:

  • I’m trying to create a virtualenv for an app
  • Virtualenv does some calls to underlying libraries for its thing
  • There’s something wrong with one of my libraries

After about an hour of searching, I came across this on the python.org bug tracker.

Digging through the whole thing, it became more clear what was going on – Anaconda had introduced something which breaks everything.

The user wesmadrigal had said he’d fixed it:

I just commented out the _sys_version_parser regular expression in 
anaconda/lib/python2.7/platform.py on line 1363 and replaced it with the
_sys_version_parser from /usr/lib/python2.7/platform.py and everything worked fine.

What he’s saying is that if you do this:

GKs-MacBook-Air:~ gk$ python
Python 2.7.6 |Anaconda 1.9.1 (x86_64)| (default, Jan 10 2014, 11:23:15) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform
>>> platform.__file__
'/Users/gk/anaconda/lib/python2.7/platform.pyc'

and open up the .py file (not the .pyc!) referenced therein, and search for ‘_sys_version_parser’, it should look like this:

_sys_version_parser = re.compile(
    r'([\w.+]+)\s*'
    '\(#?([^,]+),\s*([\w ]+),\s*([\w :]+)\)\s*'
    '\[([^\]]+)\]?')

and not like this:

_sys_version_parser = re.compile(
    r'([\w.+]+)\s*'
    '\|[^|]*\|\s*' # version extra
    '\(#?([^,]+),\s*([\w ]+),\s*([\w :]+)\)\s*'
    '\[([^\]]+)\]?')

It looks like wesmadrigal has submitted a bug to the folks at Continuum who write code for Anaconda.

So if you run into this error, hopefully the above will fix things.

I don’t know if

conda update conda

or

conda update anaconda

will break this again, though. Gotta keep an eye out.

 

Whither advertising

My click patterns probably don’t indicate as much about what I want to buy this weekend as much as what I want to buy several weeks, months, or years from now.

I know I went to cars.com, know I didn’t go to enterprise.com, and did go to getaround.com to explore. But seriously, people who have the dirt on me should know by this point that I possess a car which is functional, yeah?

Follow

Get every new post delivered to your Inbox.

Join 900 other followers