Thursday, November 12, 2009

Remove a property from GAE model

This article, Updating Your Model's Schema,” is already great and clear, but it does not have a complete code example. I decided to make one and write down some explanations. Just in case I might need it later.

It has one two stages to remove a property from a data model:
  1. Inherit from db.Expando if the model does not inherit from that.
  2. Remove the obsolete property from model definition.
  3. Delete the attribute, the property, of each entity — del entity.obsolete
  4. Inherit from db.Model if the model originally inherited from.

How to actually do it:

Assume a model look like:
class MyModel(db.Model):
  foo = db.TextProperty()
  obsolete = db.TextProperty()

Re-define the model to:
class MyModel(db.Expando):
#class MyModel(db.Model):
  foo = db.TextProperty()
#  obsolete = db.TextProperty()

Make sure the model inherit from db.Expando and comment out (or just delete the line) the obsolete property.

Here is the example code to delete the attribute, the property:

from google.appengine.runtime import DeadlineExceededError

def del_obsolete(self):

  count = 0
  last_key = ''
  try:
    q = MyModel.all()
    cont = self.request.get('continue')
    if cont:
      q.filter('__key__ >=', db.Key(cont))
    q.order('__key__')
    entities = q.fetch(100)
    while entities:
      for entity in entities:
        last_key = str(entity.key())
        try:
          del entity.obsolete
        except AttributeError:
          pass
        entity.put()
        count += 1
      q.filter('__key__ >', entities[-1].key())
      entities = q.fetch(100)
  except DeadlineExceededError:
    self.response.out.write('%d processed, please continue to %s?continue=%s' % (count, self.request.path_url, last_key))
    return
  self.response.out.write('%d processed, all done.' % count)

Note that this snippet is to be used as a webapp.RequestHandler's get method, so it has self.response.

It use entities' keys to walk through every entity, it is efficient and safe. But you may also want to put your application under maintenance, preventing other code to add new entities, even though the values of keys seem to be increased only for new entities, but you really don't need to waste CPU time since new entities has no obsolete property.

Because it have to go through all entities and therefore it takes a lot of time to process. A mechanism to continue the process on the rest of entities is necessary. The code will catch google.appengine.runtime.DeadlineExceededError if it can not finish in one request, it then return a link which allows you to continue if you follow it. If you have lots of entities, you may want to use task instead of manual continuation. You may also want to set up the maximal amount of processing entities like 1000 entities in one request.

Once it has done its job, change the model definition back to db.Model and remove obsolete property line:
class MyModel(db.Model):
  foo = db.TextProperty()


That's it.

Wednesday, November 11, 2009

Walking through/counting all entities in GAE datastore

I need to count how many entity of kind Blog has boolean property accepted set to True, but I suddenly realized that OFFSET in query is no use for me (In fact, it is not really useful).

In SDK 1.1.0, OFFSET does what you think on Development Server if you first use GAE and have experience of SQL, but it's still different than on Production Server.

Basically, if you have 1002 entities in Blog and you want to get the 1002nd entity. The follows will not get you that entity:
q = Blog.all()
# Doing filter here
# Order here
# Then fetch
r = q.fetch(1, 0)[0]    # 1st
r = q.fetch(1, 1)[0]    # 2nd
r = q.fetch(1, 999)[0]  # 1000th
r = q.fetch(1, 1000)[0] # 1001st
r = q.fetch(1, 1001)[0] # 1002nd

You will get an exception on the last one like:
BadRequestError: Offset may not be above 1000.
BadRequestError: Too big query offset.
First one is on Production Sever, second is on Development Server.

The OFFSET takes effective after:
  1. filter data (WHERE clause)
  2. sort data (ORDER clause)
  3. truncate to first 1001 entities (even though count() only returns 1000 at most)
After filtering, sorting, truncating to first 1001 entities, then you can have your OFFSET. If you have read Updaing Your Model's Schema,” it warns you:
“A word of caution: when writing a query that retrieves entities in batches, avoid OFFSET (which doesn't work for large sets of data) and instead limit the amount of data returned by using a WHERE condition.”
The only way is to filtering data (WHERE clause), you will need a unique property if you need to walk through all entities.

An amazing thing is you don't need to create new property, there is already one in all of you Kinds, the __key__ in query, the Key.

The benefits of using it:
  • No additional property,
  • No additional index (Because it's already created by default), and
  • Combination of two above, you don't need to use additional datastore quota. Index and Property use quota.
Here is a code snippet that I use to count Blog entities, you should be able to adapt it if you need to process data:
def get_count(q):
  r = q.fetch(1000)
  count = 0 
  while True:
    count += len(r)
    if len(r) < 1000:
      break
    q.filter('__key__ >', r[-1])
    r = q.fetch(1000)
  return count

q = db.Query(blog.Blog, keys_only=True)
q.order('__key__')
total_count = get_count(q)

q = db.Query(blog.Blog, keys_only=True)
q.filter('accepted =', True)
q.order('__key__')
accepted_count = get_count(q)
  
q = db.Query(blog.Blog, keys_only=True)
q.filter('accepted =', False)
q.order('__key__')
blocked_count = get_count(q)

Note that
  • Remove keys_only=True if you need to process data. And you will need to use r[-1].key() to filter.
  • Add a resuming functionality because it really uses a lot of CPU time if it works on large set of data.

Dump from GAE and upload to Development Server

I just download the data from one of my App Engine application by following “Uploading and Downloading,” I used this new and experimental bulkloader.py to download data into a sqlite3 database. You don't need to create the Loader/Exporter classes with this new method

It does explain how to download and upload, but, as for, uploading is only for production server. You have to look into the command line options, it's not complicated.

Here is a complete example to dump data:
$ python googleappengine/python/bulkloader.py --dump --kind=Kind --url=http://app-id.appspot.com/remote_api --filename=app-id-Kind.db /path/to/app.yaml/
[INFO    ] Logging to bulkloader-log-20091111.001712
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Opening database: bulkloader-progress-20091111.001712.sql3
[INFO    ] Opening database: bulkloader-results-20091111.001712.sql3
[INFO    ] Connecting to brps.appspot.com/remote_api
Please enter login credentials for app-id.appspot.com
Email: username@gmail.com
Password for username@gmail.com: 
.[INFO    ] Kind: No descending index on __key__, performing serial download
.......................................................................................................................................................................................
.................................
[INFO    ] Have 2160 entities, 0 previously transferred
[INFO    ] 2160 entities (0 bytes) transferred in 134.6 seconds

And the following is for upload to Development Server using the sqlite3 database which we just download (not the CSV):
$ python googleappengine/python/bulkloader.py --restore --kind=Kind --url=http://localhost:8080/remote_api --filename=app-id-Kind.db --app_id=app-id
[INFO    ] Logging to bulkloader-log-20091111.004013
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Opening database: bulkloader-progress-20091111.004013.sql3
Please enter login credentials for localhost
Email: ksdf@sdfk.com <- This does not matter, type anything
Password for ksdf@sdfk.com: <- Does not matter
[INFO    ] Connecting to localhost:8080/remote_api
[INFO    ] Starting import; maximum 10 entities per post
........................................................................................................................................................................................................................
[INFO    ] 2160 entites total, 0 previously transferred
[INFO    ] 2160 entities (0 bytes) transferred in 31.3 seconds
[INFO    ] All entities successfully transferred

You will need to specify the app id, which must match the Development server is running on.

This may be no need once the bulkloader.py is stable.

Monday, November 9, 2009

help in Python Interactive shell

Someone asked why does "help(import)" not work? I know the reason but it's not why I wanted to write about here. One reply exposed that I didn't know much about help. It shows a usage, that I had never known before:
help('import')

You can pass a string type, I also thought help is just printing out __doc__. And yes string also has __doc__, but why would you do that? Why would you want to get __doc__ of an instance of int, str, list, etc? So I never tried to pass a string to help.

Therefore I didn't known I could even get help about keywords. Moreover, I thought help was a function, which is not after I dug in. help is an instance of site._Helper. site module will be loaded automatically when you fire up Python interactive shell. Once it load, the help in shell is an instance of site._Helper.

If you invoke help without any arguments, help(), this will bring you to interactive help, I had never tried to use help without passing an object before.

This is actually invoking site._Helper.__call__, which is an instance method, means the instance of site._Helper is callable, and that's the way you get into interactive help.

site._Helper also has overridden __repr__ method, if you just type help and hit enter. The interactive shell will actually invoke this __repr__ method, and that's how we get this hint
Type help() for interactive help, or help(object) for help about object.

Note this does not directly mention that you can use help('string'), where string could be a module name, a keyword, or a topic. But you can know it from the message after you quit interactive help:
>>> help()

Welcome to Python 2.6!  This is the online help utility.

If this is your first time using Python, you should definitely check out
the tutorial on the Internet at http://docs.python.org/tutorial/.

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules.  To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, or topics, type "modules",
"keywords", or "topics".  Each module also comes with a one-line summary
of what it does; to list the modules whose summaries contain a given word
such as "spam", type "modules spam".

help> quit

You are now leaving help and returning to the Python interpreter.
If you want to ask for help on a particular object directly from the
interpreter, you can type "help(object)".  Executing "help('string')"
has the same effect as typing a particular string at the help> prompt.

Maybe this is my excuse that I did know help better.

Tuesday, October 27, 2009

pxss.py: Pure Python to access libXss via ctypes

pxss.py is a replacement of PyXSS/src/__init__.py, but not entire PyXSS. You can have IdleTracker, XSSTracker, and get_info(), and that's all.

It accesses libXss.so via ctypes. You only need to put it with your script without installation or compilation.

A quick example of getting the idle time:
import pxss
print pxss.get_info().idle, 'ms'

The get_info() returns the same data as in PyXSS.

If you have another display, you should be able to pass it (after you open it) with other necessary variables to get_info():
def get_info(p_display=None, default_root_window=None, p_info=None):

and get the XScreenSaverInfo.

I made this is for my another helper script, its quality is very poor. If you are interested in ctypes, this script might give your some ideas. But this is only my second time to use ctypes. My first time was on Windows for accessing GDI+.

Thursday, October 22, 2009

Blit cursor in Matplotlib

I have been writing a program to show quotes from Yahoo Finance service. After a few searches I know Matplotlib has matplotlib.widgets.Cursor to do the task, here is the example code. It's not a the kind of cursor we want, the cursor in such program must to snap its horizontal line to the price in figure.

So this snap version example could fit the need. This cursor manually draw the cursor. It works fine if your figure only have one or two data lines. If you use something like matplotlib.finance.CandleSticks, which plot many things on your figure. You will see the lag of movement of your cursor.

The first example has a way to deal with that, it's called Blit, you can read more about it at this page. Basically, you save the rendered image, every time you need to draw your cursor, you restore that saved image, then you draw your cursor. That would save a lot of time. There is a another example code for Blit.

I wrote my own example, you can read it here. A quick screenshot:


Sunday, October 18, 2009

Flu Data Viewer

I wrote this to get familiar with GTK, Glade and matplotlib. This post is not a walkthrough or tutorial of using either of them. I wanted to write down some notes. The Flu Data Viewer is an example or a code I could copy from for other coding (So this code is put in Public Domain), so it would unlikely be updated in the future. The flu data is from Google Flu Trends.

This program draws one or more countries' flu data in one figure. It downloads data from Google and saves data to fludata.csv in current directory. I was thinking to do some process, but I didn't know what I can do and I wasn't really interested in this.

Here is a screenshot:



You can download the code at Google Code.

Window.visible

When I first time use Glade to create the UI, I didn't know the window.visible is False by default, so you have to either set it in Glade or run window.show().

gtk.glade.XML(gladefile, windowname)

windowname must match window.name in Glade. It's obvious but I thought one Glade XML one window, I set it to something else, therefore my window never show up. One Glade XML could have many windows.

missing int type data in csv

loading using matplotlib.mlab.csv2rec

d = dict(zip(fields, [lambda value: int(value) if value != '' else nan]*len(fields)))
self.rec = mlab.csv2rec(CSV_FILENAME, converterd=d)

Where fields is a list of field names. If a data is missing, it will be an empty string '', I think using NaN (Not a Number, numpy.nan) may be a better idea to represent it than 0 (zero).

There is another way to deal with missing data by giving missingd to csv2rec, but the version of matplotlib on my computer doesn't have it. It should like converterd but with values which you want to assign when data is missing.

feeding to gtk.TreeView

If you set int type to gtk.ListStore, then it will not accept numpy.nan. So str type may be okay to use.

Multiselection mode in gtk.TreeView

treeview.get_selection().set_mode(gtk.SELECTION_MULTIPLE)

Don't forget NavigationToolbar*

Users still need a way to zoom in/out.
from matplotlib.backends.backend_gtkagg import NavigationToolbar2GTKAgg as NavigationToolbar

vbox.pack_start(canvas, True, True)
vbox.pack_start(NavigationToolbar(canvas, window), False, False)

I hope I could find a way to do data line tracing. It's hard to tell Y=? when X=123.