6 Python Language Features You Must Know

Python provides a lot of features by itself. There are multiple ways to do the same thing; some are efficient and concise while some are long and inefficient. In this post I will mention 6 python features that I think is very useful to know and use it in your daily programming.

Iterators

From Python docs: an iterator is an object representing a stream of data. Basically, it means that the data is returned one at a time time instead of all. For. e.g when you iterate over an items in a list, you will get one item at a time, or when iterating over a string, you will get a character at a time. In Python, many data types support iterators. String, list, dict, and tuple are the most common data types that support iteration.

Why use it?
Since iterators return only one item at a time, not everything has to be loaded to the memory. For e.g. if you want do count the number of words in a file, you can either load the whole file in the memory and do the operation, or you can read the file line by line (which means only one line at a time is loaded in the memory) and do the word count.

 

Enumeration

How many times have you written a code like this to keep track of the index of item?

There is a better way. Use enumerator

python enumerate

The results are same but using enumerate function is much better.

Lambda expressions

To simply put lambda expressions are used to create anonymous functions. Anonymous functions are functions without a declaration. They are used to create functions that are usually very short (typically contained within a single line).
The code below implements a simple lambda function that adds two numbers. While this implementation is not too exciting, there are other places where this is used frequently; especially with map and filter functions.

python lambda

Output

any()/all()

These are quite interesting functions.

any function iterates over a sequence and returns True if any of the iterable is true.
Similarly all function iterates over a sequence and returns True only if all of the iterables are true.
Lets say we want to find if any of the given words are in the string or not. One simple way is to use “or” in the if condition. However, if there are many words we want to check then it becomes pretty tiresome to list out all alternatives.

map()

map function applies a function to every item in a sequence. We need to pass a function we want to apply and the sequence (list, tuples etc).
The code below demonstrates usage of map together with lambda functions.

python map

 

Full source code for reference. Also check out this page https://docs.python.org/3.6/library/functions.html for more information.

Dynamic Scheduling in OpenMP

 

This post demonstrates using a simple example when and how dynamic scheduling in OpenMP could be useful. In dynamic scheduling, iteration space is divided into blocks of chunk size and are scheduled to threads in the order in which threads finish previous blocks. This is a nice way to overcome load imbalance in iteration where some threads do more work than the others.

The code snippet below presents a scenario of load imbalance in the iteration and is executed using both static and dynamic scheduling.

Here we have N = 16 iterations, 4 threads, and chunk size of 2 (i.e. each block will contain 2 iterations). In for loop, there is a condition which simulates a load imbalance. If i = 0 or 9, then the program will sleep for 2 seconds (to simulate heavy workload).

Now in static scheduling, Thread 0 gets 0 and 1 iteration, Thread 1 will get 2 and 3 … and Thread 3 will get 6 and 7. Since static scheduling works in a round robin fashion, Thread 0 is again assigned iterations 8 and 9 and so on.

We can clearly see that Thread 0 will sleep (in real case it would be doing heavy work) for a total of 4 seconds ( 2 seconds for 0th iteration and 2 seconds for 9th), whereas other threads finish in no time at all and wait for Thread 0 to finish.

In dynamic scheduling however, the iteration blocks are assigned dynamically based on the order the threads finish their previous blocks. Thread 0 gets 0 and 1st iteration, Thread 1 gets 2nd and 3rd iteration, Thread 2 gets 6th and 7th (its dynamic!) and Thread 3 gets 4th and 5th. Since Thread 0 is “sleeping” for 2 seconds, other threads have already done their jobs and are free. So OpenMP assigns them another block. Since Thread 1 finished first, it gets assigned 8th and 9th iteration (where is “sleeps” for 2 seconds). Since Thread 0 and Thread 1 are basically “sleeping parallelly”, overall sleeping time is just a little over 2 seconds in this case. The figure below shows the output.

Dynamic Scheduling in OpenMP

Static Scheduling in OpenMP

There are number of ways we can tell OpenMP to schedule the iterations. If nothing is specified, then the default is Static Scheduling where OpenMP will divide the iteration in block of chunk size. Each thread is assigned a block in round robin fashion.

Syntax of this directive is

#pragma omp parallel for schedule(static [, chunkSize])

chunkSize is optional.

In the code below, we have N = 30 iterations, 7 threads and static scheduling with chunkSize = 2

The output is shown below. Each thread is assigned a block of iteration where the size of block is 2 (i.e. chunkSize). First thread whose Rank is 0, gets assigned iteration 0 and 1. Then 2nd thread whose Rank is 1, gets assigned 2 and 3. When the 7th thread, whose Rank is 6 gets assigned 12 and 13th iteration, OpenMP again assigns the remaining iteration to the threads. Again, first thread (Rank 0) gets assigned iteration 14 and 15, second thread gets assigned 16 and 17 and so on until every iteration is assigned to some thread.
Static Scheduling in OpenMP

 

Now if we remove the chunksize paramter in the #pragma directive,

#pragma omp parallel for num_threads(7) schedule(static)

Then we get the following output. Notice the iteration number that have been assigned to each thread.

Static Scheduling in OpenMP

Getting started with OpenMPI and Netbeans 8.1

I wanted to learn MPI and tried doing it in Windows but I had some issues with MSMPI and couldn’t run it. So I decided to try it in Linux with OpenMPI and finally I managed to set up everything.

I am using Lubuntu in my VirtualBox.

So, first of all install OpenMPI using the command

sudo apt-get install libopenmpi-dev

Now we need to configure Netbeans to use OpenMPI compiler.

  • Go to Tools -> Options
  • On C/C++ tab
  • Click Add button as marked in the figure. You might only see one entry (GNU) in the Tool Collection.

1

  •  In the dialog box, enter /usr/bin in the Base Directory, leave Tool Collection Family as GNU and type OpenMPI  in Tool Collection Name. You can put any name there.

2

  •  Select the newly added Tool Collection and enter /usr/bin/mpicc for C Compiler and /usr/bin/mpic++ for C++ Compiler.

3

  • Switch to Code Assistance tab and make sure that Include Directories are populated. Otherwise you can add them.

4

  •  Now when creating a new C/C++ project, you can choose the newly created Tool Collection so that it uses mpi compiler.

5

  • If you have already created the project, then you can go to the Project Properties -> Build and in Tool Collection, choose OpenMPI.

6

 

Now that we have setup our Netbeans, lets run a simple MPI program.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char** argv) {

int numprocs, myrank;
MPI_Init(&argc, &argv);

MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

if(myrank == 0) printf(“Number of processes = %d \n”, numprocs);

printf(“My rank is %d \n”, myrank);

MPI_Finalize();
}

Right click the project and click Build.

Now in your terminal, go to <yourprojectdir>/dist/Debug/OpenMPI-Linux and run the following command

mpiexec -np 5 cppapplication_1

I got the following output

7

Happy coding!

Nepali Text Processing Library (Reference)

There are few classes and functions that you will use for processing Nepali text.

For stemming, there is SubediStemmer class in package np.com.sanjayasubedi.core.SubediStemmer

Use String stem(String word) function in the class to stem a word. This function accepts a String as input and returns a String. Note that the input should be a word and only a word!

Example

It produces the following output

Another class that you might use is NepaliString. It has three functions

public static int getLength(String content)

This function will count the number of characters in the string. Unlike regular String length, which will count each and every character in the string, this function will only count the alphabets of Nepali language. It will ignore the characters such as modifiers and symbols.

Example

It will output

 

public static String[] tokenize(String content)

This function will tokenize the text and return an array of String. It will take any non Nepali characters as splitting point.

Example

Output will be

public static String readFile(String filePath)

This function will read a file in UTF-8 format as text and return the contents of the file as String.

 

Finally, there is FilterStopwords class.

You can either use the built-in stop word list or use your own list.  It all depends on how you initialize the object of this class.

The parameter on the constructor is the path to the file where you have defined your own stop words. The words should be one per line.

It has one function Boolean isStopWord(String word). It will check if the word is a stop word or not.

Analysis On Profiles of Nepali Elance Users

Elance is a great place for freelancers as well as clients from all around the world. People specify the tasks they need to be done and freelancers place a bid on the projects. If taken seriously, freelancing can be a great source of income and lots of experiences.
As with the real physical world, you won’t go anywhere without a solid profile. Your profile will reflect about you to your potential clients. So it is absolutely necessary that your profile states your experiences, interests, skills and such.
The objective of this study is to find out what services do Nepali freelancers provide. I intend to find the most popular or common service and also the least common service provided by a Nepali freelancer. For this study, their “tag”, “overview” and “services” attributes of 174 Nepali freelancers have been collected. Take a look at my Elance profile if you have no idea what those mean. “tag” is, well, a tag line. Then comes “overview” – it is where freelancers post about their experiences, skills and other convincing stuffs. Finally there is “services” where freelancers write about the services they provide.
So basically, these attributes contain the “work area” that a freelancer is willing to work. If a guy is a software developer, then one can expect to see “software”, “architecture”, “Java”, “C#”, “development” etc. words in his “tag” or “overview” or “services”.
So for data processing, created tokens or words from the above attributes by:
1. Tokenizing based on non-characters (i.e. break the sentences into words based on non characters)
2. Stopword filtering (i.e. remove the words that are in a pre-defined list of stopwords)
3. Lowercase transform (transform to lower or uppercase so that “Age” and “age” are counted as same word rather than different ones)
I use “token” and “word” interchangeably.
 
Now that we have list of words along with their frequency, lets see which are 20 most frequent words.

Word  Total Occurrence  Document Occurrence
i 515 121
web 323 109
development 258 85
wordpress 229 78
design 227 84
php 154 82
read 137 134
years 128 93
html 124 70
css 123 60
work 115 70
experience 110 80
psd 90 36
developer 79 58
working 77 54
software 75 31
data 69 27
services 69 41
mysql 67 46
expert 66 50
Total occurrence means that how many times a word has been used in all the documents i.e. users profiles.
Document occurrence means that in how may documents i.e. user profiles, has the word been used. So document occurrence basically means how many users are using the word in their profile.Remember that we have 174 user profiles, that means we have 174 documents. 

The table shows that “I” is the most frequent word. Of course, everyone is praising themselves in their profiles, so one shouldn’t be surprised! Then comes the “words of the web”- web, development, wordpress, design, php. Also see their document occurrence, they are seen in more than 80 user profiles. So out of 175 users, more than 80 users have listed web, wordpress, php, design as their skills.

Next interesting words are year, experience, and work. Many of us state “I have 5 years of experience”, “I work hard” etc… so no wonder these are frequently used.

Now lets see some infrequent words. For this study, I have defined infrequent word as: If a word has only 1 document occurrence then it is considered as infrequent. There are too many to list so I will randomly choose 20 such words.

Word  Total Occurrence  Document Occurrence
wpf 1 1
wpm 1 1
actionscript 1 1
armchair 1 1
azure 1 1
wxpython 1 1
xcart 1 1
xcode 1 1
blackened 1 1
yam 1 1
criminals 1 1
youth 1 1
yui 1 1
dreamwaver 1 1
zoho 1 1
sugarcrm 1 1
dsl 1 1

Now that we have seen frequent and infrequent individual words, lets see which word pairs (bi-grams) are most common. If you are wondering how i_working is frequent then remember that we have removed some words (i.e. stopwords). It removes words like “of”, “am” etc. So basically a sentence transforms from “I am working as software developer. ” to “I working software developer”.

Words Total Occurrence Document Occurrences
web_design 61 41
html_css 57 41
web_development 49 31
php_mysql 46 32
design_development 38 30
data_entry 33 15
years_experience 31 30
application_development 27 17
e_commerce 26 22
web_developer 26 22
i_working 24 19
wordpress_theme 24 13
i_m 23 18
psd_html 22 17
search_engine 21 18
web_applications 21 19
software_development 20 13
Want even more details? How about frequency of three words sequence (tri-grams).
Word Total Occurrence Document Occurrences
web_design_development 18 16
search_engine_optimization 13 11
wordpress_theme_development 12 10
design_web_development 8 6
e_commerce_solutions 8 7
responsive_retina_ready 8 7
i_look_forward 7 7
look_forward_working 7 7
team_web_professionals 7 7
web_professionals_builds 7 7
wordpress_plugin_development 7 6
We can actually conclude that there is tough competition in “web design <and> development”. Total of 16 users are specifically targeting for this kind of work. Similarly “search engine optimization”, “wordpress theme development” and others are also competitive.
So from these observations, either you can decide to build your skills, submit proposals in popular but competitive areas or you can build you skills on “unexplored” areas like wxpython, wpf, actionscript, zoho etc.
Let me know about your experiences in freelancing and about this article. Hope you enjoyed it!

Text Processing Library For Nepali Language

There are a lots of NLP libraries out there but the basic tasks like stemming, tokenization, stopword filtering are not available for Nepali language and missing these basic but important tasks make the whole Nepali NLP task a bit unfruitful. So I have developed a library that can be used for tokenization, stopword removal and stemming.

  • Stemming algorithm is based on the algorithm published in my report “Text Stemming in Nepali” (Download pdf report: Text Stemming in Nepali). Currently, only suffixes are removed. Prefix removal has not yet been done.
  • Tokenization is done considering all non-Nepali characters as splitting points.
  • Stopword removal is done based on stop words that I have collected. See details here.

The library is available in Java and Rapidminer platform as an extension. You can Download Java library if you intend to develop text processing application in Java.  After downloading the library, you might want to check this post about the classes and functions in the library.

Or, Download Rapidminer extension for processing Nepali text in Rapidminer. Please note that the extension depends on “Text Processing Extension”. If you do not have installed it already then install it first.  To install this extension you need to copy the jar file to the “plugins” directory of Rapidminer. It is usually located in “C:\Program Files (x86)\Rapid-I\RapidMiner5\lib\plugins”. The path may vary depending on the OS you use and the folder where you have installed Rapidminer.

Nepali Text Classification Using Rapidminer

Text classification is a process of categorizing a text to some category. This process can be automated and the accuracy of the classification is acceptable. For this project, I will be using Rapidminer 5 and the “Nepali Text Processing” extension that I have built for it. Refer to this post to download the extension. A Java library is also available if you want to use the Nepali text processing functions. The basic text processing functions that we will use are Tokenizer, Stopword removal and Stemming,

To start with, I have collected over 600 Nepali news articles from the web and have organized them according to their category in different folders. ( I don’t have the rights to distribute those articles. As soon as the authors give their permission, I will post them here. )

Folder structure

The Rapidminer process looks like

Rapidminer process

The “Process Document” operator reads the files from the folders, tokenizes, filters stopwords and finally performs stemming. The flow of “Process Document” is shown below.

Process documents

The above process creates a word vector using TF-IDF method. Then the data is split into training and testing data (70% and 30% of the total data respectively).

Naive Bayes algorithm was used as the learning algorithm because of its simplicity. K-Nearest Neighbours, SVM are also quite popular for text classification. As there are lots of attributes, so choosing a simpler and faster algorithm can be beneficial. In this simple project alone, there are over 7600 attributes.

The result for training data is shown below

performance_train

The result for test data is shown below.

performance_test

 

As we can see that the accuracy for training data is around 82% and the accuracy for test data is 80%. This is an acceptable accuracy for an automated system.

For the interesting part, lets see some of the words that are strongly related to certain category.  This data was derived from the distribution table of Naive Bayes.

Politics Technology Sports Finance Tourism
मत सामसुङ गोल बजेट पर्यटक
संविध टि्वटर रन वृद्धिदर होटल
सहम एनसेल केजी अर्गान चलचित्र
प्रधानमन्त्र आइफो क्रिकेट मोदी उड
ओली टेलिकम लिभरपुल महत बोर्ड
बिप एप्पल क्लब आयात लज
दल प्रयोगकर्ता सेकेन्ड वृद्धि पर्यट्
कोइराला ट्वीटर गोल निगम जहाज
समिति गुगल मड्रिड अर्ब हेलिकप्टर
पार्ट चन्द्रमा एपीएफ अर्थ विमानस्थल
संविध पृथ्वी इब्राहिमोभिच करोड फेवाताल
बैठक मोबाइल बार्सिलोना महासंघ निगम
संच् स्मार्ट खेल बीपी छानवि
संसद् घडी रियल कफ जिब्रो
काँग्रेस फोन चेल्सी प्रतिश् त्रिवि
प्रचण्ड पूरानो खेल उजुर अस्ट्रेलिया
प्रतिवेद् ओपेर् मिटर निर्यात एयरवेज
पार्टी एप्स मिनेट वस्तु चय्

Please note that some of the words are not valid words, because they have been “stemmed” incorrectly by the Stemming algorithm but this is not an issue for text classification problem. As long as the stemming algorithm produces output based on the rules given to it, the text classification problem is not affected by this. However, there might be other uses of stemming algorithm where error free output maybe desired.

 

Download Rapidminer Process. Unrar the archive and copy the contents of XML file. Then create a new process in Rapidminer, go to “XML” tab and paste the contents of the file you downloaded. Change the “Text directories” property of “Process Document” operator according to your project setup.

Nepali Stop Words

Stop words are the words that are removed before Natural Language Processing (NLP) is done. I couldn”t find any available stop word collection for Nepali language so I decided to build such list. In order to build this list, I collected around 600 news articles and generated a word frequency table and based on that table, frequent words but pretty much useless in NLP tasks were collected.

Below is the list of stop words in Nepali. You can also download the file in .txt format. Download Nepali Stop Words



पनि
छन्
लागि
भएको
गरेको
भने
गर्न
गर्ने
हो
तथा
यो
रहेको
उनले
थियो
हुने
गरेका
थिए
गर्दै
तर
नै
को
मा
हुन्
भन्ने
हुन
गरी

हुन्छ
अब
के
रहेका
गरेर
छैन
दिए
भए
यस
ले
गर्नु
औं
सो
त्यो
कि
जुन
यी
का
गरि
ती

छु
छौं
लाई
नि