This dataset example is a simplified record of item sales made by a chain of retail shops.
Each record is an individual item purchased from a shop belonging to this imaginary company.

 

Schema

 

Name Type Example Value Description
UTCTimestamp float 1451914602.0 Unix timestamp representing the time the transaction took place. Timezone is UTC.
TransactionID str T0000 Identifier given to an individual transaction, or shopping basket. Several records can have the same TransactionID if the items were all bought together.
ShopID str Glasgow Identifier given to the shop branch at which the item was sold.
ProductID str A63 Identifier given to the type of product that was sold.
SaleAmount float 23.0 Amount the item was sold for.

 

Example Records

 

1451914602.0	T0000	Glasgow	G12	26.0
1451914602.0	T0000	Glasgow	A63	35.0
1451914602.0	T0000	Glasgow	D95	23.0
1451914602.0	T0000	Glasgow	A58	81.0
1451914602.0	T0000	Glasgow	B01	30.0
1451914602.0	T0000	Glasgow	E64	4.0
1451914602.0	T0000	Glasgow	E67	42.0
1451914602.0	T0000	Glasgow	B57	16.0
1451914602.0	T0000	Glasgow	G17	13.0
1451914632.0	T0001	Glasgow	E64	81.0
1451914632.0	T0001	Glasgow	G56	68.0
1451914632.0	T0001	Glasgow	D11	21.0
1451914632.0	T0001	Glasgow	B36	35.0
1451914632.0	T0001	Glasgow	C87	16.0
1451914632.0	T0001	Glasgow	A85	16.0
1451914646.0	T0002	Cardiff	B75	75.0
1451914646.0	T0002	Cardiff	B33	43.0
1451914646.0	T0002	Cardiff	G72	91.0
1451914646.0	T0002	Cardiff	A90	6.0
1451914646.0	T0002	Cardiff	B53	77.0
1451914646.0	T0002	Cardiff	D02	14.0
1451914646.0	T0002	Cardiff	G97	19.0

 

Generating Sample Data Sets

 

import time
import math
import random

CAT_LETTERS = "ABCDEFG"


def get_transaction_id(transaction_number):
    """
    :param transaction_number: int: Global counter representing number of
                                    transactions/shopping baskets generated so
                                    far.
    :return: str: A new TransactionID (the next in the sequence).

    TransactionID is "T" plus an int padded with leading 0's to 4 characters.
    e.g. "T0128".

    Note: transaction_number is updated by the record creating loop, not this
    function.
    """
    new_transaction_id = "T%04d" % transaction_number
    return new_transaction_id


def get_shop_id():
    """
    :return: str: A valid ShopID.

    In this case, a ShopID is the name of the town or city a shop is located in.

    Methodology: Select a random shop from a predefined list.
    """
    shop_ids = ["Cardiff", "Exeter", "Glasgow"]
    return shop_ids[random.randint(0, len(shop_ids) - 1)]


def get_product_id():
    """
    :return: str: A valid ProductID.

    A ProductID is a string consisting of a letter between A and G inclusive
    and two digits.
    e.g. "D45"

    Methodology: Randomly select a letter from "ABCDEFG", then randomly select a
    number between 0 and 99 inclusive and zero pad it.
    """
    selected_category = CAT_LETTERS[random.randint(0, len(CAT_LETTERS) - 1)]
    selected_number = random.randint(0, 99)
    return "%s%02d" % (selected_category, selected_number)


def get_sale_amount():
    """
    :return: float: A valid SaleAmount.

    Methodology: Select a random whole number (in floating point form) between
    1.0 and 100.0 inclusive.
    """
    return 1.0 * random.randint(1, 100)


# Number of data points to generate.
N = 10000
# File to store the data points in.
data_file = open("sample_data.csv", "wb")
# Number of transactions/shopping baskets generated so far.
num_baskets_generated = 0
# Timestamp to use as starting point.
current_time = math.floor(time.time())


# We'll decrement N for every record we create and stop when N hits 0.
while N > 0:
    # Generate timestamp, transaction ID and shop ID.
    timestamp_utc = current_time
    current_time += random.randint(0, 100)
    transaction_id = get_transaction_id(num_baskets_generated)
    num_baskets_generated += 1
    shop_id = get_shop_id()

    # Decide how many products to create in this shopping basket.
    num_items_to_create = None
    while num_items_to_create is None:
        rand_num = random.randint(1, 10)
        # Need to check that the number generated is within range of the number
        # of records left to create.
        if rand_num <= N:
            num_items_to_create = rand_num

    # Generate that many product entries.
    for x in xrange(0, num_items_to_create):
        product_id = get_product_id()
        sale_amount = get_sale_amount()
        # Write to the sample data file.
        entries = map(str, [timestamp_utc,
                            transaction_id,
                            shop_id,
                            product_id,
                            sale_amount])
        data_file.write("\t".join(entries)+"\n")

    # Take away the number of entries just generated from the overall counter.
    N -= num_items_to_create

# Close the file now that we've finished writing records.
data_file.close()

Imagine that you have an online shop, and from your sales data you want to find all the countries you have ever sold and shipped a product to. You don’t care how many sales you made to each country, just that you have sold something there at least once.

This is a good example of where using sets would be easier and faster than a solution that uses lists.

 

Sets

 

Sets can be described as follows:

  • Each element in a set is unique.
  • The elements are unordered within the set.

 

Initialising a set

 

An empty set can be initialised by:

my_set = set([])

A pre-populated set can be initialised by:

my_set = set(["one", 5, "hello"])

Or (preferred method):

my_set = {"one", 5, "hello"}

 

Note that from the examples above a set, like a list, can be populated with elements of different data types (i.e. they do not all have to be the same data type).

 

add

 

You can add an element to a set by using add():

my_set = set([])
my_set.add("hello")
print my_set
set(['hello'])

 

remove

 

You can remove an element from a set by using remove():

my_set = {"bop", "bit", 5}
my_set.remove("bop")
print my_set
set(['bit', 5])

 

union

 

If you have two sets and want to create a new set with all the elements from both sets, you can use union():

set_one = {"hello", 12, 7}
set_two = {"apple", "hello", 7, 18}

set_union = set_one.union(set_two)
print set_union
set([18, 'apple', 7, 12, 'hello'])

 

intersection

 

If you have two sets and want to find the elements that are in both sets, you can use intersection():

set_one = {"hello", 12, 7}
set_two = {"apple", "hello", 7, 18}

set_intersection = set_one.intersection(set_two)
print set_intersection
set(['hello', 7])

 

difference

 

If you have two sets and wish to remove any elements that appear in the second set from the first set, you can use difference():

set_one = {"hello", 12, 7}
set_two = {"apple", "hello", 7, 18}

set_difference = set_one.difference(set_two)
print set_difference
set([12])

 

symmetric difference

 

If you have two sets and want to find elements that appear in one set but not in both sets, you can use symmetric_difference():

set_one = {"hello", 12, 7}
set_two = {"apple", "hello", 7, 18}

set_symmetric_difference = set_one.symmetric_difference(set_two)
print set_symmetric_difference
set([18, 12, 'apple'])

While in other programming languages you have to write your own functions to deal with basic string processing tasks, in python 2.7.x there are several already built-in for you. They’ve already been optimised for performance, so are usually the much better option than writing your own implementations.

 

Built-In String Functions

 

split

 

split() takes a string and breaks it up into a list of strings, using a string you provide as the point at which it splits the original string.

 

example_string = "cat-dog-parrot-mouse"
split_list = example_string.split("-")
print split_list
['cat', 'dog', 'parrot', 'mouse']

 

 

Faster split

 

split() will process the entire string if you don’t tell it otherwise, and you might only want to split on the first few instances of a given string. If this is the case then you can avoid the performance overhead of splitting on every single instance of the string and instead tell it to split only n times, then stop.

 

example_string = "cat-dog-parrot-mouse"
split_list = example_string.split("-", 1)
print split_list
['cat', 'dog-parrot-mouse']

 

In the above example, it splits on the first instance of "-" then stops, putting the rest of the unprocessed string in split_list[-1].

 

example_string = "cat-dog-parrot-mouse"
split_list = example_string.split("-", 2)
print split_list
['cat', 'dog', 'parrot-mouse']

 

 

join

 

join() takes a list of strings and creates a new string from its elements, inserting a given string in-between each element in the new string.

 

list_of_strings = ["cat", "dog", "parrot", "mouse"]
new_string = "-".join(list_of_strings)
print new_string
cat-dog-parrot-mouse

 

 

replace

 

replace() will create a new string from the string you specify, but will replace instances of a given substring with another substring.

 

example_string = "cat-dog-parrot-mouse"
new_string = example_string.replace("-", "+")
print new_string
cat+dog+parrot+mouse

 

 

strip, lstrip and rstrip

 

What these three functions are really useful for is removing trailing whitespace characters from your data, but you can also use them to remove other characters.

However unlike replace(), these functions only operate on either end of the string. lstrip() removes the characters from the left hand side of the string, rstrip() removes them from the right hand side, and strip() removes them from both ends.

To clarify the expected behaviour of each function, in the examples below they have been called with a non-whitespace string as the parameter ("-"). Whitespace characters can be removed by calling with no parameter.

 

 

strip

 

example_string = "-cat-dog-parrot-mouse-"
new_string = example_string.strip("-")
print new_string
cat-dog-parrot-mouse

 

 

lstrip

 

example_string = "-cat-dog-parrot-mouse-"
new_string = example_string.lstrip("-")
print new_string
cat-dog-parrot-mouse-

 

 

rstrip

 

example_string = "-cat-dog-parrot-mouse-"
new_string = example_string.rstrip("-")
print new_string
-cat-dog-parrot-mouse

 

 

upper

 

Converts all lower-case letters in a string to their upper-case equivalents.

 

example_string = "heLLO AnD wELcomE"
new_string = example_string.upper()
print new_string
HELLO AND WELCOME

 

 

lower

 

Converts all upper-case letters in a string to their lower-case equivalents.

 

example_string = "heLLO AnD wELcomE"
new_string = example_string.lower()
print new_string
hello and welcome

 

 

More Useful String Functions

 

Length of a string

 

Strings are just a list of characters, so you can call len() to get the length of a string.

 

sample_string = "Count the characters in this string!"
length = len(sample_string)
print length
"36"