×
By the end of this chapter, you should be able to:
A Python string is an iterable series of characters. You can loop through a string just like a list:
for x in "word": print(x) # w # o # r # d
String literals can be defined with either single or double quotes (your choice), and can be defined on multiple lines with a backslash like so:
x = "my favorite " \ "string"
Most importantly, strings in Python are immutable. This means you cannot change strings like so:
my_str = "can't touch this" my_str[6] = " " # TypeError
Also, when you build strings with +=
in a loop, you're creating a new string every iteration:
new_str = "hello " for c in "world": new_str += c # new string created every single time! print(new_str) # hello world
This has serious implications for time and space complexity, which are discussed in the computer science fundamentals section about Big O.
In Python 2 strings, are stored internally as 8 bit ASCII. But in Python 3, all strings are represented in Unicode.
Uh, what?
Before we talk about methods on strings in Python, let's learn a little bit about the history of character encodings. If you would like a longer description, feel free to read this excellent article.
When we as humans see text on a computer screen, we are viewing something quite different than what a computer processes. Remember that computers deal with bits and bytes, so we need a way to encode (or map) characters to something a computer can work with. In 1968, the American Standard Code for Information Interchange (or ASCII) was standardized as a character encoding. ASCII defined codes for characters ranging from 0 to 127.
Why this range? Remember that computers work in base 2 or binary, so each bit represents a power of two. This means that 7 bits can get us 2^7 = 128
different binary numbers; since each bit can equal 0
or 1
, with 7
bits we can represent all numbers from 0000000
up to 1111111
. With ASCII, we can then map each of these numbers to a distinct character. Since there are only 26 letters in the alphabet (52 if you care about the distinction between upper and lower case), plus a handful of digits and punctuation characters, ASCII should more than cover our needs, right?
ASCII was a great start, but issues arose when non English characters like é
or ö
could not be processed and would just be converted to e
and o
. In the 1980s, computers were 8-bit machines which meant that bytes now held 8 bits. The highest binary number we could obtain on these machines was 11111111
or 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7
, or 255
. Different machines now used the values of 128 to 255 for accented characters, but there was not a standard that emerged until the International Standards Organization (or ISO) emerged.
Even with an additional 128
characters, we started running into lots of issues once the web grew. Languages with completely different character sets like Russian, Chinese, Arabic, and many more had to be encoded in completely different character sets, causing a nightmare when trying to deliver a single text file with multiple character sets.
In the 1980s, a new encoding called Unicode
was introduced. Unicode's mission was to encode all possible characters and still be backward compatible with ASCII. The most popular character encoding that is dominant on the web now is UTF-8
, which uses 8-bit
code units, but with a variable length to ensure that all sorts of characters can be encoded.
TL;DR: in Python3, strings are Unicode by default.
Python contains quite a few helpful string methods; here are a few. Try running these in a REPL to see what they do!
Let's start with a simple variable:
string = "this Is nIce"
To convert every character to upper-case we can use the upper
function.
string.upper() # 'THIS IS NICE'
To convert every character to lower-case we can use the lower
function.
string.lower() # 'this is nice'
To convert the first character in a string to upper-case and everything else to lower-case we can use the capitalize
function.
string.capitalize() # 'This is nice'
To convert every first character in a string to upper-case and everything else to lower-case we can use the title
function.
string.title() # 'This Is Nice'
To find a subset of characters in a string we can use the find
method. This will return the index at which the first match occurs. If the character/characters is/are not found, find
will return -1
instructor = 'elie' instructor.find('e') # 0 instructor.find('E') # -1 it IS case sensitive! string.find("i") # 2, since the character "i" is at index 2 string.find('Tim') # -1
To see if all characters are alphabetic we can use the isalpha
function.
string.isalpha() # False string[0].isalpha() # True
To see if a character or all characters are empty spaces, we can use the isspace
function
string.isspace() # False string[0].isspace() # False string[4].isspace() # True
To see if a character or all characters are lower-cased , we can use the islower
function (there is also a function, which does the inverse called isupper
)
string.islower() # False string[0].islower() # True string[5].islower() # False string.lower().islower() # True
To see if a string is a "title" (first character of each word is capitalized), we can use the istitle
function.
string.istitle() # False string.title().istitle() # True "not Awesome Sauce".istitle() # False "Awesome Sauce".istitle() # True
To see if a string ends with a certain set of characters we can use the endswith
function.
"string".endswith('g') # True "awesome".endswith('foo') # False
To partition a string based on a certain character, we can use the partition
function.
string.partition('i') # what's the type of what you get back? "awesome".partition('e') # ('aw', 'e', 'some')
One of the most common string methods you'll use is the format
method. This is a powerful method that can do all kinds of string manipulation, but it's most commonly just used to pass varaibles into strings. In general this is preferred over string concatenation, which can quickly get cumbersome if you're mixing a lot of variables with strings. For example:
first_name = "Matt" last_name = "Lane" city = "San Francisco" mood = "great" greeting = "Hi, my name is " + first_name + " " + last_name + ", I live in " + city + " and I feel " + mood + "." greeting # 'Hi, my name is Matt Lane, I live in San Francisco and I feel great.'
Here, the greeting
variable looks fine, but all that string concatenation isn't easy on the eyes. It's very easy to forget about a +
sign, or to forget to separate words with extra whitespace at the beginning and end of our strings.
This is one reason why format
is nice. Here's a refactor:
greeting = "Hi, my name is {} {}, I live in {} and I feel {}.".format(first_name, last_name, city, mood)
When we call format
on a string, we can pass variables into the string! The variables will be passed in order, wherever format
finds a set of curly braces.
Starting in Python 3.6, however, we have f-strings, which are a cleaner way of doing string interpolation. Simply put f in front of the string, and then brackets with actual variable names.
greeting = f"Hi, my name is {first_name} {last_name}, I live in {city}, and I feel {mood}."
When you're ready, move on to Boolean Logic