Chapter 3 - Filtering dataframes

# The usual preamble
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)


# This is necessary to show lots of columns in pandas 0.12. 
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000) 
pd.set_option('display.max_columns', 60)

Let’s continue with our NYC 311 service requests example.

complaints = pd.read_csv('311-service-requests.csv')

3.1 Selecting only noise complaints

I’d like to know which borough has the most noise complaints. First, we’ll take a look at the data to see what it looks like:

complaints[:5]

Output:

Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
02658965110/31/2013 02:08:41 AMNaNNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud TalkingStreet/Sidewalk1143290-03 169 STREET169 STREET90 AVENUE91 AVENUENaNNaNADDRESSJAMAICANaNPrecinctAssigned10/31/2013 10:08:41 AM10/31/2013 02:35:17 AM12 QUEENSQUEENS1042027197389UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.708275-73.791604(40.70827532593202, -73.79160395779721)
12659369810/31/2013 02:01:04 AMNaNNYPDNew York City Police DepartmentIllegal ParkingCommercial Overnight ParkingStreet/Sidewalk1137858 AVENUE58 AVENUE58 PLACE59 STREETNaNNaNBLOCKFACEMASPETHNaNPrecinctOpen10/31/2013 10:01:04 AMNaN05 QUEENSQUEENS1009349201984UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.721041-73.909453(40.721040535628305, -73.90945306791765)
22659413910/31/2013 02:00:24 AM10/31/2013 02:40:32 AMNYPDNew York City Police DepartmentNoise - CommercialLoud Music/PartyClub/Bar/Restaurant100324060 BROADWAYBROADWAYWEST 171 STREETWEST 172 STREETNaNNaNADDRESSNEW YORKNaNPrecinctClosed10/31/2013 10:00:24 AM10/31/2013 02:39:42 AM12 MANHATTANMANHATTAN1001088246531UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.843330-73.939144(40.84332975466513, -73.93914371913482)
32659572110/31/2013 01:56:23 AM10/31/2013 02:21:48 AMNYPDNew York City Police DepartmentNoise - VehicleCar/Truck HornStreet/Sidewalk10023WEST 72 STREETWEST 72 STREETCOLUMBUS AVENUEAMSTERDAM AVENUENaNNaNBLOCKFACENEW YORKNaNPrecinctClosed10/31/2013 09:56:23 AM10/31/2013 02:21:10 AM07 MANHATTANMANHATTAN989730222727UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.778009-73.980213(40.7780087446372, -73.98021349023975)
42659093010/31/2013 01:53:44 AMNaNDOHMHDepartment of Health and Mental HygieneRodentCondition Attracting RodentsVacant Lot10027WEST 124 STREETWEST 124 STREETLENOX AVENUEADAM CLAYTON POWELL JR BOULEVARDNaNNaNBLOCKFACENEW YORKNaNNaNPending11/30/2013 01:53:44 AM10/31/2013 01:59:54 AM10 MANHATTANMANHATTAN998815233545UnspecifiedMANHATTANUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.807691-73.947387(40.80769092704951, -73.94738703491433)

5 rows × 52 columns

To get the noise complaints, we need to find the rows where the “Complaint Type” column is “Noise - Street/Sidewalk”. I’ll show you how to do that, and then explain what’s going on.

noise_complaints = complaints[complaints['Complaint Type'] == "Noise - Street/Sidewalk"]
noise_complaints[:3]

Output:

Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
02658965110/31/2013 02:08:41 AMNaNNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud TalkingStreet/Sidewalk1143290-03 169 STREET169 STREET90 AVENUE91 AVENUENaNNaNADDRESSJAMAICANaNPrecinctAssigned10/31/2013 10:08:41 AM10/31/2013 02:35:17 AM12 QUEENSQUEENS1042027197389UnspecifiedQUEENSUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.708275-73.791604(40.70827532593202, -73.79160395779721)
162659408610/31/2013 12:54:03 AM10/31/2013 02:16:39 AMNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud Music/PartyStreet/Sidewalk10310173 CAMPBELL AVENUECAMPBELL AVENUEHENDERSON AVENUEWINEGAR LANENaNNaNADDRESSSTATEN ISLANDNaNPrecinctClosed10/31/2013 08:54:03 AM10/31/2013 02:07:14 AM01 STATEN ISLANDSTATEN ISLAND952013171076UnspecifiedSTATEN ISLANDUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.636182-74.116150(40.63618202176914, -74.1161500428337)
252659157310/31/2013 12:35:18 AM10/31/2013 02:41:35 AMNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud TalkingStreet/Sidewalk1031224 PRINCETON LANEPRINCETON LANEHAMPTON GREENDEAD ENDNaNNaNADDRESSSTATEN ISLANDNaNPrecinctClosed10/31/2013 08:35:18 AM10/31/2013 01:45:17 AM03 STATEN ISLANDSTATEN ISLAND929577140964UnspecifiedSTATEN ISLANDUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.553421-74.196743(40.55342078716953, -74.19674315017886)

3 rows × 52 columns

If you look at noise_complaints, you’ll see that this worked, and it only contains complaints with the right complaint type. But how does this work? Let’s deconstruct it into two pieces

complaints['Complaint Type'] == "Noise - Street/Sidewalk"

Output:

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
111054     True
111055    False
111056    False
111057    False
111058    False
111059     True
111060    False
111061    False
111062    False
111063    False
111064    False
111065    False
111066     True
111067    False
111068    False
Name: Complaint Type, Length: 111069, dtype: bool

This is a big array of Trues and Falses, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where our boolean array evaluated to True. It’s important to note that for row filtering by a boolean array the length of our dataframe’s index must be the same length as the boolean array used for filtering.

You can also combine more than one condition with the & operator like this:

is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
complaints[is_noise & in_brooklyn][:5]

Output:

Unique KeyCreated DateClosed DateAgencyAgency NameComplaint TypeDescriptorLocation TypeIncident ZipIncident AddressStreet NameCross Street 1Cross Street 2Intersection Street 1Intersection Street 2Address TypeCityLandmarkFacility TypeStatusDue DateResolution Action Updated DateCommunity BoardBoroughX Coordinate (State Plane)Y Coordinate (State Plane)Park Facility NamePark BoroughSchool NameSchool NumberSchool RegionSchool CodeSchool Phone NumberSchool AddressSchool CitySchool StateSchool ZipSchool Not FoundSchool or Citywide ComplaintVehicle TypeTaxi Company BoroughTaxi Pick Up LocationBridge Highway NameBridge Highway DirectionRoad RampBridge Highway SegmentGarage Lot NameFerry DirectionFerry Terminal NameLatitudeLongitudeLocation
312659556410/31/2013 12:30:36 AMNaNNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud Music/PartyStreet/Sidewalk11236AVENUE JAVENUE JEAST 80 STREETEAST 81 STREETNaNNaNBLOCKFACEBROOKLYNNaNPrecinctOpen10/31/2013 08:30:36 AMNaN18 BROOKLYNBROOKLYN1008937170310UnspecifiedBROOKLYNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.634104-73.911055(40.634103775951736, -73.91105541883589)
492659555310/31/2013 12:05:10 AM10/31/2013 02:43:43 AMNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud TalkingStreet/Sidewalk1122525 LEFFERTS AVENUELEFFERTS AVENUEWASHINGTON AVENUEBEDFORD AVENUENaNNaNADDRESSBROOKLYNNaNPrecinctClosed10/31/2013 08:05:10 AM10/31/2013 01:29:29 AM09 BROOKLYNBROOKLYN995366180388UnspecifiedBROOKLYNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.661793-73.959934(40.6617931276793, -73.95993363978067)
1092659465310/30/2013 11:26:32 PM10/31/2013 12:18:54 AMNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud Music/PartyStreet/Sidewalk11222NaNNaNNaNNaNDOBBIN STREETNORMAN STREETINTERSECTIONBROOKLYNNaNPrecinctClosed10/31/2013 07:26:32 AM10/31/2013 12:18:54 AM01 BROOKLYNBROOKLYN996925203271UnspecifiedBROOKLYNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.724600-73.954271(40.724599563793525, -73.95427134534344)
2362659199210/30/2013 10:02:58 PM10/30/2013 10:23:20 PMNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud TalkingStreet/Sidewalk11218DITMAS AVENUEDITMAS AVENUENaNNaNNaNNaNLATLONGBROOKLYNNaNPrecinctClosed10/31/2013 06:02:58 AM10/30/2013 10:23:20 PM01 BROOKLYNBROOKLYN991895171051UnspecifiedBROOKLYNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.636169-73.972455(40.63616876563881, -73.97245504682485)
3702659416710/30/2013 08:38:25 PM10/30/2013 10:26:28 PMNYPDNew York City Police DepartmentNoise - Street/SidewalkLoud Music/PartyStreet/Sidewalk11218126 BEVERLY ROADBEVERLY ROADCHURCH AVENUEEAST 2 STREETNaNNaNADDRESSBROOKLYNNaNPrecinctClosed10/31/2013 04:38:25 AM10/30/2013 10:26:28 PM12 BROOKLYNBROOKLYN990144173511UnspecifiedBROOKLYNUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedUnspecifiedNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN40.642922-73.978762(40.6429222774404, -73.97876175474585)

5 rows × 52 columns

Or if we just wanted a few columns:

complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']][:10]

Output:

Complaint TypeBoroughCreated DateDescriptor
31Noise - Street/SidewalkBROOKLYN10/31/2013 12:30:36 AMLoud Music/Party
49Noise - Street/SidewalkBROOKLYN10/31/2013 12:05:10 AMLoud Talking
109Noise - Street/SidewalkBROOKLYN10/30/2013 11:26:32 PMLoud Music/Party
236Noise - Street/SidewalkBROOKLYN10/30/2013 10:02:58 PMLoud Talking
370Noise - Street/SidewalkBROOKLYN10/30/2013 08:38:25 PMLoud Music/Party
378Noise - Street/SidewalkBROOKLYN10/30/2013 08:32:13 PMLoud Talking
656Noise - Street/SidewalkBROOKLYN10/30/2013 06:07:39 PMLoud Music/Party
1251Noise - Street/SidewalkBROOKLYN10/30/2013 03:04:51 PMLoud Talking
5416Noise - Street/SidewalkBROOKLYN10/29/2013 10:07:02 PMLoud Talking
5584Noise - Street/SidewalkBROOKLYN10/29/2013 08:15:59 PMLoud Music/Party

10 rows × 4 columns

3.2 A digression about numpy arrays

On the inside, the type of a column is pd.Series

pd.Series([1,2,3])

Output:

0    1
1    2
2    3
dtype: int64

and pandas Series are internally numpy arrays. If you add .values to the end of any Series, you’ll get its internal numpy array

np.array([1,2,3])

Output:

array([1, 2, 3])
pd.Series([1,2,3]).values

Output:

array([1, 2, 3])

So this binary-array-selection business is actually something that works with any numpy array:

arr = np.array([1,2,3])
arr != 2

Output:

array([ True, False,  True], dtype=bool)
arr[arr != 2]

Output:

array([1, 3])

3.3 So, which borough has the most noise complaints?

is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
noise_complaints = complaints[is_noise]
noise_complaints['Borough'].value_counts()

Output:

MANHATTAN        917
BROOKLYN         456
BRONX            292
QUEENS           226
STATEN ISLAND     36
Unspecified        1
dtype: int64

It’s Manhattan! But what if we wanted to divide by the total number of complaints, to make it make a bit more sense? That would be easy too:

noise_complaint_counts = noise_complaints['Borough'].value_counts()
complaint_counts = complaints['Borough'].value_counts()
noise_complaint_counts / complaint_counts

Output:

BRONX            0
BROOKLYN         0
MANHATTAN        0
QUEENS           0
STATEN ISLAND    0
Unspecified      0
dtype: int64

Oops, why was that zero? That’s no good. This is because of integer division in Python 2. Let’s fix it, by converting complaint_counts into an array of floats.

noise_complaint_counts / complaint_counts.astype(float)

Output:

BRONX            0.014833
BROOKLYN         0.013864
MANHATTAN        0.037755
QUEENS           0.010143
STATEN ISLAND    0.007474
Unspecified      0.000141
dtype: float64

Now let’s plot it!

(noise_complaint_counts / complaint_counts.astype(float)).plot(kind='bar')

Output:

Plot which borough has the most noise complaints

So Manhattan really does complain more about noise than the other boroughs! Neat.

Help us improve this content by editing this page on GitHub