A Quick Introduction to Data Analysis With Pandas

Ronak Mutha

Artificial Intelligence / Machine Learning

Tags:

pandas python library

data science

python

pandas

data analytics

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas aims to integrate the functionality of NumPy and matplotlib to give you a convenient tool for data analytics and visualization. Besides the integration, it also makes the usage far more better.

In this blog, I’ll give you a list of useful pandas snippets that can be reused over and over again. These will definitely save you some time that you may otherwise need to skim through the comprehensive Pandas docs.

The data structures in Pandas are capable of holding elements of any type: Series, DataFrame.

Series

A one-dimensional object that can hold any data type such as integers, floats, and strings

A Series object can be created of different values. Series can be remembered similar to a Python list.

In the below example, NaN is NumPy’s nan symbol which tells us that the element is not a number but it can be used as one numerical type pointing out to be not a number. The type of series is an object because the series has mixed contents of strings and numbers.

	>>> import pandas as pd
	>>> import numpy as np
	>>> series = pd.Series([12,32,54,2, np.nan, "a string", 6])
	>>> series
	0 12
	1 32
	2 54
	3 2
	4 NaN
	5 a string
	6 6
	dtype: object

view raw series_object.py hosted with ❤ by GitHub

Now if we use only numerical values, we get the basic NumPy dtype - float for our series.

	>>> series = pd.Series([1,2,np.nan, 4])
	>>> series
	0 1.0
	1 2.0
	2 NaN
	3 4.0
	dtype: float64

view raw series_float64.py hosted with ❤ by GitHub

DataFrame

A two-dimensional labeled data structure where columns can be of different types.

Each column in a Pandas DataFrame represents a Series object in memory.

In order to convert a certain Python object (dictionary, lists, etc) to a DataFrame, it is extremely easy. From the python dictionaries, the keys map to Column names while values correspond to a list of column values.

	>>> d = {
	"stats": pd.Series(np.arange(10,15,1.0)),
	"year": pd.Series(["2012","2007","2012","2003"]),
	"intake": pd.Series(["SUMMER","WINTER","WINTER","SUMMER"]),
	}
	>>> df = pd.DataFrame(d)
	>>> df

view raw example_dataframe.py hosted with ❤ by GitHub

Reading CSV files

Pandas can work with various file types while reading any file you need to remember.

pd.read_filetype()

view raw file_type.py hosted with ❤ by GitHub

Now you will have to only replace “filetype” with the actual type of the file, like csv or excel. You will have to give the path of the file inside the parenthesis as the first argument. You can also pass in different arguments that relate to opening the file. (Reading a csv file? See this)

	>>> df = pd.read_csv('companies.csv')
	>>> df.head()

view raw read_csv.py hosted with ❤ by GitHub

Accessing Columns and Rows

DataFrame comprises of three sub-components, the index, columns, and the data (also known as values).

The index represents a sequence of values. In the DataFrame, it always on the left side. Values in an index are in bold font. Each individual value of the index is called a label. Index is like positions while the labels are values at that particular index. Sometimes the index is also referred to as row labels. In all the examples below, the labels and indexes are the same and are just integers beginning from 0 up to n-1, where n is the number of rows in the table.

Selecting rows is done using loc and iloc:

loc gets rows (or columns) with particular labels from the index. Raises KeyError when the items are not found.
iloc gets rows (or columns) at particular positions/index (so it only takes integers). Raises IndexError if a requested indexer is out-of-bounds.

>>> df.loc[:5] #similar to df.head()

view raw dataframe_loc.py hosted with ❤ by GitHub

Accessing the data using column names

Pandas takes an extra step and allows us to access data through labels in DataFrames.

>>> df.loc[:5, ["name","vertical", "url"]]

view raw dataframe_iloc.py hosted with ❤ by GitHub

In Pandas, selecting data is very easy and similar to accessing an element from a dictionary or a list.

You can select a column (df[col_name]) and it will return column with label col_name as a Series, because rows and columns are stored as Series in a DataFrame, If you need to access more columns (df[[col_name_1, col_name_2]]) and it returns columns as a new DataFrame.

Filtering DataFrames with Conditional Logic

Let’s say we want all the companies with the vertical as B2B, the logic would be:

>>> df[(df['vertical'] == 'B2B')]

view raw df_access_by_col_1.py hosted with ❤ by GitHub

If we want the companies for the year 2009, we would use:

>>> df[(df['year'] == 2009)]

view raw df_access_by_col_2.py hosted with ❤ by GitHub

Need to combine them both? Here’s how you would do it:

>>> df[(df['vertical'] == 'B2B') & (df['year'] == 2009)]

view raw df_access_by_col_3.py hosted with ❤ by GitHub

Filtering Dataframes with Conditional logic — Get all companies with vertical as B2B for the year 2009

Sort and Groupby

Sorting

Sort values by a certain column in ascending order by using:

>>> df.sort_values(colname)

view raw df_sort_1.py hosted with ❤ by GitHub

>>> df.sort_values(colname,ascending=False)

view raw df_sort_2.py hosted with ❤ by GitHub

Furthermore, it’s also possible to sort values by multiple columns with different orders. colname_1 is being sorted in ascending order and colname_2 in descending order by using:

>>> df.sort_values([colname_1,colname_2],ascending=[True,False])

view raw df_sort_3.py hosted with ❤ by GitHub

Grouping

This operation involves 3 steps; splitting of the data, applying a function on each of the group, and finally combining the results into a data structure. This can be used to group large amounts of data and compute operations on these groups.

df.groupby(colname) returns a groupby object for values from one column while df.groupby([col1,col2]) returns a groupby object for values from multiple columns.

Data Cleansing

Data cleaning is a very important step in data analysis.

Checking missing values in the data

Check null values in the DataFrame by using:

>>> df.isnull()

view raw df_null_check.py hosted with ❤ by GitHub

This returns a boolean array (an array of true for missing values and false for non-missing values).

>>> df.isnull().sum()

view raw df_null_sum.py hosted with ❤ by GitHub

Check non null values in the DataFrame using pd.notnull(). It returns a boolean array, exactly converse of df.notnull()

Removing Empty Values

Dropping empty values can be done easily by using:

>>> df.dropna()

view raw df_drop_na.py hosted with ❤ by GitHub

This drops the rows having empty values or df.dropna(axis=1) to drop the columns.

Also, if you wish to fill the missing values with other values, use df.fillna(x). This fills all the missing values with the value x (here you can put any value that you want) or s.fillna(s.mean()) which replaces null values with the mean (mean can be replaced with any function from the arithmetic section).

Operations on Complete Rows, Columns, or Even All Data

>>> df["url_len"] = df["url"].map(len)

view raw df_map_func.py hosted with ❤ by GitHub

The .map() operation applies a function to each element of a column.

.apply() applies a function to columns. Use .apply(axis=1) to do it on the rows.

Iterating over rows

Very similar to iterating any of the python primitive types such as list, tuples, dictionaries.

	>>> for i, row in df.iterrows():
	print("Index {0}".format(i))
	print("Row {0}".format(row))

view raw df_iterate.py hosted with ❤ by GitHub

The .iterrows() loops 2 variables together i.e, the index of the row and the row itself, variable i is the index and variable row is the row in the code above.

Tips & Tricks

Using ufuncs (also known as Universal Functions). Python has the .apply() which applies a function to columns/rows. Similarly, Ufuncs can be used while preprocessing. What is the difference between ufuncs and .apply()?

Ufuncs is a numpy library, implemented in C which is highly efficient (ufuncs are around 10 times faster).

A list of common Ufuncs:

isinf: Element-wise checks for positive or negative infinity.

isnan: Element-wise checks for NaN and returns result as a boolean array.

isnat: Element-wise checks for NaT (not time) and returns result as a boolean array.

trunc: Return the truncated value of the input, element-wise.

.dt commands: Element-wise processing for date objects.

High-Performance Pandas

Pandas performs various vectorized/broadcasted operations and grouping-type operations. These operations are efficient and effective.

As of version 0.13, Pandas included tools that allow us to directly access C-speed operations without costly allocation of intermediate arrays. There are two functions, eval() and query().‍

DataFrame.eval() for efficient operations:

	>>> import pandas as pd
	>>> nrows, ncols = 100000, 100
	>>> rng = np.random.RandomState(42)
	>>> df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
	for i in range(4))

view raw eval_dataframe.py hosted with ❤ by GitHub

To compute the sum of df1, df2, df3, and df4 DataFrames using the typical Pandas approach, we can just write the sum:

	>>> %timeit df1 + df2 + df3 + df4

	10 loops, best of 3: 103.1 ms per loop

view raw without_eval_dataframe_result.py hosted with ❤ by GitHub

A better and optimized approach for the same operation can be computed via pd.eval():

	>>> %timeit pd.eval('df1 + df2 + df3 + df4')

	10 loops, best of 3: 53.6 ms per loop

view raw with_eval_dataframe_result.py hosted with ❤ by GitHub

%timeit — Measure execution time of small code snippets.

The eval() expression is about 50% faster (it also consumes mush less memory).

And it performs the same result:

	>>> np.allclose(df1 + df2 + df3 + df4,d.eval('df1 + df2 + df3 + df4'))

	True

view raw eval_dataframe_result_compare.py hosted with ❤ by GitHub

np.allclose() is a numpy function which returns True if two arrays are element-wise equal within a tolerance.

Column-Wise & Assignment Operations Using df.eval()

Normal expression to split the first character of a column and assigning it to the same column can be done by using:

>>> df['batch'] = df['batch'].str[0]

view raw split_first_char_1.py hosted with ❤ by GitHub

By using df.eval(), same expression can be performed much faster:

>>> df.eval("batch=batch.str[0]")

view raw split_first_char_2.py hosted with ❤ by GitHub

DataFrame.query() for efficient operations:

Similar to performing filtering operations with conditional logic, to filter rows with vertical as B2B and year as 2009, we do it by using:

	>>> %timeit df[(df['vertical'] == 'B2B') & (df['year'] == 2009)]

	1.69 ms ± 57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

view raw without_query_dataframe_result.py hosted with ❤ by GitHub

With .query() the same filtering can be performed about 50% faster.

	>>> %timeit df.query("vertical == 'B2B' and year == 2009")

	875 µs ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

view raw with_query_dataframe_result.py hosted with ❤ by GitHub

When to use eval() and query()?

Two aspects: computation time and memory usage.

Memory usage: Every operation which involves NumPy/Pandas DataFrames results into implicit creation of temporary variables. In such cases, if the memory usage of these temporary variables is greater, using eval() and query() is an appropriate choice to reduce the memory usage.

Computation time: Traditional method of performing NumPy/Pandas operations is faster for smaller arrays! The real benefit of eval()/query() is achieved mainly because of the saved memory, and also because of the cleaner syntax they offer.

Conclusion

Pandas is a powerful and fun library for data manipulation/analysis. It comes with easy syntax and fast operations. The blog highlights the most used pandas implementation and optimizations. Best way to master your skills over pandas is to use real datasets, beginning with Kaggle kernels to learning how to use pandas for data analysis. Check out more on real time text classification using Kafka and Scikit-learn and explanatory vs. predictive models in machine learning here.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

A Quick Introduction to Data Analysis With Pandas

The data structures in Pandas are capable of holding elements of any type: Series, DataFrame.

Series

A one-dimensional object that can hold any data type such as integers, floats, and strings

A Series object can be created of different values. Series can be remembered similar to a Python list.

	>>> import pandas as pd
	>>> import numpy as np
	>>> series = pd.Series([12,32,54,2, np.nan, "a string", 6])
	>>> series
	0 12
	1 32
	2 54
	3 2
	4 NaN
	5 a string
	6 6
	dtype: object

view raw series_object.py hosted with ❤ by GitHub

Now if we use only numerical values, we get the basic NumPy dtype - float for our series.

	>>> series = pd.Series([1,2,np.nan, 4])
	>>> series
	0 1.0
	1 2.0
	2 NaN
	3 4.0
	dtype: float64

view raw series_float64.py hosted with ❤ by GitHub

DataFrame

A two-dimensional labeled data structure where columns can be of different types.

Each column in a Pandas DataFrame represents a Series object in memory.

	>>> d = {
	"stats": pd.Series(np.arange(10,15,1.0)),
	"year": pd.Series(["2012","2007","2012","2003"]),
	"intake": pd.Series(["SUMMER","WINTER","WINTER","SUMMER"]),
	}
	>>> df = pd.DataFrame(d)
	>>> df

view raw example_dataframe.py hosted with ❤ by GitHub

Reading CSV files

Pandas can work with various file types while reading any file you need to remember.

pd.read_filetype()

view raw file_type.py hosted with ❤ by GitHub

	>>> df = pd.read_csv('companies.csv')
	>>> df.head()

view raw read_csv.py hosted with ❤ by GitHub

Accessing Columns and Rows

DataFrame comprises of three sub-components, the index, columns, and the data (also known as values).

Selecting rows is done using loc and iloc:

loc gets rows (or columns) with particular labels from the index. Raises KeyError when the items are not found.
iloc gets rows (or columns) at particular positions/index (so it only takes integers). Raises IndexError if a requested indexer is out-of-bounds.

>>> df.loc[:5] #similar to df.head()

view raw dataframe_loc.py hosted with ❤ by GitHub

Accessing the data using column names

Pandas takes an extra step and allows us to access data through labels in DataFrames.

>>> df.loc[:5, ["name","vertical", "url"]]

view raw dataframe_iloc.py hosted with ❤ by GitHub

In Pandas, selecting data is very easy and similar to accessing an element from a dictionary or a list.

Filtering DataFrames with Conditional Logic

Let’s say we want all the companies with the vertical as B2B, the logic would be:

>>> df[(df['vertical'] == 'B2B')]

view raw df_access_by_col_1.py hosted with ❤ by GitHub

If we want the companies for the year 2009, we would use:

>>> df[(df['year'] == 2009)]

view raw df_access_by_col_2.py hosted with ❤ by GitHub

Need to combine them both? Here’s how you would do it:

>>> df[(df['vertical'] == 'B2B') & (df['year'] == 2009)]

view raw df_access_by_col_3.py hosted with ❤ by GitHub

Sort and Groupby

Sorting

Sort values by a certain column in ascending order by using:

>>> df.sort_values(colname)

view raw df_sort_1.py hosted with ❤ by GitHub

>>> df.sort_values(colname,ascending=False)

view raw df_sort_2.py hosted with ❤ by GitHub

Furthermore, it’s also possible to sort values by multiple columns with different orders. colname_1 is being sorted in ascending order and colname_2 in descending order by using:

>>> df.sort_values([colname_1,colname_2],ascending=[True,False])

view raw df_sort_3.py hosted with ❤ by GitHub

Grouping

df.groupby(colname) returns a groupby object for values from one column while df.groupby([col1,col2]) returns a groupby object for values from multiple columns.

Data Cleansing

Data cleaning is a very important step in data analysis.

Checking missing values in the data

Check null values in the DataFrame by using:

>>> df.isnull()

view raw df_null_check.py hosted with ❤ by GitHub

This returns a boolean array (an array of true for missing values and false for non-missing values).

>>> df.isnull().sum()

view raw df_null_sum.py hosted with ❤ by GitHub

Check non null values in the DataFrame using pd.notnull(). It returns a boolean array, exactly converse of df.notnull()

Removing Empty Values

Dropping empty values can be done easily by using:

>>> df.dropna()

view raw df_drop_na.py hosted with ❤ by GitHub

This drops the rows having empty values or df.dropna(axis=1) to drop the columns.

Operations on Complete Rows, Columns, or Even All Data

>>> df["url_len"] = df["url"].map(len)

view raw df_map_func.py hosted with ❤ by GitHub

The .map() operation applies a function to each element of a column.

.apply() applies a function to columns. Use .apply(axis=1) to do it on the rows.

Iterating over rows

Very similar to iterating any of the python primitive types such as list, tuples, dictionaries.

	>>> for i, row in df.iterrows():
	print("Index {0}".format(i))
	print("Row {0}".format(row))

view raw df_iterate.py hosted with ❤ by GitHub

The .iterrows() loops 2 variables together i.e, the index of the row and the row itself, variable i is the index and variable row is the row in the code above.

Tips & Tricks

Ufuncs is a numpy library, implemented in C which is highly efficient (ufuncs are around 10 times faster).

A list of common Ufuncs:

isinf: Element-wise checks for positive or negative infinity.

isnan: Element-wise checks for NaN and returns result as a boolean array.

isnat: Element-wise checks for NaT (not time) and returns result as a boolean array.

trunc: Return the truncated value of the input, element-wise.

.dt commands: Element-wise processing for date objects.

High-Performance Pandas

Pandas performs various vectorized/broadcasted operations and grouping-type operations. These operations are efficient and effective.

As of version 0.13, Pandas included tools that allow us to directly access C-speed operations without costly allocation of intermediate arrays. There are two functions, eval() and query().‍

DataFrame.eval() for efficient operations:

	>>> import pandas as pd
	>>> nrows, ncols = 100000, 100
	>>> rng = np.random.RandomState(42)
	>>> df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
	for i in range(4))

view raw eval_dataframe.py hosted with ❤ by GitHub

To compute the sum of df1, df2, df3, and df4 DataFrames using the typical Pandas approach, we can just write the sum:

	>>> %timeit df1 + df2 + df3 + df4

	10 loops, best of 3: 103.1 ms per loop

view raw without_eval_dataframe_result.py hosted with ❤ by GitHub

A better and optimized approach for the same operation can be computed via pd.eval():

	>>> %timeit pd.eval('df1 + df2 + df3 + df4')

	10 loops, best of 3: 53.6 ms per loop

view raw with_eval_dataframe_result.py hosted with ❤ by GitHub

%timeit — Measure execution time of small code snippets.

The eval() expression is about 50% faster (it also consumes mush less memory).

And it performs the same result:

	>>> np.allclose(df1 + df2 + df3 + df4,d.eval('df1 + df2 + df3 + df4'))

	True

view raw eval_dataframe_result_compare.py hosted with ❤ by GitHub

np.allclose() is a numpy function which returns True if two arrays are element-wise equal within a tolerance.

Column-Wise & Assignment Operations Using df.eval()

Normal expression to split the first character of a column and assigning it to the same column can be done by using:

>>> df['batch'] = df['batch'].str[0]

view raw split_first_char_1.py hosted with ❤ by GitHub

By using df.eval(), same expression can be performed much faster:

>>> df.eval("batch=batch.str[0]")

view raw split_first_char_2.py hosted with ❤ by GitHub

DataFrame.query() for efficient operations:

Similar to performing filtering operations with conditional logic, to filter rows with vertical as B2B and year as 2009, we do it by using:

	>>> %timeit df[(df['vertical'] == 'B2B') & (df['year'] == 2009)]

	1.69 ms ± 57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

view raw without_query_dataframe_result.py hosted with ❤ by GitHub

With .query() the same filtering can be performed about 50% faster.

	>>> %timeit df.query("vertical == 'B2B' and year == 2009")

	875 µs ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

view raw with_query_dataframe_result.py hosted with ❤ by GitHub

When to use eval() and query()?

Two aspects: computation time and memory usage.

Conclusion

pandas python library

About the Author

Ronak is a Software Engineer who is passionate about data science and analytics. He currently works on building applications using Python/Django. In his free time, he loves reading and trekking.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Velotio Technologies is an outsourced software product development partner for top technology startups and enterprises. We partner with companies to design, develop, and scale their products. Our work has been featured on TechCrunch, Product Hunt and more.

We have partnered with our customers to built 90+ transformational products in areas of edge computing, customer data platforms, exascale storage, cloud-native platforms, chatbots, clinical trials, healthcare and investment banking.

Since our founding in 2016, our team has completed more than 90 projects with 220+ employees across the following areas:

Building web/mobile applications
Architecting Cloud infrastructure and Data analytics platforms
Designing AI/ML-based solutions
Intelligent Chatbots

Talk to us

Subscribe to get the latest technology updates

A Quick Introduction to Data Analysis With Pandas

Ronak Mutha

Series

DataFrame

Reading CSV files

Accessing Columns and Rows

Filtering DataFrames with Conditional Logic

Sort and Groupby

Sorting

Grouping

Data Cleansing

Tips & Tricks

High-Performance Pandas

Column-Wise & Assignment Operations Using df.eval()

Conclusion

MORE POSTS BY THIS AUTHOR

Ronak Mutha

You may also like

Policy Insights: Chatbots and RAG in Health Insurance Navigation

Shreyash Panchal

The Responsible Use of Artificial Intelligence - Shaping a Safer Tomorrow

Shivali Bari

Vector Search: The New Frontier in Personalized Recommendations

Afshan Khan

A Quick Introduction to Data Analysis With Pandas

Series

DataFrame

Reading CSV files

Accessing Columns and Rows

Filtering DataFrames with Conditional Logic

Sort and Groupby

Sorting

Grouping

Data Cleansing

Tips & Tricks

High-Performance Pandas

Column-Wise & Assignment Operations Using df.eval()

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Policy Insights: Chatbots and RAG in Health Insurance Navigation

The Responsible Use of Artificial Intelligence - Shaping a Safer Tomorrow

Vector Search: The New Frontier in Personalized Recommendations

Unlocking Legal Insights: Effortless Document Summarization with OpenAI's LLM and LangChain

Building an Intelligent Recommendation Engine with Collaborative Filtering

Build ML Pipelines at Scale with Kubeflow

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Real Time Text Classification Using Kafka and Scikit-learn

Your Complete Guide to Building Stateless Bots Using Rasa Stack

Chatbots With Google DialogFlow: Build a Fun Reddit Chatbot in 30 Minutes

Amazon Lex + AWS Lambda: Beyond Hello World

Machine Learning for your Infrastructure: Anomaly Detection with Elastic + X-Pack

A Quick Guide to Building a Serverless Chatbot With Amazon Lex

Building an Intelligent Chatbot Using Botkit and Rasa NLU

Explanatory vs. Predictive Models in Machine Learning

Benefits of Using Chatbots: How Companies Are Using Them to Their Advantange

A Step Towards Machine Learning Algorithms: Univariate Linear Regression

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting