Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Python Packages For Data Mining

by dataaspirant
April 24, 2017
in Data Science 101
Home Topics Data Science Data Science 101
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Just because you have a “hammer”, doesn’t mean that every problem you come across will be a “nail”.

The intelligent key thing is when you use  the same hammer to solve what ever problem you came across. Like the same way when we indented to solve a datamining problem  we will face so many issues but we can solve them by using python in a intelligent way.

In very next post I am going to wet your hands to solve one interesting  datamining problem using python programming language. so in this post I am going to explain about some powerful Python weapons( packages )

Before stepping directly to Python packages, let me clear up any doubts you may have about why you should be using Python.

Table of Contents


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


  • Why Python ?
    • Python is Easy
    • Python is Efficient
    • Python is Fast
  • NumPy
    • SciPy
    • Pandas
    • Matplotlib
    • Ipython
    • scikit-learn

Why Python ?

We all know that python is powerful programming language, but what does that mean, exactly? What makes python  a powerful programming language?

Python is Easy

Universally, Python has gained a reputation because of it’s easy to learn. The syntax of Python programming language is designed to be easily readable. Python has significant popularity in  scientific computing. The people working in this field are scientists first, and programmers second.

Python is Efficient

Nowadays we working on bulk amount of data, popularly known as big data.  The more data you have to process, the more important it becomes to manage the memory you use. Here Python will work very efficiently.

Python is Fast

We all know Python is an interpreted language, we may think that it is slow, but some amazing work has been done over the past years to improve Python’s performance. My point is that if you want to do high-performance computing, Python is a viable best option today.

Hope I cleared your doubt about “Why Python?”, so let me jump to Python Packages for data mining.

NumPy

Numpylogo

 About:

NumPy is the fundamental package for scientific computing with Python. NumPy is an extension to the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications.

Original author(s)Travis Oliphant
Developer(s)Community project
Initial releaseAs Numeric, 1995; as NumPy, 2006
Stable release1.9.0 / 7 September 2014; 36 days ago
Written inPython, C
Operating systemCross-platform
TypeTechnical computing
LicenseBSD-new license
Websitewww.numpy.org

Installing numpy:

If Python is not installed in your computer please install it first.

Installing numpy in linux

Open your terminal and copy these commands:

sudo apt-get update
sudo apt-get install python-numpy

Sample numpy code for using reshape function

[code language=”css”]from numpy import *
a = arange(12)
a = a.reshape(3,2,2)
print a [/code]

Script output

[[[ 0 1]
[ 2 3]]

[[ 4 5]
[ 6 7]]

[[ 8 9]
[10 11]]]

SciPy

scipy_logo

About:

SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world’s leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, Scipy is the tool for the job.

Original author(s)Travis Oliphant, Pearu Peterson, Eric Jones
Developer(s)Community library project
Stable release0.14.0 / 3 May 2014; 5 months ago
Written inPython, Fortran, C, C++[1]
Operating systemCross-platform (list)
TypeTechnical computing
LicenseBSD-new license
Websitewww.scipy.org

 Installing SciPy in linux

Open your terminal and copy these commands:

sudo apt-get update
sudo apt-get install python-scipy

Sample SciPy code

[code language=”css”] from scipy import special, optimize
f = lambda x: -special.jv(3, x)
sol = optimize.minimize(f, 1.0)
x = linspace(0, 10, 5000)
plot(x, special.jv(3, x), ‘-‘, sol.x, -sol.fun, ‘o’)
savefig(‘plot.png’, dpi=96)[/code]

 Script output

Screenshot from 2014-10-29 19:36:33

Pandas

pandas

About:

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet.
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure.

Installing Pandas in Linux

Open your terminal and copy these commands:

sudo apt-get update
sudo apt-get install python-pandas

Sample Pandas code about Pandas Series

[code language=”css”]import pandas as pd

values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
ser = pd.Series(values)
print ser[/code]

Script output
0 2.0000
1 1.0000
2 5.0000
3 0.9700
4 3.0000
5 10.0000
6 0.0599
7 8.0000

Matplotlib

540px-Matplotlib_logo.svg

About:
matplotlib is a plotting library for the Python programming language and its NumPy numerical mathematics extension. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB. SciPy makes use of matplotlib.

Original author(s)John Hunter
Developer(s)Michael Droettboom, et al.
Stable release1.4.2 (26 October 2014; 3 days ago) [±]
Written inPython
Operating systemCross-platform
TypePlotting
Licensematplotlib license
Websitematplotlib.org

Installing Matplotlib in linux

Open your terminal and copy these commands:

sudo apt-get update
sudo apt-get install python-matplotlib

Sample Matplotlib code to Create Histograms

[code language=”css”]import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# example data
mu = 100 # mean of distribution
sigma = 15 # standard deviation of distribution
x = mu + sigma * np.random.randn(10000)

num_bins = 50
# the histogram of the data
n, bins, patches = plt.hist(x, num_bins, normed=1, facecolor=’green’, alpha=0.5)
# add a ‘best fit’ line
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins, y, ‘r–‘)
plt.xlabel(‘Smarts’)
plt.ylabel(‘Probability’)
plt.title(r’Histogram of IQ: $\mu=100$, $\sigma=15$’)

# Tweak spacing to prevent clipping of ylabel
plt.subplots_adjust(left=0.15)
plt.show()[/code]

Script output

Screenshot from 2014-10-29 19:55:21

Ipython

ipython

IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, that offers enhanced introspection, rich media, additional shell syntax, tab completion, and rich history. IPython currently provides the following features:

  • Powerful interactive shells (terminal and Qt-based).
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into one’s own projects.
  • Easy to use, high performance tools for parallel computing.
Original author(s)Fernando Perez and others
Stable release2.3 / 1 October 2014; 27 days ago
Written inPython, JavaScript, CSS,HTML
Operating systemCross-platform
TypeShell
LicenseBSD
Websitewww.ipython.org

Installing IPython in linux

Open your terminal and copy these commands:

sudo apt-get update
sudo pip install ipython

Sample IPython code

This piece of code is to plot demonstrating the integral as the area under a curve

[code language=”css”]import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon
def func(x):
return (x – 3) * (x – 5) * (x – 7) + 85

a, b = 2, 9 # integral limits
x = np.linspace(0, 10)
y = func(x)

fig, ax = plt.subplots()
plt.plot(x, y, ‘r’, linewidth=2)
plt.ylim(ymin=0)

# Make the shaded region
ix = np.linspace(a, b)
iy = func(ix)
verts = [(a, 0)] + list(zip(ix, iy)) + [(b, 0)]
poly = Polygon(verts, facecolor=’0.9′, edgecolor=’0.5′)
ax.add_patch(poly)

plt.text(0.5 * (a + b), 30, r”$\int_a^b f(x)\mathrm{d}x$”,
horizontalalignment=’center’, fontsize=20)

plt.figtext(0.9, 0.05, ‘$x$’)
plt.figtext(0.1, 0.9, ‘$y$’)

ax.spines[‘right’].set_visible(False)
ax.spines[‘top’].set_visible(False)
ax.xaxis.set_ticks_position(‘bottom’)

ax.set_xticks((a, b))
ax.set_xticklabels((‘$a$’, ‘$b$’))
ax.set_yticks([])

plt.show()[/code]

Script output

area_fig

scikit-learn

scikit-learn-logo

The scikit-learn project started as scikits.learn, a Google Summer of Code project by David Cournapeau. Its name stems from the notion that it is a “SciKit” (SciPy Toolkit), a separately-developed and distributed third-party extension to SciPy. The original codebase was later extensively rewritten by other developers. Of the various scikits, scikit-learn as well as scikit-image were described as “well-maintained and popular” in November 2012.

Original author(s)David Cournapeau
Initial releaseJune 2007; 7 years ago[1]
Stable release0.15.1 / August 1, 2014; 2 months ago[2]
Written inPython, Cython, C andC++
Operating systemLinux, Mac OS X,Microsoft Windows
TypeLibrary for machine learning
LicenseBSD License
Websitescikit-learn.org

Installing Scikit-learn in linux

Open your terminal and copy these commands

sudo apt-get update
sudo apt-get install python-sklearn

Sample Scikit-learn code

[code language=”css”]import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis]
diabetes_X_temp = diabetes_X[:, :, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X_temp[:-20]
diabetes_X_test = diabetes_X_temp[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print(‘Coefficients: \n’, regr.coef_)
# The mean square error
print(“Residual sum of squares: %.2f”
% np.mean((regr.predict(diabetes_X_test) – diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print(‘Variance score: %.2f’ % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color=’black’)
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color=’blue’,
linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show() [/code]

Script output

Coefficients:
[ 938.23786125]
Residual sum of squares: 2548.07
Variance score: 0.47

linera

I have explained the packages which we are going to use in coming posts to solve some interesting problems.

Please leave your comment if you have any other Python data mining packages to add to this list.

Originally published here.

(Image credit: Thomas Hawk)

Tags: NumpyPandaspythonScikit LearnSciPyWeekly Newsletter

Related Posts

dark data

If only you knew the power of the dark data…

August 22, 2022
What is a data governance framework? Data governance framework components, examples, practices and how to find the best data governance framework explained

The data governance framework is an indispensable compass of the digital age

August 4, 2022
In this article, you can learn what is data transformation, data transformation examples, data transformation tools, data transformation process, data transformation rules, and more.

The ABC’s of data transformation

July 14, 2022
In today’s article we will explain what are data points and their synonyms. We’ll also clarify how unit of observation is utilized in addition to types of data points. For digital marketing analytics, there are some important data point categories professionals need to be aware of. Finally we will learn differences between a data point, data set, data field and so on.

The key of optimization: Data points

July 11, 2022
continuous data protection

Data is too valuable to backup traditionally

June 29, 2022
What is linear regression in machine learning

From statistics to machine learning: Linear regression

June 28, 2022

Comments 11

  1. Peter Wang says:
    8 years ago

    For installation stuff, you might want to mention using a distribution like Anaconda (http://continuum.io/anaconda) which is critical for Windows and OS X users, and helpful even for Linux users that have decent package managers. (Official Linux distro versions of packages can sometimes be woefully out of date.)

    Reply
  2. Dave Fuller says:
    8 years ago

    I’d also recommend using pyenv on linux, rather than installing packages within the system’s version of python. This can cause a lot of problems. With pyenv, you can control which version of python you want to use on a per directory basis (and some other ways as well) and you don’t risk causing OS-level problems.

    Reply
  3. Juanlu001 says:
    7 years ago

    That is *not* pandas logo.

    http://python-madrid.es/

    Reply
  4. Oliver Slay says:
    7 years ago

    I would recommend looking at TensorFlow.org which is Google’s new Open Source Software Library for Deep Machine Learning…

    Reply
  5. Robin White says:
    7 years ago

    Great info! Thanks for sharing it.
    http://www.thedevmasters.com/python-for-data-scientist/ is the course that I would recommend and that you should check out to understand those.

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

BuzzFeed ChatGPT integration: Buzzfeed stock surges after the OpenAI deal

Adversarial machine learning 101: A new cybersecurity frontier

Fostering a culture of innovation through digital maturity

Nvidia Eye Contact AI can be the savior of your online meetings

How did ChatGPT passed an MBA exam?

AI prompt engineering is the key to limitless worlds

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.