Data Wrangling with PySpark for Data Scientists Who Know Pandas – Andrew Ray




[ad_1]

“Data scientists spend more time wrangling data than making models. Traditional tools like Pandas provide a very powerful data manipulation toolset. Transitioning to big data tools like PySpark allows one to work with much larger datasets, but can come at the cost of productivity.

In this session, learn about data wrangling in PySpark from the perspective of an experienced Pandas user. Topics will include best practices, common pitfalls, performance consideration and debugging.

Session hashtag: #SFds12

Learn more:
Developing Custom Machine Learning Algorithms in PySpark
https://databricks.com/blog/2017/08/30/developing-custom-machine-learning-algorithms-in-pyspark.html

Introducing Pandas UDF for PySpark
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Best Practices for Running PySpark
https://databricks.com/session/best-practices-for-running-pyspark

Session Overview:
– Why?
– What Do i get with pyspark?
– Primer
– Important Concepts
– Architecture
– Setup
– Run
– Load CSV
– View Dataframe
– Rename Columns
– Drop Column
– Filtering
– Add Column
– Fill Nulls
– Aggregation
– Standard Transformations
– Keep it in the JVM
– Row Conditional Statements
– Python when Required
– merge/join dataframes
– Pivot table
– Summary Statistics
– histogram
– SQL
– Make sure to
– Things not to do
– If things go wrong
– Thank you

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/

Source


[ad_2]

Comment List

  • Databricks
    December 14, 2020

    he provided with a really good comparison between the two!

  • Databricks
    December 14, 2020

    Too quiet please fix

  • Databricks
    December 14, 2020

    Just use koalas.

  • Databricks
    December 14, 2020

    Which is better in databricks environment?? Python or R or SQL..reply in comments

  • Databricks
    December 14, 2020

    great presentation!

  • Databricks
    December 14, 2020

    For those who search a tutorial: https://www.rrighart.com/pyspark/pyspark-in-the-context-of-retail . Disclaimer: I am the author ;-).

  • Databricks
    December 14, 2020

    Best videos for beginners to start with!

  • Databricks
    December 14, 2020

    Thank you for such a great presentation

  • Databricks
    December 14, 2020

    Would this be a good tool for combining large numbers of csvs into a single dataframe quickly and then performing manipulations on that dataframe before outputting a single csv?

  • Databricks
    December 14, 2020

    19:12, now pandas has an SQL support

  • Databricks
    December 14, 2020

    great tech video, but volume really …

  • Databricks
    December 14, 2020

    Thank you for such a great presentation for beginners!

  • Databricks
    December 14, 2020

    Whats with the volume?

  • Databricks
    December 14, 2020

    Very good presentation! Thanks

  • Databricks
    December 14, 2020

    This a great video. Exactly what I'm looking for thanks very much.

  • Databricks
    December 14, 2020

    Really nice how we see pandas and pyspark functions side-by-side!

  • Databricks
    December 14, 2020
  • Databricks
    December 14, 2020

    Volume is low! 🙁

  • Databricks
    December 14, 2020

    Cool talk and key differences nicely illustrated.

  • Databricks
    December 14, 2020

    LOL good presentation, but unprepared for the Q &A

  • Databricks
    December 14, 2020

    Must watch Q n A session in the end. I loved it.

  • Databricks
    December 14, 2020

    by just downloading and writing this code it will not work. You have to create a session.

  • Databricks
    December 14, 2020

    Hey Andrew could you send me your Github link

Write a comment